Data type analyzed

What Data You Analyzed – KDnuggets Poll Results and Trends

把Excel 學好就很夠用了

廣告

Data science

MaxthonSnap20170418165847

老外視覺化能力不錯, 做決策的人,概念上肯定都有這樣的分析能力. 亂做決策的人,大概都不用這些概念.

組織越大,  這些分析都要正規化,

組織越小 這些分析都要概念化,抽象化,  要實質意義, 簡化形式和流程.

資料科學週期表

 

table of data science

資料科學週期表

table down right

社群 Q&A

table up

資料管理視覺化

table down

table right

資料科學相關新聞

table left

資料科學相關課程會議

 

居然已把資料科學做成一個週期表,想像力視覺化還真豐富.

最左邊是課程, 最右邊是新聞, 我平常已有在讀KDnuggets 和HackerNews.

(Source : https://www.datacamp.com/community/blog/data-science-periodic-table#gs.FeNjSMA)

Modern data scientist

MaxthonSnap20170415132054

It’s interesting and reasonable

Hadoop

讀 : http://hadoop.apache.org/

之前也寫過

  • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 擃展性
  • 可在應用層, 偵除錯.

4個 modules :

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  • 其他Hadoop的專案 :

  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

以上這些專案,有用Hadoop , 就會需要Ambari .

Hbase 這種結構化的db, 看起來用在健保資料很適合.

Spark 台灣有在用吧

Hadoop download : http://hadoop.apache.org/releases.html

  • 讀這份 document: Apache Hadoop 3.0.0-alpha2 // http://hadoop.apache.org/docs/current/

這版本還不穩,但閱讀是ok的.

讀 : HDFS, MapReduce, YARN, Tools,

 

 

Algorithm in decision making

 

MaxthonSnap20170323175126

  • One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vis-a-vis pure data mining.
  • “The key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison.
  • there are many viable approaches to model building that leverage, e.g., Lasso, LAR, stepwise algorithms or “elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can’t use all of the available information at the same time

 

(Source from http://www.kdnuggets.com/2016/06/data-science-variable-selection-review.html?utm_content=buffer1cd7f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)

 

Data and skills sets

MaxthonSnap20170322101911

Open source application for big data

1. Hadoop

  • 這我知 要細看一下網站及文件
  • OS:Windows、Linux 和 OS X
  • website:http://hadoop.apache.org

2. Hypertable

  • Hypertable 在互联网公司当中非常流行,它由谷歌开发,用来提高数据库的可扩展性
  • 与 Hadoop 兼容,提供商业支持和培训。
  • OS:Linux 和 OS X
  • website:http://www.hypertable.com

3. Mesos

  • Apache Mesos 是一种资源抽象工具,有了它,企业就可以鼗整个数据中心当成一个资源池,它在又在运行 Hadoop、Spark 及类似应用程序的公司当中很流行
  • OS:Linux 和 OS X
  • website:http://mesos.apache.org

4. Presto 

  • Presto 由 Facebook 开发,自称是“一款开源分布式 SQL 查询引擎,用于对大大小小(从 GB 级到 PB 级)的数据源运行交互式分析查询
  • OS:Linux
  • website:https://prestodb.io

5. Solr

  • 这种“快若闪电”的企业搜索平台声称高度可靠、扩展和容错
  • OS:与操作系统无关
  • website:http://Lucene.apache.org/solr/

6. Spark

  • 這我寫過
  • Apache Spark 声称,“它在内存中运行程序的速度比 Hadoop MapReduce 最多快 100 倍,在磁盘上快 10 倍
  • OS:Windows、Linux 和 OS X
  • website:http://spark.apache.org

7. Storm

  • Apache Storm 用来处理实时数据
  • OS:Linux
  • 相关网站:https://storm.apache.org

我提過這些技術該如何看,我有興趣的是1.4.5.6,但目前我個人用不到, 要考量時間的機會成本. 不是瞎學就能解決問題, 台灣的企業有多少需要big data 作決策, 擔心市場應用的程度.  我只是要找到應用解決問題,數據要多大,坦白說對我並不重要, 黑貓白貓, 抓的到老鼠就是好貓.

我會繼讀這些文件啦.

Machine Learning for Marketing

The marketing big data ecosystem being impacted by machine learning in four major areas:

  1. Automated data visualization (including ML results) will become more rich, and user-friendly.
  2. Content analysis (textual, lexical, multimedia/rich) will be used to drive better marketing conversations.
  3. Incremental ML techniques will become more prevalent, leading to real-time, not just on-going and automated, changes in marketing execution.
  4. Learning from ML results will accelerate the growth and skills of marketing professionals.
  • Automated Data Visualization tools: Tableau and Qlikview

Predictive model : The objective of ML is to build predictive model for forecast.

the ability to modify a solution that is already in place by introducing new data rather than having to stop using the current solution before building a new model from scratch.

(Source from How Machine Learning Will Be Used For Marketing In 2017)

Big data process

BIG DATA PROCESS

這流程跟產業,企業分析是一樣的, 只是數據有大有小有深有淺,對決策而言 只要能解決問題 就是好數據. 以數據強化決策的理性與精準

  • 大数据报告根据功能来分,可分为4个常见类型
  1.  市場 行業分析
  2.  用戶 (user demographic profile)

  3.  競爭者 (Competitors)

4.  经营分析/业务问题专题:企业经营中重大战略决策的分析或针对某具体业务问题进行专题分析

Big data 支持商業決策,  這部分應該同業相當熟, 需要加強的是技術上的掌握與實作.

數據太多對我來說是一個困擾, 夠用就好,是我的原則,但是 所謂big data 資料超多的,架幾10台到幾百台server 平行處理都是有的, 就商業決策不見得時時都要這麼多data, 但要用時卻不會就麻煩了.  這些分析也可參考產業圈部落格, 我應該還是會多學些技術方面的理解

(參 :http://p.t.qq.com/longweibo/page.php?lid=18427992910191656282)

%d 位部落客按了讚: