Data business



Data security , Privacy  未來熱門行業

Data type analyzed

What Data You Analyzed – KDnuggets Poll Results and Trends

把Excel 學好就很夠用了

Data science


老外視覺化能力不錯, 做決策的人,概念上肯定都有這樣的分析能力. 亂做決策的人,大概都不用這些概念.

組織越大,  這些分析都要正規化,

組織越小 這些分析都要概念化,抽象化,  要實質意義, 簡化形式和流程.



table of data science


table down right

社群 Q&A

table up


table down

table right


table left




最左邊是課程, 最右邊是新聞, 我平常已有在讀KDnuggets 和HackerNews.

(Source :

Modern data scientist


It’s interesting and reasonable


讀 :


  • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
  • 擃展性
  • 可在應用層, 偵除錯.

4個 modules :

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

  • 其他Hadoop的專案 :

  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications.

以上這些專案,有用Hadoop , 就會需要Ambari .

Hbase 這種結構化的db, 看起來用在健保資料很適合.

Spark 台灣有在用吧

Hadoop download :

  • 讀這份 document: Apache Hadoop 3.0.0-alpha2 //


讀 : HDFS, MapReduce, YARN, Tools,



Algorithm in decision making



  • One of the biggest problems in predictive modeling is the conflation between classic hypothesis testing with careful model specification vis-a-vis pure data mining.
  • “The key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison.
  • there are many viable approaches to model building that leverage, e.g., Lasso, LAR, stepwise algorithms or “elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can’t use all of the available information at the same time


(Source from


Data and skills sets


Open source application for big data

1. Hadoop

  • 這我知 要細看一下網站及文件
  • OS:Windows、Linux 和 OS X
  • website:

2. Hypertable

  • Hypertable 在互联网公司当中非常流行,它由谷歌开发,用来提高数据库的可扩展性
  • 与 Hadoop 兼容,提供商业支持和培训。
  • OS:Linux 和 OS X
  • website:

3. Mesos

  • Apache Mesos 是一种资源抽象工具,有了它,企业就可以鼗整个数据中心当成一个资源池,它在又在运行 Hadoop、Spark 及类似应用程序的公司当中很流行
  • OS:Linux 和 OS X
  • website:

4. Presto 

  • Presto 由 Facebook 开发,自称是“一款开源分布式 SQL 查询引擎,用于对大大小小(从 GB 级到 PB 级)的数据源运行交互式分析查询
  • OS:Linux
  • website:

5. Solr

  • 这种“快若闪电”的企业搜索平台声称高度可靠、扩展和容错
  • OS:与操作系统无关
  • website:

6. Spark

  • 這我寫過
  • Apache Spark 声称,“它在内存中运行程序的速度比 Hadoop MapReduce 最多快 100 倍,在磁盘上快 10 倍
  • OS:Windows、Linux 和 OS X
  • website:

7. Storm

  • Apache Storm 用来处理实时数据
  • OS:Linux
  • 相关网站:

我提過這些技術該如何看,我有興趣的是1.4.5.6,但目前我個人用不到, 要考量時間的機會成本. 不是瞎學就能解決問題, 台灣的企業有多少需要big data 作決策, 擔心市場應用的程度.  我只是要找到應用解決問題,數據要多大,坦白說對我並不重要, 黑貓白貓, 抓的到老鼠就是好貓.


Machine Learning for Marketing

The marketing big data ecosystem being impacted by machine learning in four major areas:

  1. Automated data visualization (including ML results) will become more rich, and user-friendly.
  2. Content analysis (textual, lexical, multimedia/rich) will be used to drive better marketing conversations.
  3. Incremental ML techniques will become more prevalent, leading to real-time, not just on-going and automated, changes in marketing execution.
  4. Learning from ML results will accelerate the growth and skills of marketing professionals.
  • Automated Data Visualization tools: Tableau and Qlikview

Predictive model : The objective of ML is to build predictive model for forecast.

the ability to modify a solution that is already in place by introducing new data rather than having to stop using the current solution before building a new model from scratch.

(Source from How Machine Learning Will Be Used For Marketing In 2017)

%d 位部落客按了讚: