Open source application for big data

1. Hadoop

  • 這我知 要細看一下網站及文件
  • OS:Windows、Linux 和 OS X
  • website:http://hadoop.apache.org

2. Hypertable

  • Hypertable 在互联网公司当中非常流行,它由谷歌开发,用来提高数据库的可扩展性
  • 与 Hadoop 兼容,提供商业支持和培训。
  • OS:Linux 和 OS X
  • website:http://www.hypertable.com

3. Mesos

  • Apache Mesos 是一种资源抽象工具,有了它,企业就可以鼗整个数据中心当成一个资源池,它在又在运行 Hadoop、Spark 及类似应用程序的公司当中很流行
  • OS:Linux 和 OS X
  • website:http://mesos.apache.org

4. Presto 

  • Presto 由 Facebook 开发,自称是“一款开源分布式 SQL 查询引擎,用于对大大小小(从 GB 级到 PB 级)的数据源运行交互式分析查询
  • OS:Linux
  • website:https://prestodb.io

5. Solr

  • 这种“快若闪电”的企业搜索平台声称高度可靠、扩展和容错
  • OS:与操作系统无关
  • website:http://Lucene.apache.org/solr/

6. Spark

  • 這我寫過
  • Apache Spark 声称,“它在内存中运行程序的速度比 Hadoop MapReduce 最多快 100 倍,在磁盘上快 10 倍
  • OS:Windows、Linux 和 OS X
  • website:http://spark.apache.org

7. Storm

  • Apache Storm 用来处理实时数据
  • OS:Linux
  • 相关网站:https://storm.apache.org

我提過這些技術該如何看,我有興趣的是1.4.5.6,但目前我個人用不到, 要考量時間的機會成本. 不是瞎學就能解決問題, 台灣的企業有多少需要big data 作決策, 擔心市場應用的程度.  我只是要找到應用解決問題,數據要多大,坦白說對我並不重要, 黑貓白貓, 抓的到老鼠就是好貓.

我會繼讀這些文件啦.

Ansible 初探

Ansible (note1)

  • an open-source automation engine that automates software provisioning, configuration management, and application deployment (note2)
  • IT automation engine for using to drive complexity out of their environments and accelerate DevOps initiatives.
  • Ansible, an open source community project sponsored by Red Hat, is the simplest way to automate IT.
  • Ansible is the only automation language that can be used across entire IT teams – from systems and network administrators to developers and managers.
  • Ansible by Red Hat provides enterprise-ready solutions to automate your entire application lifecycle – from servers to clouds to containers and everything in between.
  • Ansible Tower by Red Hat is a commercial offering that helps teams manage complex multi-tier deployments by adding control, knowledge, and delegation to Ansible-powered environments.

The goal of automation process is for update without impact of operational capacity

IT automation, Agile development, DevOps, Deployment, Applicaiton update, Testing,

  • CD,CI (Continus Delivery, Continus Integration)  

CI systems are build systems that watch various source control
repositories for changes, run any applicable tests, and automatically build (and ideally test) the
latest version of the application from each source control change, such as Jenkins (jenkins.io).

The key handoff for CD is that the build system can invoke Ansible upon a successful build.
Users who also run unit or integration tests on code as a result of the build will also be one step ahead of the game.

Jenkins can utilize Tower to deploy the built artifact into multiple environments,

but a QA/stage environment modeled after production ups the ante and substantially improved predictability along the lifecycle. The data provided back by Ansible can then be referenced, and directly correlated to a Tower job in the Build Systems job.

Ansible’s unique multi-tier, multi-step orchestration capabilities, combined with its push-based architecture, allow for extremely rapid execution of these types of complex workflows

  • Ansible feature (note3)

MaxthonSnap20170324083511

  • What is Ansible?

 

  • Ansible 自動化組態技巧

 

 

(note1: www.ansible.com)

(note2: https://en.wikipedia.org/wiki/Ansible_(software))

(Why Ansible : https://www.ansible.com/it-automation)

(http://www.slideshare.net/joeywchou/ansible-73452143)

(note3: What is Ansible? // https://www.ansible.com/quick-start-video)

Google, AI, Machine Learning

Google cloud platform (note5)

Target : the enterprise cloud market

  • Position- To be a developer-friendly platform
  • Weakness

1) Not strong and no impression  on cloud service and the enterprise segment

2) No contribution to the open source community before.

(note1)

  • Stength : ASIC, GPU and TPU hardware in its cloud
  • Opportunity

1) begin to work with open source projects (note3)

  • Cloud Native Computing Foundation- the open-source container management tool

run by the Linux Foundation (note4)

Other partners in the new foundation include AT&T, Box, Cisco, Cloud Foundry Foundation, CoreOS, Cycle Computing, Docker, eBay, Goldman Sachs, Huawei, IBM, Intel, Joyent, Kismatic, Mesosphere, Red Hat, Switch SUPERNAP, Twitter, Univa, VMware and Weaveworks.

Absent : Microsoft, Amazon, Pivotal, and  Taiwanese tech companies)

the popular open source container orchestration system

  • TensorFlow for machine learning

-Spanner for launching massive distributed databases

-Draco for 3D graphics compression

2) To be a  developer-friendly platform

  • “OPENNESS" 

1)letting customers run whatever open source stack they choose on Google’s infrastructure,

2)releasing and supporting open source projects and making the ecosystem

3)the partners who build tools and technologies on top of GCP, a first class citizen on the platform.

4) treating them as part of the whole and the net is bringing the tech you want and using Google technology or using any of the [partner] services

The KSF :

1)Being open to win the mind shares of developers

2) much more supportive of the open source community makes people feel better about Google and makes developers feel better about working with their tools because they can avoid lock-in

  • Threat : peers: AWS (2006. 1st public cloud, market leader, 1st mover), Microsoft, IBM
  • Strategy

UsingKubernetes, the popular open source container orchestration system offer robust open source tools, something that surprised some people in this market.

  • 4 ways Google will enable enterprises to adopt machine learning and AI (note2)

1). Machine learning computing in Google Cloud

a deep learning algorithm can have tens of millions of parameters, training these machine learning models requires enormous computational resource

the Cloud Machine Learning Engine. This capability is designed for companies with data scientists and machine learning experts who are able to build their own unique machine learning models with libraries such as Tensorflow.

Google’s infrastructure as the solution to speed training times and improve the return on investment. Google has specialized ASIC, GPU and TPUhardware in its cloud to accelerate training and improve the ROI with on-demand cloud resource utilization. After the model is trained, it is deployed in range of platforms—from on-premise to mobile devices.

2. Algorithms and pretrained machine learning models

建ML model 需用 the machine learning engine, 用 Google’s pre-trained models (full list) using APIs to add machine learning capability to their applications, such as understanding natural language, images and natural language.

An API beta for understanding videos

demo: This 3-minute video of the demonstration of the Cloud Video Intelligence beta

3. Google acquires Kaggle for data

Google acquired Kaggle for data sets and talent. Kaggle, founded in 2010, is a community of 850,000 data scientists from around the world that hosts competitions to create the most accurate predictive models and market models, as well as to acquire new public data sets in a variety of fields.

4. Expertise

the Advanced Solutions Lab for customers with ambitious goals to develop machine learning to solve complex problems.

(note1: https://techcrunch.com/2017/03/09/google-in-the-cloud/)

(note2: http://www.networkworld.com/article/3179127/cloud-computing/4-ways-google-cloud-will-bring-ai-machine-learning-to-the-enterprise.html)

(note3:https://techcrunch.com/2015/07/21/as-kubernetes-hits-1-0-google-donates-technology-to-newly-formed-cloud-native-computing-foundation-with-ibm-intel-twitter-and-others/)

(note4: The mission of Linux foundation :The mission of this new foundation is to “help facilitate collaboration among developers and operators on common technologies for deploying cloud native applications and services,” )

(note5: https://cloud.google.com/)

(Reference : Google Cloud Platform 入門)

(Reference: https://technews.tw/tag/google-cloud-platform/)

 

Big data-related open source application

我有興趣的是1.4.5.6.  6 之前寫過. 遊戲業很早已有用Hadoop, 關鍵還是在應用的規模, 技術包山包海,我也只能挑重點看, 且戰且走,  Java Scirpt npm 裡有20幾萬個modules,  根本不可能線性學習, 必須博觀而約取.

1. Hadoop
OS:Windows、Linux 和 OS X
Reference : http://hadoop.apache.org

2.Hypertable

Hypertable 在互联网公司当中非常流行,它由谷歌开发,用来提高数据库的可扩展性。
与 Hadoop 兼容,提供商业支持和培训
OS:Linux 和 OS X
Reference:http://www.hypertable.com

3.Mesos

Apache Mesos 是一种资源抽象工具,有了它,企业就可以鼗整个数据中心当成一个资源池,它在又在运行 Hadoop、Spark 及类似应用程序的公司当中很流行
OS:Linux 和 OS X
Reference:http://mesos.apache.org

4.Presto

Presto 由 Facebook 开发,自称是“一款开源分布式 SQL 查询引擎,用于对大大小小(从 GB 级到 PB 级)的数据源运行交互式分析查询。”Facebook 表示,它将 Presto 用于对 300PB 大小的数据仓库执行查询

OS:Linux
Reference:https://prestodb.io

5. Solr

这种“快若闪电”的企业搜索平台声称高度可靠、扩展和容错

OS:与操作系统无关
Reference:http://Lucene.apache.org/solr/

6.Spark

Apache Spark 声称,“它在内存中运行程序的速度比 Hadoop MapReduce 最多快 100 倍,在磁盘上快 10 倍。

OS:Windows、Linux 和 OS X
Reference:http://spark.apache.org

7.Storm

Apache Storm 用来处理实时数据
OS:Linux
Reference:https://storm.apache.org

Apache Flink

這是個計算引擎, 號稱" 4G of Big Data" (note1), 快, 易用,開源, 效能佳, 但沒有儲存系統

  • Batch Processing
  • Interactive processing
  • Real-time stream processing
  • Graph Processing
  • Iterative Processing
  • In-memory processing

Flink is an alternative of Mapreduce, it processes data more than 100 times faster than MapReduce.

Flink is independant from hadoop but it can use hdfs to read, write, store, process the data. Flink does not provide its own data storage system.it takes data from distributed storage.

Flink  ecosystem:   (note2)

apache-flink-ecosystem-components

 Storage: 讀寫別家的資料庫大概都沒什麼問題

  • HDFS – Hadoop Distributed File System
  • Local-FS – Local File System
  • S3 – Simple Storage Service from Amazon
  • HBase – NoSQL Database in Hadoop ecosystem
  • MongoDB – NoSQL Database
  • RBDBMs – Any relational database
  • Kafka – Distributed messaging Queue
  • RabbitMQ – Messaging Queue
  • Flume – Data Collection and Aggregation Tool

以上都可

Deploy: 能分配部署資源 :

  • Local mode – On single node, in single JVM
  • Cluster – On multi-node cluster, with following resource manager
    • Standalone – This is the default resource manager which is shipped with Flink
    • YARN – This is very popular resource manager, it is part of Hadoop, introduced in Hadoop 2.x
    • Mesos – This is a generalized resource manager.
  • Cloud – on Amazon or Google cloud

Runtime :

the Distributed Streaming Dataflow, which is also called as kernel of Apache Flink. This is the core layer of flink which provides distributed processing, fault tolerance, reliability, native iterative processing capability, etc.

主從架構:

maxthonsnap20170216092524

 

特色:

  • Streaming – Flink is a true stream processing engine.
  • High performance – Flink’s data streaming Runtime provides very high throughput
  • Low latency – Flink can process the data in sub-second range without any delay
  • Event Time and Out-of-Order Events – Flink supports stream processing and windowing where events arrive delayed or out of order
  • Lightning fast speed – Flink processes data at lightning fast speed (hence also called as 4G of Big Data)
  • Fault Tolerance – Failure of hardware, node, software or a process doesn’t affect the cluster
  • Memory management – Flink works in managed memory and never get out of memory exception
  • Broad integration – Flink can be integrated with various storage system to process their data, it can be deployed with various resource management tools. It can also be integrated with several BI tools for reporting
  • Stream processing – Flink is a true streaming engine, can process live streams in sub-second interval
  • Program optimizer – Flink is shipped with an optimizer, before execution of a program it is optimized
  • Scalable – Flink is highly scalable. With increasing requirements we can scale flink cluster
  • Rich set of operators – Flink has lots of pre-defined operators to process the data. All the common operations can be done using these operators
  • Exactly-once Semantics – It can maintain custom state during computation
  • Highly flexible Streaming Windows – In flink we can customize windows by triggering conditions flexibly, to get required streaming patterns. We can create window according to time t1 to t5 and data driven windows.
  • Continuous streaming model with backpressure – Data streaming applications are executed with continuous (long lived) operators. Flink’s streaming engine naturally handles backpressure.
  • One Runtime for Streaming and Batch Processing – Batch processing and data streaming both have common runtime in flink
  • Easy and understandable Programmable APIs – Flink’s APIs are developed in a way to cover all the common operations, so programmers can use it efficiently.
  • Little tuning required – Requires no memory, network, serializer to configure

初看這Apache Flink, 電視台轉型需用到,以往直播用SNG車, 上衛星, 現在改串流技術,  光這樣成本就不知省多少,用途滿廣, 也可處理髒資料,推薦產品用, 作預測.

 

(note1: http://data-flair.training/blogs/apache-flink-production-fortune-500-companies-top-real-world-use-cases/)

(note2: data-flair.training/blogs/apache-flink-comprehensive-guide-tutorial-for-beginners/)

(Installation:

)

 

 

 

OpenStack

  • for IaaS

OpenStack是美國國家航空暨太空總署Rackspace共同打造的雲端開源軟體,以Apache許可證授權,並且是一個自由軟體和開放原始碼項目,來打造基礎設施即服務(Infrastructure as a Service) (note2)

  • 3個 大modules:

運算模組網通模組儲存模組,加上一套集中式管理的儀表板模組,來組合成一套OpenStack共享服務,並且以提供虛擬機方式,對外帶來運算資源,以便利彈性擴充或調度 (note1))

所以網管用的, 網通硬體, 程式化,虛擬化是必然. 硬體的軟體化, 虛擬化.ek4

各模組(套件), 請看 (note1)

maxthonsnap20161122055049

  • 網通模組(套件,module) :Neutron

Neutron套件為其它OpenStack服務提供網路連接即服務(Network-Connectivity-as-a-Service)功能。比如OpenStack運算,為租戶提供API定義網路和使用。基於插件式的架構,使其支援眾多的網路供應商和技術,,IT人員可分配IP位址、靜態IP或是動態IP。且IT人員也可以使用SDN技術,像是OpenFlow協定來打造更大規模或是多租戶的網路環境

此外,允許部署和管理其他網路服務,像是入侵偵測系統(IDS)、負載平衡、防火牆、VPN等。

類似 Amazon AWS 的 VPC。

這肯定對網通硬體廠商的轉型很重要.

 

  • Nova運算專案[1]
  • Swift物件導向數據存貯專案[2]
  • Glance虛擬機器磁碟映像檔(Virtual Machine Image)傳送服務[3] [4]
  • Horizon- 提供簡易Web界面和管理控制台[5]
  • Cinder – 提供Block資料存取
  • Keystone – 提供身份驗證機制
  • Neutron – 提供網路管理功能
  • Trove – 提供資料庫管理功能
  • Sahara – 提供海量資料運算佈署功能
  • Ceilometer – 提供計量與監控功能
  • Heat – 提供自動延展虛擬機功能

網管用的

  • Trove 資料庫服務套件 (Database as a Service)

Trove主要負責銜接簡化實際資料庫的使用,提供OpenStack各個服務一個具延展性且可靠的雲端資料庫服務(Cloud Database-as-a-Service),Database服務包含了銜接傳統關聯式資料庫與新興非關聯式資料庫.

  用 Ubuntu os .

這是好東西, 對的方向, 省錢, 加值. 亞洲接受度滿快的.

參考:

(note1: https://kairen.gitbooks.io/openstack-liberty/content/conceptions/index.html)

(note2: https://zh.wikipedia.org/wiki/OpenStack)

(note3:http://www.ithome.com.tw/newstream/109292)

(OpenStack 資源整理:https://kairen.gitbooks.io/openstack-liberty/content/openstack-resource.html)

Linux alternative

e9afeb0b-c0fd-4447-91fc-0cae7bfb71941

GPL

GPL

WORD PRESS 告 WIX (note2)

因:
Wix used Automattic’s GPL’d Rich Text Editor project in their app, then open sourced the modified and updated version (but not the complete app) on GitHub under the permissive MIT license.

  • The GNU: General Public License (GNU GPL or GPL)

a widely used free software license, which guarantees end users the freedom to run, study, share and modify the software (note1)

The GPL is a copyleft license (a play on the word copyright), also called a viral license.

This means regardless of the amount of GPL’d components you use in your code, you have to release its source code, as well as the rights to modify and distribute the entire code.

Furthermore, according to the GPL family of licenses, you also need to release the source code under “the same GPL license" (hence the name viral license). (note2)

所以Wix 用了GPL的codes 到自己的產品裡,  在Git Hub 開放產品的改版,但卻掛MIT的license

要用 Open Source code 可以, 但用完也要開放code 給別人用, 這是GPL的精神.

because it includes GPL code and you distributed the app, the “entire" thing needs to be GPL.”

  • 問題點:

how much of your software do you need to release as open source under the GPL license

The GPL v2:  requires you to share the code of all ‘derivative work’.

  • 問題點: ‘derivative work’ 的定義

 

(note1: https://en.wikipedia.org/wiki/GNU_General_Public_License)
(note2: http://www.whitesourcesoftware.com/whitesource-blog/wordpress-wix-fiasco/?utm_source=linkedin&utm_medium=social&utm_term=blog-wordpress-wix-fiasco&utm_content=blog-wordpress-wix-fiasco&utm_campaign=pat-gr)

10個提升生產力的網路工具

這10個工具,

Asana 我已有用了,

Zoho以前有用過但沒全面使用

OTRS:  a powerful open source tool that can handle nearly any size company and any amount of field engineers or internal IT staff. Every ticket generated by the system retains a history and can automatically generate an alert email to the assigned technician. 這OTRS 是個提醒工作的工具吧, 很閒的人應該會很想用用看吧…

FreshBooks : 會計解決方案, 開發票用

Collabtive: 開源專案管理工具

其他工具, 有會計, 有專案管理, 在企業裡應請資訊長評估,適用,導入.

open source 工具, 大企業會有些疑慮, 但也許很適合創業者及中小企業運用

將這篇文收藏起來, 總有一天會考慮用這些好用的工具.

%d 位部落客按了讚: