【了解】Spark和Hadoop是友,非敌

Spark 在 6 月份取得了激动人心的成绩。在圣何塞举办的 Hadoop 峰会上,Spark 成了人们经常提及的话题和许多演讲的主题。IBM 还在 6 月 15 号宣布,将对 Spark 相关的技术进行巨额投资。

这一声明帮助推动了旧金山 Spark 峰会 的召开。在这里,人们会看到有越来越多的工程师在学习 Spark,也有越来越多的公司在试验和采用 Spark。

对 Spark 的投资和采用形成了一个正向循环,迅速推动这一重要技术的成熟和发展,让整个大数据社区受益。然而,人们对 Spark 的日益关注让一些人产生了奇怪、固执的误解:即 Spark 能取代 Hadoop,而不是对 Hadoop 的补充。这一误解从《公司纷纷抛弃大数据技术 Hadoop》这样的新闻标题上就能看出来。

作为大数据长期践行者、现任大数据即服务公司首席执行官,我想就这一误解发表看法,进行一些澄清。

Spark 和 Hadoop 配合得很好。

Hadoop 正日益成为公司处理大数据的企业平台之选。Spark 则是运行在 Hadoop 之上的内存中处理解决方案。Hadoop 最大的用户(包括易趣和雅虎)都在自己的 Hadoop 集群中运行 Spark。Cloudera 和 Hortonworks 在其 Hadoop 包中也加入了 Spark。我们 Altiscale 的客户在我们最开始推出时就使用运行着 Spark 的 Hadoop。

将 Spark 放到 Hadoop 的对立面就像是在说你的新电动车非常酷,根本不需要电一样。但事实上,电动车会推动对更多电力的需求。

为什么会产生这种混淆?如今的 Hadoop 由两大部分组成。第一部分是名为 Hadoop 分布式文件系统(HDFS)的大规模存储系统,该系统能高效、低成本地存储数据,且针对大数据的容量、多样性和速度进行了优化。第二部分是名为 YARN 的计算引擎,该引擎能在 HDFS 存储的数据上运行大量并行程序。

YARN 能托管任意多的程序框架。最初的框架是由谷歌发明的 MapReduce,用来帮助处理海量网络抓取数据。Spark 是另一个这样的框架,还有一个名为 Tez 的新框架。当人们谈论 Spark 与 Hadoop 的“对决”时,他们实际上是在说现在程序员们更喜欢用 Spark 了,而非之前的 MapReduce 框架。

但是,MapReduce 不应该和 Hadoop 等同起来。MapReduce 只是 Hadoop 集群处理数据的诸多方式之一。Spark 可以替代 MapReduce。商业分析们会避免使用这两个本来是供程序员使用的底层框架。相反,他们运用 SQL 等高级语言来更方便地使用 Hadoop。

在过去四年中,基于 Hadoop 的大数据技术涌现出了让人目不暇接的创新。Hadoop 从批处理 SQL 进化到了交互操作;从一个框架(MapReduce)变成了多个框架(如 MapReduce、Spark 等)。

HDFS 的性能和安全也得到了巨大改进,在这些技术之上出现了众多工具,如 Datameer、H20 和 Tableau。这些工具极大地扩大了大数据基础设施的用户范围,让数据科学家和企业用户也能使用。

Spark 不会取代 Hadoop。相反,Hadoop 是 Spark 的基石。随着各个组织寻求运用范围最广、最健壮的平台来将自己的数据资产转变为可行动的商业洞见,它们对 Hadoop 和 Spark 技术的采用也会越来越多。

英语原文:

June was an exciting month for Apache Spark. At Hadoop Summit San Jose, it was a frequent topic of conversation, as well as the subject of many session presentations. On June 15, IBM announced plans to make a massive investment in Spark-related technology.

This announcement helped kick off the Spark Summit in San Francisco, where one could witness the increasing number of engineers learning about Spark — and the increasing number of companies experimenting with and adopting Spark.

The virtuous cycle of Spark investment and adoption is driving rapidly the maturity and capabilities of this important technology, to the benefit of the entire big data community. However, the growing attention directed toward Spark also has given rise to a strange and stubborn misconception: that Spark is somehow an alternative to Apache Hadoop, instead of a complement to it. This misconception can be seen in headlines like “Newer Software Aims to Crunch Hadoop’s Numbers” and “Companies Move On From Big Data Technology Hadoop.”

As a long-time big data practitioner, an early advocate for investment in Hadoop by Yahoo! and now CEO of a company that provides big data as a service for the enterprise, I’d like to bring some perspective and clarity to this conversation.

Spark and Hadoop work together.

Hadoop is increasingly the enterprise platform of choice for big data. Spark is an in-memory processing solution that runs on top of Hadoop. The largest users of Hadoop — including eBay and Yahoo! — both run Spark inside their Hadoop clusters. Cloudera and Hortonworks ship Spark as part of their Hadoop distributions. And our own customers here at Altiscale have been using Spark on Hadoop since we launched.

To position Spark in opposition to Hadoop is like saying that your new electric car is so cool that you won’t need electricity anymore. If anything, electric cars will drive demand for more electricity.

Why the confusion? Modern-day Hadoop consists of two main components. The first is a large-scale storage system called the Hadoop Distributed File System (HDFS), which stores data in a low-cost, high-performance manner optimized for the volume, variety and velocity of big data. The second component is a computation engine called YARN, which can run massively parallel programs on top of the data stored in HDFS.

YARN can host any number of programming frameworks. The original such framework was MapReduce, invented at Google to help process massive web crawls. Spark is another such framework, as is another new one called Tez. When people talk about Spark “crushing” Hadoop, what they really mean is that programmers now prefer using Spark to the older MapReduce framework.

However, MapReduce should not be equated with Hadoop. MapReduce is just one of many ways to process your data in a Hadoop cluster. Spark can be used as an alternative. Looking more broadly, business analysts — a growing base of big data practitioners — avoid both of these frameworks, which are low-level toolkits meant for programmers. Instead, they use high-level languages like SQL that make Hadoop more accessible.

In the last four years, Hadoop-based big data technology has seen an unprecedented level of innovation. We’ve gone from batch SQL to interactive; from one framework (MapReduce) to multiple frameworks (e.g., MapReduce, Spark and many others).

We’ve seen enormous performance and security improvements in HDFS, and we’ve seen an explosion of tools that sit on top of all of this — such as Datameer, H20 and Tableau — that make all of this big data infrastructure usable by a far broader range of data scientists and business users.

Spark isn’t a challenger that’s going to replace Hadoop. Rather, Hadoop is a foundation that makes Spark possible. We expect to see increasing adoption of both as organizations seek the broadest and most robust platform possible for turning their data assets into actionable business insight.

PPV课其他精彩文章:


1、回复“干货”查看干货 数据分析师完整知识结构

2、回复“答案”查看大数据Hadoop面试笔试题及答案

3、回复“设计”查看这是我见过最逆天的设计,令人惊叹叫绝

4、回复“可视化”查看数据可视化专题-数据可视化案例与工具

5、回复“禅师”查看当禅师遇到一位理科生,后来禅师疯了!!知识无极限

6、回复“啤酒”查看数据挖掘关联注明案例-啤酒喝尿布

7、回复“栋察”查看大数据栋察——大数据时代的历史机遇连载

8、回复“数据咖”查看数据咖——PPV课数据爱好者俱乐部省分会会长招募

9、回复“每日一课”查看【每日一课】手机在线视频集锦

PPV课大数据ID: ppvke123 (长按可复制)

大数据人才的摇篮!专注大数据行业人才的培养。每日一课,大数据(EXCEL、SAS、SPSS、Hadoop、CDA)视频课程。大数据资讯,每日分享!数据咖—PPV课数据爱好者俱乐部!

原文发布于微信公众号 - PPV课数据科学社区(ppvke123)

原文发表时间:2015-07-24

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏鸿的学习笔记

Shark,Spark SQL,Spark上的Hive以及Apache Spark上的SQL的未来

随着Spark SQL和Apache Spark effort(HIVE-7292)上新Hive的引入,我们被问到了很多关于我们在这两个项目中的地位以及它们与S...

1122
来自专栏CDA数据分析师

Spark为什么能成为大数据分析主流工具?

一.Spark是什么 Spark是伯克利大学2009年开始研发的一个项目,它是大数据时代下的一个快速处理数据分析工作的框架。spark发展十分迅速,2014年,...

4575
来自专栏大数据和云计算技术

hadoop发行商介绍:Cloudera

‍‍‍‍在Hadoop生态系统中,规模最大、知名度最高的公司则是Cloudera。现在国内很多公司也都选用他们的发行版本(CDH)。‍‍ ‍‍Cloudera由...

3048
来自专栏灯塔大数据

大数据赛道上的单挑:MapReduce与Spark到底谁快?

? 通常人们认为Spark的性能和速度全面优于MapReduce,但最新的对决显示MapReduce在某些方面也有胜场,而且数据规模越大优势越大。 Apach...

3706
来自专栏云计算D1net

如何成为云计算大数据Spark高手?

Spark是发源于美国加州大学伯克利分校AMPLab的集群计算平台,它立足于内存计算,性能超过Hadoop百倍,从多迭代批量处理出发,兼收并蓄数据仓库、流处理和...

3977
来自专栏智能计算时代

IBM open-sources machine learning SystemML

IBM is aiming to popularise its proprietary machine learning programme SystemML ...

2718
来自专栏风火数据

教你如何成为Spark大数据高手

Spark目前被越来越多的企业使用,和Hadoop一样,Spark也是以作业的形式向集群提交任务,那么如何成为Spark大数据高手?下面就来个深度教程。

1281
来自专栏我是攻城师

相比Hadoop,如何看待Spark技术?

2825
来自专栏腾讯大数据的专栏

相比Hadoop,如何看待Spark技术?

之前看Spark的评价,几乎一致表示,Spark是小数据集上处理复杂迭代的交互系统,并不擅长大数据集,也没有稳定性。但是最近的风评已经变化,尤其是14年10月他...

1819
来自专栏about云

spark与hadoop相比,存在哪些缺陷(劣势)

一说大数据,人们往往想到Hadoop。这固然不错,但随着大数据技术的深入应用,多种类型的数据应用不断被要求提出,一些Hadoop被关注的范畴开始被人们注意,相关...

3986

扫码关注云+社区

领取腾讯云代金券