首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >spark的一些小总结

spark的一些小总结

作者头像
哒呵呵
发布2018-08-06 15:42:27
3200
发布2018-08-06 15:42:27
举报

这是阅读spark文档并亲自使用spark实验的不完全小总结。

1.DAG的优势,相比较于MR

首先,DAG是MR的迭代模型。其中一个优点是,DAG可以做全局的优化,而Hadoop的MR没有意识到这点。

MapReduce简单的使用了Map和Reduce.一个MR只能做一个简单的对数据的聚合操作,但是如果要做更复杂的,那就是DAG了。

Hadoop MapReduce的操作是read data from HDFS, apply map and reduce, write back to HDFS. DAG可以更好的利用内存。

2.RDD概念的理解

RDD is a dataset which is distributed, that is, it is divided into "partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data "locally" is known as "locality aware scheduling".

3.spark-python小结

1.在一个高层次的抽象上来看,每一个Spark application都包含了一个driver program用于运行main函数,和在集群上运行parallel operations。

2.from pyspark import SparkContext, SparkConf 初始化spark sc = SparkContext(appName = "CollectFemaleInfo")

3.创建RDD的方式:parallelizing ,referencing a dataset in an external storage system

4.rdd = sc.parallelize([1, 2, 3, 4])

5.distFile = sc.textFile("data.txt")

6.如果使用的是当地路径,请确保每个集群都有

7.RDD操作

transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

里面包含了三个操作:

lambda 表达式

def 定义的函数

类和module

如下示例

>>> def m(s): print s

...

>>> a.map(m).collect()

hello world

dafas

所有的object都应该在一个函数内做完,可以传外部变量进去

不要在函数里更新全局变量,因为这个只存在于driver node,而不是executors

慎用collect,因为它会把所有数据全部放到同一台机器上

Shuffle operations

The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

repartition operations,ByKey operations,join operations

perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key

disk I/O, data serialization, and network I/O

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2017-07-02,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 鸿的学习笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
大数据
全栈大数据产品,面向海量数据场景,帮助您 “智理无数,心中有数”!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档