这是阅读spark文档并亲自使用spark实验的不完全小总结。
1.DAG的优势,相比较于MR
首先,DAG是MR的迭代模型。其中一个优点是,DAG可以做全局的优化,而Hadoop的MR没有意识到这点。
MapReduce简单的使用了Map和Reduce.一个MR只能做一个简单的对数据的聚合操作,但是如果要做更复杂的,那就是DAG了。
Hadoop MapReduce的操作是read data from HDFS, apply map and reduce, write back to HDFS. DAG可以更好的利用内存。
2.RDD概念的理解
RDD is a dataset which is distributed, that is, it is divided into "partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. Its best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data "locally" is known as "locality aware scheduling".
3.spark-python小结
1.在一个高层次的抽象上来看,每一个Spark application都包含了一个driver program用于运行main函数,和在集群上运行parallel operations。
2.from pyspark import SparkContext, SparkConf 初始化spark sc = SparkContext(appName = "CollectFemaleInfo")
3.创建RDD的方式:parallelizing ,referencing a dataset in an external storage system
4.rdd = sc.parallelize([1, 2, 3, 4])
5.distFile = sc.textFile("data.txt")
6.如果使用的是当地路径,请确保每个集群都有
7.RDD操作
transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
里面包含了三个操作:
lambda 表达式
def 定义的函数
类和module
如下示例
>>> def m(s): print s
...
>>> a.map(m).collect()
hello world
dafas
所有的object都应该在一个函数内做完,可以传外部变量进去
不要在函数里更新全局变量,因为这个只存在于driver node,而不是executors
慎用collect,因为它会把所有数据全部放到同一台机器上
Shuffle operations
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
repartition operations,ByKey operations,join operations
perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key
disk I/O, data serialization, and network I/O