内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
然而
foreach occurrence-in-the-rdd{ //do stuff with the array found on loccation n of the RDD }
可以调用RDD上的各种方法:
// set up an example -- an RDD of arrays val sparkConf = new SparkConf().setMaster("local").setAppName("Example") val sc = new SparkContext(sparkConf) val testData = Array(Array(1,2,3), Array(4,5,6,7,8)) val testRDD = sc.parallelize(testData, 2) // Print the RDD of arrays. testRDD.collect().foreach(a => println(a.size)) // Use map() to create an RDD with the array sizes. val countRDD = testRDD.map(a => a.size) // Print the elements of this new RDD. countRDD.collect().foreach(a => println(a)) // Use filter() to create an RDD with just the longer arrays. val bigRDD = testRDD.filter(a => a.size > 3) // Print each remaining array. bigRDD.collect().foreach(a => { a.foreach(e => print(e + " ")) println() }) }