如何将RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
)转换为数据帧org.apache.spark.sql.DataFrame
。我使用.rdd
将数据帧转换为rdd。在处理之后,我想把它放回数据帧中。我该怎么做呢?
发布于 2015-07-08 02:27:12
假设您的RDDrow名为rdd,您可以使用:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
rdd.toDF()
发布于 2016-11-16 14:39:05
方法1:(Scala)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df_2 = sc.parallelize(Seq((1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c"))).toDF("x", "y", "z")
方法2:(Scala)
case class temp(val1: String,val3 : Double)
val rdd = sc.parallelize(Seq(
Row("foo", 0.5), Row("bar", 0.0)
))
val rows = rdd.map({case Row(val1:String,val3:Double) => temp(val1,val3)}).toDF()
rows.show()
方法1:(Python)
from pyspark.sql import Row
l = [('Alice',2)]
Person = Row('name','age')
rdd = sc.parallelize(l)
person = rdd.map(lambda r:Person(*r))
df2 = sqlContext.createDataFrame(person)
df2.show()
方法2:(Python)
from pyspark.sql.types import *
l = [('Alice',2)]
rdd = sc.parallelize(l)
schema = StructType([StructField ("name" , StringType(), True) ,
StructField("age" , IntegerType(), True)])
df3 = sqlContext.createDataFrame(rdd, schema)
df3.show()
从row对象中提取值,然后应用case类将rdd转换为DF
val temp1 = attrib1.map{case Row ( key: Int ) => s"$key" }
val temp2 = attrib2.map{case Row ( key: Int) => s"$key" }
case class RLT (id: String, attrib_1 : String, attrib_2 : String)
import hiveContext.implicits._
val df = result.map{ s => RLT(s(0),s(1),s(2)) }.toDF
发布于 2015-12-05 17:11:35
下面是一个简单的示例,将列表转换为Spark RDD,然后将Spark RDD转换为Dataframe。
请注意,我已经使用Spark-shell的scala REPL来执行以下代码,这里的sc是SparkContext的一个实例,它在Spark-shell中隐式可用。希望它能回答你的问题。
scala> val numList = List(1,2,3,4,5)
numList: List[Int] = List(1, 2, 3, 4, 5)
scala> val numRDD = sc.parallelize(numList)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[80] at parallelize at <console>:28
scala> val numDF = numRDD.toDF
numDF: org.apache.spark.sql.DataFrame = [_1: int]
scala> numDF.show
+---+
| _1|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
https://stackoverflow.com/questions/29383578
复制相似问题