Spark DataFrame基本操作

sparkle123

发布于 2018-04-26 05:25:23

1K00

代码可运行

文章被收录于专栏：大数据-Hadoop、Spark大数据-Hadoop、Spark

运行总次数：0

代码可运行

DataFrame的概念来自R/Pandas语言，不过R/Pandas只是runs on One Machine，DataFrame是分布式的，接口简单易用。

Threshold: Spark RDD API VS MapReduce API
One Machine:R/Pandas

官网的说明 http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#datasets-and-dataframes 拔粹如下：

A Dataset is a distributed collection of data：分布式的数据集
A DataFrame is a Dataset organized into named columns. （RDD with Schema）以列（列名、列的类型、列值）的形式构成的分布式数据集，按照列赋予不同的名称
An abstraction for selecting,filtering,aggregation and plotting structured data
It is conceptually equivalent to a table in a relational database or a data frame in R/Python

RDD与DataFrame对比：

RDD运行起来，速度根据执行语言不同而不同：

java/scala  ==> jvm
python ==> python runtime

DataFrame运行起来，执行语言不同，但是运行速度一样:

java/scala/python ==> Logic Plan

根据官网的例子来了解下DataFrame的基本操作，

import org.apache.spark.sql.SparkSession

/**
  * DataFrame API基本操作
  */
object DataFrameApp {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .appName("DataFrameApp")
      .master("local[2]")
      .getOrCreate();

    // 将json文件加载成一个dataframe
    val peopleDF = spark.read.json("C:\\Users\\Administrator\\IdeaProjects\\SparkSQLProject\\spark-warehouse\\people.json");
    // Prints the schema to the console in a nice tree format.
    peopleDF.printSchema();

    // 输出数据集的前20条记录
    peopleDF.show();

    //查询某列所有的数据： select name from table
    peopleDF.select("name").show();

    // 查询某几列所有的数据，并对列进行计算： select name, age+10 as age2 from table
    peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 10).as("age2")).show();

    //根据某一列的值进行过滤： select * from table where age>19
    peopleDF.filter(peopleDF.col("age") > 19).show();

    //根据某一列进行分组，然后再进行聚合操作： select age,count(1) from table group by age
    peopleDF.groupBy("age").count().show();

 spark.stop();
  }
}