blocks|key|2425929|text|问:这两种方法有什么不同？使用Dataframe有什么性能提高吗？|type|blockquote|depth|inlineStyleRanges|entityRanges|data|2425930|答：|unstyled|2425931|霍顿的作品进行了比较研究。来源...|offset|length|2425932|要点是基于情境/情景，每一种情况都是正确的。没有硬性规定来决定这件事。请通过下面。|2425933|RDDs、DataFrames和SparkSQL+(实际上有3种方法，而不仅仅是2种)：|2425934|Spark的核心是弹性分布式数据集(RDD)的概念：|2425935|弹性-如果内存中的数据丢失，则可以重新创建数据。|unordered-list-item|2425936|分布式的、不可变的、分布在内存中的对象集合，这些对象在集群中的多个数据节点上进行分区。|2425937|Dataset+-初始数据可以来自文件、以编程方式创建、来自内存中的数据或来自另一个RDD。|2425938|DataFrames+API是一个数据抽象框架，它将数据组织为命名列：|2425939|为数据创建架构|2425940|在概念上等同于关系数据库中的表|2425941|可以从多个来源构建，包括结构化数据文件、Hive中的表、外部数据库或现有的RDDs。|2425942|为类似于数据操作和聚合的简单SQL提供数据的关系视图。|2425943|在引擎盖下，它是Row‘s的一个RDD|2425944|SparkSQL是一个用于结构化数据处理的Spark模块。您可以通过以下方式与SparkSQL进行交互：|2425945|SQL|2425946|DataFrames+API|2425947|数据集API|2425948|测试结果：|2425949|在某些类型的数据处理方面，RDD的性能优于DataFrames和SparkSQL|2425950|DataFrames和SparkSQL的性能几乎相同，尽管在涉及聚合和排序的分析中，SparkSQL有一点优势。|2425951|从语法上讲，DataFrames和SparkSQL比使用RDD的更直观|2425952|每次考试3中最好的一次|2425953|时间是一致的，而且在测试之间没有太大的变化。|2425954|作业是单独运行的，没有其他作业正在运行。|2425955|从900万个唯一订单ID的组中随机查找1个订单ID，所有不同产品的总数和排序按产品名称下降|2425956|​|2425957|📷|atomic|2425958|2425959|entityMap|0|LINK|mutability|MUTABLE|url|https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html|1|IMAGE|IMMUTABLE|imageUrl|https://developer.qcloudimg.com/http-save/yehe-900000/000e029466704b1c8838177b9541b403.png|imageAlt^0|0|0|D|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|0|0^^$0|@$1|2|3|4|5|6|7|2F|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|2G|8|@]|9|@]|A|$]]|$1|E|3|F|5|D|7|2H|8|@]|9|@$G|2I|H|2J|1|2K]]|A|$]]|$1|I|3|J|5|6|7|2L|8|@]|9|@]|A|$]]|$1|K|3|L|5|D|7|2M|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|2N|8|@]|9|@]|A|$]]|$1|O|3|P|5|Q|7|2O|8|@]|9|@]|A|$]]|$1|R|3|S|5|Q|7|2P|8|@]|9|@]|A|$]]|$1|T|3|U|5|Q|7|2Q|8|@]|9|@]|A|$]]|$1|V|3|W|5|D|7|2R|8|@]|9|@]|A|$]]|$1|X|3|Y|5|Q|7|2S|8|@]|9|@]|A|$]]|$1|Z|3|10|5|Q|7|2T|8|@]|9|@]|A|$]]|$1|11|3|12|5|Q|7|2U|8|@]|9|@]|A|$]]|$1|13|3|14|5|Q|7|2V|8|@]|9|@]|A|$]]|$1|15|3|16|5|Q|7|2W|8|@]|9|@]|A|$]]|$1|17|3|18|5|D|7|2X|8|@]|9|@]|A|$]]|$1|19|3|1A|5|Q|7|2Y|8|@]|9|@]|A|$]]|$1|1B|3|1C|5|Q|7|2Z|8|@]|9|@]|A|$]]|$1|1D|3|1E|5|Q|7|30|8|@]|9|@]|A|$]]|$1|1F|3|1G|5|D|7|31|8|@]|9|@]|A|$]]|$1|1H|3|1I|5|Q|7|32|8|@]|9|@]|A|$]]|$1|1J|3|1K|5|Q|7|33|8|@]|9|@]|A|$]]|$1|1L|3|1M|5|Q|7|34|8|@]|9|@]|A|$]]|$1|1N|3|1O|5|Q|7|35|8|@]|9|@]|A|$]]|$1|1P|3|1Q|5|Q|7|36|8|@]|9|@]|A|$]]|$1|1R|3|1S|5|Q|7|37|8|@]|9|@]|A|$]]|$1|1T|3|1U|5|D|7|38|8|@]|9|@]|A|$]]|$1|1V|3|1W|5|D|7|39|8|@]|9|@]|A|$]]|$1|1X|3|1Y|5|1Z|7|3A|8|@]|9|@$G|3B|H|3C|1|3D]]|A|$]]|$1|20|3|1W|5|D|7|3E|8|@]|9|@]|A|$]]|$1|21|3|-4|5|D|7|3F|8|@]|9|@]|A|$]]]|22|$23|$5|24|25|26|A|$27|28]]|29|$5|2A|25|2B|A|$2C|2D|2E|-4]]]]

<blockquote>
 Question : What is the difference in these two approaches?
 Is there any performance gain with using Dataframe APIs?
</blockquote>

<hr>

Answer : 

There is comparative study done by horton works. <a href="https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html" rel="noreferrer">source</a>...

<blockquote>
 Gist is based on situation/scenario each one is right. there is no
 hard and fast rule to decide this. pls go through below..
</blockquote>

<h3>RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):</h3>

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

<ul>
<li>Resilient - if data in memory is lost, it can be recreated </li>
<li>Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster </li>
<li>Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD</li>
</ul>

DataFrames API is a data abstraction framework that organizes your data into named columns:

<ul>
<li>Create a schema for the data </li>
<li>Conceptually equivalent to a table in a relational database </li>
<li>Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs</li>
<li>Provides a relational view of the data for easy SQL like data manipulations and aggregations </li>
<li>Under the hood, it is an RDD of Row’s </li>
</ul>

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

<ul>
<li>SQL </li>
<li>DataFrames API </li>
<li>Datasets API</li>
</ul>

<h3>Test results:</h3>

<ul>
<li>RDD’s outperformed DataFrames and SparkSQL for certain types of data processing</li>
<li>DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage</li>
<li>Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s</li>
<li>Took the best out of 3 for each test</li>
<li>Times were consistent and not much variation between tests</li>
<li>Jobs were run individually with no other jobs running</li>
</ul>

Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

<a href="https://i.stack.imgur.com/NkrUy.png" rel="noreferrer"><img src="https://i.stack.imgur.com/NkrUy.png" alt="enter image description here"></a>

blocks|key|2387637|text|在string查询中，直到运行时才知道语法错误(这可能代价高昂)，而在DataFrames中，语法错误可以在编译时捕获。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2387638|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.

blocks|key|2426000|text|如果查询很长，那么高效地编写和运行查询是不可能的。另一方面，DataFrame和列API一起帮助开发人员编写紧凑的代码，这对于ETL应用程序来说是非常理想的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2426001|此外，所有操作(例如大于、小于、选择、地点等).使用"DataFrame“运行构建一个”抽象语法树(AST)“，然后将其传递给”催化剂“以进行进一步的优化。(来源:+Spark，Section#3.3)|offset|length|style|BOLD|2426002|entityMap^0|0|27|L|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@$D|L|E|M|F|G]]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

If query is lengthy, then efficient writing &amp; running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.

Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

blocks|key|1096588|text|再加几个。dataframe使用钨内存表示、sql使用的催化剂优化器以及Dataframe。使用Dataset+API，您比使用SparkSQL对实际执行计划有更多的控制。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1096589|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like <code>sqlContext.sql(&quot;my hive hql&quot;)</code> ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using &quot;SQL&quot; queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.

Writing SQL vs using Dataframe APIs in Spark SQL

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我是Spark世界的新手。我目前正在迁移我的应用程序的摄取代码，其中包括在HDFS中摄取数据，在HDFS中使用原始数据和应用层，并执行CDC(变更数据捕获)，这是目前在Hive查询中编写的，并通过Oozie执行。这需要迁移到Spark应用程序(当前版本1.6)中。代码的另一部分稍后将迁移。在spark中，我可以直接从H...

问在Spark中使用Dataframe编写SQL
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Spark中使用Dataframe编写SQLEN