blocks|key|4490676|text|使用由返回的DataFrame：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4490677|yourDF.orderBy(account)|code-block|syntax|javascript|4490678|没有明确的方式在DataFrame上使用partitionBy，只能在PairRDD上使用，但是当你对DataFrame排序时，它会在它的LogicalPlan中使用它，当你需要对每个帐户进行计算时，这将会有所帮助。|offset|length|style|CODE|4490679|我只是碰巧遇到了同样的问题，我想要按帐户对数据帧进行分区。我假设，当您说“希望对数据进行分区，以便一个帐户的所有事务都在同一个Spark分区中”时，您希望它具有可扩展性和性能，但您的代码并不依赖于它(就像使用mapPartitions()等)，对吧？|4490680|entityMap^0|0|0|K|B|0|2W|F|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|N|5|6|7|V|8|@$I|W|J|X|K|L]]|9|@]|A|$]]|$1|O|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|P|$]]

Use the DataFrame returned by:

<pre><code>yourDF.orderBy(account)
</code></pre>

There is no explicit way to use <code>partitionBy</code> on a DataFrame, only on a PairRDD, but when you sort a DataFrame, it will use that in it's LogicalPlan and that will help when you need to make calculations on each Account.

I just stumbled upon the same exact issue, with a dataframe that I want to partition by account.
I assume that when you say "want to have the data partitioned so that all of the transactions for an account are in the same Spark partition", you want it for scale and performance, but your code doesn't depend on it (like using <code>mapPartitions()</code> etc), right?

blocks|key|4490714|text|所以从某种答案开始：)-你不能|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4490715|我不是专家，但就我对DataFrames的理解而言，它们并不等同于rdd，而且DataFrame也没有分割器这种东西。|4490716|一般来说，DataFrame的想法是提供另一层抽象来处理这类问题。DataFrame上的查询被转换为逻辑计划，该逻辑计划被进一步转换为对RDDs的操作。您建议的分区可能会自动应用，或者至少应该自动应用。|4490717|如果你不相信SparkSQL会提供某种最优的工作，你可以按照评论中的建议将DataFrame转换为RDDRow。|4490718|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|L|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|M|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

So to start with some kind of answer : ) - You can't

I am not an expert, but as far as I understand DataFrames, they are not equal to rdd and DataFrame has no such thing as Partitioner.

Generally DataFrame's idea is to provide another level of abstraction that handles such problems itself. The queries on DataFrame are translated into logical plan that is further translated to operations on RDDs. The partitioning you suggested will probably be applied automatically or at least should be.

If you don't trust SparkSQL that it will provide some kind of optimal job, you can always transform DataFrame to RDD[Row] as suggested in of the comments.

blocks|key|4491733|text|我能够使用RDD做到这一点。但我不知道这对你来说是不是一个可以接受的解决方案。一旦有了可用的DF作为RDD，就可以应用repartitionAndSortWithinPartitions来执行自定义的数据重新分区。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|4491734|下面是我使用的一个示例：|4491735|class+DatePartitioner(partitions:+Int)+extends+Partitioner+{

++override+def+getPartition(key:+Any):+Int+=+{
++++val+start_time:+Long+=+key.asInstanceOf[Long]
++++Objects.hash(Array(start_time))+%25+partitions
++}

++override+def+numPartitions:+Int+=+partitions
}

myRDD
++.repartitionAndSortWithinPartitions(new+DatePartitioner(24))
++.map+{+v+=>+v._2+}
++.toDF()
++.write.mode(SaveMode.Overwrite)|code-block|syntax|javascript|4491736|entityMap|0|LINK|mutability|MUTABLE|url|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions^0|1N|Y|1N|Y|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]]|D|@$9|X|A|Y|1|Z]]|E|$]]|$1|F|3|G|5|6|7|10|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|11|8|@]|D|@]|E|$K|L]]|$1|M|3|-4|5|6|7|12|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

I was able to do this using RDD. But I don't know if this is an acceptable solution for you.
Once you have the DF available as an RDD, you can apply <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions" rel="noreferrer"><code>repartitionAndSortWithinPartitions</code></a> to perform custom repartitioning of data. 

Here is a sample I used:

<pre><code>class DatePartitioner(partitions: Int) extends Partitioner {

 override def getPartition(key: Any): Int = {
 val start_time: Long = key.asInstanceOf[Long]
 Objects.hash(Array(start_time)) % partitions
 }

 override def numPartitions: Int = partitions
}

myRDD
 .repartitionAndSortWithinPartitions(new DatePartitioner(24))
 .map { v =&gt; v._2 }
 .toDF()
 .write.mode(SaveMode.Overwrite)
</code></pre>

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this.

One of the data tables I'm working with contains a list of transactions, by account, silimar to the following example.

<pre><code>Account Date Type Amount
1001 2014-04-01 Purchase 100.00
1001 2014-04-01 Purchase 50.00
1001 2014-04-05 Purchase 70.00
1001 2014-04-01 Payment -150.00
1002 2014-04-01 Purchase 80.00
1002 2014-04-02 Purchase 22.00
1002 2014-04-04 Payment -120.00
1002 2014-04-04 Purchase 60.00
1003 2014-04-02 Purchase 210.00
1003 2014-04-03 Purchase 15.00
</code></pre>

At least initially, most of the calculations will occur between the transactions within an account. So I would want to have the data partitioned so that all of the transactions for an account are in the same Spark partition.

But I'm not seeing a way to define this. The DataFrame class has a method called 'repartition(Int)', where you can specify the number of partitions to create. But I'm not seeing any method available to define a custom partitioner for a DataFrame, such as can be specified for an RDD.

The source data is stored in Parquet. I did see that when writing a DataFrame to Parquet, you can specify a column to partition by, so presumably I could tell Parquet to partition it's data by the 'Account' column. But there could be millions of accounts, and if I'm understanding Parquet correctly, it would create a distinct directory for each Account, so that didn't sound like a reasonable solution.

Is there a way to get Spark to partition this DataFrame so that all data for an Account is in the same partition?

How to define partitioning of DataFrame?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我在Spark 1.4.0中开始使用Spark SQL和DataFrames。我想用Scala在DataFrames上定义一个自定义的分区程序，但是不知道怎么做。我正在使用的一个数据表包含一个按帐户划分的事务列表，类似于下面的示例。Account   Date       Type       Amount1001  ...

问如何定义DataFrame的分区？
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何定义DataFrame的分区？EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何定义DataFrame的分区？
EN