blocks|key|2286329|text|从语法的逻辑判断，第一种方法应该使用较少的空间，因为flatMap扩展到.map().flatten，两者都使用相同大小的参数。它在Scala+REPL中编译为相同的Java字节码(编辑:当使用一个特殊的示例时，这显然不能补偿用相当大的数据实际测试它)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2286330|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Judging by logic of the syntax the first approach should use less space, since flatMap extends to .map().flatten, both on an argument of equal size. It compiles to the same Java bytecode in the Scala REPL (edit: when using a pet example, which obviously doesn't compensate for actually testing it with comparably large data).

blocks|key|2286476|text|更新：我最初的回答包含一个错误:+Spark确实支持Seq作为flatMap的结果(并将结果转换回Dataset)。对于造成的混乱，我表示歉意。我还添加了有关提高分析性能的更多信息。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|CODE|entityRanges|data|2286477|更新2：我没想到您使用的是Dataset而不是RDD+(！)。这并不会对答案产生重大影响。|2286478|Spark是一个分布式系统，它将数据分区到多个节点上，并并行处理数据。就效率而言，导致重新分区(需要在节点之间传输数据)的操作在运行时的开销远远高于就地修改。此外，您应该注意到，仅转换数据的操作(如filter、map、flatMap等)仅被存储，直到执行动作操作(如reduce、fold、aggregate等)时才会执行。因此，这两种选择实际上都没有起到任何作用。|2286479|当对这些转换的结果执行操作时，我认为filter操作的效率会高得多:它只处理传递谓词x=>x.age>25+(更典型地写为_.age+>+25)的数据(使用后续的map操作)。虽然看起来filter创建了一个中间集合，但它的执行速度很慢。因此，Spark似乎将filter和map操作融合在一起。|2286480|坦率地说，你的flatMap操作很可怕。它强制每个数据项的处理、序列创建和随后的扁平化，这肯定会增加整体处理。|2286481|也就是说，提高分析性能的最佳方法是控制分区，以便在尽可能多的节点上大致平均地拆分数据。参考this+guide作为一个很好的起点。|2286482|entityMap|0|LINK|mutability|MUTABLE|url|https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html^0|0|2|Q|3|V|7|1D|7|0|0|3|D|7|N|3|0|2R|6|2Y|3|32|7|3Q|6|3X|4|42|9|0|I|6|16|B|1P|A|29|3|2L|6|3M|6|3T|3|0|7|7|0|19|A|0|0^^$0|@$1|2|3|4|5|6|7|Y|8|@$9|Z|A|10|B|C]|$9|11|A|12|B|D]|$9|13|A|14|B|D]|$9|15|A|16|B|D]]|E|@]|F|$]]|$1|G|3|H|5|6|7|17|8|@$9|18|A|19|B|C]|$9|1A|A|1B|B|D]|$9|1C|A|1D|B|D]]|E|@]|F|$]]|$1|I|3|J|5|6|7|1E|8|@$9|1F|A|1G|B|D]|$9|1H|A|1I|B|D]|$9|1J|A|1K|B|D]|$9|1L|A|1M|B|D]|$9|1N|A|1O|B|D]|$9|1P|A|1Q|B|D]]|E|@]|F|$]]|$1|K|3|L|5|6|7|1R|8|@$9|1S|A|1T|B|D]|$9|1U|A|1V|B|D]|$9|1W|A|1X|B|D]|$9|1Y|A|1Z|B|D]|$9|20|A|21|B|D]|$9|22|A|23|B|D]|$9|24|A|25|B|D]]|E|@]|F|$]]|$1|M|3|N|5|6|7|26|8|@$9|27|A|28|B|D]]|E|@]|F|$]]|$1|O|3|P|5|6|7|29|8|@]|E|@$9|2A|A|2B|1|2C]]|F|$]]|$1|Q|3|-4|5|6|7|2D|8|@]|E|@]|F|$]]]|R|$S|$5|T|U|V|F|$W|X]]]]

Update: My original answer contained an error: Spark does support <code>Seq</code> as the result of a <code>flatMap</code> (and converts the result back into an <code>Dataset</code>). Apologies for the confusion. I also added more information on improving the performance of your analysis.

Update 2: I missed that you're using a <code>Dataset</code> rather than an <code>RDD</code> (doh!). This doesn't affect the answer significantly.

Spark is a distributed system that partitions data across multiple nodes and processes data in parallel. In terms of efficiency, actions that result in re-partitioning (requiring data to be transferred between nodes) is far more expensive in terms of run-time than in-place modifications. Also, you should note that operations that merely transform data, such as <code>filter</code>, <code>map</code>, <code>flatMap</code>, etc. are merely stored and do not execute until an action operation is performed (such as <code>reduce</code>, <code>fold</code>, <code>aggregate</code>, etc.). Consequently, neither alternative actually does anything as things stand.

When an action is performed on the result of these transformations, I would expect the <code>filter</code> operation to be far more efficient: it only processes data (using the subsequent <code>map</code> operation) that passes the predicate <code>x=&gt;x.age&gt;25</code> (more typically written as <code>_.age &gt; 25</code>). While it may appear that <code>filter</code> creates an intermediary collection, it executes lazilly. As a result, Spark appears to fuse the <code>filter</code> and <code>map</code> operations together.

Your <code>flatMap</code> operation is, frankly, hideous. It forces processing, sequence creation and subsequent flattening of every data item, which will definitely increase overall processing.

That said, the best way to improve the performance of your analysis is to control the partitioning so that the data is split roughly evenly over as many nodes as possible. Refer to <a href="https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html" rel="noreferrer">this guide</a> as a good starting point.

I have a quite a big dataset (100 million+ records with 100's of columns) that I am processing with spark. I am reading the data into a spark dataset and I want to filter this dataset and map a subset of its fields to a case class.

the code looks somewhat similar,

<pre><code>case class Subset(name:String,age:Int)
case class Complete(name:String,field1:String,field2....,age:Int)

val ds = spark.read.format("csv").load("data.csv").as[Complete]

#approach 1
ds.filter(x=&gt;x.age&gt;25).map(x=&gt; Subset(x.name,x.age))

#approach 2
ds.flatMap(x=&gt;if(x.age&gt;25) Seq(Subset(x.name,x.age)) else Seq.empty)

</code></pre>

Which approach is better? Any additional hints on how I can make this code more performant? 

Thanks!

Edit

I ran some tests to compare the runtimes and it looks like approach 2 is quite faster, the code i used for getting the runtimes is as follows,

<pre><code>val subset = spark.time {
 ds.filter(x=&gt;x.age&gt;25).map(x=&gt; Subset(x.name,x.age))
}

spark.time {
 subset.count()
}

and 

val subset2 = spark.time {
 ds.flatMap(x=&gt;if(x.age&gt;25) Seq(Subset(x.name,x.age)) else Seq.empty)
}

spark.time {
 subset2.count()
}
</code></pre>

Does flatmap give better performance than filter+map?

数据分区

Spark 

Java

我有一个相当大的数据集(100个million+记录和100个列)，我正在用spark处理。我正在将数据读入spark数据集，并希望过滤此数据集并将其字段的子集映射到case类。代码看起来有点类似，case class Subset(name:String,age:Int)case class Complete(nam...

问flatmap是否提供了比filter+map更好的性能？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问flatmap是否提供了比filter+map更好的性能？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问flatmap是否提供了比filter+map更好的性能？
EN