blocks|key|1625719|text|如果不做一些实际的分析，很难确定，但我有两个理论：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1625720|首先，您可能会失去Range类的一些好处，特别是几乎为零的内存使用率。当您执行(0L+until+N+*+N)时，您将创建一个Range对象，该对象是惰性的。它实际上不会创建包含该范围内每个数字的任何对象。我想map也不知道。而且sum一次计算并相加一个数字，所以也几乎不分配任何内存。|offset|length|style|CODE|1625721|我不确定ParRange是否也是如此。似乎它必须为每个拆分分配一些数量，并且在调用map之后，它可能必须在内存中存储一些中间结果，因为“相邻的”拆分等待另一个拆分完成。尤其是堆空间异常让我认为情况就是这样。所以你会在GC之类的事情上浪费很多时间。|1625722|其次，到目前为止，对rng.nextDouble的调用可能是该内部函数中开销最大的部分。但我相信java和scala的Random类本质上都是单线程的。它们在内部同步和阻塞。所以不管怎样，你不会从并行中获得那么多，实际上会损失一些开销。|1625723|entityMap^0|0|9|5|13|G|1R|5|2X|3|37|3|0|4|8|15|3|0|A|E|1N|6|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@$D|P|E|Q|F|G]|$D|R|E|S|F|G]|$D|T|E|U|F|G]|$D|V|E|W|F|G]|$D|X|E|Y|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|Z|8|@$D|10|E|11|F|G]|$D|12|E|13|F|G]]|9|@]|A|$]]|$1|J|3|K|5|6|7|14|8|@$D|15|E|16|F|G]|$D|17|E|18|F|G]]|9|@]|A|$]]|$1|L|3|-4|5|6|7|19|8|@]|9|@]|A|$]]]|M|$]]

Hard to know for sure without doing some actual profiling, but I have two theories:

First, you may be losing some benefits of the <code>Range</code> class, specifically near-zero memory usage. When you do <code>(0L until N * N)</code>, you create a <code>Range</code> object, which is lazy. It does not actually create any object holding every single number in the range. Neither does <code>map</code>, I think. And <code>sum</code> calculates and adds numbers one at a time, so also allocates barely any memory.

I'm not sure the same is all true about <code>ParRange</code>. Seems like it would have to allocate some amount per split, and after <code>map</code> is called, perhaps it might have to store some amount of intermediate results in memory as "neighboring" splits wait for the other to complete. Especially the heap space exception makes me think something like this is the case. So you'll lose a lot of time to GC and such.

Second, probably the calls to <code>rng.nextDouble</code> are by far the most expensive part of that inner function. But I believe both java and scala <code>Random</code> classes are essentially single-threaded. They synchronize and block internally. So you won't gain that much from parallelism anyway, and in fact lose some to overhead.

blocks|key|6391804|text|每个任务没有足够的工作量，任务粒度太细。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|6391805|创建每个任务都需要一些开销：|6391806|6391807|必须创建代表任务的某个对象|unordered-list-item|6391808|必须确保一次只有一个线程执行一个任务|6391809|在某些线程空闲的情况下，必须调用一些作业窃取过程。|6391810|6391811|对于N=+10000，您将实例化100,000,000个小任务。这些任务中的每一个几乎什么也不做:它生成两个随机数，并执行一些基本算术和一个if分支。创建任务的开销无法与每个任务所做的工作相比较。|6391812|任务必须大得多，以便每个线程都有足够的工作要做。此外，如果您将每个RNG线程设为本地线程，这样线程就可以并行执行它们的工作，而不会永久锁定默认的随机数生成器，可能会更快。|6391813|下面是一个示例：|6391814|import+scala.util.Random

def+pi_random(N:+Long):+Double+=+{
++val+rng+=+new+Random
++val+count+=+(0L+until+N+*+N)
++++.map+{+_+=>
++++++val+(x,+y)+=+(rng.nextDouble(),+rng.nextDouble())
++++++if+(x*x+%2B+y*y+<=+1)+1+else+0
++++}
++++.sum
++4+*+count.toDouble+/+(N+*+N)
}

def+pi_random_parallel(N:+Long):+Double+=+{
++val+rng+=+new+Random
++val+count+=+(0L+until+N+*+N)
++++.par
++++.map+{+_+=>
++++++val+(x,+y)+=+(rng.nextDouble(),+rng.nextDouble())
++++++if+(x*x+%2B+y*y+<=+1)+1+else+0
++++}
++++.sum
++4+*+count.toDouble+/+(N+*+N)
}


def+pi_random_properly(n:+Long):+Double+=+{
++val+count+=+(0L+until+n).par.map+{+_+=>
++++val+rng+=+ThreadLocalRandom.current
++++var+sum+=+0
++++var+idx+=+0
++++while+(idx+<+n)+{
++++++val+(x,+y)+=+(rng.nextDouble(),+rng.nextDouble())
++++++if+(x*x+%2B+y*y+<=+1.0)+sum+%2B=+1
++++++idx+%2B=+1
++++}
++++sum
++}.sum
++4+*+count.toDouble+/+(n+*+n)
}|code-block|syntax|javascript|6391815|下面是一个小演示和时间安排：|6391816|def+measureTime[U](repeats:+Long)(block:+=>+U):+Double+=+{
++val+start+=+System.currentTimeMillis

++var+iteration+=+0
++while+(iteration+<+repeats)+{
++++iteration+%2B=+1
++++block
++}

++val+end+=+System.currentTimeMillis
++(end+-+start).toDouble+/+repeats
}

//+basic+sanity+check+that+all+algos+return+roughly+same+result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))

//+time+comparison+(N+=+2k,+10+repetitions+for+each+algorithm)
val+N+=+2000
val+Reps+=+10
println("Sequential:++"+%2B+measureTime(Reps)(pi_random(N)))
println("Naive:+++++++"+%2B+measureTime(Reps)(pi_random_parallel(N)))
println("My+proposal:+"+%2B+measureTime(Reps)(pi_random_properly(N)))|6391817|输出：|6391818|3.141333
3.143418
3.14142
Sequential:+621.7
Naive:++++++3032.6
My+version:+44.7|6391819|现在，并行版本大约比顺序版本快一个数量级(结果显然取决于内核的数量等)。|6391820|我不能用N=+10000来测试它，因为这个天真的并行化版本使用"GC开销超过“的-error使所有东西崩溃，这也说明创建小任务的开销太大了。|6391821|在我的实现中，我额外地展开了内部的while：您只需要在一个寄存器中使用一个计数器，而不需要通过对范围执行mapping来创建一个巨大的集合。|offset|length|style|CODE|6391822|6391823|编辑：用ThreadLocalRandom取代了一切，现在你的编译器版本是否支持已经无关紧要了，所以它也应该适用于2.11的早期版本。|BOLD|6391824|entityMap^0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|H|5|1H|3|0|0|0|3|14|R|4|H|0^^$0|@$1|2|3|4|5|6|7|1L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|1M|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|1N|8|@]|9|@]|A|$]]|$1|E|3|F|5|G|7|1O|8|@]|9|@]|A|$]]|$1|H|3|I|5|G|7|1P|8|@]|9|@]|A|$]]|$1|J|3|K|5|G|7|1Q|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|1R|8|@]|9|@]|A|$]]|$1|M|3|N|5|6|7|1S|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|1T|8|@]|9|@]|A|$]]|$1|Q|3|R|5|6|7|1U|8|@]|9|@]|A|$]]|$1|S|3|T|5|U|7|1V|8|@]|9|@]|A|$V|W]]|$1|X|3|Y|5|6|7|1W|8|@]|9|@]|A|$]]|$1|Z|3|10|5|U|7|1X|8|@]|9|@]|A|$V|W]]|$1|11|3|12|5|6|7|1Y|8|@]|9|@]|A|$]]|$1|13|3|14|5|U|7|1Z|8|@]|9|@]|A|$V|W]]|$1|15|3|16|5|6|7|20|8|@]|9|@]|A|$]]|$1|17|3|18|5|6|7|21|8|@]|9|@]|A|$]]|$1|19|3|1A|5|6|7|22|8|@$1B|23|1C|24|1D|1E]|$1B|25|1C|26|1D|1E]]|9|@]|A|$]]|$1|1F|3|-4|5|6|7|27|8|@]|9|@]|A|$]]|$1|1G|3|1H|5|6|7|28|8|@$1B|29|1C|2A|1D|1I]|$1B|2B|1C|2C|1D|1I]|$1B|2D|1C|2E|1D|1E]]|9|@]|A|$]]|$1|1J|3|-4|5|6|7|2F|8|@]|9|@]|A|$]]]|1K|$]]

There is not enough work per task, the task granularity is too fine-grained.

Creating each task requires some overhead:

<ul>
<li>Some object representing the task must be created</li>
<li>It must be ensured that only one thread executes one task at a time</li>
<li>In the case that some threads become idle, some job-stealing procedure must be invoked.</li>
</ul>

For N = 10000, you instantiate 100,000,000 tiny tasks. Each of those tasks does almost nothing: it generates two random numbers and performs some basic arithmetic and an if-branch. The overhead of creating a task is not comparable to the work that each task is doing.

The tasks must be much larger, so that each thread has enough work to do. Furthermore, it's probably faster if you make each RNG thread local, so that the threads can do their job in parallel, without permanently locking the default random number generator.

Here is an example:

<pre><code>import scala.util.Random

def pi_random(N: Long): Double = {
 val rng = new Random
 val count = (0L until N * N)
 .map { _ =&gt;
 val (x, y) = (rng.nextDouble(), rng.nextDouble())
 if (x*x + y*y &lt;= 1) 1 else 0
 }
 .sum
 4 * count.toDouble / (N * N)
}

def pi_random_parallel(N: Long): Double = {
 val rng = new Random
 val count = (0L until N * N)
 .par
 .map { _ =&gt;
 val (x, y) = (rng.nextDouble(), rng.nextDouble())
 if (x*x + y*y &lt;= 1) 1 else 0
 }
 .sum
 4 * count.toDouble / (N * N)
}


def pi_random_properly(n: Long): Double = {
 val count = (0L until n).par.map { _ =&gt;
 val rng = ThreadLocalRandom.current
 var sum = 0
 var idx = 0
 while (idx &lt; n) {
 val (x, y) = (rng.nextDouble(), rng.nextDouble())
 if (x*x + y*y &lt;= 1.0) sum += 1
 idx += 1
 }
 sum
 }.sum
 4 * count.toDouble / (n * n)
}
</code></pre>

Here is a little demo and timings:

<pre><code>def measureTime[U](repeats: Long)(block: =&gt; U): Double = {
 val start = System.currentTimeMillis

 var iteration = 0
 while (iteration &lt; repeats) {
 iteration += 1
 block
 }

 val end = System.currentTimeMillis
 (end - start).toDouble / repeats
}

// basic sanity check that all algos return roughly same result
println(pi_random(2000))
println(pi_random_parallel(2000))
println(pi_random_properly(2000))

// time comparison (N = 2k, 10 repetitions for each algorithm)
val N = 2000
val Reps = 10
println("Sequential: " + measureTime(Reps)(pi_random(N)))
println("Naive: " + measureTime(Reps)(pi_random_parallel(N)))
println("My proposal: " + measureTime(Reps)(pi_random_properly(N)))
</code></pre>

Output:

<pre><code>3.141333
3.143418
3.14142
Sequential: 621.7
Naive: 3032.6
My version: 44.7
</code></pre>

Now the parallel version is roughly an order of magnitude faster than the sequential version (result will obviously depend on the number of cores etc.).

I couldn't test it with N = 10000, because the naively parallelized version crashed everything with an "GC overhead exceeded"-error, which also illustrates that the overhead for creating the tiny tasks is too large.

In my implementation, I've additionaly unrolled the inner <code>while</code>: you need only one counter in one register, no need to create a huge collection by <code>map</code>ping on the range.

<hr>

Edit: Replaced everything by <code>ThreadLocalRandom</code>, it now shouldn't matter whether your compiler versions supports SAM or not, so it should work with earlier versions of 2.11 too.

I am very naively trying to use Scala <code>.par</code>, and the result turns out to be slower than the non-parallel version, by quite a bit. What is the explanation for that? 

Note: the question is not to make this faster, but to understand why this naive use of <code>.par</code> doesn't yield an immediate speed-up.

Note 2: timing method: I ran both methods with N = 10000. The first one returned in about 20s. The second one I killed after 3 minutes. Not even close. If I let it run longer I get into a Java heap space exception.

<pre><code>def pi_random(N: Long): Double = {
 val count = (0L until N * N)
 .map { _ =&gt;
 val (x, y) = (rng.nextDouble(), rng.nextDouble())
 if (x*x + y*y &lt;= 1) 1 else 0
 }
 .sum
 4 * count.toDouble / (N * N)
}

def pi_random_parallel(N: Long): Double = {
 val count = (0L until N * N)
 .par
 .map { _ =&gt;
 val (x, y) = (rng.nextDouble(), rng.nextDouble())
 if (x*x + y*y &lt;= 1) 1 else 0
 }
 .sum
 4 * count.toDouble / (N * N)
}
</code></pre>

Scala parallel collections

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我非常天真地尝试使用Scala .par，结果证明比非并行版本要慢得多。对此有何解释？注意:问题不在于提高速度，而在于理解为什么这种对.par的天真使用不能立即提高速度。注2:计时方法:我用N= 10000运行了这两种方法。第一个大约在20多秒后返回。第二个是我在3分钟后杀死的。一点也不接近。如果让它运行更长时间，我会...

问Scala并行集合
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scala并行集合EN