entityMap|blocks|key|7iih1|text|如果你想要一个确切的样本，试着这样做|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3aeft|但请注意，这将返回一个数组，而不是RDD。|offset|length|style|CODE|65im|至于为什么a.sample(false，0.1)没有返回相同的样本大小:这是因为spark内部使用了一种称为伯努利采样的方法来获取样本。RDD参数并不表示fraction实际大小的分数。它表示总体中的每个元素被选为样本的概率，正如维基百科所说：|4bfeo|由于样本中的每个总体元素都是单独考虑的，因此样本大小不是固定的，而是遵循二项分布。|4fvig|99p29|这本质上意味着这个数字不会保持不变。|e8c8b|如果将第一个参数设置为true，那么它将使用泊松采样，这也会导致不确定的结果样本大小。|dv3pl|更新|fv1a6|如果您想坚持使用sample方法，您可以为fraction参数指定更大的概率，然后调用take，如下所示：|cn0q6|大多数情况下，这应该会导致样本大小为1000，但不一定总是如此。如果你有足够多的人口，这是可行的。^0|0|H|3|0|5|J|25|8|0|0|0|0|B|4|0|0|8|6|L|8|17|4|0^^$0|$]|1|@$2|3|4|5|6|7|8|X|9|@]|A|@]|B|$]]|$2|C|4|D|6|7|8|Y|9|@$E|Z|F|10|G|H]]|A|@]|B|$]]|$2|I|4|J|6|7|8|11|9|@$E|12|F|13|G|H]|$E|14|F|15|G|H]]|A|@]|B|$]]|$2|K|4|L|6|7|8|16|9|@]|A|@]|B|$]]|$2|M|4|-4|6|7|8|17|9|@]|A|@]|B|$]]|$2|N|4|O|6|7|8|18|9|@]|A|@]|B|$]]|$2|P|4|Q|6|7|8|19|9|@$E|1A|F|1B|G|H]]|A|@]|B|$]]|$2|R|4|S|6|7|8|1C|9|@]|A|@]|B|$]]|$2|T|4|U|6|7|8|1D|9|@$E|1E|F|1F|G|H]|$E|1G|F|1H|G|H]|$E|1I|F|1J|G|H]]|A|@]|B|$]]|$2|V|4|W|6|7|8|1K|9|@]|A|@]|B|$]]]]

If you want an exact sample, try doing
<pre><code>a.takeSample(false, 1000)
</code></pre>
But note that this returns an Array and not an <code>RDD</code>.
As for why the <code>a.sample(false, 0.1)</code> doesn't return the same sample size: it's because spark internally uses something called <a href="https://en.wikipedia.org/wiki/Bernoulli_sampling" rel="nofollow noreferrer">Bernoulli sampling</a> for taking the sample. The <code>fraction</code> argument doesn't represent the fraction of the actual size of the RDD. It represent the probability of each element in the population getting selected for the sample, and as wikipedia says:
<blockquote>
Because each element of the population is considered separately for the sample, the sample size is not fixed but rather follows a binomial distribution.
</blockquote>
And that essentially means that the number doesn't remain fixed.
If you set the first argument to <code>true</code>, then it will use something called <a href="https://en.wikipedia.org/wiki/Poisson_sampling" rel="nofollow noreferrer">Poisson sampling</a>, which also results in a non-deterministic resultant sample size.
Update
If you want stick with the <code>sample</code> method, you can probably specify a larger probability for the <code>fraction</code> param and then call <code>take</code> as in:
<pre><code>a.sample(false, 0.2).take(1000)
</code></pre>
This should, most of the time, but not necessarily always, result in the sample size of 1000. This could work if you have a large enough population.

entityMap|blocks|key|c01m9|text|另一种方法是先进行takeSample，然后进行RDD。对于大型数据集，这可能会很慢。|type|unstyled|depth|inlineStyleRanges|entityRanges|data^0^^$0|$]|1|@$2|3|4|5|6|7|8|C|9|@]|A|@]|B|$]]]]

Another way can be to first takeSample and then make RDD. This might be slow with large datasets.

<pre><code>sc.makeRDD(a.takeSample(false, 1000, 1234))
</code></pre>

Why does the <code>rdd.sample()</code> function on Spark RDD return a different number of elements even though the fraction parameter is the same? For example, if my code is like below:

<pre><code>val a = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1).count
</code></pre>

Every time I run the second line of the code it returns a different number not equal to 1000. Actually I expect to see 1000 every time although the 1000 elements might be different. Can anyone tell me how I can get a sample with the sample size exactly equal to 1000? Thank you very much.

How to get a sample with an exact sample size in Spark RDD?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

 为什么Spark RDD上的rdd.sample()函数返回不同数量的元素，即使分数参数是相同的？例如，如果我的代码如下所示： 每次我运行代码的第二行时，它都会返回一个不等于1000的不同数字。实际上，我希望每次都能看到1000个元素，尽管这1000个元素可能会有所不同。谁能告诉我怎样才能得到样本量恰好等于1000的...

问如何在Spark RDD中获得具有确切样本大小的样本？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Spark RDD中获得具有确切样本大小的样本？EN