blocks|key|2501835|text|对于数组：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2501836|import+scala.util.Random
import+scala.reflect.ClassTag

def+takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long)+=+{
++val+rnd+=+new+Random(seed)
++Array.fill(n)(a(rnd.nextInt(a.size)))
}|code-block|syntax|javascript|2501837|基于您的种子创建一个随机数生成器(rnd)。然后，用从0到数组大小的随机数填充数组。|offset|length|style|CODE|2501838|最后一步是将每个随机值应用于输入数组的索引运算符。在REPL中使用它可能如下所示：|2501839|scala>+val+myArray+=+Array(1,3,5,7,8,9,10)
myArray:+Array[Int]+=+Array(1,+3,+5,+7,+8,+9,+10)

scala>+takeSample(myArray,20,System.currentTimeMillis)
res0:+scala.collection.mutable.ArraySeq[Int]+=+ArraySeq(7,+8,+7,+3,+8,+3,+9,+1,+7,+10,+7,+10,
1,+1,+3,+1,+7,+1,+3,+7)|2501840|对于列表，我只需将列表转换为Array并使用相同的函数。不管怎样，我怀疑你能更有效地处理列表。|2501841|需要注意的是，使用列表的相同函数将花费O(n%5E2)时间，而首先将列表转换为数组将花费O(n)时间|BOLD|2501842|entityMap^0|0|0|H|3|0|0|0|0|0|1C|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|Y|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Z|8|@$I|10|J|11|K|L]]|9|@]|A|$]]|$1|M|3|N|5|6|7|12|8|@]|9|@]|A|$]]|$1|O|3|P|5|D|7|13|8|@]|9|@]|A|$E|F]]|$1|Q|3|R|5|6|7|14|8|@]|9|@]|A|$]]|$1|S|3|T|5|6|7|15|8|@$I|16|J|17|K|U]]|9|@]|A|$]]|$1|V|3|-4|5|6|7|18|8|@]|9|@]|A|$]]]|W|$]]

For arrays:

<pre><code>import scala.util.Random
import scala.reflect.ClassTag

def takeSample[T:ClassTag](a:Array[T],n:Int,seed:Long) = {
 val rnd = new Random(seed)
 Array.fill(n)(a(rnd.nextInt(a.size)))
}
</code></pre>

Make a random number generator (<code>rnd</code>) based on your seed. Then, fill an array with random numbers from 0 until the size of your array.

The last step is applying each random value to the indexing operator of your input array. Using it in the REPL could look as follows:

<pre><code>scala&gt; val myArray = Array(1,3,5,7,8,9,10)
myArray: Array[Int] = Array(1, 3, 5, 7, 8, 9, 10)

scala&gt; takeSample(myArray,20,System.currentTimeMillis)
res0: scala.collection.mutable.ArraySeq[Int] = ArraySeq(7, 8, 7, 3, 8, 3, 9, 1, 7, 10, 7, 10,
1, 1, 3, 1, 7, 1, 3, 7)
</code></pre>

For lists, I would simply convert the list to Array and use the same function. I doubt you can get much more efficient for lists anyway.

It is important to note, that the same function using lists would take O(n^2) time, whereas converting the list to arrays first will take O(n) time

blocks|key|34831|text|一个易于理解的版本将如下所示：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|34832|import+scala.util.Random

Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)

//+Seeded+version
val+r+=+new+Random(seed)
r.shuffle(...)|code-block|syntax|javascript|34833|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

An easy-to-understand version would look like this:

<pre><code>import scala.util.Random

Random.shuffle(list).take(n)
Random.shuffle(array.toList).take(n)

// Seeded version
val r = new Random(seed)
r.shuffle(...)
</code></pre>

blocks|key|36385|text|使用a表示理解，对于给定的数组xs，如下所示：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|36386|for+(i+<-+1+to+sampleSize;+r+=+(Math.random+*+xs.size).toInt)+yield+a(r)|code-block|syntax|javascript|36387|注意，这里的随机生成器在单位间隔内生成值，这些值被缩放到数组大小的范围内，并转换为Int以便在数组上进行索引。|36388|注意到纯函数随机生成器的例如考虑来自Functional+Programming+in+Scala的状态单子方法，讨论了here。|BOLD|36389|注释还考虑了NICTA，另一个纯函数随机值生成器，它以here为例说明了它的使用。|36390|entityMap|0|LINK|mutability|MUTABLE|url|https://www.manning.com/books/functional-programming-in-scala|1|https://stackoverflow.com/questions/31818787/pure-functional-random-number-generator-state-monad|2|https://github.com/NICTA/rng|3|https://stackoverflow.com/a/25655709/3189923^0|F|2|0|0|15|3|0|0|C|I|V|0|1O|4|1|0|0|2|6|5|2|R|4|3|0^^$0|@$1|2|3|4|5|6|7|15|8|@$9|16|A|17|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|18|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|19|8|@$9|1A|A|1B|B|C]]|D|@]|E|$]]|$1|M|3|N|5|6|7|1C|8|@$9|1D|A|1E|B|O]]|D|@$9|1F|A|1G|1|1H]|$9|1I|A|1J|1|1K]]|E|$]]|$1|P|3|Q|5|6|7|1L|8|@$9|1M|A|1N|B|O]]|D|@$9|1O|A|1P|1|1Q]|$9|1R|A|1S|1|1T]]|E|$]]|$1|R|3|-4|5|6|7|1U|8|@]|D|@]|E|$]]]|S|$T|$5|U|V|W|E|$X|Y]]|Z|$5|U|V|W|E|$X|10]]|11|$5|U|V|W|E|$X|12]]|13|$5|U|V|W|E|$X|14]]]]

Using a for comprehension, for a given array <code>xs</code> as follows,

<pre><code>for (i &lt;- 1 to sampleSize; r = (Math.random * xs.size).toInt) yield a(r)
</code></pre>

Note the random generator here produces values within the unit interval, which are scaled to range over the size of the array, and converted to <code>Int</code> for indexing over the array.

Note For pure functional random generator consider for instance the State Monad approach from <a href="https://www.manning.com/books/functional-programming-in-scala" rel="nofollow noreferrer">Functional Programming in Scala</a>, discussed <a href="https://stackoverflow.com/questions/31818787/pure-functional-random-number-generator-state-monad">here</a>.

Note Consider also <a href="https://github.com/NICTA/rng" rel="nofollow noreferrer">NICTA</a>, another pure functional random value generator, it's use illustrated for instance <a href="https://stackoverflow.com/a/25655709/3189923">here</a>.

blocks|key|34637|text|使用经典的递归。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|34638|import+scala.util.Random

def+takeSample[T](a:+List[T],+n:+Int):+List[T]+=+{
++++n+match+{
++++++case+n:+Int+if+n+<=+0+=>+List.empty[T]
++++++case+n:+Int+=>+a(Random.nextInt(a.size))+::+takeSample(a,+n+-+1)
++++}
}|code-block|syntax|javascript|34639|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Using classical recursion.

<pre><code>import scala.util.Random

def takeSample[T](a: List[T], n: Int): List[T] = {
 n match {
 case n: Int if n &lt;= 0 =&gt; List.empty[T]
 case n: Int =&gt; a(Random.nextInt(a.size)) :: takeSample(a, n - 1)
 }
}
</code></pre>

blocks|key|34664|text|package+your.pkg

import+your.pkg.SeqHelpers.SampleOps

import+scala.collection.generic.CanBuildFrom
import+scala.collection.mutable
import+scala.language.{higherKinds,+implicitConversions}
import+scala.util.Random

trait+SeqHelpers+{

++implicit+def+withSampleOps[E,+CC[_]+<:+Seq[_]](cc:+CC[E]):+SampleOps[E,+CC]+=+SampleOps(cc)
}

object+SeqHelpers+extends+SeqHelpers+{

++case+class+SampleOps[E,+CC[_]+<:+Seq[_]](cc:+CC[_])+{

++++private+def+recurse(n:+Int,+builder:+mutable.Builder[E,+CC[E]]):+CC[E]+=+n+match+{
++++++case+0+=>+builder.result
++++++case+_+=>
++++++++val+element+=+cc(Random.nextInt(cc.size)).asInstanceOf[E]
++++++++recurse(n+-+1,+builder+%2B=+element)
++++}

++++def+sample(n:+Int)(implicit+cbf:+CanBuildFrom[CC[_],+E,+CC[E]]):+CC[E]+=+{
++++++require(n+>=+0,+"Cannot+take+less+than+0+samples")
++++++recurse(n,+cbf.apply)
++++}
++}
}|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|34665|以下任一项：|unstyled|34666|例如，具有包括import+your.pkg.SeqHelpers._的Scalatest+spec|offset|length|style|CODE|34667|的|unordered-list-item|34668|Mixin+SeqHelpers|34669|34670|34671|那么下面的方法应该是可行的：|34672|Seq(1+to+100:+_*)+sample+10+foreach+{+println+}|34673|欢迎对删除演员进行编辑。|34674|另外，如果有一种方法可以为累加器创建一个空的集合实例，而不需要提前知道具体的类型，请评论。也就是说，构建器可能更有效率。|34675|entityMap^0|0|0|7|S|0|0|6|A|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|13|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|14|8|@]|9|@]|A|$]]|$1|G|3|H|5|F|7|15|8|@$I|16|J|17|K|L]]|9|@]|A|$]]|$1|M|3|N|5|O|7|18|8|@]|9|@]|A|$]]|$1|P|3|Q|5|O|7|19|8|@$I|1A|J|1B|K|L]]|9|@]|A|$]]|$1|R|3|-4|5|F|7|1C|8|@]|9|@]|A|$]]|$1|S|3|-4|5|F|7|1D|8|@]|9|@]|A|$]]|$1|T|3|U|5|F|7|1E|8|@]|9|@]|A|$]]|$1|V|3|W|5|6|7|1F|8|@]|9|@]|A|$B|C]]|$1|X|3|Y|5|F|7|1G|8|@]|9|@]|A|$]]|$1|Z|3|10|5|F|7|1H|8|@]|9|@]|A|$]]|$1|11|3|-4|5|F|7|1I|8|@]|9|@]|A|$]]]|12|$]]

<pre><code>package your.pkg

import your.pkg.SeqHelpers.SampleOps

import scala.collection.generic.CanBuildFrom
import scala.collection.mutable
import scala.language.{higherKinds, implicitConversions}
import scala.util.Random

trait SeqHelpers {

 implicit def withSampleOps[E, CC[_] &lt;: Seq[_]](cc: CC[E]): SampleOps[E, CC] = SampleOps(cc)
}

object SeqHelpers extends SeqHelpers {

 case class SampleOps[E, CC[_] &lt;: Seq[_]](cc: CC[_]) {

 private def recurse(n: Int, builder: mutable.Builder[E, CC[E]]): CC[E] = n match {
 case 0 =&gt; builder.result
 case _ =&gt;
 val element = cc(Random.nextInt(cc.size)).asInstanceOf[E]
 recurse(n - 1, builder += element)
 }

 def sample(n: Int)(implicit cbf: CanBuildFrom[CC[_], E, CC[E]]): CC[E] = {
 require(n &gt;= 0, "Cannot take less than 0 samples")
 recurse(n, cbf.apply)
 }
 }
}
</code></pre>

Either: 

<ul>
<li>Mixin <code>SeqHelpers</code>, for example, with a Scalatest spec</li>
<li>Include <code>import your.pkg.SeqHelpers._</code></li>
</ul>

Then the following should work:

<pre><code>Seq(1 to 100: _*) sample 10 foreach { println }
</code></pre>

Edits to remove the cast are welcome. 

Also if there is a way to create an empty instance of the collection for the accumulator, without knowing the concrete type ahead of time, please comment. That said, the builder is probably more efficient.

blocks|key|2502012|text|如果你想不替换样本--用随机数压缩，排序O(n*log(n)，丢弃随机数，取|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2502013|import+scala.util.Random
val+l+=+Seq("a",+"b",+"c",+"d",+"e")
val+ran+=+l.map(x+=>+(Random.nextFloat(),+x))
++.sortBy(_._1)
++.map(_._2)
++.take(3)|code-block|syntax|javascript|2502014|entityMap^0|K|A|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|P|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Q|8|@]|D|@]|E|$]]]|L|$]]

If you want to sample without replacement -- zip with randoms, sort <code>O(n*log(n)</code>, discard randoms, take

<pre><code>import scala.util.Random
val l = Seq("a", "b", "c", "d", "e")
val ran = l.map(x =&gt; (Random.nextFloat(), x))
 .sortBy(_._1)
 .map(_._2)
 .take(3)
</code></pre>

blocks|key|34677|text|没有对性能进行测试，但以下代码是一种简单而优雅的采样方式，我相信可以帮助许多来这里只是为了获得采样代码的人。只需根据最终样本的大小更改“范围”即可。如果伪随机性不能满足您的需要，您可以在内部列表中使用take(1)并增加范围。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|34678|Random.shuffle((1+to+100).toList.flatMap(x+=>+(Random.shuffle(yourList))))|offset|length|style|CODE|34679|entityMap^0|0|0|22|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@$D|L|E|M|F|G]]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

Did not test for performance, but the following code is a simple and elegant way to do the sampling and I believe can help many that come here just to get a sampling code. Just change the "range" according to the size of your end sample. If pseude-randomness is not enough for your need, you can use take(1) in the inner list and increase the range.

<code>Random.shuffle((1 to 100).toList.flatMap(x =&gt; (Random.shuffle(yourList))))</code>

I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different lists/arrays) needs to be done a large number of times.

I know for a Spark RDD we can use takeSample() to do it, is there an equivalent for Scala list/array?

Thank you very much.

How to randomly sample from a Scala list or array?

Spark 

我想从Scala列表或数组(不是RDD)中随机采样，样本大小可以比列表或数组的长度长得多，我如何有效地执行此？因为样本大小可能非常大，并且采样(在不同的列表/数组上)需要进行大量的次数。我知道对于Spark RDD我们可以使用takeSample()来做，有没有Scala list/array的等价物？非常感谢。

问如何从Scala列表或数组中随机采样？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Scala列表或数组中随机采样？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Scala列表或数组中随机采样？
EN