问如何在Apache Spark中计算百分位数
EN

Stack Overflow用户

提问于 2015-03-02 16:43:26

回答 6查看 30.9K关注 0票数 25

我有一个整数的rdd (即RDD[Int])，我想要做的是计算以下10个百分位数：[0th, 10th, 20th, ..., 90th, 100th]。做到这一点最有效的方法是什么？

apache-spark

回答 6

Stack Overflow用户

发布于 2015-08-19 05:59:28

我发现了这个要点

https://gist.github.com/felixcheung/92ae74bc349ea83a9e29

它包含以下函数：

  /**
   * compute percentile from an unsorted Spark RDD
   * @param data: input data set of Long integers
   * @param tile: percentile to compute (eg. 85 percentile)
   * @return value of input data at the specified percentile
   */
  def computePercentile(data: RDD[Long], tile: Double): Double = {
    // NIST method; data to be sorted in ascending order
    val r = data.sortBy(x => x)
    val c = r.count()
    if (c == 1) r.first()
    else {
      val n = (tile / 100d) * (c + 1d)
      val k = math.floor(n).toLong
      val d = n - k
      if (k <= 0) r.first()
      else {
        val index = r.zipWithIndex().map(_.swap)
        val last = c
        if (k >= c) {
          index.lookup(last - 1).head
        } else {
          index.lookup(k - 1).head + d * (index.lookup(k).head - index.lookup(k - 1).head)
        }
      }
    }
  }

票数 4

Stack Overflow用户

发布于 2016-03-22 04:31:40

下面是我在Spark上的Python实现，用于计算包含感兴趣的值的RDD的百分位数。

def percentile_threshold(ardd, percentile):
    assert percentile > 0 and percentile <= 100, "percentile should be larger then 0 and smaller or equal to 100"

    return ardd.sortBy(lambda x: x).zipWithIndex().map(lambda x: (x[1], x[0])) \
            .lookup(np.ceil(ardd.count() / 100 * percentile - 1))[0]

# Now test it out
import numpy as np
randlist = range(1,10001)
np.random.shuffle(randlist)
ardd = sc.parallelize(randlist)

print percentile_threshold(ardd,0.001)
print percentile_threshold(ardd,1)
print percentile_threshold(ardd,60.11)
print percentile_threshold(ardd,99)
print percentile_threshold(ardd,99.999)
print percentile_threshold(ardd,100)

# output:
# 1
# 100
# 6011
# 9900
# 10000
# 10000

另外，我定义了下面的函数来获得第10到100个百分位数。

def get_percentiles(rdd, stepsize=10):
    percentiles = []
    rddcount100 = rdd.count() / 100 
    sortedrdd = ardd.sortBy(lambda x: x).zipWithIndex().map(lambda x: (x[1], x[0]))


    for p in range(0, 101, stepsize):
        if p == 0:
            pass
            # I am not aware of a formal definition of 0 percentile, 
            # you can put a place holder like this if you want
            # percentiles.append(sortedrdd.lookup(0)[0] - 1) 
        elif p == 100:
            percentiles.append(sortedrdd.lookup(np.ceil(rddcount100 * 100 - 1))[0])
        else:
            pv = sortedrdd.lookup(np.ceil(rddcount100 * p) - 1)[0]
            percentiles.append(pv)

    return percentiles

randlist = range(1,10001)
np.random.shuffle(randlist)
ardd = sc.parallelize(randlist)
get_percentiles(ardd, 10)

# [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

票数 3

Stack Overflow用户

发布于 2015-03-02 17:46:24

将RDD转换为双精度的RDD，然后使用.histogram(10)操作。请参阅DoubleRDD ScalaDoc

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/28805602

复制

相似问题

问如何在Apache Spark中计算百分位数
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Apache Spark中计算百分位数EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Apache Spark中计算百分位数
EN