腾讯云

文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫/火花放电代表样品及其在不同柱中的发生

问熊猫/火花放电代表样品及其在不同柱中的发生
EN

Stack Overflow用户

提问于 2020-07-30 09:04:38

回答 1查看 264关注 0票数 0

我希望创建一个具有代表性的特定数据格式示例。

我的df看起来如下：

query    | total
---------|-------
facebook | 123456
monkey   | 3456
iphone   | 54321
laptop   | 1234
headset  | 3333
plates   | 4333
girl     | 11222
.
.
.

据我所知，代表如下：

每个查询都可以显示一次。
每个查询都有按其发生显示的概率(在我的例子中是total列)

如果我创建一个包含123456次facebook、3456次monkey等事件的大列表(dataset )，然后执行df.sample(某样东西)，它可能会工作，但我认为它是非常没有效率的(考虑到有一个庞大的数据集-在我的“列表”中放入“列表”之后，可能有数十亿行)。

这里有没有其他不同的(更有效的)获得代表性样品的方法？可以在熊猫或火星雨中完成。

示例：

有8个查询和20个出现的df。

q1 | 1
q2 | 1
q3 | 5
q4 | 1
q5 | 2
q6 | 4
q7 | 4
q8 | 2

假设我需要有代表性的5个查询样本。每个查询都有被选中的可能性。q1有5%，q2有5%，q3有25%等等。在第一次滚动之后，只选择一个查询，例如q3。q3被添加到最后的输出中，我们再次滚动。2.选择滚动q5并将其添加到最终输出中。3.滚动q3再次被选中，但是它没有被添加到最终输出bcs中，它已经存在了。然后，我们一次又一次地滚动后，5点的最终输出正在填补。

点是指出现次数较多的查询在最终输出中有更高的机会。

遗憾的是，我的数据集要大得多，而且我也不可能选择这样的查询--条目列表(123456x facebook，3456x猴子)太大了。

python

pandas

dataframe

pyspark

sample

回答 1

Stack Overflow用户

发布于 2020-07-30 14:26:25

如果考虑到代表性抽样可能有多次出现的值，则可以计算民防以生成间隔，并使用pd.cut计算np.random以绘制规范化样本并将它们分配给相应的间隔：

import numpy as np

def draw_representative_samples(df, names_col='query', counts_col='total', n_samples=10):
    # Compute the Cumulative Density Function based on counts and normalize to [0,1]
    df_cdf = df[counts_col].cumsum() / df[counts_col].sum()
    
    # Draw Uniform samples in [0,1]
    samples = np.random.rand(n_samples)
    
    # Assign each sample to the corresponding interval using the CDF
    return pd.cut(
        x=pd.Series(samples),
        bins=pd.Series(0).append(df_cdf), # Add a 0 to the first interval
        labels=df[names_col].to_list(),   # label the samples using names_col
        precision=20)                     # Decimal positions to use for comparisons

我们可以用一个例子来尝试：

# Generating the dataframe from the example above
df = pd.DataFrame({'query':{0: 'facebook',  1: 'monkey',  2: 'iphone',  3: 'laptop', 
                            4: 'headset',  5: 'plates',  6: 'girl'},
                   'total':{0:123456, 1:3456, 2:54321, 3:1234, 4:3333, 5:4333, 6:11222}})

# getting our samples. Each run will give you different samples!
draw_representative_samples(df, n_samples=10).astype(str)

#> 0    facebook
#  1    facebook
#  2      iphone
#  3    facebook
#  4      iphone
#  5      iphone
#  6    facebook
#  7      monkey
#  8        girl
#  9      iphone
#  dtype: object

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63169798

复制

相似问题

问熊猫/火花放电代表样品及其在不同柱中的发生
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫/火花放电代表样品及其在不同柱中的发生EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫/火花放电代表样品及其在不同柱中的发生
EN