首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何更快地设置数据片段的值?

如何更快地设置数据片段的值?
EN

Stack Overflow用户
提问于 2021-04-01 13:00:35
回答 1查看 77关注 0票数 0

每一个客户端都有数十万行的海量数据。我想将此数据汇总到另一个dataframe中,其中单个行包含该客户端所有行的汇总数据。

问题是它不是唯一的代码,它包含类似的1000多行代码。执行起来需要很长时间。但是当我在R中运行这个时,速度是R的10倍。附加R代码以供参考。

有办法让它像R码一样快速吗?

Python代码:

代码语言:javascript
复制
    for i in range(len(client)):
        print(i)
        
        sub = data.loc[data['Client Name']==client['Client Name'][i],:]
        client['requests'][i] = len(sub)
        client['ppt_req'][i] = len(sub)/sub['CID'].nunique()
    
        client['approval'][i] = (((sub['verify']=='Yes').sum())/client['requests'][i])*100
        client['denial'][i] = (((sub['verify']=='No').sum())/client['requests'][i])*100
    
        client['male'][i] = (((sub['gender']=='Male').sum())/client['requests'][i])*100
        client['female'][i] = (((sub['gender ']=='female').sum())/client['requests'][i])*100

R码:

代码语言:javascript
复制
for(i in 1: nrow(client))
{print(i)
  #i=1
  sub<-subset(data,data$Client.Name==client$Client.Name[i])
 
  
  client$requests[i]<-nrow(sub)
  client$ppt_req[i]<-nrow(sub)/(length(unique(sub$CID)))
  client$approval[i]<-((as.numeric(table(sub$verify=="Yes")["TRUE"]))/client$requests[i])*100
  
  client$denial[i]<-((as.numeric(table(sub$verify=="No")["TRUE"]))/client$requests[i])*100
  client$male[i]<-((as.numeric(table(sub$gender)["Male"]))/client$requests[i])*100
  client$female[i]<-((as.numeric(table(sub$gender)["Female"]))/client$requests[i])*100
EN

回答 1

Stack Overflow用户

发布于 2021-04-01 20:11:48

使用Python循环迭代非常慢。但是,主要问题来自行data.loc[data['Client Name']==client['Client Name'][i],:],它遍历了每个客户端的整个数据格式data。这意味着这一行最终将迭代超过100,000次字符串>100,000次,因此将进行数百亿次代价高昂的字符串比较,从而实现。更别提每组计算都是为每个客户端复制的。

您可以通过在客户端名称上使用groupby来解决这个问题,然后是合并

下面是代码的草图(未经测试):

代码语言:javascript
复制
# If the number of client name in `data` is much more important than in `client`, 
# one can filter `data` before applying the next `groupby` using:
# client['Client Name'].unique()

# Generate a compact dataframe containing the information for each 
# possible client name that appear in `data`.
clientDataInfos = pd.DataFrame(
    {
        'requests': len(group),
        'ppt_req': len(group) / group['CID'].nunique(),
        'approval': (((group['verify']=='Yes').sum()) / len(group)) * 100,
        'denial': (((group['verify']=='No').sum()) / len(group)) * 100,
        'male': (((group['gender']=='Male').sum()) / len(group)) * 100,
        'female': (((group['gender ']=='female').sum()) / len(group)) * 100
    } for name,group in data.groupby('Client Name')
)

# Extend `client` with the precomputed information in `clientDataInfos`.
# The extended columns should not already appear in `client`.
client = client.merge(clientDataInfos, on='Client Name')
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66905022

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档