文章/答案/技术大牛

发布

社区首页 >问答首页 >计算数据框列中元素对的出现次数

问计算数据框列中元素对的出现次数
EN

Stack Overflow用户

提问于 2021-04-06 16:20:14

回答 1查看 52关注 0票数 0

下面是我正在使用的数据库的一个示例：

{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
 'director': {0: nan,
  1: 'Jorge Michel Grau',
  2: 'Gilbert Chan',
  3: 'Shane Acker',
  4: 'Robert Luketic'},
 'country': {0: 'Brazil',
  1: 'Mexico',
  2: 'Singapore',
  3: 'Poland, United States',
  4: 'Norway, Poland, United States'},
 'rating': {0: 'TV-MA', 1: 'TV-MA', 2: 'R', 3: 'PG-13', 4: 'PG-13'}}

我想要做的是计算两个国家(如果可能的话，甚至超过两个)在一部电影/电视节目上合作的次数。使用提供的示例，我会发现美国和波兰合作了两次，美国和挪威合作了一次，其他国家没有合作。以下是我设法编写的代码：

# This function would count the number of occurrences of each country in the column
def count(data, column) :
    return Counter([thing.strip() for thing in data[column].fillna('missing') for thing in thing.split(',')])

# And this one would count the occurrences of couples of countries together
def count_tuple(data, column) :
    a, b = zip(*count(data, column).most_common())
    s = pd.DataFrame(columns=a, index=a)
    
    for l in a :
        mask = data[column].fillna('missing').apply(lambda z: l in z)
        df = data[mask]
        c, d = zip(*count(df, column).most_common())
        for k in c :
            if k !='' :
                occur = count(df, column)[k]
                s.loc[l,k] = occur
    return s.fillna(0)

此函数将在每两个国家/地区出现时返回一个dataframe。像往常一样，我不认为这种方法是有效的，还有其他方法可以做到这一点吗？

python

pandas

dataframe

回答 1

Stack Overflow用户

发布于 2021-04-06 16:41:09

以下是嵌套列表理解中country列值的所有组合的解决方案，称为forozensets用于计数，顺序不重要，这意味着(United States, Poland)与(Poland, United States)用于计数相同：

from itertools import chain, combinations
from collections import Counter

#https://stackoverflow.com/a/5898031/2901002
def all_subsets(ss):
    return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))

L = [frozenset(y) for x in df['country'].fillna('missing') 
                  for y in all_subsets(x.split(', '))]
print (L)

out = Counter(L)

df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
                              col1  col2
0          (United States, Poland)     2
1                 (Norway, Poland)     1
2          (United States, Norway)     1
3  (United States, Norway, Poland)     1

或者，您可以将值排序为元组：

from itertools import chain, combinations
from collections import Counter

def all_subsets(ss):
    return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))

L = [tuple(sorted(y)) for x in df['country'].fillna('missing') 
                  for y in all_subsets(x.split(', '))]
print (L)

out = Counter(L)

df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
                              col1  col2
0          (Poland, United States)     2
1                 (Norway, Poland)     1
2          (Norway, United States)     1
3  (Norway, Poland, United States)     1

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66965052

复制

相似问题

问计算数据框列中元素对的出现次数
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算数据框列中元素对的出现次数EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算数据框列中元素对的出现次数
EN