下面是我正在使用的数据库的一个示例:
{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'director': {0: nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'country': {0: 'Brazil',
1: 'Mexico',
2: 'Singapore',
3: 'Poland, United States',
4: 'Norway, Poland, United States'},
'rating': {0: 'TV-MA', 1: 'TV-MA', 2: 'R', 3: 'PG-13', 4: 'PG-13'}}我想要做的是计算两个国家(如果可能的话,甚至超过两个)在一部电影/电视节目上合作的次数。使用提供的示例,我会发现美国和波兰合作了两次,美国和挪威合作了一次,其他国家没有合作。以下是我设法编写的代码:
# This function would count the number of occurrences of each country in the column
def count(data, column) :
return Counter([thing.strip() for thing in data[column].fillna('missing') for thing in thing.split(',')])
# And this one would count the occurrences of couples of countries together
def count_tuple(data, column) :
a, b = zip(*count(data, column).most_common())
s = pd.DataFrame(columns=a, index=a)
for l in a :
mask = data[column].fillna('missing').apply(lambda z: l in z)
df = data[mask]
c, d = zip(*count(df, column).most_common())
for k in c :
if k !='' :
occur = count(df, column)[k]
s.loc[l,k] = occur
return s.fillna(0)此函数将在每两个国家/地区出现时返回一个dataframe。像往常一样,我不认为这种方法是有效的,还有其他方法可以做到这一点吗?
发布于 2021-04-06 16:41:09
以下是嵌套列表理解中country列值的所有组合的解决方案,称为forozensets用于计数,顺序不重要,这意味着(United States, Poland)与(Poland, United States)用于计数相同:
from itertools import chain, combinations
from collections import Counter
#https://stackoverflow.com/a/5898031/2901002
def all_subsets(ss):
return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))
L = [frozenset(y) for x in df['country'].fillna('missing')
for y in all_subsets(x.split(', '))]
print (L)
out = Counter(L)
df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
col1 col2
0 (United States, Poland) 2
1 (Norway, Poland) 1
2 (United States, Norway) 1
3 (United States, Norway, Poland) 1或者,您可以将值排序为元组:
from itertools import chain, combinations
from collections import Counter
def all_subsets(ss):
return list(chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1))))
L = [tuple(sorted(y)) for x in df['country'].fillna('missing')
for y in all_subsets(x.split(', '))]
print (L)
out = Counter(L)
df = pd.DataFrame({'col1':out.keys(), 'col2': out.values()})
print (df)
col1 col2
0 (Poland, United States) 2
1 (Norway, Poland) 1
2 (Norway, United States) 1
3 (Norway, Poland, United States) 1https://stackoverflow.com/questions/66965052
复制相似问题