下面是我正在使用的更大的dataframe的第几行。我有代码(感谢用户捕获)将所有单词组合在一起,而扬声器名称不改变,保留第一个单词的“开始”值和组合中最后一个单词的“停止”值。此代码:
df.groupby([(df['speaker'] != df['speaker'].shift()).cumsum(), , df['speaker']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
})
打开这个数据文件:
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
这方面:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
这太棒了。但是,我希望限制每个新单词列中要组合的单词总数。具体来说,我希望每个新单词组合的平均单词数在4个左右。
例如:
所以如果我有这个
word start stop speaker
0 but 2.72 2.85 2
1 that's 2.85 3.09 2
2 alright 3.09 3.47 2
3 we'll 8.43 8.69 1
4 have 8.69 8.97 1
5 to 8.97 9.07 1
6 okay 9.19 10.01 2
7 sure 10.02 11.01 2
8 what? 11.02 12.00 1
9 i 12.01 13.00 2
10 want 13.01 14.00 2
11 to 14.01 15.00 2
12 go 15.01 16.00 2
13 there 16.01 17.00 2
14 where 17.01 18.00 1
15 is 18.01 19.00 1
16 it 19.01 20.00 1
17 you 20.01 21.00 1
18 would 21.01 22.00 1
19 like 22.01 23.00 1
20 to 23.01 24.00 1
21 go 24.01 25.00 1
我明白了:
word start stop speaker
0 but that's alright 2.72 3.47 2
1 we'll have to 8.43 9.07 1
2 okay sure 9.19 11.01 2
3 what? 11.02 12.00 1
4 I want to go there 12.01 17.00 2
5 where is it you 17.01 21.00 1
6 would like to go 21.01 25.00 1
谢谢!
发布于 2019-06-20 19:44:49
考虑到你的最终代码,我想我已经把它处理好了。只需将“扬声器”分解为要分组的分区即可。
请注意,我的示例使用的是每个扬声器的2,而不是4个单词,因为使用示例数据更容易。
import pandas as pd
import math
z = pd.read_clipboard()
y = ((z.groupby((z['speaker'] != z['speaker'].shift(1)).cumsum()).cumcount().apply(float)+1) / 2)
z['speaker2'] = z['speaker'].apply(str) + y.apply(math.floor).apply(str)
z.groupby([(z['speaker2'] != z['speaker2'].shift()).cumsum(), z['speaker2']], as_index=False).agg({
'word': ' '.join,
'start': 'min',
'stop': 'max'
word start stop
0 but that's 2.72 3.09
1 alright 3.09 3.47
2 we'll have 8.43 8.97
3 to 8.97 9.07
4 okay sure 9.19 11.01
5 what? 11.02 12.00
})
https://stackoverflow.com/questions/56692408
复制相似问题