我有一只熊猫,
import pandas as pd
df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
print(df)原始数据
text
0 set an alarm for [time : two hours from now]
1 wake me up at [time : nine am] on [date : friday]
2 check email from [person : john]如果列表中的值超过一个,我想对列表中的所有值重复列表和标签(日期、时间和个人)。所以想要的输出是,
期望产出:
new_text
0 set an alarm for [time : two] [time : hours] [time : from] [time : now]
1 wake me up at [time : nine] [time : am] on [date : friday]
2 check email from [person : john]到目前为止,我已经尝试将列表与原始列分开,但不知道如何继续。
df['separated_list'] = df.text.str.split(r"\s(?![^[]*])|[|]").apply(lambda x: [y for y in x if '[' in y])发布于 2022-10-12 13:32:31
您可以使用带有自定义函数的regex替换:
df['new_text'] = df.text.str.replace(
r"\[([^\[\]]*?)\s*:\s*([^\[\]]*)\]",
lambda m: ' '.join([f'[{m.group(1)} : {x}]'
for x in m.group(2).split()]), # new chunk for each word
regex=True)产出:
text new_text
0 set an alarm for [time : two hours from now] set an alarm for [time : two] [time : hours] [time : from] [time : now]
1 wake me up at [time : nine am] on [date : friday] wake me up at [time : nine] [time : am] on [date : friday]
2 check email from [person : john] check email from [person : john]发布于 2022-10-12 17:13:40
使用后面和前面查找[],使用重复捕获组获取字符串内容,然后使用以下方法拆分内容:
df = pd.DataFrame({'text': ['set an alarm for [time : two hours from now]','wake me up at [time : nine am] on [date : friday]','check email from [person : john]']})
#print(df)
data=df['text']
for item in data:
print(item)
matches=re.findall(r'(?<=\[)(?:[\w+\s*]+\:[\w+\s*]+)(?=\])', item)
for match in matches:
parts=match.split(":")
print(parts)产出:
set an alarm for [time : two hours from now]
['time ', ' two hours from now']
wake me up at [time : nine am] on [date : friday]
['time ', ' nine am']
['date ', ' friday']
check email from [person : john]
['person ', ' john']https://stackoverflow.com/questions/74042649
复制相似问题