嗨,我正在寻找从不同产品的数量从Df列到一个新的列。目前,这些数字是在产品类型之后出现的。
数据如下:
PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL
*Keeps going more entries...*
我的功能如下:
df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)
产品可以是TAB,COMP,AMP,SOB,过去,GRAG .和其他人
我想得到这样的东西:
PRODUCTS PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB 30
LIPITOR 40 MG 1+1 ORAL 15 TAB 15
LOFTYL 150 MG ORAL 30 TAB 30
SOMAZINA 500 MG ORAL 10 COMP RECUB 10
LOFTYL 30 TAB 150 MG ORAL 30
我可以在我的行中做些什么来得到以下信息呢?
谢谢你读到我和你的帮助。
发布于 2021-07-29 13:56:21
您可以使用
import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 TAB RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 MG ORAL 30 TAB 30
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
>>>
如果像tab
、cap
等这样的单词是完整的单词,并且不能是较长的单词的一部分,那么您需要在模式的末尾添加一个单词边界,即rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)\b'
。
见regex演示。详细信息
(?i)
-一个不区分大小写的内联修饰符(\d*\.?\d+)
-第1组:零或多个数字,一个可选的.
,然后一个或多个数字\s*
-零或更多空格字符(?:tab|cap|grag|past|sob|comp)
-一个非捕获组(为了不干扰Series.str.extract
输出),与其内的任何替代子字符串相匹配。\b
-一个词的边界。发布于 2021-07-29 14:26:30
也许.
给出了一个数据格式(注:我使产品在一行中出现了两次,以防万一发生这种情况).
PRODUCTS
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB
1 LIPITOR 40 MG 1+1 ORAL 15 TAB
2 LOFTYL 150 GRAG ORAL 30 TAB
3 SOMAZINA 500 MG ORAL 10 COMP RECUB
4 LOFTYL 30 TAB 150 MG ORAL
5 *Keeps going more entries...*
代码:
import pandas as pd
import re
data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
"LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
"LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}
df = pd.DataFrame(data)
# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']
def getProduct(x):
found = list()
for product in products:
pattern = r'(\d+)' + ' ' + str(product)
found.append(re.findall(pattern, x))
found = list(filter(None, found))
found = [item for sublist in found for item in sublist]
found = ", ".join(str(item) for item in found)
return found
df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]
print(df)
输出:
PRODUCTS PZ
0 PULSAR AT 20 MG ORAL 30 GRAG RECUB 30
1 LIPITOR 40 MG 1+1 ORAL 15 TAB 15
2 LOFTYL 150 GRAG ORAL 30 TAB 30, 150
3 SOMAZINA 500 MG ORAL 10 COMP RECUB 10
4 LOFTYL 30 TAB 150 MG ORAL 30
5 *Keeps going more entries...*
https://stackoverflow.com/questions/68583167
复制相似问题