问熊猫栏中的提取正则表达式
EN

Stack Overflow用户

提问于 2021-07-29 13:51:31

回答 2查看 80关注 0票数 1

嗨，我正在寻找从不同产品的数量从Df列到一个新的列。目前，这些数字是在产品类型之后出现的。

数据如下：

PRODUCTS
PULSAR AT 20 MG ORAL 30 TAB RECUB
LIPITOR 40 MG 1+1 ORAL 15 TAB
LOFTYL 150 MG ORAL 30 TAB
SOMAZINA 500 MG ORAL 10 COMP RECUB
LOFTYL 30 TAB 150 MG ORAL 
*Keeps going more entries...*

我的功能如下：

df['PZ'] = df['PRODUCTS'].str.extract('([\d]*\.*[\d]+)\s*[tab|cap|grag|past|sob]',flags=re.IGNORECASE)

产品可以是TAB，COMP，AMP，SOB，过去，GRAG .和其他人

我想得到这样的东西：

PRODUCTS                              PZ
PULSAR AT 20 MG ORAL 30 TAB RECUB     30
LIPITOR 40 MG 1+1 ORAL 15 TAB         15
LOFTYL 150 MG ORAL 30 TAB             30
SOMAZINA 500 MG ORAL 10 COMP RECUB    10
LOFTYL 30 TAB 150 MG ORAL             30

我可以在我的行中做些什么来得到以下信息呢？

谢谢你读到我和你的帮助。

python

regex

pandas

series

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-29 13:56:21

您可以使用

import pandas as pd
df = pd.DataFrame({'PRODUCTS':['PULSAR AT 20 MG ORAL 30 TAB RECUB','LIPITOR 40 MG 1+1 ORAL 15 TAB','LOFTYL 150 MG ORAL 30 TAB','SOMAZINA 500 MG ORAL 10 COMP RECUB','LOFTYL 30 TAB 150 MG ORAL']})
rx = r'(?i)(\d*\.?\d+)\s*(?:tab|cap|grag|past|sob|comp)'
df['PZ'] = df['PRODUCTS'].str.extract(rx)
>>> df
                             PRODUCTS  PZ
0   PULSAR AT 20 MG ORAL 30 TAB RECUB  30
1       LIPITOR 40 MG 1+1 ORAL 15 TAB  15
2           LOFTYL 150 MG ORAL 30 TAB  30
3  SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4           LOFTYL 30 TAB 150 MG ORAL  30
>>>

见regex演示。详细信息

(?i) -一个不区分大小写的内联修饰符
(\d*\.?\d+) -第1组:零或多个数字，一个可选的.，然后一个或多个数字
\s* -零或更多空格字符
(?:tab|cap|grag|past|sob|comp) -一个非捕获组(为了不干扰Series.str.extract输出)，与其内的任何替代子字符串相匹配。
\b -一个词的边界。

票数 0

Stack Overflow用户

发布于 2021-07-29 14:26:30

也许.

给出了一个数据格式(注:我使产品在一行中出现了两次，以防万一发生这种情况).

    PRODUCTS
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB
1   LIPITOR 40 MG 1+1 ORAL 15 TAB
2   LOFTYL 150 GRAG ORAL 30 TAB
3   SOMAZINA 500 MG ORAL 10 COMP RECUB
4   LOFTYL 30 TAB 150 MG ORAL
5   *Keeps going more entries...*

代码：

import pandas as pd
import re

data = {'PRODUCTS' : ["PULSAR AT 20 MG ORAL 30 GRAG RECUB", "LIPITOR 40 MG 1+1 ORAL 15 TAB", \
                      "LOFTYL 150 GRAG ORAL 30 TAB", "SOMAZINA 500 MG ORAL 10 COMP RECUB", \
                      "LOFTYL 30 TAB 150 MG ORAL" , "*Keeps going more entries...*"]}

df = pd.DataFrame(data)

# maintain a list of products to find
products = ['TAB', 'COMP', 'AMP', 'SOB', 'PAST', 'GRAG']

def getProduct(x):
    found = list()
    for product in products:
        pattern = r'(\d+)' + ' ' + str(product)
        found.append(re.findall(pattern, x))
    found = list(filter(None, found))
    found = [item for sublist in found for item in sublist]
    found = ", ".join(str(item) for item in found)
    return found

df['PZ'] = [getProduct(row) for row in df['PRODUCTS']]

print(df)

输出：

    PRODUCTS                            PZ
0   PULSAR AT 20 MG ORAL 30 GRAG RECUB  30
1   LIPITOR 40 MG 1+1 ORAL 15 TAB       15
2   LOFTYL 150 GRAG ORAL 30 TAB         30, 150
3   SOMAZINA 500 MG ORAL 10 COMP RECUB  10
4   LOFTYL 30 TAB 150 MG ORAL           30
5   *Keeps going more entries...*