问pandas中的矢量化列式正则表达式匹配
EN

Stack Overflow用户

提问于 2020-04-26 05:41:37

回答 3查看 159关注 0票数 2

第一部分

假设我有一个数据集df，如下所示：

x   | y     
----|--------
foo | 1.foo-ya
bar | 2.bar-ga
baz | 3.ha-baz
qux | None

我想过滤y在中间恰好包含x的行(既不是开始也不是结束，即匹配模式'^.+\w+.+$'，命中第1&2行)，不包括None/NaN：

x   | y
----|-----
foo | 1.foo-ya
bar | 2.bar-ga

这是一个典型的成对字符比较，这在SQL中很容易实现：

select x, y from df where y like concat('^.+', x, '.+%');

或在R中：

library(dplyr)
library(stringr)
library(glue)
df %>% filter(str_detect(y, glue('^.+{x}.+$')))

但是，既然我不是熊猫方面的专家，那么在熊猫中似乎没有类似的简单的“矢量化”正则表达式匹配方法？我应用了lambda方法：

import pandas as pd
import re
df.loc[df.apply(lambda row: bool(re.search(
                '^.+' + row.x + '.+$', row.y)) 
       if row.x and row.y else False, axis=1), :]

在熊猫中有没有更优雅的方法来完成这件事？

第二部分

此外，我想提取前导数字(1，2，...)在第一部分中产生的匹配记录中：

x   | y        |  z
----|----------|---
foo | 1.foo-ya |  1
bar | 2.bar-ga |  2

在R中，我可以做一个直接的管道争论：

df %>%
  filter(str_detect(y, glue('^.+{x}.+$'))) %>%
  mutate(z=str_replace(y, glue('^(\\d+)\\.{x}.+$'), '\\1') %>%
           as.numeric)

但在熊猫中，我只知道lambda方法。还有比它更“好”的方法吗？

a = df.loc[df.apply(lambda row: bool(
                re.search('^.+' + row.x + '.+$', row.y))
                if row.x and row.y else False, axis=1), 
       ['x', 'y']]
a['z'] = a.apply(lambda row: re.sub(
       r'^(\d+)\.' + row.x + '.+$', r'\1', row.y), axis=1).astype('int')
a

顺便说一句，assign方法无法工作。

df.loc[df.apply(lambda row: bool(re.search(
                '^.+' + row.x + '.+$', row.y))
                if row.x and row.y else False, axis=1), 
       ['x', 'y']].assign(z=lambda row: re.sub(
                r'^(\d+)\.' + row.x + '.+$', r'\1', row.y))

谢谢!

regex

pandas

dplyr

vectorization

回答 3

Stack Overflow用户

发布于 2020-04-26 08:29:00

pandas字符串操作建立在python的string和re模块之上。试一试，看看这是不是你想要的：

import re

#find out if values in column x are in column y
#according to the pattern u wrote in the question
pattern = [re.match(fr'^.+{a}.+$',b)
           for a,b 
           in zip(df.x.str.strip(),
                  df.y.str.strip())
          ]

match = [ent.group() if ent is not None else np.nan for ent in pattern]

#extract values for digit immediately preceding val in col x    
ext = [re.search(fr'\d(?=\.{a})', b) for a,b  in 
       zip(df.x.str.strip(),
           df.y.str.strip())]

extract = [ent.group() if ent is not None else np.nan for ent in ext]

df['match'], df['extract'] = match, extract

     x     y        match   extract
1   foo 1.foo-ya    1.foo-ya    1
2   bar 2.bar-ga    2.bar-ga    2
3   baz 3.ha-baz      NaN      NaN
4   qux    None       NaN      NaN

票数 1

Stack Overflow用户

发布于 2020-04-26 11:55:29

感谢你所有鼓舞人心的回复。我不得不说，尽管Python在很多方面都很优秀，但当涉及到这种向量化操作时，我更喜欢R。所以我为这个案子重新设计了轮子。

def str_detect(string: pd.Series, pattern: pd.Series) -> List[bool]:
    """mimic str_detect in R
    """
    if len(string) > len(pattern):
        pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
    elif len(string) < len(pattern):
        pattern = pattern[1:len(string)]

    return [bool(re.match(y, x)) if x and y else False
            for x, y in zip(string, pattern)]

def str_extract(string: pd.Series, pattern: pd.Series) -> List[str]:
    """mimic str_extract in R
    """
    if len(string) > len(pattern):
        pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
    elif len(string) < len(pattern):
        pattern = pattern[1:len(string)]
    o = [re.search(y, x) if x and y else None
         for x, y in zip(string, pattern)]

    return [x.group() if x else np.nan for x in o]

然后

df.loc[str_detect(
    df['y'], '^.+' + df['x']+'.+$'), ['x', 'y']]
(df
  .assign(z=str_extract(df['y'], r'^(\d+)(?=\.' + df['x'] + ')'))
  .dropna(subset=['z'])
  .loc[:, ['x', 'y', 'z']])

票数 0

Stack Overflow用户

发布于 2021-06-25 00:28:11

这是你想要的方式吗？几乎复制了你在R中所做的事情：

>>> from numpy import vectorize
>>> from pipda import register_func
>>> from datar.all import f, tribble, filter, grepl, paste0, mutate, sub, as_numeric
[2021-06-24 17:27:16][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>> 
>>> df = tribble(
...   f.x,   f.y,
...   "foo", "1.foo-ya",
...   "bar", "2.bar-ga",
...   "baz", "3.ha-baz",
...   "qux", None
... )
>>> 
>>> @register_func(None)
... @vectorize
... def str_detect(text, pattern):
...   return grepl(pattern, text)
... 
>>> @register_func(None)
... @vectorize
... def str_replace(text, pattern, replacement):
...   return sub(pattern, replacement, text)
... 
>>> df >> \
...   filter(str_detect(f.y, paste0('^.+', f.x, '.+$'))) >> \
...   mutate(z=as_numeric(str_replace(f.y, paste0(r'^(\d+)\.', f.x, '.+$'), r'\1')))
         x         y         z
  <object>  <object> <float64>
0      foo  1.foo-ya       1.0
1      bar  2.bar-ga       2.0