文章/答案/技术大牛

发布

社区首页 >问答首页 >Python-Polars:如何使用字符串列表过滤分类列

问Python-Polars:如何使用字符串列表过滤分类列
EN

Stack Overflow用户

提问于 2022-08-28 15:16:00

回答 2查看 150关注 0票数 0

我有一个如下所示的极坐标：

df_cat = pl.DataFrame(
[
    pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
    pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ ---   ┆ ---   │
│ cat   ┆ cat   │
╞═══════╪═══════╡
│ c     ┆ F     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ G     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ E     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c     ┆ G     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ G     │
└───────┴───────┘

以下过滤器运行非常好：

print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ ---   ┆ ---   │
│ cat   ┆ cat   │
╞═══════╪═══════╡
│ c     ┆ F     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c     ┆ G     │
└───────┴───────┘

我想要的是使用一个字符串列表来更有效地运行过滤器。因此，我尝试并得到了以下错误消息：

print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))

File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
   2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
   2182     predicate = pli.Series(predicate)
   2184 return (
-> 2185     self.lazy()
   2186     .filter(predicate)  # type: ignore[arg-type]
   2187     .collect(no_optimization=True, string_cache=False)
   2188 )

File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
    650     projection_pushdown = False
    652 ldf = self._ldf.optimization_toggle(
    653     type_coercion,
    654     predicate_pushdown,
   (...)
    658     slice_pushdown,
    659 )
--> 660 return pli.wrap_df(ldf.collect())

ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache

从这个Stackoverflow link中我了解到“您需要设置一个全局字符串缓存来比较在不同列/列表中创建的分类”。但我的问题是

为什么== 1单字符串筛选用例工作？
有什么正确的方法来过滤包含字符串列表的分类列？

谢谢!

python-polars

Stack Overflow用户

发布于 2022-08-28 19:13:23

实际上，您不需要设置全局字符串缓存来比较字符串和分类变量。您可以使用cast来完成这一任务。

让我们使用这些数据。我已经包含了作为分类变量基础的整数值，以便稍后演示一些内容。

import polars as pl

df_cat = (
    pl.DataFrame(
        [
            pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
            pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
        ]
    )
    .with_column(
        pl.all().to_physical().suffix('_phys')
    )
)
df_cat

shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ ---   ┆ ---   ┆ ---        ┆ ---        │
│ cat   ┆ cat   ┆ u32        ┆ u32        │
╞═══════╪═══════╪════════════╪════════════╡
│ c     ┆ F     ┆ 0          ┆ 0          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a     ┆ G     ┆ 1          ┆ 1          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ E     ┆ 2          ┆ 2          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c     ┆ S     ┆ 0          ┆ 3          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X     ┆ X     ┆ 3          ┆ 4          │
└───────┴───────┴────────────┴────────────┘

将范畴变量与字符串进行比较

如果我们cast一个分类变量返回到它的字符串值，我们可以进行任何我们需要的比较。例如：

df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))

shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ ---   ┆ ---   ┆ ---        ┆ ---        │
│ cat   ┆ cat   ┆ u32        ┆ u32        │
╞═══════╪═══════╪════════════╪════════════╡
│ c     ┆ F     ┆ 0          ┆ 0          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a     ┆ G     ┆ 1          ┆ 1          │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c     ┆ S     ┆ 0          ┆ 3          │
└───────┴───────┴────────────┴────────────┘

或者在filter步骤中比较不共享相同字符串缓存的两个分类变量的字符串值。

df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))

shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ ---   ┆ ---   ┆ ---        ┆ ---        │
│ cat   ┆ cat   ┆ u32        ┆ u32        │
╞═══════╪═══════╪════════════╪════════════╡
│ X     ┆ X     ┆ 3          ┆ 4          │
└───────┴───────┴────────────┴────────────┘

注意，它是要比较的字符串值(而不是作为两个分类变量的基础的整数)。

范畴变量上的等式算子

以下声明相当于：

df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))

前者是后者的语法糖，因为前者是一个常见的用例。

票数 1

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73519899

复制

相似问题

问Python-Polars:如何使用字符串列表过滤分类列
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python-Polars:如何使用字符串列表过滤分类列EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python-Polars:如何使用字符串列表过滤分类列
EN