问如何在python cuDF中使用自定义函数进行分组？
EN

Stack Overflow用户

提问于 2022-07-30 08:23:34

回答 1查看 100关注 0票数 0

我刚开始使用GPU进行数据操作，并且一直在努力复制cuDF中的一些功能。例如，我希望为数据集中的每个组获取一个模式值。在Pandas中，使用自定义函数很容易实现：

df = pd.DataFrame({'group': [1, 2, 2, 1, 3, 1, 2],
                   'value': [10, 10, 30, 20, 20, 10, 30]}

| group | value |
| ----- | ----- |
| 1     | 10    |
| 2     | 10    |
| 2     | 30    |
| 1     | 20    |
| 3     | 20    |
| 1     | 10    |
| 2     | 30    |

def get_mode(customer):
    freq = {}
    for category in customer:
        freq[category] = freq.get(category, 0) + 1
    key = max(freq, key=freq.get)
    return [key, freq[key]]

df.groupby('group').agg(get_mode)

| group | value |
| ----- | ----- |
| 1     | 10    |
| 2     | 30    |
| 3     | 20    |

但是，我似乎无法在cuDF中复制相同的功能。尽管似乎有一种方法，我已经找到了一些例子，但它在我的情况下不起作用。例如，下面是我试图为cuDF使用的函数：

def get_mode(group, mode):
    print(group)
    freq = {}
    for i in range(cuda.threadIdx.x, len(group), cuda.blockDim.x):
        category = group[i]
        freq[category] = freq.get(category, 0) + 1
    mode = max(freq, key=freq.get)
    max_freq = freq[mode]
    
df.groupby('group').apply_grouped(get_mode, incols=['group'],
                                   outcols=dict((mode=np.float64))

有谁能帮我了解一下这里出了什么问题，以及如何解决它？试图运行上面的代码会引发以下错误(希望我成功地将其置于扰流板之下)：

Error code

TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_get at 0x7fa8f0500710>) found for signature:

>>> impl_get(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))

There are 2 candidate implementations:
    - Of which 1 did not match due to:
    Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
      With argument(s): '(DictType[undefined,undefined]<iv=None>, int64, int64)':
     Rejected as the implementation raised a specific error:
       TypingError: Failed in nopython mode pipeline (step: nopython frontend)
     non-precise type DictType[undefined,undefined]<iv=None>
     During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
     
     File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
         def impl(dct, key, default=None):
             castedkey = _cast(key, keyty)
             ^

raised from /opt/conda/lib/python3.7/site-packages/numba/core/typeinfer.py:1086
    - Of which 1 did not match due to:
    Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
      With argument(s): '(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))':
     Rejected as the implementation raised a specific error:
       TypingError: Failed in nopython mode pipeline (step: nopython frontend)
     non-precise type DictType[undefined,undefined]<iv={}>
     During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
     
     File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
         def impl(dct, key, default=None):
             castedkey = _cast(key, keyty)

During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.DictType'>, 'get') for DictType[undefined,undefined]<iv={}>)
During: typing of call at /tmp/ipykernel_33/2595976848.py (6)


File "../../tmp/ipykernel_33/2595976848.py", line 6:
<source missing, REPL/exec in use?>

During: resolving callee type: type(<numba.cuda.compiler.Dispatcher object at 0x7fa8afe49520>)
During: typing of call at <string> (10)


File "<string>", line 10:
<source missing, REPL/exec in use?>

rapids

cudf

Stack Overflow用户

回答已采纳

发布于 2022-07-30 21:35:57

cuDF建立在Numba的CUDA目标之上，以支持UDF。这不支持在UDF中使用字典，但是您的用例可以通过将cuDF和value_counts结合起来，用熊猫或drop_duplicates的内置操作来表示。

import pandas as pd

df = pd.DataFrame(
    {
        'group': [1, 2, 2, 1, 3, 1, 2],
        'value': [10, 10, 30, 20, 20, 10, 30]
    }
)

out = (
    df
    .value_counts()
    .reset_index(name="count")
    .sort_values(["group", "count"], ascending=False)
    .drop_duplicates(subset="group", keep="first")
)
print(out[["group", "value"]])
   group  value
4      3     20
1      2     30
0      1     10

票数 0

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73174015

复制

相似问题

问如何在python cuDF中使用自定义函数进行分组？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在python cuDF中使用自定义函数进行分组？EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在python cuDF中使用自定义函数进行分组？
EN