首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >输入数组中OrdinalEncoder与np.nan的不一致结果

输入数组中OrdinalEncoder与np.nan的不一致结果
EN

Stack Overflow用户
提问于 2021-10-29 04:55:14
回答 1查看 858关注 0票数 0

我希望使用OrdinalEncoder对一些序号数据进行编码,格式如下:["6-10","11-15","1-5",...,np.nan],参数类别中指定的编码顺序为["1-5","6-10","11-15",...],np.nan被忽略(我希望在填充nans之前先对给定的特性进行编码)。

根据用户手册,sklearn应该忽略输入数组中的np.nan

来自https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features

但是,不一致的结果来自于指定了类别参数的普通list/np.Array/:

代码语言:javascript
运行
复制
!pip install -U scikit-learn
!pip install -U numpy

import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

print(sklearn.__version__)

dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))
代码语言:javascript
运行
复制
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 0.]
 [nan]]
[[0.]
 [2.]
 [1.]
 [2.]
 [1.]
 [1.]
 [0.]
 [3.]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
     16 print(enc1.fit_transform(dummy_array))
     17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    845         if y is None:
    846             # fit method of arity 1 (unsupervised transformation)
--> 847             return self.fit(X, **fit_params).transform(X)
    848         else:
    849             # fit method of arity 2 (supervised transformation)

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    884 
    885         # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886         self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
    887 
    888         if self.handle_unknown == "use_encoded_value":

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
    114                             " during fit".format(diff, i)
    115                         )
--> 116                         raise ValueError(msg)
    117             self.categories_.append(cats)
    118 

ValueError: Found unknown categories [nan] in column 0 during fit

由于我没有太多的蒙皮和滑雪板的经验,我不知道是什么原因与这三种情况的不同结果。据我理解,前两种情况都应得出以下结果,第三种情况不应引起错误:

代码语言:javascript
运行
复制
[[ 0.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 0.]
 [nan]] 

任何帮助都将不胜感激,谢谢!1:https://i.stack.imgur.com/Gba8X.png

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-10-29 18:43:11

您需要清楚地知道如何处理未知(缺失)值:

代码语言:javascript
运行
复制
from sklearn.preprocessing import OrdinalEncoder

dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]

# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order], 
                      handle_unknown='use_encoded_value', 
                      unknown_value=np.nan)  

enc3.fit_transform(dummy_array)

收益率

代码语言:javascript
运行
复制
array([[ 0.],
       [ 1.],
       [ 2.],
       [ 1.],
       [ 2.],
       [ 2.],
       [ 0.],
       [nan]])

handle_uknown的缺省值是"error",这是您得到的结果。

The 文献状态

handle_unknown:{“错误”,“use_encoded_value”},默认值=“错误” 当设置为“错误”时,如果在转换过程中出现未知的分类特征,则会引发错误。当设置为‘use_ encoded _ value’时,未知类别的编码值将设置为参数unknown_value的给定值

unknown_value的帮助是:

unknown_value:int或np.nan,default=None 当参数handle_unknown设置为‘use_ encoded _ value’时,该参数是必需的,并将设置未知类别的编码值。它必须与用于编码任何类别的值不同。如果设置为np.nan,则dtype参数必须是浮点dtype。

dummy_array2输出所有编码的值,包括NaN,是因为输入是一个由字符串组成的NumPy数组:np.nan将被转换为'nan',因为其他元素是字符串,NumPy数组需要一个单一的数据dtype。在这种情况下,dtype是"U32“。因此,所有的值都被正确地编码为整数(好,浮动)。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69763817

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档