我希望使用OrdinalEncoder对一些序号数据进行编码,格式如下:["6-10","11-15","1-5",...,np.nan]
,参数类别中指定的编码顺序为["1-5","6-10","11-15",...]
,np.nan被忽略(我希望在填充nans之前先对给定的特性进行编码)。
根据用户手册,sklearn应该忽略输入数组中的np.nan
:
来自https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
但是,不一致的结果来自于指定了类别参数的普通list/np.Array/:
!pip install -U scikit-learn
!pip install -U numpy
import sklearn
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
print(sklearn.__version__)
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
dummy_array2 = np.array(["1-5","6-10","10-15","6-10","10-15","10-15","1-5",np.nan])
enc_order = ["1-5","6-10","10-15"]
enc1 = OrdinalEncoder()
enc2 = OrdinalEncoder()
enc3 = OrdinalEncoder(categories=[enc_order])
print(enc1.fit_transform(dummy_array))
print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
print(enc3.fit_transform(dummy_array))
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (1.0.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.21.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (3.0.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.21.3)
1.0.1
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
[[0.]
[2.]
[1.]
[2.]
[1.]
[1.]
[0.]
[3.]]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c460949a3bd3> in <module>()
16 print(enc1.fit_transform(dummy_array))
17 print(enc2.fit_transform(dummy_array2.reshape(-1,1)))
---> 18 print(enc3.fit_transform(dummy_array))
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
845 if y is None:
846 # fit method of arity 1 (unsupervised transformation)
--> 847 return self.fit(X, **fit_params).transform(X)
848 else:
849 # fit method of arity 2 (supervised transformation)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
884
885 # `_fit` will only raise an error when `self.handle_unknown="error"`
--> 886 self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
887
888 if self.handle_unknown == "use_encoded_value":
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown, force_all_finite)
114 " during fit".format(diff, i)
115 )
--> 116 raise ValueError(msg)
117 self.categories_.append(cats)
118
ValueError: Found unknown categories [nan] in column 0 during fit
由于我没有太多的蒙皮和滑雪板的经验,我不知道是什么原因与这三种情况的不同结果。据我理解,前两种情况都应得出以下结果,第三种情况不应引起错误:
[[ 0.]
[ 2.]
[ 1.]
[ 2.]
[ 1.]
[ 1.]
[ 0.]
[nan]]
任何帮助都将不胜感激,谢谢!1:https://i.stack.imgur.com/Gba8X.png
发布于 2021-10-29 18:43:11
您需要清楚地知道如何处理未知(缺失)值:
from sklearn.preprocessing import OrdinalEncoder
dummy_array = [["1-5"],["6-10"],["10-15"],["6-10"],["10-15"],["10-15"],["1-5"],[np.nan]]
enc_order = ["1-5","6-10","10-15"]
# unknown_value is mandatory when handle_unknown is given
enc3 = OrdinalEncoder(categories=[enc_order],
handle_unknown='use_encoded_value',
unknown_value=np.nan)
enc3.fit_transform(dummy_array)
收益率
array([[ 0.],
[ 1.],
[ 2.],
[ 1.],
[ 2.],
[ 2.],
[ 0.],
[nan]])
handle_uknown
的缺省值是"error"
,这是您得到的结果。
The 文献状态
handle_unknown
:{“错误”,“use_encoded_value”},默认值=“错误” 当设置为“错误”时,如果在转换过程中出现未知的分类特征,则会引发错误。当设置为‘use_ encoded _ value’时,未知类别的编码值将设置为参数unknown_value的给定值
对unknown_value
的帮助是:
unknown_value
:int或np.nan,default=None 当参数handle_unknown设置为‘use_ encoded _ value’时,该参数是必需的,并将设置未知类别的编码值。它必须与用于编码任何类别的值不同。如果设置为np.nan,则dtype参数必须是浮点dtype。
dummy_array2
输出所有编码的值,包括NaN,是因为输入是一个由字符串组成的NumPy数组:np.nan
将被转换为'nan'
,因为其他元素是字符串,NumPy数组需要一个单一的数据dtype。在这种情况下,dtype
是"U32“。因此,所有的值都被正确地编码为整数(好,浮动)。
https://stackoverflow.com/questions/69763817
复制相似问题