我在使用dask.dataframe.core.DataFrame
规范Dask.dask_ml.preprocessing.MinMaxScaler
时遇到了问题,我可以使用sklearn.preprocessing.MinMaxScaler
,但是我希望使用dask进行扩展。
最小的、可复制的例子:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
错误消息:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
不确定旋转表中的“分类”是什么,但我尝试过.as_ordered()
索引:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
但我得到了错误信息:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
附加信息
test.csv
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50
发布于 2020-11-30 17:29:44
pivot_table
生成一个列索引,该索引是绝对的,因为原始列“字段”是绝对的。将索引写入数据帧上调用reset_index,熊猫无法为列索引添加新的值,因为它是绝对的。您可以使用ddf.columns = list(ddf.columns)
来避免这种情况。
因此,添加ddf_p.columns = list(ddf_p.columns)
解决了这个问题:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!
https://stackoverflow.com/questions/65077370
复制相似问题