scale与normalize,是我们在做前期数据处理的时候经常做的操作,但是它们经常会被混淆,现在网上的一些讨论也比较混乱。
import pandas as pd
import numpy as np
# for Box-Cox Transformation
from scipy import stats
# for min_max scaling
from mlxtend.preprocessing import minmax_scaling
from sklearn import preprocessing
# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
# set seed for reproducibility
np.random.seed(0)
特征缩放,特点是不改变数据分布情况。比如min-max或者Z-score (主要有如下四种方法,详见:Feature_scaling).
Min-Max scale:
original_data = np.random.beta(5, 1, 1000) * 60
# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns=[0])
# 或者
scaled_data = preprocessing.minmax_scale(original_data)
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")
Z-score:
s_scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
df_s = s_scaler.fit_transform(original_data.reshape(-1,1))
# plot both together to compare
fig, ax = plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(df_s, ax=ax[1])
ax[1].set_title("Scaled data")
Normalization则会改变数据的分布。比如Box-Cox转换,可以将数据转为正态分布。
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)
# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
换一个分布看一下:
original_data = np.random.exponential(size=1000)
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)
# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
参考: