# 数据科学与机器学习管道中预处理的重要性（一）：中心化、缩放和K近邻

## K近邻可视化描述

`from IPython.display import ImageImage(url= 'http://36.media.tumblr.com/d100eff8983aae7c5654adec4e4bb452/tumblr_inline_nlhyibOF971rnd3q0_500.png')`

## Python(scikit-learn)实现k-NN

```import pandas as pd
%matplotlib inlineimport matplotlib.pyplot as plt
plt.style.use('ggplot')
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variabley1 = df['quality'].values
pd.DataFrame.hist(df, figsize = [15,15]);```

```y = y1 <= 5 # is the rating <= 5?
# plot histograms of original target variable# and aggregated target variableplt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.hist(y1);
plt.xlabel('original target value')
plt.ylabel('count')
plt.subplot(1, 2, 2);
plt.hist(y)
plt.xlabel('aggregated target value')
plt.show()```

## k-NN：实际性能和训练测试拆分

```from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)```

```from sklearn import neighbors, linear_model
knn = neighbors.KNeighborsClassifier(n_neighbors = 5)
knn_model_1 = knn.fit(X_train, y_train)
print('k-NN accuracy for test set: %f' % knn_model_1.score(X_test, y_test))```
`k-NN accuracy for test set: 0.612500`

```from sklearn.metrics import classification_report
y_true, y_pred = y_test, knn_model_1.predict(X_test)
print(classification_report(y_true, y_pred))```
```             precision    recall  f1-score   support      False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141avg / total       0.61      0.61      0.61       320```

## 预处理：缩放实战

```from sklearn.preprocessing import scale
Xs = scale(X)from sklearn.cross_validation import train_test_split
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
knn_model_2 = knn.fit(Xs_train, y_train)
print('k-NN score for test set: %f' % knn_model_2.score(Xs_test, y_test))
print('k-NN score for training set: %f' % knn_model_2.score(Xs_train, y_train))
y_true, y_pred = y_test, knn_model_2.predict(Xs_test)
print(classification_report(y_true, y_pred))```
```k-NN score for test set: 0.712500k-NN score for training set: 0.814699
precision    recall  f1-score   support      False       0.72      0.79      0.75       179
True       0.70      0.62      0.65       141avg / total       0.71      0.71      0.71       320```

1. 预测变量可能包含非常不同的范围，并且在某些情况下，比如使用k-NN时，这些变量值需要进行削减以免某些特征在算法中占主导地位；
2. 你希望你的特征是单位独立的，也就是说，不涉及单位度量：例如，你可能有一些以米为单位的特征，我可能有用厘米表示的同样的特征。如果我们各自缩放数据，这些特征对我们来说都会是一样的。

```# Set the the number of neighbors for k-NN
n_neig = 5# Set sc = True if you want to scale your features
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values # drop target variable

# Here we scale, if desiredif sc == True:
X = scale(X)

# Target valuey1 = df['quality'].values # original target variable
y = y1 <= 5 # new target variable: is the rating <= 5?

# Split the data into a test set and a training setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train k-NN model and print performance on the test setknn = neighbors.KNeighborsClassifier(n_neighbors = n_neig)
knn_model = knn.fit(X_train, y_train)
y_true, y_pred = y_test, knn_model.predict(X_test)
print('k-NN accuracy for test set: %f' % knn_model.score(X_test, y_test))
print(classification_report(y_true, y_pred))```
```<script.py> output:
k-NN accuracy for test set: 0.612500
precision    recall  f1-score   support          False       0.66      0.64      0.65       179
True       0.56      0.57      0.57       141

avg / total       0.61      0.61      0.61       320```

### 术语表

K近邻（k-Nearest Neighbors）：分类任务的一种算法，一个数据点的标签由离它最近的k个质心投票决定。

964 篇文章114 人订阅

0 条评论

## 相关文章

2439

### 教你使用Keras一步步构建深度神经网络：以情感分析任务为例

【导读】Keras是深度学习领域一个非常流行的库，通过它可以使用简单的代码构建强大的神经网络。本文介绍基于Keras构建神经网络的基本过程，包括加载数据、分析数...

1.1K7

6678

1.5K4

7839

1763

### 开发 | 小白学CNN以及Keras的速成

AI 科技评论按：本文作者Sherlock，原文载于其知乎专栏深度炼丹，AI 科技评论已获得其授权发布。 一、为何要用Keras 如今在深度学习大火的时候，第三...

3546

4155

3435

### 基于深度学习的FAQ问答系统

| 导语 问答系统是信息检索的一种高级形式，能够更加准确地理解用户用自然语言提出的问题，并通过检索语料库、知识图谱或问答知识库返回简洁、准确的匹配答案。相较于...

11.1K11