# 数字识别，从KNN,LR,SVM,RF到深度学习

@蜡笔小轩V

## kaggle数据读取

```import pandas as pdimport numpy as npimport timefrom sklearn.cross_validation import cross_val_score#read data dataset = pd.read_csv("./data/train.csv")
X_train = dataset.values[0:, 1:]
y_train = dataset.values[0:, 0]#for fast evaluationX_train_small = X_train[:10000, :]
y_train_small = y_train[:10000]

pandas的dataframe当然是个好东西，完整去学太费时间了，建议先把几个常用的学起来就好了吧。

## KNN

KNN在这里有个很直观intuition，跟哪个数字比较像，那就判断为哪个数字。虽然看上去有点土，但道理上完全讲得通！另辟蹊径，倒是一个挺赞的思路。KNN需要记录原始训练样本，有点死记硬背的味道(属于Non-Parametric Models)。说不定阿猫阿狗之类的动物就是用这种方法来”学习”的。

```#knnfrom sklearn.neighbors import KNeighborsClassifier#begin timestart = time.clock()#progressingknn_clf=KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', weights='distance', p=3)
score = cross_val_score(knn_clf, X_train_small, y_train_small, cv=3)

print( score.mean() )#end timeelapsed = (time.clock() - start)
print("Time used:",int(elapsed), "s")#k=3#0.942300738697#0.946100822903 weights='distance'#0.950799888775 p=3#k=5#0.939899237556#0.94259888029#k=7#0.935395994386 #0.938997377902#k=9#0.933897851978```

```clf=knn_clf

start = time.clock()
clf.fit(X_train,y_train)
elapsed = (time.clock() - start)
print("Training Time used:",int(elapsed/60) , "min")

result=clf.predict(X_test)
result = np.c_[range(1,len(result)+1), result.astype(int)]
df_result = pd.DataFrame(result, columns=['ImageId', 'Label'])

df_result.to_csv('./results.knn.csv', index=False)#end timeelapsed = (time.clock() - start)
print("Test Time used:",int(elapsed/60) , "min")```

## LR

```#LR also works!from sklearn.linear_model import LogisticRegression#begin timestart = time.clock()#progressinglr_clf=LogisticRegression(penalty='l2', solver ='lbfgs', multi_class='multinomial', max_iter=800,  C=0.2 )#lr_clf=LogisticRegression(penalty='l1', multi_class='ovr', max_iter=400,  C=4 )parameters = {'penalty':['l2'] , 'C':[2e-2, 4e-2,8e-2, 12e-2, 2e-1]}#parameters = {'penalty':['l1'] , 'C':[2e0,2e1, 2e2]}gs_clf =  GridSearchCV(lr_clf, parameters, n_jobs=1, verbose=True )

gs_clf.fit( X_train_small.astype('float')/256, y_train_small )

print()for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"  % (mean_score, scores.std() * 2, params))
print()#end timeelapsed = (time.clock() - start)
print("Time used:",elapsed)#可以打印模型参数出来看看#clf.coef_ [1,:]```

0.870 (+/-0.004) for {‘penalty’: ‘l2’, ‘C’: 0.002} 0.900 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.02} 0.905 (+/-0.001) for {‘penalty’: ‘l2’, ‘C’: 0.2} 0.890 (+/-0.003) for {‘penalty’: ‘l2’, ‘C’: 2.0} (‘Time used:’, 114.5217506956833) 0.900 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.02} 0.904 (+/-0.006) for {‘penalty’: ‘l2’, ‘C’: 0.04} 0.908 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.08} 0.908 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.12} 0.905 (+/-0.001) for {‘penalty’: ‘l2’, ‘C’: 0.2}

## SVM

```#svcfrom sklearn.svm import SVC,NuSVCfrom sklearn.grid_search import   GridSearchCV#begin timestart = time.clock()#progressingparameters = {'nu':(0.05, 0.02) , 'gamma':[3e-2, 2e-2, 1e-2]}

svc_clf=NuSVC(nu=0.1, kernel='rbf', verbose=True )
gs_clf =  GridSearchCV(svc_clf, parameters, n_jobs=1, verbose=True )

gs_clf.fit( X_train_small.astype('float')/256, y_train_small )

print()for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"  % (mean_score, scores.std() * 2, params))
print()#end timeelapsed = (time.clock() - start)
print("Time used:",elapsed)```

Fitting 3 folds for each of 6 candidates, totalling 18 fits [LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]LibSVM 0.968 (+/-0.001) for {‘nu’: 0.05, ‘gamma’: 0.03} 0.968 (+/-0.001) for {‘nu’: 0.02, ‘gamma’: 0.03} 0.967 (+/-0.003) for {‘nu’: 0.05, ‘gamma’: 0.02} 0.968 (+/-0.002) for {‘nu’: 0.02, ‘gamma’: 0.02} 0.961 (+/-0.002) for {‘nu’: 0.05, ‘gamma’: 0.01} 0.963 (+/-0.002) for {‘nu’: 0.02, ‘gamma’: 0.01} (‘Time used:’, 819.6633204167592)

optimization finished, #iter = 1456 C = 2.065921 obj = 160.316989, rho = 0.340949 nSV = 599, nBSV = 15

[LibSVM](‘Training Time used:’, 6, ‘min’) (‘Test Time used:’, 12, ‘min’)

## Random Forest

```from sklearn.ensemble import RandomForestClassifier#begin timestart = time.clock()#progressingparameters = {'criterion':['gini','entropy'] , 'max_features':['auto', 12, 100]}

rf_clf=RandomForestClassifier(n_estimators=400, n_jobs=4, verbose=1)
gs_clf =  GridSearchCV(rf_clf, parameters, n_jobs=1, verbose=True )

gs_clf.fit( X_train_small.astype('int'), y_train_small )

print()for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"  % (mean_score, scores.std() * 2, params))
print()#end timeelapsed = (time.clock() - start)
print("Time used:",elapsed)```

0.946 (+/-0.002) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’} 0.945 (+/-0.001) for {‘max_features’: 12, ‘criterion’: ‘gini’} 0.943 (+/-0.005) for {‘max_features’: 100, ‘criterion’: ‘gini’} 0.944 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘entropy’} 0.944 (+/-0.006) for {‘max_features’: 12, ‘criterion’: ‘entropy’} 0.942 (+/-0.007) for {‘max_features’: 100, ‘criterion’: ‘entropy’} () (‘Time used:’, 342.1534636337892) 0.946 (+/-0.005) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: None} 0.889 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 6} 0.945 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 18} 0.946 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 32} () (‘Test Time used:’, 1, ‘min’)

## Deep Learning

```#DL  modified from keras's example'''Train a simple convnet on the MNIST dataset.
Run on GPU: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_cnn.py
Get to 99.25% test accuracy after 12 epochs (there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''from __future__ import print_functionimport numpy as np
np.random.seed(1337)  # for reproducibility#from keras.datasets import mnistfrom keras.models import Sequentialfrom keras.layers.core import Dense, Dropout, Activation, Flattenfrom keras.layers.convolutional import Convolution2D, MaxPooling2Dfrom keras.utils import np_utils

batch_size = 128nb_classes = 10nb_epoch = 60# input image dimensionsimg_rows, img_cols = 28, 28# number of convolutional filters to usenb_filters = 32# size of pooling area for max poolingnb_pool = 2# convolution kernel sizenb_conv = 3#(X_train, y_train), (X_test, y_test) = mnist.load_data()X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_train = X_train.astype('float32')

X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
X_test = X_test.astype('float32')

X_test /=255X_train /= 255print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')# convert class vectors to binary class matricesY_train = np_utils.to_categorical(y_train, nb_classes)#Y_test = np_utils.to_categorical(y_test, nb_classes)model = Sequential()

border_mode='valid',
input_shape=(1, img_rows, img_cols)))

model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
show_accuracy=True, verbose=1 )

test_result=model.predict_classes( X_test, batch_size=128, verbose=1)
result = np.c_[range(1,len(test_result)+1), test_result.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['ImageId', 'Label'])

df_result.to_csv('./results.dl.csv', index=False)```

## 总结

LR线性模型显然最弱。神经网络处理这种图像问题确实目前是最强的。svm的support vector在这里起到作用非常明显，准确地找出了最具区分度的“特征图像”。RF有点像非线性问题的万金油，这里默认参数已经很可以了。只比KNN结果稍微差一点，因为只用了像素的局部信息。当然了，模型的对比这里只针对数字识别的问题，对于其他问题可能有不同的结果，要具体问题具体分析，结合模型特点，选取合适的模型。

822 篇文章217 人订阅

0 条评论

## 相关文章

2.3K2

### Feature Selection For Machine Learning in Python (Python机器学习中的特征选择)

Feature Selection For Machine Learning in Python 原文作者：Jason Brownlee 原文地址：http...

4376

5157

2907

1902

1962

2293

3248

1663

54010