# 机器学习算法一览（附python和R代码）

“谷歌的无人车和机器人得到了很多关注，但我们真正的未来却在于能够使电脑变得更聪明，更人性化的技术，机器学习。 ” —— 埃里克 施密特（谷歌首席执行官）

## 常见的机器学习算法

```1.线性回归 (Linear Regression)
2.逻辑回归 (Logistic Regression)
3.决策树 (Decision Tree)
4.支持向量机（SVM）
5.朴素贝叶斯 (Naive Bayes)
6.K邻近算法（KNN）
7.K-均值算法（K-means）
8.随机森林 (Random Forest)
9.降低维度算法（Dimensionality Reduction Algorithms）

### 1.线性回归 (Linear Regression)

• Y- 因变量
• a- 斜率
• X- 自变量
• b- 截距

a和b可以通过最小化因变量误差的平方和得到（最小二乘法）。

Python 代码

`#Import Library#Import other necessary libraries like pandas, numpy...from sklearn import linear_model#Load Train and Test datasets#Identify feature and response variable(s) and values must be numeric and numpy arraysx_train=input_variables_values_training_datasetsy_train=target_variables_values_training_datasetsx_test=input_variables_values_test_datasets# Create linear regression objectlinear = linear_model.LinearRegression()# Train the model using the training sets and check scorelinear.fit(x_train, y_train)linear.score(x_train, y_train)#Equation coefficient and Interceptprint('Coefficient: \n', linear.coef_)print('Intercept: \n', linear.intercept_)#Predict Outputpredicted= linear.predict(x_test)`

R 代码

`#Load Train and Test datasets#Identify feature and response variable(s) and values must be numeric and numpy arraysx_train <- input_variables_values_training_datasetsy_train <- target_variables_values_training_datasetsx_test <- input_variables_values_test_datasetsx <- cbind(x_train,y_train)# Train the model using the training sets and check scorelinear <- lm(y_train ~ ., data = x)summary(linear)#Predict Outputpredicted= predict(linear,x_test) `

### 2.逻辑回归

```odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk```

Python 代码

`#Import Libraryfrom sklearn.linear_model import LogisticRegression#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create logistic regression objectmodel = LogisticRegression()# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Equation coefficient and Interceptprint('Coefficient: \n', model.coef_)print('Intercept: \n', model.intercept_)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`x <- cbind(x_train,y_train)# Train the model using the training sets and check scorelogistic <- glm(y_train ~ ., data = x,family='binomial')summary(logistic)#Predict Outputpredicted= predict(logistic,x_test)`

• 加入交互项（interaction）
• 减少特征变量
• 正则化（regularization
• 使用非线性模型

### 3.决策树

Python 代码

`#Import Library#Import other necessary libraries like pandas, numpy...from sklearn import tree#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create tree object model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  # model = tree.DecisionTreeRegressor() for regression# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(rpart)x <- cbind(x_train,y_train)# grow tree fit <- rpart(y_train ~ ., data = x,method="class")summary(fit)#Predict Output predicted= predict(fit,x_test)`

### 4. 支持向量机（SVM）

#### 我们可以把这个算法想成n维空间里的JezzBall游戏，不过有一些变动：

• 你可以以任何角度画分割线/分割面（经典游戏中只有垂直和水平方向）。
• 现在这个游戏的目的是把不同颜色的小球分到不同空间里。
• 小球是不动的。

Python 代码

`#Import Libraryfrom sklearn import svm#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification object model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(e1071)x <- cbind(x_train,y_train)# Fitting modelfit <-svm(y_train ~ ., data = x)summary(fit)#Predict Output predicted= predict(fit,x_test)`

### 5. 朴素贝叶斯

• P(c|x)是已知特征x而分类为c的后验概率。
• P(c)是种类c的先验概率。
• P(x|c)是种类c具有特征x的可能性。
• P(x)是特征x的先验概率。

Python 代码

`#Import Libraryfrom sklearn.naive_bayes import GaussianNB#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(e1071)x <- cbind(x_train,y_train)# Fitting modelfit <-naiveBayes(y_train ~ ., data = x)summary(fit)#Predict Output predicted= predict(fit,x_test)`

### 6.KNN（K-邻近算法）

KNN在生活中的运用很多。比如，如果你想了解一个不认识的人，你可能就会从这个人的好朋友和圈子中了解他的信息。

• KNN的计算成本很高
• 所有特征应该标准化数量级，否则数量级大的特征在计算距离上会有偏移。
• 在进行KNN前预处理数据，例如去除异常值，噪音等。

Python 代码

`#Import Libraryfrom sklearn.neighbors import KNeighborsClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create KNeighbors classifier object model KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(knn)x <- cbind(x_train,y_train)# Fitting modelfit <-knn(y_train ~ ., data = x,k=5)summary(fit)#Predict Output predicted= predict(fit,x_test)`

### 7. K均值算法（K-Means）

K均值算法如何划分集群：

1. 从每个集群中选取K个数据点作为质心（centroids）。
2. 将每一个数据点与距离自己最近的质心划分在同一集群，即生成K个新集群。
3. 找出新集群的质心，这样就有了新的质心。
4. 重复2和3，直到结果收敛，即不再有新的质心出现。

Python 代码

`#Import Libraryfrom sklearn.cluster import KMeans#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset# Create KNeighbors classifier object model k_means = KMeans(n_clusters=3, random_state=0)# Train the model using the training sets and check scoremodel.fit(X)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(cluster)fit <- kmeans(X, 3) # 5 cluster solution`

### 8.随机森林

1. 如果训练集中有N种类别，则有重复地随机选取N个样本。这些样本将组成培养决策树的训练集。
2. 如果有M个特征变量，那么选取数m << M，从而在每个节点上随机选取m个特征变量来分割该节点。m在整个森林养成中保持不变。
3. 每个决策树都最大程度上进行分割，没有剪枝。

1. Introduction to Random forest – Simplified
2. Comparing a CART model to Random Forest (Part 1)
3. Comparing a Random Forest to a CART model (Part 2)
4. Tuning the parameters of your Random Forest model

Python 代码

`#Import Libraryfrom sklearn.ensemble import RandomForestClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Random Forest objectmodel= RandomForestClassifier()# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(randomForest)x <- cbind(x_train,y_train)# Fitting modelfit <- randomForest(Species ~ ., x,ntree=500)summary(fit)#Predict Output predicted= predict(fit,x_test)`

### 9.降维算法（Dimensionality Reduction Algorithms）

Python 代码

`#Import Libraryfrom sklearn import decomposition#Assumed you have training and test data set as train and test# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)# For Factor analysis#fa= decomposition.FactorAnalysis()# Reduced the dimension of training dataset using PCAtrain_reduced = pca.fit_transform(train)#Reduced the dimension of test datasettest_reduced = pca.transform(test)`

R 代码

`library(stats)pca <- princomp(train, cor = TRUE)train_reduced  <- predict(pca,train)test_reduced  <- predict(pca,test)`

Python 代码

`#Import Libraryfrom sklearn.ensemble import GradientBoostingClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Gradient Boosting Classifier objectmodel= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)# Train the model using the training sets and check scoremodel.fit(X, y)#Predict Outputpredicted= model.predict(x_test)`

R 代码

`library(caret)x <- cbind(x_train,y_train)# Fitting modelfitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)predicted= predict(fit,x_test,type= "prob")[,2] `

# 结束语

4028 篇文章166 人订阅

0 条评论

## 相关文章

48580

18210

29950

### 《Single Image Haze Removal Using Dark Channel Prior》一文中图像去雾算法的原理、实现、效果（速度可实时）

最新的效果见 ：http://video.sina.com.cn/v/b/124538950-1254492273.html         可处理...

903100

### 开发 | 小白学CNN以及Keras的速成

AI 科技评论按：本文作者Sherlock，原文载于其知乎专栏深度炼丹，AI 科技评论已获得其授权发布。 一、为何要用Keras 如今在深度学习大火的时候，第三...

36260

42550

39660

### 机器学习算法一览（附python和R代码）

“谷歌的无人车和机器人得到了很多关注，但我们真正的未来却在于能够使电脑变得更聪明，更人性化的技术，机器学习。 ” —— 埃里克 施密特（谷歌首席执行官） 当计算...

40770

35640

### 开发 | TOP5%Kaggler：如何在 Kaggle 首战中进入前 10%

Introduction Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛...

59080