# 10 种机器学习算法的要点（附 Python 和 R 代码）

1. 线性回归
2. 逻辑回归
3. 决策树
4. SVM
5. 朴素贝叶斯
6. K最近邻算法
7. K均值算法
8. 随机森林算法
9. 降维算法

## 1、线性回归

• Y：因变量
• a：斜率
• x：自变量
• b ：截距

Python 代码

Python

1234567891011121314151617181920212223

#Import Library#Import other necessary libraries like pandas, numpy...fromsklearn importlinear_model#Load Train and Test datasets#Identify feature and response variable(s) and values must be numeric and numpy arraysx_train=input_variables_values_training_datasetsy_train=target_variables_values_training_datasetsx_test=input_variables_values_test_datasets# Create linear regression objectlinear=linear_model.LinearRegression()# Train the model using the training sets and check scorelinear.fit(x_train,y_train)linear.score(x_train,y_train)#Equation coefficient and Interceptprint('Coefficient: n',linear.coef_)print('Intercept: n',linear.intercept_)#Predict Outputpredicted=linear.predict(x_test)

R代码

12345678910111213

#Load Train and Test datasets#Identify feature and response variable(s) and values must be numeric and numpy arraysx_train<-input_variables_values_training_datasetsy_train<-target_variables_values_training_datasetsx_test<-input_variables_values_test_datasetsx<-cbind(x_train,y_train)# Train the model using the training sets and check scorelinear<-lm(y_train~.,data=x)summary(linear)#Predict Outputpredicted=predict(linear,x_test)

## 2、逻辑回归

123

odds=p/(1-p)=probability of event occurrence/probability of notevent occurrenceln(odds)=ln(p/(1-p))logit(p)=ln(p/(1-p))=b0+b1X1+b2X2+b3X3....+bkXk

Python代码

Python

12345678910111213141516

#Import Libraryfromsklearn.linear_model importLogisticRegression#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create logistic regression objectmodel=LogisticRegression()# Train the model using the training sets and check scoremodel.fit(X,y)model.score(X,y)#Equation coefficient and Interceptprint('Coefficient: n',model.coef_)print('Intercept: n',model.intercept_)#Predict Outputpredicted=model.predict(x_test)

R代码

1234567

x<-cbind(x_train,y_train)# Train the model using the training sets and check scorelogistic<-glm(y_train~.,data=x,family='binomial')summary(logistic)#Predict Outputpredicted=predict(logistic,x_test)

• 加入交互项
• 精简模型特性
• 使用正则化方法
• 使用非线性模型

## 3、决策树

Python代码

Python

123456789101112131415

#Import Library#Import other necessary libraries like pandas, numpy...fromsklearn importtree#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create tree objectmodel=tree.DecisionTreeClassifier(criterion='gini')# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini# model = tree.DecisionTreeRegressor() for regression# Train the model using the training sets and check scoremodel.fit(X,y)model.score(X,y)#Predict Outputpredicted=model.predict(x_test)

R代码

123456789

library(rpart)x<-cbind(x_train,y_train)# grow treefit<-rpart(y_train~.,data=x,method="class")summary(fit)#Predict Outputpredicted=predict(fit,x_test)

## 4、支持向量机

• 比起之前只能在水平方向或者竖直方向画直线，现在你可以在任意角度画线或平面。
• 游戏的目的变成把不同颜色的球分割在不同的空间里。
• 球的位置不会改变。

Python代码

Python

12345678910111213

#Import Libraryfromsklearn importsvm#Assumed you have, X (predictor)andY(target)fortraining data setandx_test(predictor)of test_dataset# Create SVM classification objectmodel=svm.svc()# there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.# Train the model using the training sets and check scoremodel.fit(X,y)model.score(X,y)#Predict Outputpredicted=model.predict(x_test)

R代码

123456789

library(e1071)x<-cbind(x_train,y_train)# Fitting modelfit<-svm(y_train~.,data=x)summary(fit)#Predict Outputpredicted=predict(fit,x_test)

## 5、朴素贝叶斯

• P(c|x) 是已知预示变量（属性）的前提下，类（目标）的后验概率
• P(c) 是类的先验概率
• P(x|c) 是可能性，即已知类的前提下，预示变量的概率
• P(x) 是预示变量的先验概率

Python代码

Python

12345678910

#Import Libraryfromsklearn.naive_bayes importGaussianNB#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link# Train the model using the training sets and check scoremodel.fit(X,y)#Predict Outputpredicted=model.predict(x_test)

R代码

123456789

library(e1071)x<-cbind(x_train,y_train)# Fitting modelfit<-naiveBayes(y_train~.,data=x)summary(fit)#Predict Outputpredicted=predict(fit,x_test)

## 6、KNN（K – 最近邻算法）

• KNN 的计算成本很高。
• 变量应该先标准化（normalized），不然会被更高范围的变量偏倚。
• 在使用KNN之前，要在野值去除和噪音去除等前期处理多花功夫。

#### Python代码

Python

123456789101112

#Import Libraryfromsklearn.neighbors importKNeighborsClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create KNeighbors classifier object modelKNeighborsClassifier(n_neighbors=6)# default value for n_neighbors is 5# Train the model using the training sets and check scoremodel.fit(X,y)#Predict Outputpredicted=model.predict(x_test)

R代码

123456789

library(knn)x<-cbind(x_train,y_train)# Fitting modelfit<-knn(y_train~.,data=x,k=5)summary(fit)#Predict Outputpredicted=predict(fit,x_test)

## 7、K 均值算法

K – 均值算法是一种非监督式学习算法，它能解决聚类问题。使用 K – 均值算法来将一个数据归入一定数量的集群（假设有 k 个集群）的过程是简单的。一个集群内的数据点是均匀齐次的，并且异于别的集群。

K – 均值算法怎样形成集群：

1. K – 均值算法给每个集群选择k个点。这些点称作为质心。
2. 每一个数据点与距离最近的质心形成一个集群，也就是 k 个集群。
3. 根据现有的类别成员，找出每个类别的质心。现在我们有了新质心。
4. 当我们有新质心后，重复步骤 2 和步骤 3。找到距离每个数据点最近的质心，并与新的k集群联系起来。重复这个过程，直到数据都收敛了，也就是当质心不再改变。

K – 均值算法涉及到集群，每个集群有自己的质心。一个集群内的质心和各数据点之间距离的平方和形成了这个集群的平方值之和。同时，当所有集群的平方值之和加起来的时候，就组成了集群方案的平方值之和。

#### Python代码

Python

123456789101112

#Import Libraryfromsklearn.cluster importKMeans#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset# Create KNeighbors classifier object modelk_means=KMeans(n_clusters=3,random_state=0)# Train the model using the training sets and check scoremodel.fit(X)#Predict Outputpredicted=model.predict(x_test)

R代码

12

library(cluster)fit<-kmeans(X,3)# 5 cluster solution

## 8、随机森林

1. 如果训练集的案例数是 N，则从 N 个案例中用重置抽样法随机抽取样本。这个样本将作为“养育”树的训练集。
2. 假如有 M 个输入变量，则定义一个数字 m<
3. 尽可能大地种植每一棵树，全程不剪枝。

1. 随机森林入门—简化版
2. 将 CART 模型与随机森林比较（上）
3. 将随机森林与 CART 模型比较（下）
4. 调整你的随机森林模型参数

Python

Python

123456789101112

#Import Libraryfromsklearn.ensemble importRandomForestClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Random Forest objectmodel=RandomForestClassifier()# Train the model using the training sets and check scoremodel.fit(X,y)#Predict Outputpredicted=model.predict(x_test)

R代码

123456789

library(randomForest)x<-cbind(x_train,y_train)# Fitting modelfit<-randomForest(Species~.,x,ntree=500)summary(fit)#Predict Outputpredicted=predict(fit,x_test)

## 9、降维算法

Python代码

Python

1234567891011121314

#Import Libraryfromsklearn importdecomposition#Assumed you have training and test data set as train and test# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)# For Factor analysis#fa= decomposition.FactorAnalysis()# Reduced the dimension of training dataset using PCAtrain_reduced=pca.fit_transform(train)#Reduced the dimension of test datasettest_reduced=pca.transform(test)#For more detail on this, please refer this link.

R Code

1234

library(stats)pca<-princomp(train,cor=TRUE)train_reduced<-predict(pca,train)test_reduced<-predict(pca,test)

#### Python代码

Python

123456789101112

#Import Libraryfromsklearn.ensemble importGradientBoostingClassifier#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create Gradient Boosting Classifier objectmodel=GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=1,random_state=0)# Train the model using the training sets and check scoremodel.fit(X,y)#Predict Outputpredicted=model.predict(x_test)

#### R代码

1234567

library(caret)x<-cbind(x_train,y_train)# Fitting modelfitControl<-trainControl(method="repeatedcv",number=4,repeats=4)fit<-train(y~.,data=x,method="gbm",trControl=fitControl,verbose=FALSE)predicted=predict(fit,x_test,type="prob")[,2]

333 篇文章64 人订阅

0 条评论

## 相关文章

### 开发 | TOP5%Kaggler：如何在 Kaggle 首战中进入前 10%

Introduction Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金，在 Kaggle 上组织数据竞赛...

5848

20510

4555

### 机器学习&人工智能博文链接汇总

? 争取每天更新 ? 126 ? ---- 蜗牛的历程： [入门问题] [机器学习] [聊天机器人] [好玩儿的人工智能应用实例] [Tensor...

3716

2904

2906

2537

4897

3716

3003