基于sklearn的K邻近分类器概念代码实现

概念

KNN(K临近)分类器应该算是概率派的机器学习算法中比较简单的。基本的思想为在预测时,计算输入向量到每个训练样本的欧氏距离(几何距离),选取最近的K个训练样本,K个训练样本中出现最多的类别即预测为输入向量的类别(投票)

代码实现

载入数据集——鸢尾花数据集

from sklearn.datasets import load_iris
dataset = load_iris()
print(dataset.data.shape)
print(dataset.DESCR)
(150, 4)
Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

数据预处理

分割数据

from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(dataset.data,dataset.target,test_size=0.25,random_state=1)
print(x_train.shape)
print(x_test.shape)
(112, 4)
(38, 4)

标准化

from sklearn.preprocessing import StandardScaler
stantard = StandardScaler()
x_train = stantard.fit_transform(x_train)
x_test = stantard.transform(x_test)

调用K邻近分类器

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

模型评估

自带评估

print(knn.score(x_test,y_test))
0.973684210526

评估器评估

from sklearn.metrics import classification_report
y_pre = knn.predict(x_test)
print(classification_report(y_test,y_pre,target_names=dataset.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        13
 versicolor       1.00      0.94      0.97        16
  virginica       0.90      1.00      0.95         9

avg / total       0.98      0.97      0.97        38

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据结构与算法

扩展中国剩余定理详解

前言 阅读本文前,推荐先学一下中国剩余定理。其实不学也无所谓,毕竟两者没啥关系 扩展CRT 我们知道,中国剩余定理是用来解同余方程组 但是有一个非常令...

3269
来自专栏小樱的经验随笔

基于连通性状态压缩的动态规划问题

基于连通性状态压缩的动态规划问题 基于状态压缩的动态规划问题是一类以集合信息为状态且状态总数为指数级的特殊的动态规划问题.在状态压缩的基础上,有一类问题的状态中...

3538
来自专栏瓜大三哥

基于FPGA的Canny算子设计(一)

Canny算子计算流程: 高斯滤波和Sobel算子已经在前面讲过,所以这里主要讨论非最大值抑制和滞后分割电路设计。 非最大值一直电路设计 非最大值抑制主要是对...

3097
来自专栏专知

【论文推荐】最新7篇变分自编码器(VAE)相关论文—汉语诗歌、生成模型、跨模态、MR图像重建、机器翻译、推断、合成人脸

【导读】专知内容组整理了最近七篇变分自编码器(Variational Autoencoders)相关文章,为大家进行介绍,欢迎查看! 1. Generating...

6864
来自专栏云时之间

深度学习与TensorFlow:关于DBN的一些认识

最近在学习深度置信网络(DBN)的时候,看过几篇博客,但是在DBN的结构上,一大堆博客讲DBN是将受限玻尔兹曼机(RBM)像砖块一样叠加起来的一个网络,这本身是...

1614
来自专栏逍遥剑客的游戏开发

边缘高亮效果

2289
来自专栏智能计算时代

数据科学家用得最多的十种数据挖掘算法

Latest KDnuggets Poll asked Which methods/algorithms you used in the past 12 mon...

3597
来自专栏数据小魔方

机器学习笔记——哑变量处理

在机器学习的特征处理环节,免不了需要用到类别型特征,这类特征进入模型的方式与一般数值型变量有所不同。

6853
来自专栏技术专栏

Python3入门机器学习(三)- matplotlib基础与简单应用

3182
来自专栏大数据挖掘DT机器学习

用R语言做数据清理(详细教程)

数据的清理 如同列夫托尔斯泰所说的那样:“幸福的家庭都是相似的,不幸的家庭各有各的不幸”,糟糕的恶心的数据各有各的糟糕之处,好的数据集都是相似的。一份好的,干净...

8265

扫码关注云+社区

领取腾讯云代金券