前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >机器学习笔记之KNN分类

机器学习笔记之KNN分类

作者头像
数据小磨坊
发布2018-07-25 11:51:47
8720
发布2018-07-25 11:51:47
举报
文章被收录于专栏:数据小魔方

KNN分类器作为有监督学习中较为通俗易懂的分类算法,在各类分类任务中经常使用。

KNN模型的核心思想很简单,即近朱者赤、近墨者黑,它通过将每一个测试集样本点与训练集中每一个样本之间测算欧氏距离,然后取欧氏距离最近的K个点(k是可以人为划定的近邻取舍个数,K的确定会影响算法结果),并统计这K个训练集样本点所属类别频数,将其中频数最高的所属类别化为该测试样本点的预测类别。

这样意味着测试集中的每一个点都需要与训练集每一个样本点之间计算一次欧氏距离,算法复杂度较高。

其伪代码如下:

  1. 计算已知类别数据集中的点与当前点之间的距离;
  2. 按照距离递增次序排序;
  3. 选择与当前距离最小的k个点;
  4. 确定前k个点所在类别的出现概率
  5. 返回前k个点出现频率最高的类别作为当前点的预测分类。

其优点主要体现在简单易懂,无需训练; 但其数据结果对训练样本中的类别分布状况很敏感,类别分布不平衡会影响分类结果; 对设定的k值(选取的近邻个数)也会影响最终划分的类别; 随着训练集与测试集的增加,算法复杂度较高,内存占用高。

KNN算法中的距离度量既可以采用欧式距离,也可以采用余弦距离(文本分类任务),欧氏距离会受到特征量级大小的影响,因而需要在训练前进行数据标准化。

本次练习使用莺尾花数据集(数据比较规范、量级小适合单机训练)。

R Code:

代码语言:javascript
复制
## !/user/bin/env RStudio 1.1.423
## -*- coding: utf-8 -*-
## KNN Model

library("dplyr")
library('caret')

rm(list = ls())
gc()

#数据转换(数据导入、数据标准化、测试集与训练集分割、样本与标签分配)

Data_Input <- function(file_path = "D:/R/File/iris.csv",p = .75){
    data = read.csv(file_path,stringsAsFactors = FALSE,check.names = FALSE)
    names(data) <- c('sepal_length','sepal_width','petal_length','petal_width','class')
    data[,-ncol(data)] <- scale(data[,-ncol(data)])
    data['class_c'] =  as.numeric(as.factor(data$class))
    x = data[,1:(ncol(data)-2)];y =  data$class_c
    samples = sample(nrow(data),p*nrow(data))
    train_data  =  x[samples,1:(ncol(data)-2)];train_target = y[samples]
    test_data   =  data[-samples,1:(ncol(data)-2)];test_target = y[-samples]
    return(
        list(
            data = data,
            train_data = train_data, 
            test_data = test_data,
            train_target = train_target,
            test_target = test_target
            )
        )
 }
    # 分类器构建(距离计算、排序、统计类别频数、输出最高频类别作为预测类):

kNN_Classify <-function(test_data,test_target,train_data,train_target,k){
    # step 1: 计算距离
    centr_matrix = unlist(rep(test_data,time = nrow(train_data)),use.names = FALSE)  %>%
    matrix(byrow = TRUE,ncol = 4) 
    diff = as.matrix(train_data) - centr_matrix  
    squaredDist = apply(diff^2,1,sum) 
    distance = as.numeric(squaredDist ^ 0.5)
    # step 2: 对距离排序
    sortedDistIndices = rank(distance)
    classCount = c()
    for (i in 1:k){
        # step 3: 选择k个最近邻居
        target_sort = train_target[sortedDistIndices == i]
        classCount = c(classCount,target_sort)
    }    
    # step 4: 分类统计并返回频数最高的类
    Max_count = plyr::count(classCount) %>% arrange(-freq) %>%.[1,1]
    return (Max_count)
}

data_source  <- Data_Input()
train_data   <- data_source$train_data
test_data  <- data_source$test_data
train_target   <- data_source$train_target
test_target <- data_source$test_target
    
# 测试单样本分类

kNN_Classify(
    test_data = test_data[1,] ,
    test_target = test_target,
    train_data = train_data,
    train_target = train_target,
    k = 5
    )

# 构建全样本分类任务(全样本扫描)、输出混洗矩阵与预测类别结果

datingClassTest <- function(test_data,train_data,train_target,test_target,k = 5){
    m = nrow(test_data)
    w = ncol(test_data)
    errorCount = 0.0
    test_predict = c()
    for (i in 1:m){
        classifierResult =  kNN_Classify(
            test_data    = test_data[i,],
            train_data   = train_data,
            train_target = train_target,
            k = k
            )
            if (classifierResult != test_target[i]){
            errorCount =  errorCount + 1.0
        }
        test_predict = c(test_predict,classifierResult)
    }
    test_data[['test_predict']] = test_predict
    test_data[['class']] = test_target
    target_names = c('setosa', 'versicolor', 'virginica')
    print(confusionMatrix(factor(test_predict,labels = target_names),factor(test_target,labels = target_names)))
    confusion_matrix = table(test_predict,test_target) 
    dimnames(confusion_matrix) <- list(target_names,target_names)
    cat(sprintf("in datingClassTest,the total error rate is: %f",errorCount/m),sep = '\n')
    cat(sprintf('in datingClassTest,errorCount:%d',errorCount),sep = '\n')
    return (list(test_data = test_data,confusion_matrix = confusion_matrix))
}
    
#执行分类任务
result <- datingClassTest(
    test_data = test_data,
    train_data = train_data,
    train_target = train_target,
    test_target = test_target
    )

预测结果收集与混洗矩阵输出:
result$test_data
result$confusion_matrix

从结果来看,整体样本划分准确率为92.1%,一共错判了三个点,错误率为7.89%,考虑到数据集随机划分导致的样本类别平衡问题,每次分类结果都可能不一致(可通过设置随机种子来复现抽样结果),这里的K值确定需要根据实际交叉验证情况进行择优取舍。

Python:

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import time
import csv
from numpy import tile
from sklearn import preprocessing
from collections import Counter
from collections import OrderedDict
import pandas as pd
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from time import time

'''KNN分类器'''

## 数据集读入、训练集与测试集样本划分及数据标准化:

def Data_Input():
    data = pd.read_csv("D:/Python/File/iris.csv")
    data.columns = ['sepal_length','sepal_width','petal_length','petal_width','class']
    data.iloc[:,0:-1] = preprocessing.scale(data.iloc[:,0:-1])
    data['class_c'] =  pd.factorize(data['class'])[0]
    x,y = data.iloc[:,0:-2],data.iloc[:,-1]  
    print(data.shape,'\n',data.head())
    train_data,test_data,train_target,test_target = train_test_split(x,y,test_size = 0.25)
    return train_data,test_data,train_target,test_target
    
# KNN分类器函数
def kNN_Classify(test_data,train_data,train_target,k):
    #diff = train_data -  np.tile(test_data,(len(train_data),1)) 
    #test_data  =  test_data.values[0].reshape(1,4)
    # 数据标准化
    # step 1: 计算距离
    diff = train_data -  np.repeat(test_data,repeats = len(train_data) ,axis = 0)
    squaredDist = np.sum(diff ** 2, axis = 1) 
    distance = squaredDist ** 0.5

    # step 2: 对距离排序
    sortedDistIndices = np.argsort(distance).values
    classCount = []
    for i in range(k):
        # step 3: 选择k个最近邻
        target_sort = train_target.values[sortedDistIndices[i]]
        classCount.append(target_sort)
    # step 4: 计算k个最近邻中各类别出现的次数
    counter = Counter(classCount)
    # step 5: 返回出现次数最多的类别标签
    Max_count = counter.most_common(1)[0][0] 
    return Max_count
    
#单样本测试:
kNN_Classify(test_data.values[0].reshape(1,4),train_data,train_target,k = 5)

#构建全样本扫描的分类器并输出分类结果与混洗矩阵:

def datingClassTest(test_data,train_data,train_target,test_target,k = 5):
    m = test_data.shape[0]
    w = test_data.shape[1]
    errorCount = 0.0
    test_predict = []
    for i in range(m):
        classifierResult =  kNN_Classify(
            test_data    = test_data.values[i].reshape(1,w),
            train_data   = train_data,
            train_target = train_target,
            k = k
            )
        if (classifierResult != test_target.values[i]): 
            errorCount += 1.0
        test_predict.append(classifierResult)
    test_data['test_predict'] = test_predict
    test_data['class'] = test_target
    confusion_matrix = pd.crosstab(train_target,test_predict)
    print ("in datingClassTest,the total error rate is: %f" % (errorCount/float(m)))
    print ('in datingClassTest,errorCount:',errorCount)
    target_names = ['setosa', 'versicolor', 'virginica']
    print(classification_report(test_target,test_predict, target_names=target_names))
    return test_data,confusion_matrix
    
#执行分类任务并输出分类结果到本地:

if __name__ == "__main__":    
    #计时开始:
    t0 = time.time()
    train_data,test_data,train_target,test_target = Data_Input()
    test_reslut,confusion_matrix = datingClassTest(test_data,train_data,train_target,test_target,k = 5)
    name = "KNN" + str(int(time.time())) + ".csv"
    print ("Generating results file:", name)
        with open("D:/Python/File/" + name, "w",newline='') as csvfile:
        open_file_object = csv.writer(csvfile)
        open_file_object.writerow(['sepal_length','sepal_width','petal_length','petal_width','test_predict','class'])
        open_file_object.writerows(test_reslut.values)
    t1 = time.time()
    total = t1 - t0
    print("消耗时间:{}".format(total))

这只是第一次尝试手写KNN,还没有做很好地代码封装和模型调优,作为代码实战的一个小开端,之后会更加注重特征选择和模型优化方面的学习~

参考资料: https://www.cnblogs.com/ybjourney/p/4702562.html

github源码:

https://github.com/ljtyduyu/MachineLearning/tree/master/Model/KNN

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2018-06-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 数据小魔方 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档