首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

Machine Learning in Action(3)——KNN algorithm in practice(1)

1. Testing algorithm(classifier) —— error rate

To test out a classifier, we start with some known data so we can hide the answer from the classifier and ask the classifier for its best guess. We can add up the number of times the classifier was wrong and divide it by the total number of tests we gave it. This will give us the error rate, which is a common measure to gauge how good a classifier is doing on a data set. An error rate of 0 means we have a perfect classifier, and an error rate of 1.0 means the classifier is always wrong.

2. Main steps when applying KNN algorithm in practice.

(1) In most cases data collected is in a text file, so how to process the text with python, extract data from the text. We make an assumption that each line in the text file represents a piece of data.

Common way : open file -> loop all the lines of the file-> for each line, use strip() and split() to split the line into a list of elements.

(2) Analyze the data with making plot of data. Create scatter plots with Matplotlib. When we make plots for the data, we should consider the relationship between different features and how to distinguish labels of each piece of data.

(3) In some cases it’s important to normalize the data set first to eliminate the dimensional effects.

(4) Choose the test examples and the training examples in the data set. And use the simple KNN algorithm above to train algorithm and calculate the error rate.

(5) Use the algorithm to predict class label for a newly given data.

3. Using KNN on results from a dating site

Description :

Hellen wants us to help her classify boys she meet in the dating site so as to decide whom she is going to date with.

Now she has collected some information(three features) about the boys she has dated with in the past and has given every boy a label according to her degree of love (three labels) in advance.

Hellen has been collecting these data for a while and has 1,000 entries. A new sample is on each line, and Hellen has recorded the following features:

■ Number of frequent flyer miles earnedper year

■ Percentage of time spent playingvideo games

■ Liters of ice cream consumed per week

Three labels are:

■ People she didn’t like

■ People she liked in small doses

■ People she liked in large doses

Recently she meets another boy and she hope that we can make a decision for her.

Implementation:

(1) Parsing data from a text file——extract data from text file to a matrix of training example and a vector of class labels.

(2) Creating scatter plots with matplotlib

In this code segment, we use the function pyplot.scatter() and pyplot.subplot()

(3) Normalizethe dataset.When calculating the distance the largest-value term is going to make the most difference. That is the largest-value term will dominate even though the other terms have the largest differences.But actually unless it’s specified, every term should be at the same importance. Thus we should normalize the data set to make all terms equally important. And a common way to normalize is to normalize them to ranges 0 to 1 or -1 to 1. To scale everything from 0 to 1, we need to apply the following formula:

newValue = (oldValue-min)/(max-min)

Notice that we’d better return not only the normMat but also the ranges and minimum for future using in testing the algorithm. Because not only the training example need to be normalized, test examples also need to be normalized before testing.

Annotations about functions of python:

A. functionmax() and function min()

B. constant shape of array and matrix and function shape() of iterable object , constant shape is of type tuple and the return type of the function shape() is also tuple.

(4) Testing the classifier

One common task in machine learning is evaluating an algorithm’s accuracy. One way you can use the existing data is to take some portion, say 90%, to train the classifier. Then you’ll take the remaining 10% to test the classifier and see how accurate it is. The 10% to be held back should be randomly selected. Our data isn’t stored in a specific sequence, so you can take the first 10% or last 10% without upsetting any statistics professors.

We will measure the performance of a classifier with the error rate.

Noticethat function KNN.classify0() is the core KNN algorithm built at the beginning . It is used to calculate the distances and choose the top k similar examples and decide the class label.

(5) Putting together a useful system.

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180710G0D1U000?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券