我们可以看到,上述的决策边界并不是很好,虽然都可以完整的划分数据集,但是明显不够好。
此处的beta垂直于w。
根据上图,我们得知,如果我们可以得到w(或者beta)同时,计算出bias(=b)就可以得到关于数据集的决策边界。
这是一个带有不等式条件约束的问题,我们可以通过拉格朗日乘子法,以及对偶问题的求解来转化优化方程,来使中间的margin最大。
使用MATLAB中的LIBSVM实现SVM对于数据集的线性分类。LIBSVM是台湾大学林智仁(Lin Chih-Jen)教授等开发设计的一个简单、易于使用和快速有效的SVM模式识别与回归的软件包。
需要安装LIBSVM,安装方式较为简单,此处不赘述。
[trainlabels,trainfeatures]=libsvmread('twofeature.txt')
使用libsvmread 导入数据,返回两个变量,分别为训练集标签以及训练集特征。
可视化数据集如下:
pos=find(trainlabels==1),neg=find(trainlabels==-1);
scatter(trainfeatures(pos, 1), trainfeatures(pos,2),'filled','bo'); hold on
scatter(trainfeatures(neg, 1), trainfeatures(neg,2),'filled', 'go');
legend({'pos','neg'});
title('datset');
xlabel('xfeature');
ylabel('yfeature');
注意,这里trainfeatures返回的是一个稀疏矩阵,如下图:
不能直接作为数据集导入SVM模型训练,需要先转化为普通的矩阵,
x=zeros([51,2])
disp(length(trainfeatures))
for i =1:length(trainfeatures)
x(i,1)=trainfeatures(i,1)
x(i,2)=trainfeatures(i,2)
end
关于fitcsvm函数
model = fitcsvm(x,trainlabels,'KernelFunction','linear','BoxConstraint',1);
设置参数,kernel为现行的,同时BoxConstraint 设置为1.
其中,返回值是一个对象
sv=model.SupportVectors;
figure;
scatter(trainfeatures(pos, 1), trainfeatures(pos,2),'filled','bo'); hold on
scatter(trainfeatures(neg, 1), trainfeatures(neg,2),'filled', 'go')
title('datset')
xlabel('xfeature')
ylabel('yfeature')
hold on
plot(sv(:,1),sv(:,2),'ro','MarkerSize',10)
legend({'pos','neg','support vectors'})
标记支持向量如下:
计算决策边界函数
beta=model.Beta
b=model.Bias
y_plot=-(beta(1)/(beta(2)))*x_plot-b/(beta(2))
绘制决策边界
其中
beta =
1.4068
2.1334
b = -10.3460
函数调用基本与上面相同,只是调整一下Boxconstraint的参数。
计算得到
beta =
4.6826
13.0917
b = -53.1399
绘制决策边界结果如下:
根据前后对比,我们可以明显看出,C很大时,对于构造一个大的margin只有一个相当小的权重,即力图达到更高的分类正确率,但是,这时候的决策边界效果不具有更好的泛化效果。
附注:绘制新的支持向量
发现支持向量在前后并没有发生改变。
使用线性SVM实现对四个训练集的分类,同时在测试集上做出评价。
定义函数SVM
function [model]=SVM(path)
[trainlabels,trainfeatures]=libsvmread(path);
[m1,n1]=size(trainfeatures);
x=zeros([m1,2500]);
for i =1:m1
for j =1:n1
x(i,j)=trainfeatures(i,j) ;
end
end
model = fitcsvm(x,trainlabels,'KernelFunction','linear','BoxConstraint',1);
% label = predict(model,x);
% cnt=0;
% for i =1:m1
% if(label(i)==trainlabels(i))
% cnt=cnt+1;
% end
% end
% disp(cnt);
% accuracy=cnt/m1;
% disp(accuracy)
end
定义训练集上的评测函数
function []=evaluation(model)
[testlabels,testfeatures]=libsvmread('email_test.txt');
[m_test,n_test]=size(testfeatures);
test_x=zeros([m_test,2500]);
for i =1:m_test
for j=1:n_test
test_x(i,j)=testfeatures(i,j);
end
end
label = predict(model,test_x);
% [label,score] = predict(model,test_x);
cnt=0;
for i =1:260
if(label(i)==testlabels(i))
cnt=cnt+1;
end
end
disp(cnt);
accuracy=cnt/260;
disp(accuracy)
end
定义路径名称并进行测试评估
clc,clear;
train50='email_train-50.txt';
train100='email_train-100.txt';
train400='email_train-400.txt';
train='email_train-all.txt';
model=SVM(train50);
evaluation(model);
model=SVM(train100);
evaluation(model);
model=SVM(train400);
evaluation(model);
model=SVM(train);
evaluation(model);
在训练集大小为50,100,400,all下,得到的准确率分别为
196(正确切分数量)
0.7538(百分比,共计260个样本)
230
0.8846
255
0.9808
256
0.9846
我们可以看到,训练集越大,在测试集上的分类效果越好,准确率越高。
附录:程序源代码
SVM1_two_features
clc,clear;
[trainlabels,trainfeatures]=libsvmread('twofeature.txt')
pos=find(trainlabels==1),neg=find(trainlabels==-1);
scatter(trainfeatures(pos, 1), trainfeatures(pos,2),'filled','bo'); hold on
scatter(trainfeatures(neg, 1), trainfeatures(neg,2),'filled', 'go');
legend({'pos','neg'});
title('datset');
xlabel('xfeature');
ylabel('yfeature');
x=zeros([51,2])
disp(length(trainfeatures))
for i =1:length(trainfeatures)
x(i,1)=trainfeatures(i,1)
x(i,2)=trainfeatures(i,2)
end
model = fitcsvm(x,trainlabels,'KernelFunction','linear','BoxConstraint',1);
sv=model.SupportVectors;
figure;
scatter(trainfeatures(pos, 1), trainfeatures(pos,2),'filled','bo'); hold on
scatter(trainfeatures(neg, 1), trainfeatures(neg,2),'filled', 'go')
title('datset')
xlabel('xfeature')
ylabel('yfeature')
hold on
plot(sv(:,1),sv(:,2),'ro','MarkerSize',10)
legend({'pos','neg','support vectors'})
x_plot=linspace(0,4.5,200);
la=model.SupportVectorLabels
% alpha=model.Alpha
% W=alpha.*sv.*la % 12x1 12*2 12*1
% w=sum(W)
beta=model.Beta
b=model.Bias
y_plot=-(beta(1)/(beta(2)))*x_plot-b/(beta(2))
figure
scatter(trainfeatures(pos, 1), trainfeatures(pos,2),'filled','bo'); hold on
scatter(trainfeatures(neg, 1), trainfeatures(neg,2),'filled', 'go')
title('datset')
xlabel('xfeature')
ylabel('yfeature')
hold on
pre_pos=find(la==1),pre_neg=find(la==-1)
plot(sv(pre_neg,1),sv(pre_neg,2),'ro','MarkerSize',10)
plot(sv(pre_pos,1),sv(pre_pos,2),'ko','MarkerSize',10)
plot(x_plot,y_plot)
legend({'pos','neg','support vectors','support vectors','Decison boundry C=1'})
model1 = fitcsvm(x,trainlabels,'KernelFunction',...
'linear','BoxConstraint',100)
la=model.SupportVectorLabels;
sv=model.SupportVectors;
pre_pos=find(la==1),pre_neg=find(la==-1)
plot(sv(pre_neg,1),sv(pre_neg,2),'go','MarkerSize',20)
plot(sv(pre_pos,1),sv(pre_pos,2),'bo','MarkerSize',20)
beta=model1.Beta
b=model1.Bias
y_plot1=-(beta(1)/(beta(2)))*x_plot-b/(beta(2))
plot(x_plot,y_plot1)
legend({'pos','neg','support vectors','support vectors','Decison boundry C=1','Decison boundry C=100'})
SVM2_text_classification
clc,clear;
train50='email_train-50.txt';
train100='email_train-100.txt';
train400='email_train-400.txt';
train='email_train-all.txt';
model=SVM(train50);
evaluation(model);
model=SVM(train100);x
evaluation(model);
model=SVM(train400);
evaluation(model);
model=SVM(train);
evaluation(model);
function [model]=SVM(path)
[trainlabels,trainfeatures]=libsvmread(path);
[m1,n1]=size(trainfeatures);
x=zeros([m1,2500]);
for i =1:m1
for j =1:n1
x(i,j)=trainfeatures(i,j) ;
end
end
model = fitcsvm(x,trainlabels,'KernelFunction','linear','BoxConstraint',1);
% label = predict(model,x);
% cnt=0;
% for i =1:m1
% if(label(i)==trainlabels(i))
% cnt=cnt+1;
% end
% end
% disp(cnt);
% accuracy=cnt/m1;
% disp(accuracy)
end
function []=evaluation(model)
[testlabels,testfeatures]=libsvmread('email_test.txt');
[m_test,n_test]=size(testfeatures);
test_x=zeros([m_test,2500]);
for i =1:m_test
for j=1:n_test
test_x(i,j)=testfeatures(i,j);
end
end
label = predict(model,test_x);
% [label,score] = predict(model,test_x);
cnt=0;
for i =1:260
if(label(i)==testlabels(i))
cnt=cnt+1;
end
end
disp(cnt);
accuracy=cnt/260;
disp(accuracy)
end