用朴素贝叶斯实现垃圾邮件分类器,解题代码如下
numTrainDocs = 700; numTokens = 2500; M = dlmread('F:\machine\ex6DataPrepared\train-features.txt', ' '); spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTrainDocs, numTokens); train_matrix = full(spmatrix); y = dlmread('F:\machine\ex6DataPrepared\train-labels.txt', ' '); spam=find(y==1); nonspam=find(y==0); p_y=length(spam)/numTrainDocs; xofspam=zeros(numTokens,1); xofnonspam=zeros(numTokens,1); for i=1:numTokens xofspam(i,1)=sum(train_matrix(spam,i)); xofnonspam(i,1)=sum(train_matrix(nonspam,i)); end word=sum(train_matrix,2); fi_y1=(xofspam+1)./(sum(word(spam))+numTokens); fi_y0=(xofnonspam+1)./(sum(word(nonspam))+numTokens); %以上是train %以下是test numTestDocs = 260; M =dlmread('F:\machine\ex6DataPrepared\test-features.txt', ' '); test_spmatrix = sparse(M(:,1), M(:,2), M(:,3), numTestDocs, numTokens); test_matrix = full(test_spmatrix); test_result=zeros(numTestDocs,1); a=test_matrix*log(fi_y1); b=test_matrix*log(fi_y0); test_result=a>b; test_labels=dlmread('F:\machine\ex6DataPrepared\test-labels.txt', ' '); length(find(test_result-test_labels));
对公式理解的两处错误导致我改了一晚上bug,以及MATLAB使用不熟练导致代码冗余,一个矩阵运算或者一个函数就可以搞定的问题我就傻傻的写了for循环。