前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

作者头像
星哥玩云
发布2022-07-03 11:17:55
2090
发布2022-07-03 11:17:55
举报
文章被收录于专栏:开源部署

引言

接前一篇文章《Mahout0.9 打patch使其支持 Hadoop2.2.0》http://www.linuxidc.com/Linux/2014-09/106286.htm,

Mahout0.9打过Patch编译成功后,使用贝叶斯文本分类来测试Mahout0.9对Hadoop2.2.0的兼容性。

步骤一:将20news的文件都上传到hdfs

yarn@singletest:~/Mahout/mahout-distribution-0.7$ hadoop fs -ls /workspace/mahout/week4/data/20news

Found 2 items

drwxr-xr-x   - yarn supergroup          0 2014-09-04 21:52 /workspace/mahout/week4/data/20news/20news-bydate-test

drwxr-xr-x   - yarn supergroup          0 2014-09-04 21:57 /workspace/mahout/week4/data/20news/20news-bydate-train

步骤二:对数据创建序列文件

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seqdirectory -i /workspace/mahout/week4/data/20news -o /workspace/mahout/week4/data/20news_seq

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_seq

Found 1 items

-rw-r--r--   1 yarn supergroup   37064977 2014-09-04 22:12 /workspace/mahout/week4/data/20news_seq/chunk-0

第三步:将序列文件转化成向量

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seq2sparse -i /workspace/mahout/week4/data/20news_seq/ -o /workspace/mahout/week4/data/20news_vectors -lnorm -nv -wt tfidf

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_vectors

Found 7 items

drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/df-count

-rw-r--r--   1 yarn supergroup    1937084 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/dictionary.file-0

-rw-r--r--   1 yarn supergroup    1890053 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/frequency.file-0

drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:19 /workspace/mahout/week4/data/20news_vectors/tf-vectors

drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:21 /workspace/mahout/week4/data/20news_vectors/tfidf-vectors

drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/tokenized-documents

drwxr-xr-x   - yarn supergroup          0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/wordcount


第四步:将向量集分为训练集和测试数据

参数:

  • -tr训练集    
  • -te测试集
  • -rp参数设定的是测试数据集占总数据集的百分比,以下代码设定为20%!

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout split -i /workspace/mahout/week4/data/20news_vectors/tfidf-vectors -tr /workspace/mahout/week4/data/train-vectors -te /workspace/mahout/week4/data/test-vectors -rp 20 -ow -seq -xm sequential

第五步:训练模型

yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout trainnb -i /workspace/mahout/week4/data/train-vectors -el -o /workspace/mahout/week4/nbmodel -li /workspace/mahout/week4/labindex -ow -c

查看生成的索引:

yarn@singletest:~$ hadoop fs -text /workspace/mahout/week4/labindex

20news-bydate-test      0

20news-bydate-train     1

查看训练出来的模型:

yarn@singletest:~$ hadoop fs -ls /workspace/mahout/week4/nbmodel

Found 1 items

-rw-r--r--   1 yarn supergroup    2437874 2014-09-05 23:09 /workspace/mahout/week4/nbmodel/naiveBayesModel.bin

第六步:测试

yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout testnb -i /workspace/mahout/week4/data/test-vectors -m /workspace/mahout/week4/nbmodel -l /workspace/mahout/week4/labindex -ow -o /workspace/mahout/week4/20news-test-result -c

注意:测试时的-i跟着的输入路径是第四步拆分出来的测试集。

测试结果:

14/09/05 23:18:09 INFO test.TestNaiveBayesDriver: Complementary Results:

=======================================================

Summary

-------------------------------------------------------

Correctly Classified Instances          :       2887       74.9675%

Incorrectly Classified Instances        :        964       25.0325%

Total Classified Instances              :       3851

=======================================================

Confusion Matrix

-------------------------------------------------------

a       b       <--Classified as

1131    413      |  1544        a     = 20news-bydate-test

551     1756     |  2307        b     = 20news-bydate-train

=======================================================

Statistics

-------------------------------------------------------

Kappa                                        0.486

Accuracy                                   74.9675%

Reliability                                49.7892%

Reliability (standard deviation)            0.4314

14/09/05 23:18:09 INFO driver.MahoutDriver: Program took 17504 ms (Minutes: 0.29173333333333334)

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档