前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用opennlp进行文档分类

使用opennlp进行文档分类

作者头像
code4it
发布2018-09-17 16:06:03
7820
发布2018-09-17 16:06:03
举报
文章被收录于专栏:码匠的流水账码匠的流水账

本文主要研究下如何使用opennlp进行文档分类

DoccatModel

要对文档进行分类,需要一个最大熵模型(Maximum Entropy Model),在opennlp中对应DoccatModel

代码语言:javascript
复制
    @Test
    public void testSimpleTraining() throws IOException {

        ObjectStream<DocumentSample> samples = ObjectStreamUtils.createObjectStream(
                new DocumentSample("1", new String[]{"a", "b", "c"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}),
                new DocumentSample("0", new String[]{"x", "y", "z"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"}));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 0);

        DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples,
                params, new DoccatFactory());

        DocumentCategorizer doccat = new DocumentCategorizerME(model);

        double[] aProbs = doccat.categorize(new String[]{"a"});
        Assert.assertEquals("1", doccat.getBestCategory(aProbs));

        double[] bProbs = doccat.categorize(new String[]{"x"});
        Assert.assertEquals("0", doccat.getBestCategory(bProbs));

        //test to make sure sorted map's last key is cat 1 because it has the highest score.
        SortedMap<Double, Set<String>> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"});
        Set<String> cat = sortedScoreMap.get(sortedScoreMap.lastKey());
        Assert.assertEquals(1, cat.size());
    }

这里为了方便测试,先手工编写DocumentSample来做训练文本 categorize方法返回的是一个概率,getBestCategory可以根据概率来返回最为匹配的分类

输出如下:

代码语言:javascript
复制
Indexing events with TwoPass using cutoff of 0

    Computing event counts...  done. 6 events
    Indexing...  done.
Sorting and merging events... done. Reduced 6 events to 6.
Done indexing in 0.13 s.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 6
        Number of Outcomes: 2
      Number of Predicates: 14
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-4.1588830833596715    0.5
  2:  ... loglikelihood=-2.6351991759048894    1.0
  3:  ... loglikelihood=-1.9518912133474995    1.0
  4:  ... loglikelihood=-1.5599038834410852    1.0
  5:  ... loglikelihood=-1.3039748361952568    1.0
  6:  ... loglikelihood=-1.1229511041438864    1.0
  7:  ... loglikelihood=-0.9877356230661396    1.0
  8:  ... loglikelihood=-0.8826624290652341    1.0
  9:  ... loglikelihood=-0.7985244514476817    1.0
 10:  ... loglikelihood=-0.729543972551105    1.0
//...
 95:  ... loglikelihood=-0.0933856684859806    1.0
 96:  ... loglikelihood=-0.09245907503183291    1.0
 97:  ... loglikelihood=-0.09155090064000486    1.0
 98:  ... loglikelihood=-0.09066059844628399    1.0
 99:  ... loglikelihood=-0.08978764309881068    1.0
100:  ... loglikelihood=-0.08893152970793908    1.0

小结

opennlp的categorize方法需要自己先切词好,单独调用不是很方便,不过如果是基于pipeline设计的,也可以理解,在pipeline前面先经过切词等操作。本文仅仅是使用官方的测试源码来做介绍,读者可以下载个中文分类文本训练集来训练,然后对中文文本进行分类。

doc

  • Document Categorizer API
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-04-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 码匠的流水账 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • DoccatModel
  • 小结
  • doc
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档