前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用opennlp进行词性标注

使用opennlp进行词性标注

作者头像
code4it
发布2018-09-17 16:06:27
8670
发布2018-09-17 16:06:27
举报
文章被收录于专栏:码匠的流水账码匠的流水账

本文主要研究下如何使用opennlp进行词性标注

POS Tagging

词性(Part of Speech, POS),标注是对一个词汇或一段文字进行描述的过程。这个描述被称为一个标注。

目前流行的中文词性标签有两大类:北大词性标注集和宾州词性标注集。现代汉语的词可以分为两类12种词性:一类是实词:名词、动词、形容词、数词、量词和代词;另一类是虚词:副词、介词、连词、助词、叹词和拟声词。

这块的技术大多数使用HMM(隐马尔科夫模型)+ Viterbi算法,最大熵算法(Maximum Entropy)。

OpenNLP里头可以使用POSTaggerME类来执行基本的标注,以及ChunkerME类来执行分块。

POSTaggerME

代码语言:javascript
复制
    public static POSModel trainPOSModel(ModelType type) throws IOException {
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, type.toString());
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 5);

        return POSTaggerME.train("eng", createSampleStream(), params,
                new POSTaggerFactory());
    }

    private static ObjectStream<POSSample> createSampleStream() throws IOException {
        InputStreamFactory in = new ResourceAsStreamFactory(POSTaggerMETest.class,
                "postag/AnnotatedSentences.txt");

        return new WordTagSampleStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8));
    }

    @Test
    public void testPOSTagger() throws IOException {
        POSModel posModel = trainPOSModel(ModelType.MAXENT);

        POSTagger tagger = new POSTaggerME(posModel);

        String[] tags = tagger.tag(new String[] {
                "The",
                "driver",
                "got",
                "badly",
                "injured",
                "."});

        Assert.assertEquals(6, tags.length);
        Assert.assertEquals("DT", tags[0]);
        Assert.assertEquals("NN", tags[1]);
        Assert.assertEquals("VBD", tags[2]);
        Assert.assertEquals("RB", tags[3]);
        Assert.assertEquals("VBN", tags[4]);
        Assert.assertEquals(".", tags[5]);
    }

这里首先进行模型训练,其中训练文本样式如下:

代码语言:javascript
复制
Last_JJ September_NNP ,_, I_PRP tried_VBD to_TO find_VB out_RP the_DT address_NN of_IN an_DT old_JJ school_NN friend_NN whom_WP I_PRP had_VBD not_RB seen_VBN for_IN 15_CD years_NNS ._.
I_PRP just_RB knew_VBD his_PRP$ name_NN ,_, Alan_NNP McKennedy_NNP ,_, and_CC I_PRP 'd_MD heard_VBD the_DT rumour_NN that_IN he_PRP 'd_MD moved_VBD to_TO Scotland_NNP ,_, the_DT country_NN of_IN his_PRP$ ancestors_NNS ._.
So_IN I_PRP called_VBD Julie_NNP ,_, a_DT friend_NN who's_WDT still_RB in_IN contact_NN with_IN him_PRP ._.
She_PRP told_VBD me_PRP that_IN he_PRP lived_VBD in_IN 23213_CD Edinburgh_NNP ,_, Worcesterstreet_NNP 12_CD ._.
I_PRP wrote_VBD him_PRP a_DT letter_NN right_RB away_RB and_CC he_PRP answered_VBD soon_RB ,_, sounding_VBG very_RB happy_JJ and_CC delighted_JJ ._.

标注说明:

  • DT(Determiner)
  • NN (Noun, singular or mass)
  • VBD (Verb, past tense)
  • RB (Adverb)
  • VBN (Verb, past participle)

ChunkerME

代码语言:javascript
复制
    private Chunker chunker;

    private static String[] toks1 = { "Rockwell", "said", "the", "agreement", "calls", "for",
            "it", "to", "supply", "200", "additional", "so-called", "shipsets",
            "for", "the", "planes", "." };

    private static String[] tags1 = { "NNP", "VBD", "DT", "NN", "VBZ", "IN", "PRP", "TO", "VB",
            "CD", "JJ", "JJ", "NNS", "IN", "DT", "NNS", "." };

    private static String[] expect1 = { "B-NP", "B-VP", "B-NP", "I-NP", "B-VP", "B-SBAR",
            "B-NP", "B-VP", "I-VP", "B-NP", "I-NP", "I-NP", "I-NP", "B-PP", "B-NP",
            "I-NP", "O" };

    @Before
    public void startup() throws IOException {
        ResourceAsStreamFactory in = new ResourceAsStreamFactory(getClass(),
                "chunker/test.txt");

        ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(
                new PlainTextByLineStream(in, StandardCharsets.UTF_8));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        ChunkerModel chunkerModel = ChunkerME.train("eng", sampleStream, params, new ChunkerFactory());

        this.chunker = new ChunkerME(chunkerModel);
    }

    @Test
    public void testChunkAsArray() throws Exception {

        String[] preds = chunker.chunk(toks1, tags1);

        Assert.assertArrayEquals(expect1, preds);
    }

这里同样也进行了模型训练,其训练文本样式如下:

代码语言:javascript
复制
Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-VP
its PRP$ B-NP
contract NN I-NP
with IN B-PP
Boeing NNP B-NP
Co. NNP I-NP
to TO B-VP
provide VB I-VP
structural JJ B-NP
parts NNS I-NP
for IN B-PP
Boeing NNP B-NP
's POS B-NP
747 CD I-NP
jetliners NNS I-NP

标注说明:

  • \B 标注开始
  • \I 标注的中间
  • \E 标注的结束
  • NP 名词块
  • VB 动词块

小结

本文初步展示了如何使用opennlp进行词性标注,模型训练是个比较重要的一个方面,可以通过特定训练提高特定领域文本的标注准确性。

doc

  • NLP汉语自然语言处理原理与实践
  • Penn Part of Speech Tags
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-04-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 码匠的流水账 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • POS Tagging
  • POSTaggerME
  • ChunkerME
  • 小结
  • doc
相关产品与服务
NLP 服务
NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档