首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何使用斯坦福解析器将文本拆分成句子?

如何使用斯坦福解析器将文本拆分成句子?
EN

Stack Overflow用户
提问于 2012-02-29 10:19:54
回答 12查看 35.9K关注 0票数 28

如何使用Stanford parser将文本或段落拆分成句子

有没有什么方法可以提取句子,比如为Ruby提供的getSentencesFromString()

EN

回答 12

Stack Overflow用户

回答已采纳

发布于 2012-02-29 11:39:17

您可以检查DocumentPreprocessor类。下面是一个简短的代码片段。我认为可能还有其他方法可以做你想做的事情。

代码语言:javascript
复制
String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}
票数 31
EN

Stack Overflow用户

发布于 2013-06-12 09:18:41

我知道已经有一个被接受的answer...but,通常您只需要从带注释的文档中获取SentenceAnnotations即可。

代码语言:javascript
复制
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

Source - http://nlp.stanford.edu/software/corenlp.shtml (减半)

如果你只是在寻找句子,你可以从流水线初始化中去掉后面的步骤,比如"parse“和"dcoref”,这将为你节省一些负载和处理时间。摇滚乐。~K

票数 24
EN

Stack Overflow用户

发布于 2015-06-19 06:09:03

对于公认的答案,有几个问题。首先,标记器将一些字符转换为两个字符的。其次,将标记化的文本与空格连接在一起并不会返回与之前相同的结果。因此,来自接受答案的示例文本以非平凡的方式转换输入文本。

但是,记号赋予器使用的CoreLabel类会跟踪它们映射到的源字符,因此,如果您有原始的字符串,那么重新构建正确的字符串是很容易的。

下面的方法1显示了接受答案方法,方法2显示了我的方法,它克服了这些问题。

代码语言:javascript
复制
String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

这将输出以下内容:

代码语言:javascript
复制
My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.
票数 17
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/9492707

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档