String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}

票数 31

Stack Overflow用户

发布于 2013-06-12 09:18:41

我知道已经有一个被接受的answer...but，通常您只需要从带注释的文档中获取SentenceAnnotations即可。

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

Source - http://nlp.stanford.edu/software/corenlp.shtml (减半)

如果你只是在寻找句子，你可以从流水线初始化中去掉后面的步骤，比如"parse“和"dcoref”，这将为你节省一些负载和处理时间。摇滚乐。~K

票数 24

Stack Overflow用户

发布于 2015-06-19 06:09:03

对于公认的答案，有几个问题。首先，标记器将一些字符转换为两个字符的。其次，将标记化的文本与空格连接在一起并不会返回与之前相同的结果。因此，来自接受答案的示例文本以非平凡的方式转换输入文本。

但是，记号赋予器使用的CoreLabel类会跟踪它们映射到的源字符，因此，如果您有原始的字符串，那么重新构建正确的字符串是很容易的。

下面的方法1显示了接受答案方法，方法2显示了我的方法，它克服了这些问题。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

这将输出以下内容：

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.

票数 17

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/9492707

复制

相似问题

问如何使用斯坦福解析器将文本拆分成句子？
EN

回答 12

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用斯坦福解析器将文本拆分成句子？EN

回答 12

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用斯坦福解析器将文本拆分成句子？
EN