blocks|key|836340|text|自然语言工具包(+Natural+Language+Toolkit，nltk.org)提供了您需要的东西。This+group+posting表示这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|836341|import+nltk.data

tokenizer+=+nltk.data.load('tokenizers/punkt/english.pickle')
fp+=+open("test.txt")
data+=+fp.read()
print+'\n-----\n'.join(tokenizer.tokenize(data))|code-block|syntax|javascript|836342|(我还没试过呢！)|836343|entityMap|0|LINK|mutability|MUTABLE|url|http://www.nltk.org/|1|http://mailman.uib.no/public/corpora/2007-October/005426.html^0|Y|8|0|1H|I|1|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@$A|V|B|W|1|X]|$A|Y|B|Z|1|10]]|C|$]]|$1|D|3|E|5|F|7|11|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|12|8|@]|9|@]|C|$]]|$1|K|3|-4|5|6|7|13|8|@]|9|@]|C|$]]]|L|$M|$5|N|O|P|C|$Q|R]]|S|$5|N|O|P|C|$Q|T]]]]

The Natural Language Toolkit (<a href="http://www.nltk.org/" rel="noreferrer">nltk.org</a>) has what you need. <a href="http://mailman.uib.no/public/corpora/2007-October/005426.html" rel="noreferrer">This group posting</a> indicates this does it:

<pre><code>import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
</code></pre>

(I haven't tried it!)

blocks|key|1051326|text|您也可以使用nltk库，而不是使用正则表达式将文本拆分成句子。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1051327|>>>+from+nltk+import+tokenize
>>>+p+=+"Good+morning+Dr.+Adams.+The+patient+is+waiting+for+you+in+room+number+3."

>>>+tokenize.sent_tokenize(p)
['Good+morning+Dr.+Adams.',+'The+patient+is+waiting+for+you+in+room+number+3.']|code-block|syntax|javascript|1051328|参考：https://stackoverflow.com/a/9474645/2877052|offset|length|1051329|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/9474645/2877052^0|0|0|3|17|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@$I|V|J|W|1|X]]|A|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

Instead of using regex for spliting the text into sentences, you can also use nltk library.

<pre><code>&gt;&gt;&gt; from nltk import tokenize
&gt;&gt;&gt; p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

&gt;&gt;&gt; tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
</code></pre>

ref: <a href="https://stackoverflow.com/a/9474645/2877052">https://stackoverflow.com/a/9474645/2877052</a>

blocks|key|1051361|text|您可以尝试使用Spacy而不是正则表达式。我使用它，它就完成了工作。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1051362|import+spacy
nlp+=+spacy.load('en')

text+=+'''Your+text+here'''
tokens+=+nlp(text)

for+sent+in+tokens.sents:
++++print(sent.string.strip())|code-block|syntax|javascript|1051363|entityMap|0|LINK|mutability|MUTABLE|url|https://spacy.io/^0|7|5|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

You can try using <a href="https://spacy.io/" rel="noreferrer">Spacy</a> instead of regex. I use it and it does the job.

<pre><code>import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
 print(sent.string.strip())
</code></pre>

blocks|key|1051279|text|这里是一种中间方法，它不依赖于任何外部库。我使用列表理解来排除缩写和终止符之间的重叠，以及排除终止符的变体之间的重叠，例如：‘’vs.+'."‘|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1051280|abbreviations+=+{'dr.':+'doctor',+'mr.':+'mister',+'bro.':+'brother',+'bro':+'brother',+'mrs.':+'mistress',+'ms.':+'miss',+'jr.':+'junior',+'sr.':+'senior',
+++++++++++++++++'i.e.':+'for+example',+'e.g.':+'for+example',+'vs.':+'versus'}
terminators+=+['.',+'!',+'?']
wrappers+=+['"',+"'",+')',+']',+'}']


def+find_sentences(paragraph):
+++end+=+True
+++sentences+=+[]
+++while+end+>+-1:
+++++++end+=+find_sentence_end(paragraph)
+++++++if+end+>+-1:
+++++++++++sentences.append(paragraph[end:].strip())
+++++++++++paragraph+=+paragraph[:end]
+++sentences.append(paragraph)
+++sentences.reverse()
+++return+sentences


def+find_sentence_end(paragraph):
++++[possible_endings,+contraction_locations]+=+[[],+[]]
++++contractions+=+abbreviations.keys()
++++sentence_terminators+=+terminators+%2B+[terminator+%2B+wrapper+for+wrapper+in+wrappers+for+terminator+in+terminators]
++++for+sentence_terminator+in+sentence_terminators:
++++++++t_indices+=+list(find_all(paragraph,+sentence_terminator))
++++++++possible_endings.extend(([]+if+not+len(t_indices)+else+[[i,+len(sentence_terminator)]+for+i+in+t_indices]))
++++for+contraction+in+contractions:
++++++++c_indices+=+list(find_all(paragraph,+contraction))
++++++++contraction_locations.extend(([]+if+not+len(c_indices)+else+[i+%2B+len(contraction)+for+i+in+c_indices]))
++++possible_endings+=+[pe+for+pe+in+possible_endings+if+pe[0]+%2B+pe[1]+not+in+contraction_locations]
++++if+len(paragraph)+in+[pe[0]+%2B+pe[1]+for+pe+in+possible_endings]:
++++++++max_end_start+=+max([pe[0]+for+pe+in+possible_endings])
++++++++possible_endings+=+[pe+for+pe+in+possible_endings+if+pe[0]+!=+max_end_start]
++++possible_endings+=+[pe[0]+%2B+pe[1]+for+pe+in+possible_endings+if+sum(pe)+>+len(paragraph)+or+(sum(pe)+<+len(paragraph)+and+paragraph[sum(pe)]+==+'+')]
++++end+=+(-1+if+not+len(possible_endings)+else+max(possible_endings))
++++return+end


def+find_all(a_str,+sub):
++++start+=+0
++++while+True:
++++++++start+=+a_str.find(sub,+start)
++++++++if+start+==+-1:
++++++++++++return
++++++++yield+start
++++++++start+%2B=+len(sub)|code-block|syntax|javascript|1051281|我使用了以下条目中的Karl的find_all函数：Find+all+occurrences+of+a+substring+in+Python|offset|length|1051282|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python^0|0|0|Q|19|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@$I|V|J|W|1|X]]|A|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

<pre><code>abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
 end = True
 sentences = []
 while end &gt; -1:
 end = find_sentence_end(paragraph)
 if end &gt; -1:
 sentences.append(paragraph[end:].strip())
 paragraph = paragraph[:end]
 sentences.append(paragraph)
 sentences.reverse()
 return sentences


def find_sentence_end(paragraph):
 [possible_endings, contraction_locations] = [[], []]
 contractions = abbreviations.keys()
 sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
 for sentence_terminator in sentence_terminators:
 t_indices = list(find_all(paragraph, sentence_terminator))
 possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
 for contraction in contractions:
 c_indices = list(find_all(paragraph, contraction))
 contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
 possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
 if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
 max_end_start = max([pe[0] for pe in possible_endings])
 possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
 possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) &gt; len(paragraph) or (sum(pe) &lt; len(paragraph) and paragraph[sum(pe)] == ' ')]
 end = (-1 if not len(possible_endings) else max(possible_endings))
 return end


def find_all(a_str, sub):
 start = 0
 while True:
 start = a_str.find(sub, start)
 if start == -1:
 return
 yield start
 start += len(sub)
</code></pre>

I used Karl's find_all function from this entry:
<a href="https://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python">Find all occurrences of a substring in Python</a>

blocks|key|836388|text|对于简单的情况(句子通常结束)，这应该是可行的：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|836389|import+re
text+=+''.join(open('somefile.txt').readlines())
sentences+=+re.split(r'+*[\.\?!][\'"\)\]]*+*',+text)|code-block|syntax|javascript|836390|正则表达式是*\.+%2B，它匹配由左侧0个或更多空格和右侧1个或更多空格包围的句号(以防止像re.split中的句号这样的内容被算作句子的变化)。|offset|length|style|CODE|836391|显然，这不是最健壮的解决方案，但它在大多数情况下都会做得很好。这里不会涉及的唯一情况是缩写(也许可以遍历句子列表，检查sentences中的每个字符串是否都以大写字母开头？)|836392|entityMap^0|0|0|6|5|0|1N|9|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|N|5|6|7|V|8|@$I|W|J|X|K|L]]|9|@]|A|$]]|$1|O|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|P|$]]

For simple cases (where sentences are terminated normally), this should work:

<pre><code>import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
</code></pre>

The regex is <code>*\. +</code>, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in <code>sentences</code> starts with a capital letter?)

blocks|key|836710|text|您也可以在NLTK中使用句子标记化函数：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|836711|from+nltk.tokenize+import+sent_tokenize
sentence+=+"As+the+most+quoted+English+writer+Shakespeare+has+more+than+his+share+of+famous+quotes.++Some+Shakespare+famous+quotes+are+known+for+their+beauty,+some+for+their+everyday+truths+and+some+for+their+wisdom.+We+often+talk+about+Shakespeare’s+quotes+as+things+the+wise+Bard+is+saying+to+us+but,+we+should+remember+that+some+of+his+wisest+words+are+spoken+by+his+biggest+fools.+For+example,+both+‘neither+a+borrower+nor+a+lender+be,’+and+‘to+thine+own+self+be+true’+are+from+the+foolish,+garrulous+and+quite+disreputable+Polonius+in+Hamlet."

sent_tokenize(sentence)|code-block|syntax|javascript|836712|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You can also use sentence tokenization function in NLTK:

<pre><code>from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes. Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)
</code></pre>

blocks|key|836827|text|使用spacy|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|836828|import+spacy

nlp+=+spacy.load('en_core_web_sm')
text+=+"How+are+you+today?+I+hope+you+have+a+great+day"
tokens+=+nlp(text)
for+sent+in+tokens.sents:
++++print(sent.string.strip())|code-block|syntax|javascript|836829|entityMap|0|LINK|mutability|MUTABLE|url|https://spacy.io/^0|2|5|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

Using <a href="https://spacy.io/" rel="nofollow noreferrer">spacy</a>:
<pre><code>import spacy

nlp = spacy.load('en_core_web_sm')
text = &quot;How are you today? I hope you have a great day&quot;
tokens = nlp(text)
for sent in tokens.sents:
 print(sent.string.strip())
</code></pre>

blocks|key|1051153|text|您可以使用以下函数为俄语(和其他一些语言)创建一个新的标记器：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1051154|def+russianTokenizer(text):
++++result+=+text
++++result+=+result.replace('.',+'+.+')
++++result+=+result.replace('+.++.++.+',+'+...+')
++++result+=+result.replace(',',+'+,+')
++++result+=+result.replace(':',+'+:+')
++++result+=+result.replace(';',+'+;+')
++++result+=+result.replace('!',+'+!+')
++++result+=+result.replace('?',+'+?+')
++++result+=+result.replace('\"',+'+\"+')
++++result+=+result.replace('\'',+'+\'+')
++++result+=+result.replace('(',+'+(+')
++++result+=+result.replace(')',+'+)+')+
++++result+=+result.replace('++',+'+')
++++result+=+result.replace('++',+'+')
++++result+=+result.replace('++',+'+')
++++result+=+result.replace('++',+'+')
++++result+=+result.strip()
++++result+=+result.split('+')
++++return+result|code-block|syntax|javascript|1051155|然后这样调用它：|1051156|text+=+'вы+выполняете+поиск,+используя+Google+SSL;'
tokens+=+russianTokenizer(text)|1051157|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|N|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|O|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

You could make a new tokenizer for Russian (and some other languages) using this function:
<pre class="lang-py prettyprint-override"><code>def russianTokenizer(text):
 result = text
 result = result.replace('.', ' . ')
 result = result.replace(' . . . ', ' ... ')
 result = result.replace(',', ' , ')
 result = result.replace(':', ' : ')
 result = result.replace(';', ' ; ')
 result = result.replace('!', ' ! ')
 result = result.replace('?', ' ? ')
 result = result.replace('\&quot;', ' \&quot; ')
 result = result.replace('\'', ' \' ')
 result = result.replace('(', ' ( ')
 result = result.replace(')', ' ) ') 
 result = result.replace(' ', ' ')
 result = result.replace(' ', ' ')
 result = result.replace(' ', ' ')
 result = result.replace(' ', ' ')
 result = result.strip()
 result = result.split(' ')
 return result
</code></pre>
and then call it in this way:
<pre class="lang-py prettyprint-override"><code>text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)
</code></pre>

blocks|key|1051695|text|如果NLTK的sent_tokenize不是一个东西(例如，在长文本上需要大量的GPU+RAM+)，并且正则表达式不能跨语言正常工作，那么sentence+splitter可能值得一试。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1051696|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/mediacloud/sentence-splitter^0|1X|H|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

If NLTK's sent_tokenize is not a thing (e.g. needs a lot of GPU RAM on long text) and regex doesn't work properly across languages, <a href="https://github.com/mediacloud/sentence-splitter" rel="nofollow noreferrer">sentence splitter</a> might be try worth.

blocks|key|836484|text|毫无疑问，NLTK是最适合这个目的的。但是开始使用NLTK是非常痛苦的(但是一旦你安装了它-你就会收获回报)|type|unstyled|depth|inlineStyleRanges|entityRanges|data|836485|下面是可从http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html获得的基于re的简单代码|offset|length|836486|#+split+up+a+paragraph+into+sentences
#+using+regular+expressions


def+splitParagraphIntoSentences(paragraph):
++++'''+break+a+paragraph+into+sentences
++++++++and+return+a+list+'''
++++import+re
++++#+to+split+by+multile+characters

++++#+++regular+expressions+are+easiest+(and+fastest)
++++sentenceEnders+=+re.compile('[.!?]')
++++sentenceList+=+sentenceEnders.split(paragraph)
++++return+sentenceList


if+__name__+==+'__main__':
++++p+=+"""This+is+a+sentence.++This+is+an+excited+sentence!+And+do+you+think+this+is+a+question?"""

++++sentences+=+splitParagraphIntoSentences(p)
++++for+s+in+sentences:
++++++++print+s.strip()

#output:
#+++This+is+a+sentence
#+++This+is+an+excited+sentence

#+++And+do+you+think+this+is+a+question+|code-block|syntax|javascript|836487|entityMap|0|LINK|mutability|MUTABLE|url|http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html^0|0|5|2C|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|T|8|@]|9|@$D|U|E|V|1|W]]|A|$]]|$1|F|3|G|5|H|7|X|8|@]|9|@]|A|$I|J]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

So here is simple re based code available at <a href="http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html" rel="nofollow">http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html</a>

<pre><code># split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
 ''' break a paragraph into sentences
 and return a list '''
 import re
 # to split by multile characters

 # regular expressions are easiest (and fastest)
 sentenceEnders = re.compile('[.!?]')
 sentenceList = sentenceEnders.split(paragraph)
 return sentenceList


if __name__ == '__main__':
 p = """This is a sentence. This is an excited sentence! And do you think this is a question?"""

 sentences = splitParagraphIntoSentences(p)
 for s in sentences:
 print s.strip()

#output:
# This is a sentence
# This is an excited sentence

# And do you think this is a question 
</code></pre>

blocks|key|1051557|text|另外，要注意上面的一些答案中没有包含的其他顶级域名。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1051558|例如，.info，.biz，.ru，.online会抛出一些句子解析器，但上面没有包括在内。|1051559|这里有一些关于顶级域名频率的信息：https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/|offset|length|1051560|可以通过编辑上面的代码来解决这个问题：|1051561|alphabets=+"([A-Za-z])"
prefixes+=+"(Mr%7CSt%7CMrs%7CMs%7CDr)[.]"
suffixes+=+"(Inc%7CLtd%7CJr%7CSr%7CCo)"
starters+=+"(Mr%7CMrs%7CMs%7CDr%7CHe\s%7CShe\s%7CIt\s%7CThey\s%7CTheir\s%7COur\s%7CWe\s%7CBut\s%7CHowever\s%7CThat\s%7CThis\s%7CWherever)"
acronyms+=+"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites+=+"[.](com%7Cnet%7Corg%7Cio%7Cgov%7Cai%7Cedu%7Cco.uk%7Cru%7Cinfo%7Cbiz%7Conline)"|code-block|syntax|javascript|1051562|entityMap|0|LINK|mutability|MUTABLE|url|https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/^0|0|0|H|21|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|X|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Y|8|@]|9|@$F|Z|G|10|1|11]]|A|$]]|$1|H|3|I|5|6|7|12|8|@]|9|@]|A|$]]|$1|J|3|K|5|L|7|13|8|@]|9|@]|A|$M|N]]|$1|O|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|P|$Q|$5|R|S|T|A|$U|V]]]]

Also, be wary of additional top level domains that aren't included in some of the answers above.
For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.
Here's some info on frequency of top level domains: <a href="https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/" rel="nofollow noreferrer">https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/</a>
That could be addressed by editing the code above to read:
<pre><code>alphabets= &quot;([A-Za-z])&quot;
prefixes = &quot;(Mr|St|Mrs|Ms|Dr)[.]&quot;
suffixes = &quot;(Inc|Ltd|Jr|Sr|Co)&quot;
starters = &quot;(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)&quot;
acronyms = &quot;([A-Z][.][A-Z][.](?:[A-Z][.])?)&quot;
websites = &quot;[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)&quot;
</code></pre>

blocks|key|1051630|text|不妨把这个放进去，因为这是第一篇显示句子被n句分割的帖子。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1051631|这与可变拆分长度一起工作，它指示最后连接在一起的句子。|1051632|import+nltk
//nltk.download('punkt')
from+more_itertools+import+windowed

split_length+=+3+//+3+sentences+for+example+

elements+=+nltk.tokenize.sent_tokenize(text)
segments+=+windowed(elements,+n=split_length,+step=split_length)
text_splits+=+[]
for+seg+in+segments:
++++++++++txt+=+"+".join([t+for+t+in+seg+if+t])
++++++++++if+len(txt)+>+0:
++++++++++++++++text_splits.append(txt)|code-block|syntax|javascript|1051633|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|L|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|M|8|@]|9|@]|A|$G|H]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

Might as well throw this in, since this is the first post that showed up for sentence split by n sentences.
This works with a variable split length, which indicates the sentences that get joined together in the end.
<pre><code>import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
 txt = &quot; &quot;.join([t for t in seg if t])
 if len(txt) &gt; 0:
 text_splits.append(txt)
</code></pre>

blocks|key|1051751|text|使用Stanza，这是一个支持多种人类语言的自然语言处理库。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1051752|import+stanza

stanza.download('en')
nlp+=+stanza.Pipeline(lang='en',+processors='tokenize')

doc+=+nlp(t_en)
for+sentence+in+doc.sentences:
++++print(sentence.text)|code-block|syntax|javascript|1051753|entityMap|0|LINK|mutability|MUTABLE|url|https://stanfordnlp.github.io/stanza/^0|2|6|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

Using <a href="https://stanfordnlp.github.io/stanza/" rel="nofollow noreferrer">Stanza</a> a natural language processing library that works for many human languages.
<pre><code>import stanza

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')

doc = nlp(t_en)
for sentence in doc.sentences:
 print(sentence.text)
</code></pre>

blocks|key|836659|text|我必须阅读字幕文件，并将它们拆分成句子。经过预处理(如删除.srt文件中的时间信息等)，变量fullFile包含字幕文件的全文。下面这种粗糙的方式将它们整齐地拆分成句子。也许我很幸运，句子总是(正确地)以空格结束。先试一试，如果它有任何例外，添加更多的检查和平衡。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|836660|#+Very+approximate+way+to+split+the+text+into+sentences+-+Break+after+?+.+and+!
fullFile+=+re.sub("(\!%7C\?%7C\.)+","\\1<BRK>",fullFile)
sentences+=+fullFile.split("<BRK>");
sentFile+=+open("./sentences.out",+"w%2B");
for+line+in+sentences:
++++sentFile.write+(line);
++++sentFile.write+("\n");
sentFile.close;|code-block|syntax|javascript|836661|噢!井。我现在意识到，因为我的内容是西班牙语，所以我没有处理“Smith先生”等问题。|836662|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

<pre><code># Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1&lt;BRK&gt;",fullFile)
sentences = fullFile.split("&lt;BRK&gt;");
sentFile = open("./sentences.out", "w+");
for line in sentences:
 sentFile.write (line);
 sentFile.write ("\n");
sentFile.close;
</code></pre>

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

blocks|key|836752|text|我希望这将帮助你在拉丁语，中文，阿拉伯语文本|type|unstyled|depth|inlineStyleRanges|entityRanges|data|836753|import+re

punctuation+=+re.compile(r"([%5E\d%2B])(\.%7C!%7C\?%7C;%7C\n%7C。%7C！%7C？%7C；%7C…%7C　%7C!%7C؟%7C؛)%2B")
lines+=+[]

with+open('myData.txt','r',encoding="utf-8")+as+myFile:
++++lines+=+punctuation.sub(r"\1\2<pad>",+myFile.read())
++++lines+=+[line.strip()+for+line+in+lines.split("<pad>")+if+line.strip()]|code-block|syntax|javascript|836754|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

i hope this will help you on latin,chinese,arabic text

<pre><code>import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
 lines = punctuation.sub(r"\1\2&lt;pad&gt;", myFile.read())
 lines = [line.strip() for line in lines.split("&lt;pad&gt;") if line.strip()]
</code></pre>

blocks|key|837004|text|使用spacy|type|unstyled|depth|inlineStyleRanges|entityRanges|data|837005|import+spacy
nlp+=+spacy.load('en_core_web_sm')
doc+=+nlp(u'This+is+first.This+is+second.This+is+Thired+')
for+sentence+in+doc.sent:
++print(sentence)|code-block|syntax|javascript|837006|但是如果你想通过索引来获取一个句子：|837007|#don't+work
+doc.sents[0]|837008|使用|837009|list(+doc.sents)[0]|837010|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|S|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|U|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|O|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|P|$]]

using spacy
<pre><code>import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sent:
 print(sentence)
</code></pre>
But if you want to do get a sentence by index Example:
<pre><code>#don't work
 doc.sents[0]
</code></pre>
Use
<pre><code>list( doc.sents)[0]
</code></pre>

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

<pre><code>re.compile('(\. |^|!|\?)([A-Z][^;↑\.&lt;&gt;@\^&amp;/\[\]]*(\.|!|\?) )',re.M)
</code></pre>

How can I split a text into sentences?

自然语言处理

我有一个文本文件。我需要一张句子清单。如何实现这一点？有很多微妙之处，比如在缩写中使用了一个点。我的旧正则表达式运行得很糟糕：re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

问如何将文本拆分成句子？
EN

回答 16

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将文本拆分成句子？EN

回答 16

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将文本拆分成句子？
EN