blocks|key|2468612|text|是的，nltk中的大多数托卡器都有一个名为span_tokenize的方法，但不幸的是，您使用的托卡器没有。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2468613|默认情况下，word_tokenize函数使用TreebankWordTokenizer。TreebankWordTokenizer实现有一个相当健壮的实现，但目前它缺少一个重要方法span_tokenize的实现。|2468614|我认为span_tokenize没有用于TreebankWordTokenizer的实现，所以我相信您需要实现自己的。子类TokenizerI可以使这个过程变得不那么复杂。|2468615|您可能会发现PunktWordTokenizer的PunktWordTokenizer方法作为起点很有用。|2468616|希望这个信息能帮上忙。|2468617|entityMap|0|LINK|mutability|MUTABLE|url|http://www.nltk.org/_modules/nltk/tokenize.html|1|http://www.nltk.org/_modules/nltk/tokenize/treebank.html|2|http://www.nltk.org/_modules/nltk/tokenize/api.html|3|http://www.nltk.org/_modules/nltk/tokenize/punkt.html^0|L|D|0|6|D|19|L|2J|D|L|N|0|24|2|1|0|3|D|K|L|1P|A|2|0|6|I|P|I|1B|2|3|0|0^^$0|@$1|2|3|4|5|6|7|11|8|@$9|12|A|13|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|14|8|@$9|15|A|16|B|C]|$9|17|A|18|B|C]|$9|19|A|1A|B|C]]|D|@$9|1B|A|1C|1|1D]|$9|1E|A|1F|1|1G]]|E|$]]|$1|H|3|I|5|6|7|1H|8|@$9|1I|A|1J|B|C]|$9|1K|A|1L|B|C]]|D|@$9|1M|A|1N|1|1O]]|E|$]]|$1|J|3|K|5|6|7|1P|8|@$9|1Q|A|1R|B|C]|$9|1S|A|1T|B|C]]|D|@$9|1U|A|1V|1|1W]]|E|$]]|$1|L|3|M|5|6|7|1X|8|@]|D|@]|E|$]]|$1|N|3|-4|5|6|7|1Y|8|@]|D|@]|E|$]]]|O|$P|$5|Q|R|S|E|$T|U]]|V|$5|Q|R|S|E|$T|W]]|X|$5|Q|R|S|E|$T|Y]]|Z|$5|Q|R|S|E|$T|10]]]]

Yes, most Tokenizers in nltk have a method called <code>span_tokenize</code> but unfortunately the Tokenizer you are using doesn't.

By default the <code>word_tokenize</code> function <a href="http://www.nltk.org/_modules/nltk/tokenize.html">uses a TreebankWordTokenizer</a>. The <code>TreebankWordTokenizer</code> implementation has a fairly robust <a href="http://www.nltk.org/_modules/nltk/tokenize/treebank.html">implementation</a> but currently it lacks an implementation for one important method, <code>span_tokenize</code>.

I see no implementation of <code>span_tokenize</code> for a <code>TreebankWordTokenizer</code> so I believe you will need to implement your own. Subclassing <a href="http://www.nltk.org/_modules/nltk/tokenize/api.html">TokenizerI</a> can make this process a little less complex.

You might find the <code>span_tokenize</code> method of <code>PunktWordTokenizer</code> useful as a <a href="http://www.nltk.org/_modules/nltk/tokenize/punkt.html">starting point</a>.

I hope this info helps.

blocks|key|1949517|text|至少因为NLTK3.4+TreebankWordTokenizer支持span_tokenize|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1949518|>>>+from+nltk.tokenize+import+TreebankWordTokenizer+as+twt
>>>+list(twt().span_tokenize('What+is+the+airspeed+of+an+unladen+swallow+?'))
[(0,+4),
+(5,+7),
+(8,+11),
+(12,+20),
+(21,+23),
+(24,+26),
+(27,+34),
+(35,+42),
+(43,+44)]|code-block|syntax|javascript|1949519|entityMap|0|LINK|mutability|MUTABLE|url|https://www.nltk.org/_modules/nltk/tokenize/treebank.html^0|Z|D|C|L|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@$9|V|A|W|1|X]]|E|$]]|$1|F|3|G|5|H|7|Y|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|L|$M|$5|N|O|P|E|$Q|R]]]]

At least since NLTK 3.4 <a href="https://www.nltk.org/_modules/nltk/tokenize/treebank.html" rel="noreferrer">TreebankWordTokenizer</a> supports <code>span_tokenize</code>:

<pre><code>&gt;&gt;&gt; from nltk.tokenize import TreebankWordTokenizer as twt
&gt;&gt;&gt; list(twt().span_tokenize('What is the airspeed of an unladen swallow ?'))
[(0, 4),
 (5, 7),
 (8, 11),
 (12, 20),
 (21, 23),
 (24, 26),
 (27, 34),
 (35, 42),
 (43, 44)]
</code></pre>

blocks|key|1949521|text|pytokenizations有一个有用的函数get_original_spans来获得跨空间：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1949522|#+$+pip+install+pytokenizations
import+tokenizations
text+=+"(Dr.+Edwards+is+my+friend.)"
tokens+=+nltk.word_tokenize(text)
tokenizations.get_original_spans(tokens,+text)
>>>+[(0,1),+(1,4),+(5,12),+(13,15),+(16,18),+(19,25),+(25,26),+(26,27)]|code-block|syntax|javascript|1949523|有关其他有用的函数，请参见文献资料。|1949524|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/tamuhey/tokenizations#get_original_spans^0|0|F|N|I|0|0|D|4|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]|$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|Z|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|10|8|@]|D|@$9|11|A|12|1|13]]|E|$]]|$1|M|3|-4|5|6|7|14|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

<code>pytokenizations</code> have a useful function <code>get_original_spans</code> to get the spans:

<pre class="lang-py prettyprint-override"><code># $ pip install pytokenizations
import tokenizations
text = "(Dr. Edwards is my friend.)"
tokens = nltk.word_tokenize(text)
tokenizations.get_original_spans(tokens, text)
&gt;&gt;&gt; [(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]
</code></pre>

See <a href="https://github.com/tamuhey/tokenizations#get_original_spans" rel="nofollow noreferrer">the documentation</a> for other useful functions.

blocks|key|652794|text|NLTK3.5版的TreebankWordDetokenizer支持span_tokenize()函数，因此不再需要编写自己的偏移算法：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|652795|>>>+from+nltk.tokenize+import+TreebankWordTokenizer
>>>+s+=+'''Good+muffins+cost+$3.88\\nin+New+(York).++Please+(buy)+me\\ntwo+of+them.\\n(Thanks).'''
>>>+expected+=+[(0,+4),+(5,+12),+(13,+17),+(18,+19),+(19,+23),
...+(24,+26),+(27,+30),+(31,+32),+(32,+36),+(36,+37),+(37,+38),
...+(40,+46),+(47,+48),+(48,+51),+(51,+52),+(53,+55),+(56,+59),
...+(60,+62),+(63,+68),+(69,+70),+(70,+76),+(76,+77),+(77,+78)]
>>>+list(TreebankWordTokenizer().span_tokenize(s))+==+expected
True
>>>+expected+=+['Good',+'muffins',+'cost',+'$',+'3.88',+'in',
...+'New',+'(',+'York',+')',+'.',+'Please',+'(',+'buy',+')',
...+'me',+'two',+'of',+'them.',+'(',+'Thanks',+')',+'.']
>>>+[s[start:end]+for+start,+end+in+TreebankWordTokenizer().span_tokenize(s)]+==+expected
True|code-block|syntax|javascript|652796|entityMap|0|LINK|mutability|MUTABLE|url|https://www.nltk.org/_modules/nltk/tokenize/treebank.html^0|Y|F|9|N|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@$9|V|A|W|1|X]]|E|$]]|$1|F|3|G|5|H|7|Y|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Z|8|@]|D|@]|E|$]]]|L|$M|$5|N|O|P|E|$Q|R]]]]

NLTK version 3.5 's <a href="https://www.nltk.org/_modules/nltk/tokenize/treebank.html" rel="nofollow noreferrer">TreebankWordDetokenizer</a> supports the function <code>span_tokenize()</code> so there is no need to write an own offset arithmetic anymore:
<pre><code>&gt;&gt;&gt; from nltk.tokenize import TreebankWordTokenizer
&gt;&gt;&gt; s = '''Good muffins cost $3.88\\nin New (York). Please (buy) me\\ntwo of them.\\n(Thanks).'''
&gt;&gt;&gt; expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]
&gt;&gt;&gt; list(TreebankWordTokenizer().span_tokenize(s)) == expected
True
&gt;&gt;&gt; expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']
&gt;&gt;&gt; [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected
True
</code></pre>

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box.

<pre><code>&gt;&gt;&gt; nltk.word_tokenize("(Dr. Edwards is my friend.)")
['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')']
</code></pre>

I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens. 

By offset I mean 2-ples that can serve as indexes into the original string. For example here I'd have

<pre><code>&gt;&gt;&gt; s = "(Dr. Edwards is my friend.)"
&gt;&gt;&gt; s.token_spans()
[(0,1), (1,4), (5,12), (13,15), (16,18), (19,25), (25,26), (26,27)]
</code></pre>

because s[0:1] is "(", s[1:4] is "Dr." and so forth.

Is there a single NLTK call that does this, or do I have to write my own offset arithmetic?

How do I use NLTK's default tokenizer to get spans instead of strings?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

NLTK的默认标记器nltk.word_tokenizer链接两个标记器，一个句子标记器，然后一个对句子操作的单词标记器。开箱就干得很好。>>> nltk.word_tokenize("(Dr. Edwards is my friend.)")['(', 'Dr.', 'Edwards', 'is', 'my', 'f...

问如何使用NLTK的默认令牌程序来获取跨度而不是字符串？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用NLTK的默认令牌程序来获取跨度而不是字符串？EN