问如何构建python NLTK的翻译语料库？
EN

Stack Overflow用户

提问于 2018-08-10 03:07:00

回答 1查看 1K关注 0票数 4

我一直在使用Python的NTLK进行通用语言解析，最近我想创建一个专门用于翻译的语料库。我无法理解NTLK用于翻译的语料库选项和结构。

有很多material on how to read or use corpus resources，但是我找不到任何关于创建翻译风格语料库的细节。通过浏览语料库参考，我了解到有各种各样的风格和类型，但是我似乎找不到任何特定的翻译语料库示例或文档。

corpus

python

python-3.x

nltk

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-09-03 20:18:04

对于像数据集这样的翻译，NLTK可以使用AlignedCorpusReader读取单词对齐句子的语料库。文件必须采用以下格式：

first source sentence
first target sentence 
first alignment
second source sentence
second target sentence
second alignment

这意味着，假设标记由空格分隔，句子从单独的行开始。例如，假设您有一个如下所示的目录结构：

reader.py
data/en-es.txt
data/en-pt.txt

其中，文件的内容为：

# en-es.txt
This is an example
Esto es un ejemplo
0-0 1-1 2-2 3-3

和

# en-pt.txt
This is an example
Esto é um exemplo
0-0 1-1 2-2 3-3

您可以使用以下脚本加载此玩具示例：

# reader.py    
from nltk.corpus.reader.aligned import AlignedCorpusReader

reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')

for sentence in reader.aligned_sents():
    print(sentence.words)
    print(sentence.mots)
    print(sentence.alignment)

输出

['This', 'is', 'an', 'example']
['Esto', 'es', 'un', 'ejemplo']
0-0 1-1 2-2 3-3
['This', 'is', 'an', 'example']
['Esto', 'é', 'um', 'exemplo']
0-0 1-1 2-2 3-3

reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')行创建AlignedCorpusReader的一个实例，该实例读取‘./ '.txt'’目录中所有以data结尾的文件。它还指定文件的编码为'utf-8'。AlignedCorpusReader的其他参数是word_tokenizer和sent_tokenizer，word_tokenizer设置为WhitespaceTokenizer()，sent_tokenizer设置为RegexpTokenizer('\n', gaps=True)。

可以在文档(1和2)中找到更多信息。

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51774192

复制

相似问题

问如何构建python NLTK的翻译语料库？
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何构建python NLTK的翻译语料库？EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何构建python NLTK的翻译语料库？
EN