我使用Tokenizer训练Tokenizer并保存模型,如下所示
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(files=["/content/drive/MyDrive/Work/NLP/bert_practice/data/doc.txt"], trainer=trainer)
tokenizer.model.save('/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer')
['/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json',
'/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt']
它工作得很好:
tokenizer.encode("东风日产2021款劲客正式上市").tokens
['东风日产', '2021款', '劲客', '正式上市']
但当我通过transformers的BertTokenizer加载模型时,如下所示:
from transformers import BertTokenizer
tokenizer = BertTokenizer(
vocab_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/vocab.json",
#merges_file="/content/drive/MyDrive/Work/NLP/bert_practice/data/tokenizer/merges.txt",
)
它总是这样预测'UNK‘:
tokenizer.tokenize("奥迪A5有着年轻时尚的外观,动力强、操控也很棒")
['[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]',
'[UNK]']
有人能解决这个问题吗?任何解决这个问题的建议都会很有帮助。
发布于 2021-09-06 14:52:45
您正在尝试将基于BPE的记号赋值器读入BERTTokenizer,但BERTTokenizer不使用BPE。它使用WordPiece令牌化器。所以,这是不兼容的。请从HuggingFace库中查看此链接。https://huggingface.co/transformers/tokenizer_summary.html#wordpiece
https://stackoverflow.com/questions/69072624
复制相似问题