文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用transformers.BertTokenizer编码多个句子？

问如何使用transformers.BertTokenizer编码多个句子？
EN

Stack Overflow用户

提问于 2020-07-01 03:32:24

回答 2查看 13.4K关注 0票数 13

我想通过使用transform.BertTokenizer编码多个句子来创建一个小型批处理。它似乎只适用于一个句子。如何使它对几个句子起作用？

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# tokenize a single sentence seems working
tokenizer.encode('this is the first sentence')
>>> [2023, 2003, 1996, 2034, 6251]

# tokenize two sentences
tokenizer.encode(['this is the first sentence', 'another sentence'])
>>> [100, 100] # expecting 7 tokens

word-embedding

huggingface-transformers

huggingface-tokenizers

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-07-02 02:56:04

变压器>= 4.0.0

使用__call__方法的令牌器。它将生成一个字典，其中包含input_ids、token_type_ids和attention_mask作为每个输入句子的列表：

tokenizer(['this is the first sentence', 'another setence'])

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

变压器< 4.0.0

使用tokenizer.batch_encode_plus (文档)。它将生成一个字典，其中包含input_ids、token_type_ids和attention_mask作为每个输入句子的列表：

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

适用于调用和batch_encode_plus:

如果您只想生成input_ids，则必须将return_token_type_ids和return_attention_mask设置为False：

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'], return_token_type_ids=False, return_attention_mask=False)

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]]}

票数 19

Stack Overflow用户

发布于 2022-07-19 07:42:19

你所做的几乎是正确的。您可以将这些句子作为列表传递给令牌程序。

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
two_sentences = ['this is the first sentence', 'another sentence']


tokenized_sentences = tokenizer(two_sentences)

最后一行代码决定了不同之处。

tokenized_sentences是一个包含以下信息的dict

{“输入_id”：[101,2023,2003,1996,2034,6251,102,101,2178,6251,102]，“令牌_type_id”：[0，0，0，0，0，0，0，0，0]，“注意_掩码”：[1，1，1，1，1，1，1，1]}

其中，句子列表生成存储在input_ids键下的标记化语句列表。

这是第一句= 101,2023,2003,1996,2034,6251,102，另一句= 101,2178,6251,102。

101是起始标记。102是停车标志。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62669261

复制

相似问题

问如何使用transformers.BertTokenizer编码多个句子？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用transformers.BertTokenizer编码多个句子？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用transformers.BertTokenizer编码多个句子？
EN