huggingface-transformers中的Roberta Tokenizer将罗伯塔的标记化方法描述为:
- single sequence: ``<s> X </s>``
- pair of sequences: ``<s> A </s></s> B </s>``我很好奇为什么多个序列的标记化不是<s> A </s><s> B </s>
在上面的基础上,如果我要手动编码两个以上的序列,我应该将它们编码为<s> A </s></s> B </s></s> C </s>还是<s> A </s><s> B </s><s> C </s>
发布于 2020-04-28 16:07:03
与许多其他问题一样,这个问题最好的答案可能是“因为它已经以这种方式进行了预训练”。
transformer系列中的模型的主要好处是对它们进行了大量的预训练。除非你愿意复制几周/几个月的预训练阶段,否则我认为最好接受这个特性。
与此相关,这也意味着您建议的一次输入两个以上句子的方法可能不起作用,请参阅this相关问题;由于RoBERTa没有接受超过两个句子的输入,因此如果没有非常大的预训练数据集,它可能无法工作。
我认为对于更多特定于实现的细节,你可能也应该去huggingface问题跟踪器本身,这听起来像是一个很有前途的功能,其他人可能有兴趣为自己工作/使用。但请记住,令牌限制保持不变,512个令牌对于三个或更多句子来说并不多...
https://stackoverflow.com/questions/61465223
复制相似问题