前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >关于bertTokenizer

关于bertTokenizer

作者头像
西西嘛呦
发布2021-06-17 20:54:23
1.6K0
发布2021-06-17 20:54:23
举报

具体实例

代码语言:javascript
复制
from transformers import BertTokenizer
import os

tokens = ['我','爱','北','京','天','安','门']

tokenizer = BertTokenizer(os.path.join('/content/drive/MyDrive/simpleNLP/model_hub/bert-base-case','vocab.txt'))
encode_dict = tokenizer.encode_plus(text=tokens,
                  max_length=256,
                  pad_to_max_length=True,
                  is_pretokenized=True,
                  return_token_type_ids=True,
                  return_attention_mask=True)
tokens = ['[CLS]'] + tokens + ['[SEP]']
print(' '.join(tokens))
print(encode_dict['input_ids'])

结果:

代码语言:javascript
复制
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
[CLS] 我 爱 北 京 天 安 门 [SEP]
[101, 100, 100, 993, 984, 1010, 1016, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py:2079: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  FutureWarning,
代码语言:javascript
复制
from transformers import BertTokenizer
import os

tokens = ['我','爱','北','京','天','安','门']

tokenizer = BertTokenizer(os.path.join('/content/drive/MyDrive/simpleNLP/model_hub/bert-base-case','vocab.txt'))
tokens_a = '我 爱 北 京 天 安 门'.split(' ')
tokens_b = '我 爱 打 英 雄 联 盟 啊 啊'.split(' ')

encode_dict = tokenizer.encode_plus(text=tokens_a,
                  text_pair=tokens_b,
                  max_length=20,
                  pad_to_max_length=True,
                  truncation_strategy='only_second',
                  is_pretokenized=True,
                  return_token_type_ids=True,
                  return_attention_mask=True)
tokens = " ".join(['[CLS]'] + tokens_a + ['[SEP]'] + tokens_b + ['[SEP]'])
token_ids = encode_dict['input_ids']
attention_masks = encode_dict['attention_mask']
token_type_ids = encode_dict['token_type_ids']

print(tokens)
print(token_ids)
print(attention_masks)
print(token_type_ids)

结果:

代码语言:javascript
复制
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
[CLS] 我 爱 北 京 天 安 门 [SEP] 我 爱 打 英 雄 联 盟 啊 啊 [SEP]
[101, 100, 100, 993, 984, 1010, 1016, 100, 102, 100, 100, 100, 100, 100, 100, 100, 100, 100, 102, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py:2079: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
  FutureWarning,
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2021-06-13 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 具体实例
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档