我试图遵循本教程,微调伯特为一个NER任务使用我自己的数据集。https://www.philschmid.de/huggingface-transformers-keras-tf。下面是我缩短的代码,以及由于代码的最后一行造成的错误。我对所有这些都是新手,提前感谢你的帮助!
# load dataset,
df_converters = {'tokens': ast.literal_eval, 'labels': ast.literal_eval}
train_df = pd.read_csv("train_df_pretokenization.csv", converters=df_converters)
train_df = train_df.head(10)
# model and pretrained tokenizer
model_ckpt = "indobenchmark/indobert-base-p2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# tokenization, and align labels
def tokenize_and_align_labels(batch):
tag2int = {'B-POI':0, 'B-STR':1, 'E-POI':2, 'E-STR':3, 'I-POI':4,
'I-STR':5, 'S-POI':6, 'S-STR':7, 'O':8}
#tokenized_inputs = tokenizer(batch['tokens'], is_split_into_words=True, truncation = True, padding = True)
tokenized_inputs = tokenizer(batch['tokens'], is_split_into_words=True, truncation = True)
labels=[]
for idx, label in enumerate(batch['labels']):
word_ids = tokenized_inputs.word_ids(batch_index = idx)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(tag2int[label[word_idx]])
else:
label_ids.append(tag2int[label[word_idx]])
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs['tags'] = labels
return tokenized_inputs
def encode_dataset(ds):
return ds.map(tokenize_and_align_labels, batched= True, batch_size=10, remove_columns=['labels','tokens', 'index'])
train_ds = Dataset.from_pandas(train_df)
train_ds_encoded = encode_dataset(train_ds)
# prepare model input
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
tf_train_dataset = train_ds_encoded.to_tf_dataset(
columns= ['input_ids', 'token_type_ids', 'attention_mask', 'tags'],
shuffle=False,
batch_size=5,
collate_fn=data_collator
)
ValueError:无法创建张量,您可能应该激活截断和/或填充'padding=True‘'truncation=True’以具有相同长度的批处理张量。
我认为数据整理器应该负责填充工作,考虑到所要求的批处理大小,我不明白为什么在不同长度的序列中喂食会导致这个错误。实际上,本教程运行良好,无需指定填充或截断。如果我将code = True添加到函数中的令牌程序(函数中注释掉的行)中,我的代码就会运行。但我不认为这是添加垫子的合适地方。
发布于 2022-07-15 22:19:35
我认为在这里提出一个错误是很好的,因为它明确地提到了必须做什么,并且截断,作为一种潜在的信息损失,应该被选择--即使定义了最大长度和填充。
tokenized_inputs = tokenizer(batch['tokens'],
is_split_into_words=True,
return_tensors="tf",
truncation=True,
padding='max_length',
max_length=10)
https://stackoverflow.com/questions/72999740
复制相似问题