文章/答案/技术大牛

发布

社区首页 >问答首页 >使用自定义X和Y数据训练TFBertForSequenceClassification

问使用自定义X和Y数据训练TFBertForSequenceClassification
EN

Stack Overflow用户

提问于 2020-02-29 17:49:46

回答 3查看 1.6K关注 0票数 9

我正在研究一个TextClassification问题，我试图在huggingface-transformers库中给出的TFBertForSequenceClassification上训练我的模型。

我遵循了他们的github页面上给出的示例，我能够使用tensorflow_datasets.load('glue/mrpc')对给定的示例数据运行示例代码。但是，我找不到一个关于如何加载我自己的自定义数据并将其传递到model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)中的示例。

如何定义自己的X，对X进行标记化，并用X和Y准备train_dataset。其中X表示我的输入文本，Y表示给定X的分类类别。

样本训练数据帧：

    text    category_index
0   Assorted Print Joggers - Pack of 2 ,/ Gray Pri...   0
1   "Buckle" ( Matt ) for 35 mm Width Belt  0
2   (Gagam 07) Barcelona Football Jersey Home 17 1...   2
3   (Pack of 3 Pair) Flocklined Reusable Rubber Ha...   1
4   (Summer special Offer)Firststep new born baby ...   0

nlp

pytorch

tensorflow2.0

huggingface-transformers

bert-language-model

回答 3

Stack Overflow用户

发布于 2020-08-07 12:30:41

使用自定义数据集文件的HuggingFace转换器并不多见。

让我们先导入所需的库：

import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as p

import tensorflow as tf
import transformers as trfs

并定义所需的常量：

# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

现在是时候读取数据集了：

df = pd.read_csv('data.csv')

然后从预训练的BERT中定义所需的模型进行序列分类：

def create_model(max_sequence, model_name, num_labels):
    bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

现在我们需要使用定义的函数实例化模型，并编译我们的模型：

model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

创建用于标记化(将文本转换为标记)的函数：

def batch_encode(X, tokenizer):
    return tokenizer.batch_encode_plus(
    X,
    max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
    add_special_tokens=True, # add [CLS] and [SEP] tokens
    return_attention_mask=True,
    return_token_type_ids=False, # not needed for this type of ML task
    pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
    return_tensors='tf'
)

加载记号赋予器：

tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

将数据拆分为训练部分和验证部分：

X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)

编码我们的集合：

X_train = batch_encode(X_train)
X_val = batch_encode(X_val)

最后，我们可以使用训练集来拟合我们的模型，并在每个时期之后使用验证集进行验证：

model.fit(
    x=X_train.values(),
    y=y_train,
    validation_data=(X_val.values(), y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

票数 3

Stack Overflow用户

发布于 2020-03-22 01:00:06

您需要使用预期的模式将输入数据转换为tf.data格式，以便可以首先创建特征，然后训练分类模型。

如果您查看tensorflow_datasetslink即将推出的glue数据集，您将看到数据具有特定的模式：

dataset_ops.get_legacy_output_classes(data['train'])

{'idx': tensorflow.python.framework.ops.Tensor,
 'label': tensorflow.python.framework.ops.Tensor,
 'sentence': tensorflow.python.framework.ops.Tensor}

如果您想要使用convert_examples_to_features准备好要注入到模型中的数据，就需要这样的模式。

例如，转换数据并不像pandas那样简单，这在很大程度上取决于输入数据的结构。

例如，你可以一步一步找到here，做这样的转换。这可以使用tf.data.Dataset.from_generator来完成。

票数 0

Stack Overflow用户

发布于 2021-06-29 08:06:11

扩展来自konstantin_doncov的答案。

配置文件

在实例化模型时，您需要定义在Transformers配置文件中定义的模型初始化参数。基类是PretrainedConfig。

PretrainedConfig

所有配置类的

基类。处理所有型号配置的一些通用参数，以及加载/下载/保存配置的方法。

每个子类都有自己的参数。例如，Bert预训练模型具有BertConfig。

BertConfig

这是用于存储BertModel或TFBertModel配置的configuration类。用于根据指定的参数实例化BERT模型，定义模型架构。使用默认值实例化配置将产生类似于BERT - bert-base-uncased体系结构的配置。

例如，num_labels参数来自PretrainedConfig

num_labels (int，可选)-在添加到模型的最后一层中使用的标签数，通常用于分类任务。

TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

在Huggingface model - bert-base-uncased - config.json上发布了模型bert-base-uncased的配置文件。

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

微调(迁移学习)

Huggngface提供了几个示例，用于在您自己的自定义数据集上进行微调。例如，利用BERT的Sequence Classification功能进行文本分类。

Fine-tuning with custom datasets

本教程将带您完成几个使用Transformers模型和您自己的数据集的示例。

Fine-tuning a pretrained model

如何从变形金刚库中微调预训练的模型。在TensorFlow中，可以使用Keras和fit方法直接训练模型。

但是，文档中的示例是概述，缺乏详细信息。

Fine-tuning with native PyTorch/TensorFlow

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

github提供了完整的代码。

HuggingFace Text classification examples

此文件夹包含一些脚本，其中显示了hugs Transformers库的文本分类示例。

run_text_classification.py是针对TensorFlow进行文本分类微调的示例。

然而，这既不简单也不直接，因为它的目的是通用和通用。因此，没有一个很好的例子可以让人们开始，导致人们需要提出这样的问题。

分类层

您会看到迁移学习(微调)文章解释了在预先训练的基础模型上添加分类层，答案中也是如此。

output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

但是，文档中的huggingface示例没有添加任何分类层。

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

这是因为TFBertForSequenceClassification已经添加层。

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks

在没有任何特定头部的情况下对基础DistilBERT模型进行分类(与其他类不同，比如TFDistilBertForSequenceClassification 有一个额外的分类头)。

如果显示Keras模型摘要，例如TFDistilBertForSequenceClassification，它会显示在基本BERT模型顶部添加的密集层和Dropout层。

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

冻结基础模型参数

有一些讨论，例如Fine Tune BERT Models，但显然Huggingface的方法不是冻结基本模型参数。如Non-trainable params: 0上的Keras模型摘要所示。

冻结基础distilbert层。

for _layer in model:
    if _layer.name == 'distilbert':
        print(f"Freezing model layer {_layer.name}")
        _layer.trainable = False
    print(_layer.name)
    print(_layer.trainable)
---
Freezing model layer distilbert
distilbert
False      <----------------
pre_classifier
True
classifier
True
dropout_99
True

资源

其他要查看的资源是Kaggle。搜索关键字"huggingface“"BERT”，您将找到为比赛发布的工作代码。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60463829

复制

相似问题

问使用自定义X和Y数据训练TFBertForSequenceClassification
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用自定义X和Y数据训练TFBertForSequenceClassificationEN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用自定义X和Y数据训练TFBertForSequenceClassification
EN