前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >“瘦身成功”的ALBERT,能取代BERT吗?

“瘦身成功”的ALBERT,能取代BERT吗?

作者头像
量子位
发布2020-03-31 16:53:28
8550
发布2020-03-31 16:53:28
举报
文章被收录于专栏:量子位量子位
十三 发自 凹非寺 量子位 报道 | 公众号 QbitAI

参数比BERT少了80%,性能却提高了。

这就是谷歌去年提出的“瘦身成功版BERT”模型——ALBERT

这个模型一经发布,就受到了高度关注,二者的对比也成为了热门话题。

而最近,网友Naman Bansal就提出了一个疑问:

是否应该用ALBERT来代替BERT?

能否替代,比比便知。

BERT与ALBERT

BERT模型是大家比较所熟知的。

2018年由谷歌提出,训练的语料库规模非常庞大,包含33亿个词语。

模型的创新点集中在了预训练过程,采用Masked LM和Next Sentence Prediction两种方法,分别捕捉词语和句子级别的表示。

BERT的出现,彻底改变了预训练产生词向量和下游具体NLP任务的关系。

时隔1年后,谷歌又提出ALBERT,也被称作“lite-BERT”,骨干网络和BERT相似,采用的依旧是 Transformer 编码器,激活函数也是GELU。

其最大的成功,就在于参数量比BERT少了80%,同时还取得了更好的结果。

与BERT相比的改进,主要包括嵌入向量参数化的因式分解、跨层参数共享、句间连贯性损失采用SOP,以及移除了dropout。

下图便是BERT和ALBERT,在SQuAD和RACE数据集上的性能测试比较结果。

可以看出,ALBERT性能取得了较好的结果。

如何实现自定义语料库(预训练)ALBERT?

为了进一步了解ALBERT,接下来,将在自定义语料库中实现ALBERT。

所采用的数据集是“用餐点评数据集”,目标就是通过ALBERT模型来识别菜肴的名称

第一步:下载数据集并准备文件

代码语言:javascript
复制
#Downlading all files and data

!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_train.csv
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/dish_name_val.csv
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review.txt
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/data_toy/restaurant_review_nopunct.txt
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/models_toy/albert_config.json
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/finetune_checkpoint
!wget https://github.com/LydiaXiaohongLi/Albert_Finetune_with_Pretrain_on_Custom_Corpus/raw/master/model_checkpoint/pretrain_checkpoint

#Creating files and setting up ALBERT

!pip install sentencepiece
!git clone https://github.com/google-research/ALBERT
!python ./ALBERT/create_pretraining_data.py --input_file "restaurant_review.txt" --output_file "restaurant_review_train" --vocab_file "vocab.txt" --max_seq_length=64
!pip install transformers
!pip install tfrecord

第二步:使用transformer并定义层

代码语言:javascript
复制
#Defining Layers for ALBERT

from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel
from transformers.configuration_albert import AlbertConfig
import torch.nn as nn
class AlbertSequenceOrderHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, 2)
        self.bias = nn.Parameter(torch.zeros(2))

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        prediction_scores = hidden_states + self.bias

        return prediction_scores

from torch.nn import CrossEntropyLoss
from transformers.modeling_bert import ACT2FN
class AlbertForPretrain(AlbertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)

        self.albert = AlbertModel(config)       

        # For Masked LM
        # The original huggingface implementation, created new output weights via dense layer
        # However the original Albert 
        self.predictions_dense = nn.Linear(config.hidden_size, config.embedding_size)
        self.predictions_activation = ACT2FN[config.hidden_act]
        self.predictions_LayerNorm = nn.LayerNorm(config.embedding_size)
        self.predictions_bias = nn.Parameter(torch.zeros(config.vocab_size)) 
        self.predictions_decoder = nn.Linear(config.embedding_size, config.vocab_size)

        self.predictions_decoder.weight = self.albert.embeddings.word_embeddings.weight

        # For sequence order prediction
        self.seq_relationship = AlbertSequenceOrderHead(config)


    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        masked_lm_labels=None,
        seq_relationship_labels=None,
    ):

        outputs = self.albert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )

        loss_fct = CrossEntropyLoss()

        sequence_output = outputs[0]

        sequence_output = self.predictions_dense(sequence_output)
        sequence_output = self.predictions_activation(sequence_output)
        sequence_output = self.predictions_LayerNorm(sequence_output)
        prediction_scores = self.predictions_decoder(sequence_output)


        if masked_lm_labels is not None:
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size)
                                      , masked_lm_labels.view(-1))

        pooled_output = outputs[1]
        seq_relationship_scores = self.seq_relationship(pooled_output)
        if seq_relationship_labels is not None:  
            seq_relationship_loss = loss_fct(seq_relationship_scores.view(-1, 2), seq_relationship_labels.view(-1))

        loss = masked_lm_loss + seq_relationship_loss

        return loss

第三步:使用LAMB优化器并微调ALBERT

代码语言:javascript
复制
#Using LAMB optimizer
#LAMB -  "https://github.com/cybertronai/pytorch-lamb"

import torch
from torch.optim import Optimizer
class Lamb(Optimizer):
    r"""Implements Lamb algorithm.
    It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
    Arguments:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float, optional): learning rate (default: 1e-3)
        betas (Tuple[float, float], optional): coefficients used for computing
            running averages of gradient and its square (default: (0.9, 0.999))
        eps (float, optional): term added to the denominator to improve
            numerical stability (default: 1e-8)
        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
        adam (bool, optional): always use trust ratio = 1, which turns this into
            Adam. Useful for comparison purposes.
    .. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes:
        https://arxiv.org/abs/1904.00962
    """

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
                 weight_decay=0, adam=False):
        if not 0.0 <= lr:
            raise ValueError("Invalid learning rate: {}".format(lr))
        if not 0.0 <= eps:
            raise ValueError("Invalid epsilon value: {}".format(eps))
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
        defaults = dict(lr=lr, betas=betas, eps=eps,
                        weight_decay=weight_decay)
        self.adam = adam
        super(Lamb, self).__init__(params, defaults)

    def step(self, closure=None):
        """Performs a single optimization step.
        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.')

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = torch.zeros_like(p.data)
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = torch.zeros_like(p.data)

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                beta1, beta2 = group['betas']

                state['step'] += 1

                # Decay the first and second moment running average coefficient
                # m_t
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                # v_t
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

                # Paper v3 does not use debiasing.
                # bias_correction1 = 1 - beta1 ** state['step']
                # bias_correction2 = 1 - beta2 ** state['step']
                # Apply bias to lr to avoid broadcast.
                step_size = group['lr'] # * math.sqrt(bias_correction2) / bias_correction1

                weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)

                adam_step = exp_avg / exp_avg_sq.sqrt().add(group['eps'])
                if group['weight_decay'] != 0:
                    adam_step.add_(group['weight_decay'], p.data)

                adam_norm = adam_step.pow(2).sum().sqrt()
                if weight_norm == 0 or adam_norm == 0:
                    trust_ratio = 1
                else:
                    trust_ratio = weight_norm / adam_norm
                state['weight_norm'] = weight_norm
                state['adam_norm'] = adam_norm
                state['trust_ratio'] = trust_ratio
                if self.adam:
                    trust_ratio = 1

                p.data.add_(-step_size * trust_ratio, adam_step)

        return loss

 import time
import torch.nn as nn
import torch
from tfrecord.torch.dataset import TFRecordDataset
import numpy as np
import os

LEARNING_RATE = 0.001
EPOCH = 40
BATCH_SIZE = 2
MAX_GRAD_NORM = 1.0

print(f"--- Resume/Start training ---")   
feat_map = {"input_ids": "int", 
           "input_mask": "int",
           "segment_ids": "int",
           "next_sentence_labels": "int",
           "masked_lm_positions": "int",
           "masked_lm_ids": "int"}
pretrain_file = 'restaurant_review_train'

# Create albert pretrain model
config = AlbertConfig.from_json_file("albert_config.json")
albert_pretrain = AlbertForPretrain(config)
# Create optimizer
optimizer = Lamb([{"params": [p for n, p in list(albert_pretrain.named_parameters())]}], lr=LEARNING_RATE)
albert_pretrain.train()
dataset = TFRecordDataset(pretrain_file, index_path = None, description=feat_map)
loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE)

tmp_loss = 0
start_time = time.time()

if os.path.isfile('pretrain_checkpoint'):
    print(f"--- Load from checkpoint ---")
    checkpoint = torch.load("pretrain_checkpoint")
    albert_pretrain.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    losses = checkpoint['losses']

else:
    epoch = -1
    losses = []
for e in range(epoch+1, EPOCH):
    for batch in loader:
        b_input_ids = batch['input_ids'].long() 
        b_token_type_ids = batch['segment_ids'].long() 
        b_seq_relationship_labels = batch['next_sentence_labels'].long()

        # Convert the dataformat from loaded decoded format into format 
        # loaded format is created by google's Albert create_pretrain.py script
        # required by huggingfaces pytorch implementation of albert
        mask_rows = np.nonzero(batch['masked_lm_positions'].numpy())[0]
        mask_cols = batch['masked_lm_positions'].numpy()[batch['masked_lm_positions'].numpy()!=0]
        b_attention_mask = np.zeros((BATCH_SIZE,64),dtype=np.int64)
        b_attention_mask[mask_rows,mask_cols] = 1
        b_masked_lm_labels = np.zeros((BATCH_SIZE,64),dtype=np.int64) - 100
        b_masked_lm_labels[mask_rows,mask_cols] = batch['masked_lm_ids'].numpy()[batch['masked_lm_positions'].numpy()!=0]     
        b_attention_mask=torch.tensor(b_attention_mask).long()
        b_masked_lm_labels=torch.tensor(b_masked_lm_labels).long()


        loss = albert_pretrain(input_ids = b_input_ids
                              , attention_mask = b_attention_mask
                              , token_type_ids = b_token_type_ids
                              , masked_lm_labels = b_masked_lm_labels 
                              , seq_relationship_labels = b_seq_relationship_labels)

        # clears old gradients
        optimizer.zero_grad()
        # backward pass
        loss.backward()
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=albert_pretrain.parameters(), max_norm=MAX_GRAD_NORM)
        # update parameters
        optimizer.step()

        tmp_loss += loss.detach().item()

    # print metrics and save to checkpoint every epoch
    print(f"Epoch: {e}")
    print(f"Train loss: {(tmp_loss/20)}")
    print(f"Train Time: {(time.time()-start_time)/60} mins")  
    losses.append(tmp_loss/20)

    tmp_loss = 0
    start_time = time.time()

    torch.save({'model_state_dict': albert_pretrain.state_dict(),'optimizer_state_dict': optimizer.state_dict(),
               'epoch': e, 'loss': loss,'losses': losses}
           , 'pretrain_checkpoint')
from matplotlib import pyplot as plot
plot.plot(losses)

#Fine tuning ALBERT

# At the time of writing, Hugging face didnt provide the class object for 
# AlbertForTokenClassification, hence write your own defination below
from transformers.modeling_albert import AlbertModel, AlbertPreTrainedModel
from transformers.configuration_albert import AlbertConfig
from transformers.tokenization_bert import BertTokenizer
import torch.nn as nn
from torch.nn import CrossEntropyLoss
class AlbertForTokenClassification(AlbertPreTrainedModel):

    def __init__(self, albert, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.albert = albert
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
    ):

        outputs = self.albert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )

        sequence_output = outputs[0]

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        return logits

import numpy as np
def label_sent(name_tokens, sent_tokens):
    label = []
    i = 0
    if len(name_tokens)>len(sent_tokens):
        label = np.zeros(len(sent_tokens))
    else:
        while i<len(sent_tokens):
            found_match = False
            if name_tokens[0] == sent_tokens[i]:       
                found_match = True
                for j in range(len(name_tokens)-1):
                    if ((i+j+1)>=len(sent_tokens)):
                        return label
                    if name_tokens[j+1] != sent_tokens[i+j+1]:
                        found_match = False
                if found_match:
                    label.extend(list(np.ones(len(name_tokens)).astype(int)))
                    i = i + len(name_tokens)
                else: 
                    label.extend([0])
                    i = i+ 1
            else:
                label.extend([0])
                i=i+1
    return label

import pandas as pd
import glob
import os

tokenizer = BertTokenizer(vocab_file="vocab.txt")

df_data_train = pd.read_csv("dish_name_train.csv")
df_data_train['name_tokens'] = df_data_train['dish_name'].apply(tokenizer.tokenize)
df_data_train['review_tokens'] = df_data_train.review.apply(tokenizer.tokenize)
df_data_train['review_label'] = df_data_train.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

df_data_val = pd.read_csv("dish_name_val.csv")
df_data_val = df_data_val.dropna().reset_index()
df_data_val['name_tokens'] = df_data_val['dish_name'].apply(tokenizer.tokenize)
df_data_val['review_tokens'] = df_data_val.review.apply(tokenizer.tokenize)
df_data_val['review_label'] = df_data_val.apply(lambda row: label_sent(row['name_tokens'], row['review_tokens']), axis=1)

MAX_LEN = 64
BATCH_SIZE = 1
from keras.preprocessing.sequence import pad_sequences
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

tr_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_train['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
tr_tags = pad_sequences(df_data_train['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")
# create the mask to ignore the padded elements in the sequences.
tr_masks = [[float(i>0) for i in ii] for ii in tr_inputs]
tr_inputs = torch.tensor(tr_inputs)
tr_tags = torch.tensor(tr_tags)
tr_masks = torch.tensor(tr_masks)
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)


val_inputs = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in df_data_val['review_tokens']],maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
val_tags = pad_sequences(df_data_val['review_label'],maxlen=MAX_LEN, padding="post",dtype="long", truncating="post")
# create the mask to ignore the padded elements in the sequences.
val_masks = [[float(i>0) for i in ii] for ii in val_inputs]
val_inputs = torch.tensor(val_inputs)
val_tags = torch.tensor(val_tags)
val_masks = torch.tensor(val_masks)
val_data = TensorDataset(val_inputs, val_masks, val_tags)
val_sampler = RandomSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=BATCH_SIZE)

model_tokenclassification = AlbertForTokenClassification(albert_pretrain.albert, config)
from torch.optim import Adam
LEARNING_RATE = 0.0000003
FULL_FINETUNING = True
if FULL_FINETUNING:
    param_optimizer = list(model_tokenclassification.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    param_optimizer = list(model_tokenclassification.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = Adam(optimizer_grouped_parameters, lr=LEARNING_RATE)

第四步:为自定义语料库训练模型

代码语言:javascript
复制
#Training the model

# from torch.utils.tensorboard import SummaryWriter
import time
import os.path
import torch.nn as nn
import torch
EPOCH = 800
MAX_GRAD_NORM = 1.0

start_time = time.time()
tr_loss, tr_acc, nb_tr_steps = 0, 0, 0
eval_loss, eval_acc, nb_eval_steps = 0, 0, 0

if os.path.isfile('finetune_checkpoint'):
    print(f"--- Load from checkpoint ---")
    checkpoint = torch.load("finetune_checkpoint")
    model_tokenclassification.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    train_losses = checkpoint['train_losses']
    train_accs = checkpoint['train_accs']
    eval_losses = checkpoint['eval_losses']
    eval_accs = checkpoint['eval_accs']

else:
    epoch = -1
    train_losses,train_accs,eval_losses,eval_accs = [],[],[],[]

print(f"--- Resume/Start training ---")    
for e in range(epoch+1, EPOCH): 

    # TRAIN loop
    model_tokenclassification.train()

    for batch in train_dataloader:
        # add batch to gpu
        batch = tuple(t for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        # forward pass
        b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

        ce_loss_fct = CrossEntropyLoss()
        # Only keep active parts of the loss
        b_active_loss = b_input_mask.view(-1) == 1
        b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]
        b_active_labels = b_labels.view(-1)[b_active_loss]

        loss = ce_loss_fct(b_active_logits, b_active_labels)
        acc = torch.mean((torch.max(b_active_logits.detach(),1)[1] == b_active_labels.detach()).float())

        model_tokenclassification.zero_grad()
        # backward pass
        loss.backward()
        # track train loss
        tr_loss += loss.item()
        tr_acc += acc
        nb_tr_steps += 1
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model_tokenclassification.parameters(), max_norm=MAX_GRAD_NORM)
        # update parameters
        optimizer.step()


    # VALIDATION on validation set
    model_tokenclassification.eval()
    for batch in val_dataloader:
        batch = tuple(t for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():

            b_outputs = model_tokenclassification(b_input_ids, token_type_ids=None,
                         attention_mask=b_input_mask, labels=b_labels)

            loss_fct = CrossEntropyLoss()
            # Only keep active parts of the loss
            b_active_loss = b_input_mask.view(-1) == 1
            b_active_logits = b_outputs.view(-1, config.num_labels)[b_active_loss]
            b_active_labels = b_labels.view(-1)[b_active_loss]
            loss = loss_fct(b_active_logits, b_active_labels)
            acc = np.mean(np.argmax(b_active_logits.detach().cpu().numpy(), axis=1).flatten() == b_active_labels.detach().cpu().numpy().flatten())

        eval_loss += loss.mean().item()
        eval_acc += acc
        nb_eval_steps += 1    

    if e % 10 ==0:

        print(f"Epoch: {e}")
        print(f"Train loss: {(tr_loss/nb_tr_steps)}")
        print(f"Train acc: {(tr_acc/nb_tr_steps)}")
        print(f"Train Time: {(time.time()-start_time)/60} mins")  

        print(f"Validation loss: {eval_loss/nb_eval_steps}")
        print(f"Validation Accuracy: {(eval_acc/nb_eval_steps)}") 

        train_losses.append(tr_loss/nb_tr_steps)
        train_accs.append(tr_acc/nb_tr_steps)
        eval_losses.append(eval_loss/nb_eval_steps)
        eval_accs.append(eval_acc/nb_eval_steps)


        tr_loss, tr_acc, nb_tr_steps = 0, 0, 0 
        eval_loss, eval_acc, nb_eval_steps = 0, 0, 0 
        start_time = time.time() 

        torch.save({'model_state_dict': model_tokenclassification.state_dict(),'optimizer_state_dict': optimizer.state_dict(),
           'epoch': e, 'train_losses': train_losses,'train_accs': train_accs, 'eval_losses':eval_losses,'eval_accs':eval_accs}
       , 'finetune_checkpoint')

plot.plot(train_losses)
plot.plot(train_accs)
plot.plot(eval_losses)
plot.plot(eval_accs)
plot.legend(labels = ['train_loss','train_accuracy','validation_loss','validation_accuracy'])

第五步:预测

代码语言:javascript
复制
#Prediction

def predict(texts):
    tokenized_texts = [tokenizer.tokenize(txt) for txt in texts]
    input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                              maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
    attention_mask = [[float(i>0) for i in ii] for ii in input_ids]

    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)

    dataset = TensorDataset(input_ids, attention_mask)
    datasampler = SequentialSampler(dataset)
    dataloader = DataLoader(dataset, sampler=datasampler, batch_size=BATCH_SIZE) 

    predicted_labels = []

    for batch in dataloader:
        batch = tuple(t for t in batch)
        b_input_ids, b_input_mask = batch

        with torch.no_grad():
            logits = model_tokenclassification(b_input_ids, token_type_ids=None,
                           attention_mask=b_input_mask)

            predicted_labels.append(np.multiply(np.argmax(logits.detach().cpu().numpy(),axis=2), b_input_mask.detach().cpu().numpy()))
    # np.concatenate(predicted_labels), to flatten list of arrays of batch_size * max_len into list of arrays of max_len
    return np.concatenate(predicted_labels).astype(int), tokenized_texts

def get_dish_candidate_names(predicted_label, tokenized_text):
    name_lists = []
    if len(np.where(predicted_label>0)[0])>0:
        name_idx_combined = np.where(predicted_label>0)[0]
        name_idxs = np.split(name_idx_combined, np.where(np.diff(name_idx_combined) != 1)[0]+1)
        name_lists.append([" ".join(np.take(tokenized_text,name_idx)) for name_idx in name_idxs])
        # If there duplicate names in the name_lists
        name_lists = np.unique(name_lists)
        return name_lists
    else:
        return None

texts = df_data_val.review.values
predicted_labels, _ = predict(texts)
df_data_val['predicted_review_label'] = list(predicted_labels)
df_data_val['predicted_name']=df_data_val.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)
                                                , axis=1)

texts = df_data_train.review.values
predicted_labels, _ = predict(texts)
df_data_train['predicted_review_label'] = list(predicted_labels)
df_data_train['predicted_name']=df_data_train.apply(lambda row: get_dish_candidate_names(row.predicted_review_label, row.review_tokens)
                                                , axis=1)

(df_data_val)

实验结果

可以看到,模型成功地从用餐评论中,提取出了菜名。

模型比拼

从上面的实战应用中可以看到,ALBERT虽然很lite,结果也可以说相当不错。

那么,参数少、结果好,是否就可以替代BERT呢?

我们可以仔细看下二者实验性能的比较,这里的Speedup是指训练时间。

因为数据数据少了,分布式训练时吞吐上去了,所以ALBERT训练更快。但推理时间还是需要和BERT一样的transformer计算。

所以可以总结为:

  • 在相同的训练时间下,ALBERT效果要比BERT好。
  • 在相同的推理时间下,ALBERT base和large的效果都是没有BERT好。

此外,Naman Bansal认为,由于ALBERT的结构,实现ALBERT的计算代价比BERT要高一些。

所以,还是“鱼和熊掌不可兼得”的关系,要想让ALBERT完全超越、替代BERT,还需要做更进一步的研究和改良。

传送门

博客地址: https://medium.com/@namanbansal9909/should-we-shift-from-bert-to-albert-e6fbb7779d3e

作者系网易新闻·网易号“各有态度”签约作者

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-03-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 量子位 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • BERT与ALBERT
  • 如何实现自定义语料库(预训练)ALBERT?
  • 模型比拼
  • 传送门
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档