文章/答案/技术大牛

发布

社区首页 >问答首页 >在大文本上微调GPT-2以生成域文本。

问在大文本上微调GPT-2以生成域文本。
EN

Stack Overflow用户

提问于 2020-09-16 01:38:21

回答 1查看 3.6K关注 0票数 2

尝试在一个非常大的文本上训练GPT-2，以便从特定的域生成文本。

与tensorflow2一起工作。

例如，假设我有哈利波特的所有书籍:)

我想对他们进行GPT-2的训练，这样以后我就可以从哈利波特领域生成文本了。

from tensorflow.keras.utils import get_file
from transformers import GPT2Tokenizer, TFGPT2Model

text = '...'
# Length of text: 474429 characters
# 84 unique characters

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = TFGPT2Model.from_pretrained('gpt2-medium')

encoded_input = tokenizer(text, return_tensors='tf') # ERROR
output = model(encoded_input)

input_ids = tokenizer.encode('severus snape', return_tensors='tf')
greedy_output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

错误:令牌索引序列长度大于此模型指定的最大序列长度(149887 > 1024)。在模型中运行此序列将导致索引错误。

那我该怎么做呢？

如何给模型一个大的新文本进行训练？

编辑：

当尝试连接时，令牌程序可以工作，但是模型不工作：

from textwrap import wrap
text_batches = wrap(text, 1000)

encoded_input = None

for tb in text_batches:
    current = tokenizer(tb, return_tensors='tf')
  
    if encoded_input == None:
        encoded_input = current
    else:
        encoded_input['input_ids']      = tf.concat([encoded_input['input_ids'], current['input_ids']], axis=-1)
        encoded_input['attention_mask'] = tf.concat([encoded_input['attention_mask'], current['attention_mask']], axis=1)

output = model(encoded_input) # ERROR

错误: InvalidArgumentError: indices0,1024 = 1024不在[0,1024]操作中:ResourceGather

我遗漏了什么？

deep-learning

nlp

huggingface-transformers

tensorflow

keras

回答 1

Stack Overflow用户

发布于 2020-09-16 02:05:23

您的问题与不同领域的培训无关。相反，您只是提供了一个文本长度(显然是149887标记)，这个长度比模型所能支持的最大长度(1024)要长。你有三个选择：

手动将输入字符串截断到令牌的最大长度。

在调用令牌程序时设置max_length参数，例如tokenizer(text, max_length=1024, ...)。确保读取Tokenizer类的所有可用选项.

重新讨论了为什么需要149 K标记的文本字符串。这是整篇课文吗？你应该用句子吗？

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63911955

复制

相似问题

问在大文本上微调GPT-2以生成域文本。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在大文本上微调GPT-2以生成域文本。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在大文本上微调GPT-2以生成域文本。
EN