我试着用谷歌的研究论文神经模型在WSD上复制PyTorch。
在对大型数据集进行培训之前,我遇到了一些问题,无法适应这个模型。
使用此培训集:
这部电影也是三部曲中的第一部。
这一模式定义:
class WordGuesser(nn.Module):
def __init__(self, hidden_dim, context_dim, embedding_dim, vocabulary_dim, batch_dim, window_dim):
super(WordGuesser, self).__init__()
self.hidden_dim = hidden_dim
self.batch_dim = batch_dim
self.window_dim = window_dim
self.word_embeddings = nn.Embedding(vocabulary_dim, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
#self.extract_context = nn.Linear((2 * window_dim + 1) * hidden_dim, context_dim)
self.extract_context = nn.Linear(hidden_dim, context_dim)
self.predict = nn.Linear(context_dim, vocabulary_dim)
self.hidden = self.init_hidden()
def init_hidden(self):
return (autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()),
autograd.Variable(torch.zeros(1, self.batch_dim, self.hidden_dim).cuda()))
def forward(self, sentence, hidden):
embeddings = self.word_embeddings(sentence)
out, self.hidden = self.lstm(embeddings.permute(1, 0, 2), hidden)
lstm_out = out[-1]
context = self.extract_context(lstm_out)
prediction = self.predict(context)
return prediction, context
而这个训练程序是:
num_epoch = 100
hidden_units = 512
embedding_dim = 256
context_dim = 256
def mytrain():
lines = open('training/overfit.txt').readlines()
sentences = data.split_to_sentences(lines) #uses spaCy to detect sentences from each line
word2idx=dict() #dictionary is built from the training set
idx2word =dict()
i = 0
for s in sentences:
for t in s.split(' '):
if t in word2idx:
continue
word2idx[t] = i
idx2word[i] = t
i += 1
word2idx['$'] = i #the token to guess the missing word in a sentence
idx2word[i] = '$'
X = list()
Y = list()
for sentence in sentences:
sentence = sentence.split(' ')
for i in range(len(sentence)):
newsentence = list(sentence)
newsentence[i] = '$'
if not sentence[i] in word2idx:
continue
indices = [word2idx[w] for w in newsentence]
label = word2idx[sentence[i]]
X.append(indices)
Y.append(label)
model = WordGuesser(hidden_units, context_dim, embedding_dim, len(word2idx), len(X), len(X[0]))
model.train()
model.cuda()
input = torch.LongTensor(X).cuda()
output = torch.LongTensor(Y).cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
model.hidden = model.init_hidden()
for epoch in range(num_epoch):
model.hidden = model.init_hidden()
model.zero_grad()
input_tensor = autograd.Variable(input)
target_tensor = autograd.Variable(output)
predictions, context = model(input_tensor, model.hidden)
for i, prediction in enumerate(predictions):
sorted_val = sorted(enumerate(np.array(prediction.data)), key=lambda x : x[1], reverse=True)
print([(idx2word[x[0]], x[1]) for x in sorted_val[:5]], idx2word[Y[i]])
loss = criterion(predictions, target_tensor)
loss.backward()
optimizer.step()
print(epoch, loss.data[0])
torch.save(model, "train2.pt")
在培训过程中,你可以从以下分数中看出,在21世纪之后,模型似乎会变得过于合适(预测中的前5个单词,一行中的最后一个词是这句话的标签):
(“那”,11.362326),(“电影”,11.356865),(“也”,7.5573149),(“to”,5.3518314),(“意图”,4.3520432) (电影,11.073805),('The',10.451499),('was',7.5498624),(‘was’,4.9684553),(be,4.0730805)电影 (“曾经”,11.232123),(“也”,9.9741745),('the',6.0156212),(be,4.9949703),('The',4.5516477)是 (“也”,9.6998224),(“曾经”,9.6202812),('The',6.345758),(“电影”,4.9122157),('be',2.6727715) (“预定”,18.344809),(“to”,16.410078),(“电影”,10.147289),('The',9.8423424),('$',9.6181822) (“to”,12.442947),(“预定”,10.900065),(“电影”,8.2598763),('The',8.0493736),('$',4.4901967) (' be ',12.189278),('was',7.7172523),(‘曾经’,7.5415096),('the',5.2521734),('The',4.1723843) be (' the ',15.59604),(be,9.3750105),('first',8.9820032),('was',8.6859236),(‘还’,5.0665498) (‘I’,10.191225),('the',5.1829329),('in',3.6020348),(be,3.4108081),('a',1.5569853) (' in ',14.731103),('first',9.3131113),('a',5.982264),(‘三部曲’,4.2928643),(be,0.49548936) ('a',14.357709),('in',8.3088198),(‘三部曲’,6.3918238),(‘第一个’,6.2178354),(‘预定’,0.95656234) a (“三部曲”,14.351434),(a,4.5073452),('in',4.2348137),('$',3.7552347),(‘打算’,3.5101018)三部曲 ('.',18.152126),('$',12.028764),('to',9.6003456),(‘意图’,8.1202478),('The',4.9225812)。
在运行另一个Python脚本时,该脚本加载模型并查询以下单词(使用相同的代码打印培训期间的分数):
这部电影也是三部曲中的第一部。be 这部电影也打算成为三部曲中的第一部。曾经是 美元电影也是三部曲中的第一部。这个
我得到了这些分数:
(电影,24.066889),('$',20.107487),(‘曾经’,16.855488),('a',12.969441),('in',8.1248817) be (电影,24.089062),('$',20.116539),('in',16.891994),('a',12.982826),(‘in’,8.1167336)是 (“电影”,23.993624),('$',20.108011),('in',16.891005),('a',12.960193),(‘in’,8.1577587)
我还尝试设置为False
的model.train()
模式,使用model.eval()
,以及调用topk
的LSTM分数,但结果并不令人满意,
发布于 2018-04-08 15:24:05
通过torch.save()
只保存模型的torch.save()
,然后使用model.load_state_dict()
将其加载回评估阶段来解决问题。
此外,我将句子查询循环包装在另一个循环中,作为一个热身(从这里获得),一旦它最后一次循环,我设置model.eval()
并打印分数,结果证明是正确的。
https://stackoverflow.com/questions/49707613
复制相似问题