Word2Vec是一种流行的词嵌入技术,通过深度学习模型将单词转换为密集向量表示,以捕捉单词间的语义关系。PyTorch作为深度学习框架,提供了灵活的工具来实现这一技术。
以下是一个简单的PyTorch Word2Vec实现示例,使用Skip-gram模型:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
class Word2VecDataset(Dataset):
def __init__(self, text, word_to_idx, idx_to_word, word_freqs, C, K):
self.text_encoded = [word_to_idx[t] for t in text]
self.text_encoded = torch.tensor(self.text_encoded).long()
self.word_freqs = torch.tensor(word_freqs)
self.C = C
self.K = K
def __len__(self):
return len(self.text_encoded)
def __getitem__(self, idx):
center_word = self.text_encoded[idx]
pos_indices = list(range(idx-self.C, idx)) + list(range(idx+1, idx+self.C+1))
pos_words = self.text_encoded[pos_indices]
neg_words = torch.multinomial(self.word_freqs, self.K*pos_words.shape[0], replacement=True)
return center_word, pos_words, neg_words
class Word2VecModel(nn.Module):
def __init__(self, vocab_size, emb_size):
super(Word2VecModel, self).__init__()
self.in_embed = nn.Embedding(vocab_size, emb_size)
self.out_embed = nn.Embedding(vocab_size, emb_size)
def forward(self, inputs):
return self.out_embed(inputs)
# 示例训练过程
sentences = ["我喜欢吃苹果", "苹果是我的最爱", ...] # 示例文本数据
model = Word2VecModel(len(sentences), embedding_dim=100)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(epochs):
for i in range(0, len(sentences), batch_size):
inputs = torch.tensor(sentences[i:i+batch_size], dtype=torch.long)
targets = inputs.clone().detach()
optimizer.zero_grad()
outputs = model(inputs)
loss = nn.CrossEntropyLoss()(outputs, targets)
loss.backward()
optimizer.step()
通过上述步骤,可以使用PyTorch实现一个简单的Word2Vec模型,用于学习单词的向量表示。
领取专属 10元无门槛券
手把手带您无忧上云