# word2vec 模型思想和代码实现

word2vec 有两个模型，CBOW 和 Skip-Gram，今天先讲 Skip-Gram 的算法和实现。

Skip-Gram 能达到什么效果？

Skip-gram 算法如下

```def test_word2vec():

dataset = type('dummy', (), {})()     #create a dynamic object and then add attributes to it
def dummySampleTokenIdx():          #generate 1 integer between (0,4)
return random.randint(0, 4)

def getRandomContext(C):                            #getRandomContext(3) = ('d', ['d', 'd', 'd', 'e', 'a', 'd'])
tokens = ["a", "b", "c", "d", "e"]
for i in xrange(2*C)]

dataset.sampleTokenIdx = dummySampleTokenIdx        #add two methods to dataset
dataset.getRandomContext = getRandomContext

random.seed(31415)
np.random.seed(9265)                                #can be called again to re-seed the generator

#in this test, this wordvectors matrix is randomly generated,
#but in real training, this matrix is a well trained data
dummy_vectors = normalizeRows(np.random.randn(10,3))                    #generate matrix in shape=(10,3),
dummy_tokens = dict([("a",0), ("b",1), ("c",2), ("d",3), ("e",4)])      #{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

print "==== Gradient check for skip-gram ===="
gradcheck_naive(lambda vec: word2vec_sgd_wrapper(skipgram, dummy_tokens, vec, dataset, 5), dummy_vectors)  #vec is dummy_vectors

print "\n=== Results ==="
print skipgram("c", 3, ["a", "b", "e", "d", "b", "c"], dummy_tokens, dummy_vectors[:5, :], dummy_vectors[5:, :], dataset)

if __name__ == "__main__":
test_word2vec()```

```def word2vec_sgd_wrapper(word2vecModel, tokens, wordVectors, dataset, C, word2vecCostAndGradient = softmaxCostAndGradient):
batchsize = 50
cost = 0.0
N = wordVectors.shape[0]
inputVectors = wordVectors[:N/2, :]
outputVectors = wordVectors[N/2:, :]

for i in xrange(batchsize):                                 #train word2vecModel for 50 times
C1 = random.randint(1, C)
centerword, context = dataset.getRandomContext(C1)      #randomly choose 1 word, and generate a context of it

if word2vecModel = skipgram:
denom = 1
else:
denom = 1

c, gin, gout = word2vecModel(centerword, C1, context, tokens, inputVectors, outputVectors, dataset, word2vecCostAndGradient)
cost += c / batchsize / denom                           #calculate the average
grad[:N/2, :] += gin / batchsize / denom
grad[N/2:, :] += gout / batchsize / denom

```def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
""" Skip-gram model in word2vec """

currentI = tokens[currentWord]                      #the order of this center word in the whole vocabulary
predicted = inputVectors[currentI, :]               #turn this word to vector representation

cost = 0.0
for cwd in contextWords:                            #contextWords is of 2C length
idx = tokens[cwd]
cc, gp, gg = word2vecCostAndGradient(predicted, idx, outputVectors, dataset)
cost += cc                                      #final cost/gradient is the 'sum' of result calculated by each word in context

```def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
""" Softmax cost function for word2vec models """

probabilities = softmax(predicted.dot(outputVectors.T))
cost = -np.log(probabilities[target])

delta = probabilities
delta[target] -= 1

N = delta.shape[0]                                              #delta.shape = (5,)
D = predicted.shape[0]                                          #predicted.shape = (3,)
grad = delta.reshape((N, 1)) * predicted.reshape((1, D))

`grad = delta.reshape((N, 1)) * predicted.reshape((1, D))`就是

`gradPred = (delta.reshape((1, N)).dot(outputVectors)).flatten()`就是

ok，Skip-Gram 和 softmax gradient 的结合就写完了，之后再看到 几行简略的算法描述，应该自己也能写出完整的代码了。 下一次要写用 SGD 求 word2vec 模型的参数，本来这一次想直接写情感分析的实战项目的，但是发现 word2vec 值得单独拿出来写一下，因为这个算法才是应用的核心，应用的项目多数都是分类问题，而 word2vec 训练出来的词向量才是分类训练的重要原料。

0 条评论

• ### poj-------(2240)Arbitrage(最短路)

Arbitrage Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 156...

• ### 深度 | RNN 之父 Schmidhuber：21世纪最重大发明——超级智能崛起

【新智元导读】刚刚过去的ACM会议上，递归神经网络（RNN）之父、瑞士人工智能实验室科学事务主管 Jürgen Schmidhuber 接受专访，畅谈深度学习技...

• ### java设计之简单的JAVA计算器

做这个东西主要是为了练习一下以前学习过的java Swing,所以那些复杂的算法就没有加载到里面去........        先展示一下效果....

• ### Shell脚本的简单排错法及调试程序bashdb

Jboss 的研究稍有卡壳，那就来点基础教程好了。 与众多脚本语言一样，Shell 脚本在执行时出错是很常见的，最简单的原因无外乎脚本在编写的过程中出现了语法错...

• ### 机器学习里，数学究竟多重要？

【新智元导读】本文的主要目的是提供资源，给出有关机器学习所需的数学上面的建议。数学初学者无需沮丧，因为初学机器学习，并不需要先学好大量的数学知识才能开始。正如这...

• ### HDUOJ-----(1162)Eddy's picture（最小生成树）

Eddy's picture Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/327...

• ### 教你如何查看Linux的CPU负载

记得博主以前被问到 CPU 负载如何才算高的时候，出过一次糗，具体就不记录了。。。在网上找了一篇比较详细的 Linux 下的 CPU 负载算法教程，科普一下。不...

• ### 【荐读】Michael Nielsen《神经网络和深度学习》：智能可以用简单的算法表示吗？

【新智元导读】本文选自量子物理学家、著名科普作家 Michael Nielsen《神经网络和深度学习》最后一章，探讨智能能否用简单算法来表示。Nielsen 从...

• ### CF---（452）A. Eevee

A. Eevee time limit per test 1 second memory limit per test 256 megabytes in...

• ### uva---(11549)CALCULATOR CONUNDRUM

Problem C CALCULATOR CONUNDRUM Alice got a hold of an old calculator that can ...