# 用深度学习实现自然语言处理：word embedding，单词向量化

```import numpy as np
samples = ['The cat jump over the dog', 'The dog ate my homework']

#我们先将每个单词放置到一个哈希表中
token_index = {}
for sample in samples:
#将一个句子分解成多个单词
for word in sample.split():
if word not in token_index:
token_index[word] = len(token_index) + 1

#设置句子的最大长度
max_length = 10
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[: max_length]:
index = token_index.get(word)
results[i, j, index] = 1.
print("{0} -> {1}".format(word, results[i, j]))```

```from keras.preprocessing.text import Tokenizer
samples = ['The cat jump over the dog', 'The dog ate my homework']
#只考虑最常使用的前1000个单词
tokenizer = Tokenizer(num_words = 1000)
tokenizer.fit_on_texts(samples)
#把句子分解成单词数组
sequences = tokenizer.texts_to_sequences(samples)
print(sequences)
one_hot_vecs = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index
print("当前总共有%s个不同单词"%len(word_index))```

```[[1, 3, 4, 5, 1, 2], [1, 2, 6, 7, 8]]

one_hot_vecs对应两个含有1000个元素的向量，第一个向量的第1，3，4，5个元素为1，其余全为0，第二个向量第1，2，6，7，8个元素为1，其余全为0.

```from keras.layers import Embedding
#Embedding对象接收两个参数，一个是单词量总数，另一个是单词向量的维度
embedding_layer = Embedding(1000, 64)```

```from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
#在网络中添加Embedding层，专门用于把单词转换成向量

'''

'''

#我们在顶部加一层只含有1个神经元的网络层，把Embedding层的输出结果对应成两个类别
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics = ['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs = 10, batch_size = 32, validation_split=0.2)```

```import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

#绘制模型对训练数据和校验数据判断的准确率
plt.plot(epochs, acc, 'bo', label = 'trainning acc')
plt.plot(epochs, val_acc, 'b', label = 'validation acc')
plt.title('Trainning and validation accuary')
plt.legend()

plt.show()
plt.figure()

#绘制模型对训练数据和校验数据判断的错误率
plt.plot(epochs, loss, 'bo', label = 'Trainning loss')
plt.plot(epochs, val_loss, 'b', label = 'Validation loss')
plt.title('Trainning and validation loss')
plt.legend()

plt.show()```

138 篇文章39 人订阅

0 条评论

## 相关文章

3686

3029

43713

3195

47711

1011

2889

4057

### 【学术】一篇关于机器学习中的稀疏矩阵的介绍

AiTechYun 编辑：Yining 在矩阵中，如果数值为0的元素数目远远多于非0元素的数目，并且非0元素分布无规律时，则称该矩阵为稀疏矩阵；与之相反，若非0...

6744

1762