公众号:尤而小屋 作者:Peter 编辑:Peter
持续更新《Python深度学习》一书的精华内容,仅作为学习笔记分享。
本文是第二篇:基于keras建模解决Python深度学习的二分类问题,使用keras内置的IMDB数据集
运行环境:Python3.9.13 + Keras2.12.0 + tensorflow2.12.0
In 1:
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.datasets import imdb # 内置数据集
from keras import models
from keras import layers
from keras import optimizers # 优化器
from tensorflow.keras.utils import to_categorical # 实现one-hot编码
# from tensorflow.keras import optimizers
# 修改1
# from tensorflow.python.keras.optimizers import rmsprop_v2
IMDB数据集是一个非常著名和广泛使用的电影数据集,包含了大量的电影和演员的信息。它由互联网电影数据库(IMDB)提供,包含了超过4700部电影和电视节目的信息,以及超过50万名演员和工作人员的信息。
IMDB数据集非常适合用于电影推荐、电影属性预测、演员演技评估等任务。您可以利用这个数据集来训练和测试机器学习模型,以实现自动电影推荐、电影属性预测、演员演技评估等。
使用IMDB数据集可以进行以下类型的机器学习实验和研究:
总之,IMDB数据集是一个非常丰富和有用的数据集,可以用于电影推荐、电影属性预测、演员演技评估等任务。通过使用这个数据集,您可以深入了解电影和演员的信息,以及它们之间的关系和影响。
IMDB数据集已经内置Keras库中
In 2:
from keras.datasets import imdb
In 3:
# 取出训练集中最常出现的前10000个词语
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
In 4:
train_data[:2]
Out4:
array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])],
dtype=object)
两个labels相关的数据都是0和1的二分类标签:其中0代表负面neg,1代表正面pos
In 5:
train_labels[:3]
Out5:
array([1, 0, 0], dtype=int64)
In 6:
test_labels[:3]
Out6:
array([0, 1, 1], dtype=int64)
前10000个单词说明单词索引不超过9999:
In 7:
max([max(sequence) for sequence in train_data])
Out7:
9999
单词和索引的互换:
In 8:
word_index = imdb.get_word_index()
reverse_word_index = dict([value, key] for (key, value) in word_index.items()) # 翻转过程
reverse_word_index
# 结果
{34701: 'fawn',
52006: 'tsukino',
52007: 'nunnery',
16816: 'sonja',
63951: 'vani',
1408: 'woods',
16115: 'spiders',
2345: 'hanging',
2289: 'woody',
52008: 'trawling',
52009: "hold's",
11307: 'comically',
40830: 'localized'
.......
}
将评论解析为英文单词:
In 9:
decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review
Out9:
"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
将整数序列编码为二进制矩阵:
In 10:
import numpy as np
def vectorize_sequences(seq, dim=10000):
"""
seq: 输入序列
dim:10000,维度
"""
results = np.zeros((len(seq), dim)) # 创建全0矩阵 length * dim
for i, s in enumerate(seq):
results[i,s] = 1. # 将该位置的值从0变成1,如果没有出现则还是0
return results
X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)
In 11:
X_train[0]
Out11:
array([0., 1., 1., ..., 0., 0., 0.])
In 12:
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")
上面已经将训练集和测试集都处理完后,就可以将数据喂入到神经网络中:
In 13:
from keras import models
from keras import layers
In 14:
X_train.shape
Out14:
(25000, 10000)
为什么在深度学习中需要激活函数?
因此,激活函数在深度学习中起着非常重要的作用,它们不仅可以引入非线性性,还可以帮助引入梯度,控制输出值范围,从而提升神经网络的性能。
In 15:
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
配置优化器和损失后编译网络
In 16:
# 写法1
model.compile(optimizer='rmsprop', # 优化器
loss='binary_crossentropy', # 二进制交叉熵
metrics=['accuracy'] # 评价指标
)
In 17:
# 写法2:有改动
model.compile(
# 原文
# optimizer= optimizers.RMSprop(lr=0.001), # 正则项
optimizer= tf.keras.optimizers.RMSprop(learning_rate=0.001), # 添加前缀tf;lr也要改成learning_rate
loss='binary_crossentropy', # 交叉熵
metrics=['accuracy'] # 使用全称
)
In 18:
# 留出验证集和真实训练集
x_val = X_train[:10000] # 前10000个
partial_x_train = X_train[10000:] # 10000个之后 真实训练集
y_val = y_train[:10000]
partial_y_train = y_train[10000:] # 真实训练集
In 19:
history = model.fit(partial_x_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(x_val, y_val)
)
Epoch 1/20
30/30 [==============================] - 1s 23ms/step - loss: 0.5108 - accuracy: 0.7746 - val_loss: 0.3802 - val_accuracy: 0.8653
Epoch 2/20
30/30 [==============================] - 0s 13ms/step - loss: 0.3102 - accuracy: 0.8959 - val_loss: 0.3066 - val_accuracy: 0.8850
Epoch 3/20
30/30 [==============================] - 0s 14ms/step - loss: 0.2343 - accuracy: 0.9200 - val_loss: 0.2997 - val_accuracy: 0.8815
Epoch 4/20
30/30 [==============================] - 0s 14ms/step - loss: 0.1912 - accuracy: 0.9371 - val_loss: 0.2921 - val_accuracy: 0.8828
......
Epoch 19/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0164 - accuracy: 0.9975 - val_loss: 0.5394 - val_accuracy: 0.8723
Epoch 20/20
30/30 [==============================] - 0s 11ms/step - loss: 0.0165 - accuracy: 0.9971 - val_loss: 0.5563 - val_accuracy: 0.8714
关于History对象:
In 20:
his_dict = history.history # 字典类型
In 21:
his_dict.keys()
Out21:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In 22:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 160016
dense_1 (Dense) (None, 16) 272
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________
In 23:
model.evaluate(X_test, y_test)
782/782 [==============================] - 1s 880us/step - loss: 0.6017 - accuracy: 0.8582
Out23:
[0.601686954498291, 0.8581600189208984]
In 24:
import matplotlib.pyplot as plt
loss = his_dict["loss"]
val_loss = his_dict["val_loss"]
acc = his_dict["accuracy"]
val_acc = his_dict["val_accuracy"]
In 25:
epochs = range(1, len(loss) + 1) # 作为横轴
In 26:
# 1、损失loss
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss")
plt.show()
# 2、精度acc
plt.clf() # 清空图像
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.xlabel("Epochs")
plt.ylabel("Acc")
plt.legend()
plt.title("Training and Validation Acc")
plt.show()
可以看到随着网络训练的进行,loss在训练集上越来越小,acc在训练集上越来越大;但是在验证集上并非如此。
也是说,模型在训练集上表现得很好,但是在验证集上不行,出现了过拟合。
重新训练一个模型,共4轮epochs=4
In 28:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],))) # 原文1000
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["acc"]) # 原文accuracy 改成acc
# 编译模型
model.compile(optimizer='rmsprop', # 优化器
loss='binary_crossentropy', # 二进制交叉熵
metrics=['accuracy'] # 评价指标
)
# 训练
history = model.fit(X_train, # 在完整数据集上训练
y_train,
epochs=4,
batch_size=512,
validation_data=(x_val, y_val)
)
Epoch 1/4
49/49 [==============================] - 1s 16ms/step - loss: 0.4807 - accuracy: 0.8072 - val_loss: 0.3134 - val_accuracy: 0.9003
Epoch 2/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2772 - accuracy: 0.9021 - val_loss: 0.2212 - val_accuracy: 0.9259
Epoch 3/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2173 - accuracy: 0.9200 - val_loss: 0.1930 - val_accuracy: 0.9283
Epoch 4/4
49/49 [==============================] - 0s 10ms/step - loss: 0.1840 - accuracy: 0.9324 - val_loss: 0.1472 - val_accuracy: 0.9544
最终模型预测:
In 29:
results = model.predict(X_test)
results
782/782 [==============================] - 1s 790us/step
Out29:
array([[0.19428788],
[0.9998849 ],
[0.8095433 ],
...,
[0.1104579 ],
[0.07548532],
[0.65479356]], dtype=float32)
模型网络对某些样本十分可信;比如概率是0.998(表示1)或者0.1(表示0);有一些则模棱两可。
In 30:
results.flatten() # 将二维展开成一维flatten
Out30:
array([0.19428788, 0.9998849 , 0.8095433 , ..., 0.1104579 , 0.07548532,
0.65479356], dtype=float32)
通过np.round函数直接将概率转成0-1分类:
In 31:
y_predict = np.round(results.flatten())
y_predict
Out31:
array([0., 1., 1., ..., 0., 0., 1.], dtype=float32)
In 32:
y_test
Out32:
array([0., 1., 1., ..., 0., 0., 0.], dtype=float32)
In 33:
from sklearn.metrics import classification_report, confusion_matrix, r2_score, recall_score
In 34:
confusion_matrix(y_predict,y_test) # 混淆矩阵
Out34:
array([[11169, 1512],
[ 1331, 10988]], dtype=int64)
In 35:
print(classification_report(y_predict, y_test))
precision recall f1-score support
0.0 0.89 0.88 0.89 12681
1.0 0.88 0.89 0.89 12319
accuracy 0.89 25000
macro avg 0.89 0.89 0.89 25000
weighted avg 0.89 0.89 0.89 25000
In 36:
import seaborn as sns
sns.heatmap(confusion_matrix(y_predict,y_test), # 混淆矩阵
annot=True, # 显示数值
#cmap=plt.cm.Blues,
fmt='.0f' # 指定格式
)
plt.show()
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。