前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【Kaggle竞赛】迭代训练模型

【Kaggle竞赛】迭代训练模型

作者头像
嵌入式视觉
发布2022-09-05 13:31:31
6410
发布2022-09-05 13:31:31
举报
文章被收录于专栏:嵌入式视觉

Contents

CV领域中,在完成数据准备工作和设计定义好模型之后,我们就可以去迭代训练模型了,通过设置调节不同的超参数(这需要理论知识和丰富的经验)来使得损失(loss)和准确率(accuracy)这两个常用的指标达到最优。一般在训练完成之后,都需要通过损失曲线图和准确率曲线图来衡量整个训练过程。

在训练模型之前,我们需要将数据划分为训练集和验证集,在训练集上训练模型,在验证集上评估模型。最后一旦找到了模型的最佳参数,就在测试集上最后测试一次,并将得到的测试结果储存为CSV文件,提交到Kaggle平台上,看分数如何,以便进行后期的改正。

数据集的划分有三种常用的方法:

  1. 简单的留出验证;
  2. K折交叉验证;
  3. 带有打乱数据的重复K折验证;

知道了训练模型的一些方法和注意事项之后,我们就要开始编写TensorFlow程序,以实现迭代训练模型,并将最终的模型保存下来。这里需要先学习TensorFlow模型持久化(即如何保存和恢复模型)。

TensorFlow模型持久化

主要介绍如何编写TensorFlow程序来持久化一个训练好的模型,并从持久化的模型文件中还原被保存的模型。TensorFlow提供一个tf.train.Saver类用于保存和还原一个神经网络模型。

保存模型

以下程序是保存模型的示例:

代码语言:javascript
复制
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
# 声明两个变量并计算他们的和
v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1")
v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2")
result = v1 + v2
# 声明tf.train.Saver类用于保存模型
saver = tf.train.Saver()
with tf.Session() as sess:
    # 初始化所有变量
    sess.run(tf.global_variables_initializer())
    # 将模型保存到指定文件中
    saver.save(sess,model_path)

输出结果如下:

可以看到在模型保存地址中出现了4个文件,这是因为TensorFlow会将计算图的结构和参数取值分开保存。

  • model.ckpt.meta 保存了计算图的结构
  • model.ckpt.data-00000-of-00001 保存了计算图上的每个变量的取值
  • checkpoint 保存了目录下的所有的模型文件列表,方便还原模型时直接调用
  • model.ckpt.index 暂时用不到

加载模型

加载模型有两种常见方法:

  1. 在加载模型的程序中定义TensorFlow计算图上的所有运算;
  2. 不重复定义计算图上运算,直接加载已经持久化的图。

第一种方法示例代码如下:

代码语言:javascript
复制
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
# 使用和保存模型代码中一样的方式来声明变量和定义计算图结构
v1 = tf.Variable(tf.constant(1.0,shape=[1]),name="v1")
v2 = tf.Variable(tf.constant(3.0,shape=[1]),name="v2")
result = v1 + v2
saver = tf.train.Saver()
with tf.Session() as sess:
    # 加载已经保存的模型,并通过已经保存的模型中变量的值来计算加法
    saver.restore(sess,'C:/Users/Administrator/logs/model.ckpt')
    print(sess.run(result))

第二种方法示例代码如下:

代码语言:javascript
复制
import tensorflow as tf
# 模型保存地址
model_path = 'C:/Users/Administrator/logs/model.ckpt'
saver = tf.train.import_meta_graph('C:/Users/Administrator/logs/model.ckpt.meta')
with tf.Session() as sess:
    saver.restore(sess,model_path)
    print(sess.run(tf.get_default_graph().get_tensor_by_name("add:0")))

两种方法输出结果一样,如下图所示:

           INFO:tensorflow:Restoring parameters from C:/Users/Administrator/logs/model.ckpt [ 4.]

迭代训练模型实现

程序代码如下:

代码语言:javascript
复制
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os
import time
# 导入模型定义文件和数据准备文件
import model
import input_data

# ---------------------------配置神经网络超参数-------------------------------------------
N_CLASSES  = 2                     # 输出类别数
IMG_W = 227                        # 图像宽度
IMG_H = 227                        # 图像高度
IMG_C = 3                          # 图像通道
BATCH_SIZE = 10                    # 训练集批次大小
MAX_STEP = 20000                   # 最大迭代步数
CAPACITY = 2000                    # 用于定义的范围
LEARNING_RETE = 0.0001             # 定义学习率

# 本地电脑训练集对应路径地址,和模型及日志文件保存地址
train_dir = "F:/Software/Python_Project/Classification-cat-dog/train/"
logs_train_dir = "F:/Software/Python_Project/Classification-cat-dog/logs/"
# 云服务器训练集对应路径地址,和模型及日志文件保存地址
# train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/train/'
# logs_train_dir = '/data/Dogs-Cats-Redux-Kernels-Edition/logs/

# ---------------------------定义模型训练函数-------------------------------------------
def run_training():
    # 获取训练集文件名和对应标签列表
    file_list, label_list = input_data.get_files(train_dir)
    # 生成一个batch的图像数据和标签
    train_batch, train_label_batch = input_data.get_batch(file_list,
                                                          label_list,
                                                          IMG_W,
                                                          IMG_H,
                                                          BATCH_SIZE,
                                                          CAPACITY)
    
    regularizer = tf.contrib.layers.l2_regularizer(0.0001)
    # 获取训练batch数据网络输出结果
    train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES)    
    train_loss = model.losses(train_logits, train_label_batch)    # 计算训练batch的损失                            
    train_op = model.trainning(train_loss, LEARNING_RETE)         # 利用损失和学习率更新网络权重W参数
    train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率
    
    # 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
    x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
    y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')
    
#     # 获取训练batch的模型输出结果,logits是一个batch_size*2的二维数组
#     logits = model.inference(x,True,BATCH_SIZE,regularizer,N_CLASSES)

#     # (小处理)将logits乘以1赋值给logits_eval,定义name,方便在后续调用模型时通过tensor名字调用输出tensor
#     b = tf.constant(value=1,dtype=tf.float32)
#     logits_eval = tf.multiply(logits,b,name='logits_eval') 

#     # 计算交叉熵作为刻画预测值和真实值之间差距的损失函数
#     cross_entroy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_)
#     # 计算在当前batch中所有样例的交叉熵平均值
#     loss = tf.reduce_mean(cross_entroy,name='loss')
#     # 使用tf.train.AdamOptimizer优化算法来优化损失函数
#     train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)

#     # 计算模型在一个batch数据上的正确率
#     correct_prediction = tf.equal(tf.cast(tf.argmax(logits,1),tf.int32), y_)    
#     acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
#     print(loss.shape,acc.shape)

    with tf.Session() as sess:
        tra_loss = []                               # 定义loss列表
        # 初始化TensorFlow持久化类
        saver = tf.train.Saver()
        sess.run(tf.global_variables_initializer())    # 初始化所有变量
        coord = tf.train.Coordinator()                 # 设置多线程协调器
        threads = tf.train.start_queue_runners(sess=sess,coord=coord)    # 开始队列运行器(Queue Runner)
        summary_op = tf.summary.merge_all()            # 汇总操作
        # 把训练的汇总写入logs_train_dir
        train_writer = tf.summary.FileWriter(logs_train_dir,sess.graph)
        try:                                           # 开始运行
            for step in np.arange(MAX_STEP):
                if coord.should_stop():
                    break
                image_batch,label_batch = sess.run([train_batch, train_label_batch])
                # 计算损失和准确率
                _,loss,acc = sess.run([train_op,train_loss,train_acc],feed_dict={x_train:image_batch,y_train_:label_batch})
                # print(type(train_loss),type(train_acc),train_acc.shape)
                tra_loss.append(loss)
                # train_loss_val = np.sum(train_loss)
                # train_acc_val = np.sum(train_acc)
                if step % 100 == 0:                    # 当step值为10的倍数时,打印损失和准确率
                    print('Step %d,train loss = %.2f,train accuracy = %.2f%%' % (step,train_loss,train_acc*100.0))
                    # print("Step %d, loss: %.2f, ac : %.2f" % (step,loss,acc))
                    summary_str = sess.run(summary_op)
                    train_writer.add_summary(summary_str,step)
                if step % 2000 == 0 or (step+1) == MAX_STEP:     # 保存模型
                    checkpoint_path = os.path.join(logs_train_dir,'model.ckpt')
                    saver.save(sess,checkpoint_path,global_step=step)
                    print('Model saved')
            print('Traning finished!')
        except tf.errors.OutOfRangeError:                       # 异常处理
            print('Done training -- epoch limit reached.')
        finally:
            # 停止所有线程
            coord.request_stop()
            coord.join(threads)
        # 绘制损失函数趋势曲线图
        plt.plot(loss)
        plt.xlabel('Iter')
        plt.ylabel('loss')
        plt.title('lr=%f,ti=%d,bs=%d' % (LEARNING_RETE,MAX_STEP,BATCH_SIZE))
        plt.tight_layout()
        plt.savefig('cat_and_dog_alexnet.jpg',dpi=200)

#-------------------------------程序从这里开始运行---------------------------------------
if __name__ == "__main__":
    run_training()

输出结果

受笔记本性能限制,我写这篇博客的时候,模型还没有训练完成,我这里只截取了部分结果,最终的输出和loss及accuracy曲线图分析,明天补上。

There are 12500 cats There are 12500 dogs Step 0, train loss = 113810.02, train accuracy = 50% Step 100, train loss = 20647.10, train accuracy = 40% Step 200, train loss = 16054.08, train accuracy = 50% Step 300, train loss = 7717.75, train accuracy = 50% Step 400, train loss = 5881.07, train accuracy = 50% Step 500, train loss = 2879.47, train accuracy = 70% Step 600, train loss = 338.30, train accuracy = 70% Step 700, train loss = 1178.86, train accuracy = 50% Step 800, train loss = 287.65, train accuracy = 50% Step 900, train loss = 245.80, train accuracy = 50% Step 1000, train loss = 20.37, train accuracy = 50% Step 1100, train loss = 49.53, train accuracy = 60% Step 1200, train loss = 11.61, train accuracy = 60% Step 1300, train loss = 1.78, train accuracy = 70% Step 1400, train loss = 10.86, train accuracy = 30% Step 1500, train loss = 2.33, train accuracy = 30% Step 1600, train loss = 26.34, train accuracy = 40% Step 1700, train loss = 43.71, train accuracy = 50% Step 1800, train loss = 14.57, train accuracy = 60% Step 1900, train loss = 23.90, train accuracy = 30% Step 2000, train loss = 1.50, train accuracy = 50% Step 2100, train loss = 3.84, train accuracy = 50% Step 2200, train loss = 1.06, train accuracy = 60% Step 2300, train loss = 1.90, train accuracy = 50% Step 2400, train loss = 8.90, train accuracy = 50% Step 2500, train loss = 4.88, train accuracy = 40% Step 2600, train loss = 1.83, train accuracy = 70% Step 2700, train loss = 3.73, train accuracy = 40% Step 2800, train loss = 40.79, train accuracy = 40% Step 2900, train loss = 57.23, train accuracy = 40% Step 3000, train loss = 1.04, train accuracy = 80% Step 3100, train loss = 1.16, train accuracy = 50% Step 3200, train loss = 2.04, train accuracy = 50% Step 3300, train loss = 49.13, train accuracy = 50% Step 3400, train loss = 1.67, train accuracy = 70% Step 3500, train loss = 2.48, train accuracy = 40% Step 3600, train loss = 2.01, train accuracy = 50% Step 3700, train loss = 2.04, train accuracy = 60% Step 3800, train loss = 0.62, train accuracy = 60% Step 3900, train loss = 3.46, train accuracy = 40% Step 4000, train loss = 1.15, train accuracy = 50% Step 4100, train loss = 2.64, train accuracy = 40% Step 4200, train loss = 1.08, train accuracy = 60% Step 4300, train loss = 5.22, train accuracy = 60% Step 4400, train loss = 7.35, train accuracy = 50% Step 4500, train loss = 0.60, train accuracy = 90% Step 4600, train loss = 1.60, train accuracy = 80% Step 4700, train loss = 1.02, train accuracy = 50% Step 4800, train loss = 1.46, train accuracy = 60% Step 4900, train loss = 1.33, train accuracy = 40%

使用输入文件队列的注意事项

关于训练数据输入神经网络的方法,我之前有用过直接使用numpy打乱及划分batch,然后通过占位符placeholder输入给神经网络,也使用过TensorFlow输入文件队列(tf.train.shuffle_batch)的方法输入Tensor数据给神经网络,两个方法都行得通。

但是,我这两天发现TensorFlow有个巨坑的地方,就是你利用文件队列的方式去进行输入数据处理,你必须将tf.train.batch方法输出的张量数据直接输入到神经网络中,不能通过占位符的方式,否则就会报如下错误:

TypeError,must be real number,not Tensor

也有可能报如下错误:

InvalidArgumentError: You must feed a value for placeholder tensor ‘x_’ with dtype float and shape [10,227,227,3] [[Node: x_ = Placeholder[dtype=DT_FLOAT, shape=[10,227,227,3], _device=”/job:localhost/replica:0/task:0/device:GPU:0″]()]]

至于原因,我也不知道为什么,还没有去细细深究,但这是我踩了两天的坑才发现的,以前也没人提过这个问题!我上面说的可能还不是很清楚,直接看代码(只截取了关键部分)吧:

正确代码:

代码语言:javascript
复制
# 获取训练集文件名和对应标签列表
file_list, label_list = input_data.get_files(train_dir)
# 生成一个batch的图像数据和标签
train_batch, train_label_batch = input_data.get_batch(file_list,
                                                      label_list,
                                                      IMG_W,
                                                      IMG_H,
                                                      BATCH_SIZE,
                                                      CAPACITY)

regularizer = tf.contrib.layers.l2_regularizer(0.0001)
# 获取训练batch数据网络输出结果
train_logits = model.inference(train_batch, True,BATCH_SIZE,regularizer, N_CLASSES)    
train_loss = model.losses(train_logits, train_label_batch)    # 计算训练batch的损失                            
train_op = model.trainning(train_loss, LEARNING_RETE)         # 利用损失和学习率更新网络权重W参数
train_acc = model.evaluation(train_logits, train_label_batch) # 计算准确率

# 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')

错误代码:

代码语言:javascript
复制
# 获取训练集文件名和对应标签列表
file_list, label_list = input_data.get_files(train_dir)
# 生成一个batch的图像数据和标签
train_batch, train_label_batch = input_data.get_batch(file_list,
                                                      label_list,
                                                      IMG_W,
                                                      IMG_H,
                                                      BATCH_SIZE,
                                                      CAPACITY)
# 定义输入输出placeholder,用于得到传递进来的真实样本,标签没有进行one-hot编码
x_train = tf.placeholder(tf.float32,shape=[BATCH_SIZE,IMG_W,IMG_H,IMG_C],name='x_')
y_train_ = tf.placeholder(tf.int32,shape=[BATCH_SIZE,],name='y_')

regularizer = tf.contrib.layers.l2_regularizer(0.0001)
# 获取训练batch数据网络输出结果
train_logits = model.inference(x_train, True,BATCH_SIZE,regularizer, N_CLASSES)
train_loss = model.losses(train_logits, y_train_)    # 计算训练batch的损失  
train_op = model.trainning(train_loss, LEARNING_RETE)# 利用损失和学习率更新网络权重W参数
train_acc = model.evaluation(train_logits, y_train_) # 计算准确率

最后,虽然我总结出了一点经验和教训,但是我希望有人来去深究里面的原因和机制,这样其实能更深入的了解和掌握TensorFlow。

参考资料

  • 《TensorFlow实战谷歌深度学习框架第二版》
  • 《深度学习卷积神经网络从入门到精通》
  • 《TensorFlow深度学习应用实践》
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2018-12-14,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • TensorFlow模型持久化
    • 保存模型
      • 加载模型
      • 迭代训练模型实现
        • 输出结果
          • 使用输入文件队列的注意事项
          • 参考资料
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档