作为多模态数据处理的经典,图像自动打标签(图像叙事功能)一直是一项非常前沿的技术,涉及到机器视觉,自然语言处理等模块。
幸运的是,谷歌基于tensorflow将此项功能进行开源。https://github.com/tensorflow/models/tree/master/im2txt#generating-captions
该功能的英文介绍如下:
The Show and Tell model is a deep neural network that learns how to describe the content of images.
其架构英文介绍如下:
The Show and Tell model is an example of an encoder-decoder neural network. It works by first "encoding" an image into a fixed-length vector representation, and then "decoding" the representation into a natural language description.
The image encoder is a deep convolutional neural network. This type of network is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Our particular choice of network is the Inception v3 image recognition model pretrained on the ILSVRC-2012-CLS image classification dataset.
The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.
Words in the captions are represented with an embedding model. Each word in the vocabulary is associated with a fixed-length vector representation that is learned during training.
The following diagram illustrates the model architecture.
即结合了inception v3 + LSTM模型来实现整个架构。将图像的表示后向量与图像标记的词向量传入到整个模型中。(具体的模型见GITHUB相关页面,比较经典的。)
二、实验测试
为了进行实验,找了提前训练好的模型,不过由于本文实验在tensorflow 1.0版本之上,需要填好几个坑:
(1) word_counts.txt文件的处理,需要将文件中的 b' str' ==> str,即把字符串的引号等全部去掉。
(2)修改预训练模型中的名称,由于预训练模型的名称不一致的问题,所以需要进行修改。
在具体代码修改中,添加一个函数来进行模型的修改和重新保存
# 由于版本不同,需要进行修改 def RenameCkpt(): vars_to_rename = { "lstm/BasicLSTMCell/Linear/Matrix": "lstm/basic_lstm_cell/weights", "lstm/BasicLSTMCell/Linear/Bias": "lstm/basic_lstm_cell/biases", } new_checkpoint_vars = {} reader = tf.train.NewCheckpointReader(FLAGS.checkpoint_path) for old_name in reader.get_variable_to_shape_map(): if old_name in vars_to_rename: new_name = vars_to_rename[old_name] else: new_name = old_name new_checkpoint_vars[new_name] = tf.Variable(reader.get_tensor(old_name)) init = tf.global_variables_initializer() saver = tf.train.Saver(new_checkpoint_vars) with tf.Session() as sess: sess.run(init) saver.save(sess, "/home/ndscbigdata/work/change/tf/gan/im2txt/ckpt/newmodel.ckpt-2000000") print("checkpoint file rename successful... ")
具体实验:
(1)手动设置一些参数
FLAGS.checkpoint_path = "/home/ndscbigdata/work/change/tf/gan/im2txt/ckpt/newmodel.ckpt-2000000" FLAGS.vocab_file = "./data/volab.txt" FLAGS.input_files = "./data/COCO_val2014_000000224477.jpg,./data/ep271.jpg,./data/dog.jpg"
(2)实验图片
图像 COCO_val2014_000000224477.jpg 标题是: 0) a man riding a wave on top of a surfboard . (概率=0.035672) 1) a person riding a surf board on a wave (概率=0.016238) 2) a man on a surfboard riding a wave . (概率=0.010146)
图像 ep271.jpg 标题是: 0) a woman is standing next to a horse . (概率=0.000759) 1) a woman is standing next to a horse (概率=0.000647) 2) a woman is standing next to a brown horse . (概率=0.000384)
图像 dog.jpg 标题是: 0) a dog is eating a slice of pizza . (概率=0.000138) 1) a dog is eating a slice of pizza on a plate . (概率=0.000047) 2) a dog is sitting at a table with a pizza on it . (概率=0.000039)
注:最后这张图片,是谷歌经典的实验用图,可以看出其测试结果还是相当令人满意的。
可惜由于实验硬件太差,要不可以结合inception v4来训练,应该效果会更好。另外,还有中文标签的生成。
具体的修改源码将公布在本人的github上,欢迎大家前往下载。https://github.com/ndscigdata