Image Captioning（1）

小飞侠xp

发布于 2018-10-11 15:37:54

1.8K0

发布于 2018-10-11 15:37:54

CNN-RNN model

首先，将图片传送到CNN中，使用预先训练的网络VGG-16或者ResNet。在这个网络的末尾是一个输出类别得分的softmax分类器。但我们不是要分类图像，我们需要表示该图像空间信息的一组特征。为了获取这组特征，删除图像分类的全连接层，并查看更早的层级从图像中提取空间信息。

现在我们使用CNN作为特征提取器，它会将原始图像中包含的大量信息压缩成更小的表示结果，此CNN通常称为编码器(Encoder)。它会将图像的内容编码为更小的特征向量，然后处理这些特征向量，并将它作为后续RNN的初始输入。

可以通过多种方式将CNN的输出与下个RNN相连，但是在所有的方式中，从CNN中提取的特征向量都需要经历一些处理步骤才能用作RNN第一个单元的输入。有时候，在将CNN输出用作RNN的输入之前，使用额外的全连接层或线性层解析CNN输出。这与迁移学习很相似，使用过的CNN经过预先训练，在其末尾添加一个未训练过的线性层使我们能在训练整个模型生成图像说明时，仅调整这一层。然后使用最为RNN输入，RNN的作用是解码处理过的特征向量并将其转换为自然语言，这部分通常被称为解码器。

图像字幕模型

我们将创建一个神经网络结构。自动从图像生成字幕。我们将使用MS COCO数据集

LSTM inputs/Outputs

我们将所有输入作为序列传递给LSTM,序列如下所示：1.首先从图像中提取特征向量；2. 然后是一个单词，下一个单词等。

嵌入维度（Embedding Dimention）

当LSTM按顺序查看输入时，序列中的每个输入需要具有一致的大小，因此嵌入特征向量和每个单词它们都是 embed_size

序列输入

LSTM按顺序查看输入，在Pytorch中，有两种方法可以做到这一点：

对于序列中的所有输入，它将按照图像、起始单词、下一个单词、下一个单词等(直到序列/批次结束)

for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

第二种方法是为LSTM提供整个序列，并使其产生一组输出和最后隐藏状态：

# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state

# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)

保持工作区

from workspace_utils import active_session

with active_session():
    # do long-running work here

coco数据集

Microsoft C*ommon *Objects in COntext (MS COCO) 数据集是用于场景理解的一个大型数据集。该数据集通常用于训练并对目标检测进行基准测试、分割和标注生成算法。

你可以在该网站或在该研究论文中查阅有关该数据集的更多信息。

初始化COCO API

需要打开GPU

import os
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO

# initialize COCO API for instance annotations
dataDir = '/opt/cocoapi'
dataType = 'val2014'
instances_annFile = os.path.join(dataDir, 'annotations/instances_{}.json'.format(dataType))
coco = COCO(instances_annFile)

# initialize COCO API for caption annotations
captions_annFile = os.path.join(dataDir, 'annotations/captions_{}.json'.format(dataType))
coco_caps = COCO(captions_annFile)

# get image ids 
ids = list(coco.anns.keys())

绘制样本图像：下来，我们要从数据集中随机选择一张图像，并为其绘图，以及五个相应的标注。每次运行下面的代码单元格时，都会选择不同的图像。

import numpy as np
import skimage.io as io
import matplotlib.pyplot as plt
%matplotlib inline

# pick a random image and obtain the corresponding URL
ann_id = np.random.choice(ids)
img_id = coco.anns[ann_id]['image_id']
img = coco.loadImgs(img_id)[0]
url = img['coco_url']

# print URL and visualize corresponding image
print(url)
I = io.imread(url)
plt.axis('off')
plt.imshow(I)
plt.show()

# load and display captions
annIds = coco_caps.getAnnIds(imgIds=img['id']);
anns = coco_caps.loadAnns(annIds)
coco_caps.showAnns(anns)

探索数据加载器

使用 data_loader.py 中的get_loader 函数对数据加载器初始化。

transform - 图像转换具体规定了应该如何对图像进行预处理，并将它们转换为PyTorch张量，然后再将它们用作CNN编码器的输入。
mode - 'train'（用于批量加载训练数据）或 'test'（用于测试数据），二者中的一个。我们将分别说明数据加载器处于训练模式或测试模式的情况。参照该 notebook 中的说明进行操作时，请设置mode='train'，这样可以使数据加载器处于训练模式。
batch_size - 它是用于确定批次的大小。训练你的模型时，它是指图像标注对的数量，用于在每个训练步骤中修改模型权重。
vocab_threshold - 它是指在将单词用作词汇表的一部分之前，单词必须出现在训练图像标注中的总次数。在训练图像标注中出现少于vocab_threshold 的单词将被认为是未知单词。
vocab_from_file - 它是指一个布尔运算（Boolean），用于决定是否从文件中加载词汇表。

import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
!pip install nltk
import nltk
nltk.download('punkt')
from data_loader import get_loader
from torchvision import transforms

# Define a transform to pre-process the training images.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Set the minimum word count threshold.
vocab_threshold = 5

# Specify the batch size.
batch_size = 10

# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=False)

运行上面的代码单元格时，数据加载器会存储在变量data_loader中。你可以将相应的数据集以data_loader.dataset 的方式访问。此数据集是data_loader.py中CoCoDataset类的一个实例。如果对数据加载器和数据集感到陌生，可以查看此 PyTorch 教程。

了解 `getitem` 方法

CoCoDataset类中的getitem方法用于确定图像标注对在合并到批处理之前应如何进行预处理。当数据加载器处于训练模式时，该方法将首先获得训练图像的文件名（path）及其对应的标注（caption）。

Image Pre-Processing(图像预处理)

# Convert image to tensor and pre-process using transform
image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
image = self.transform(image)

将训练文件夹path中的图像进行加载后，你需要使用与在实例化数据加载器时相同的转换方法（transform_train）对这些图像进行预处理。。

Caption Pre-Processing (标注预处理)

为了生成图像标注，我们的目标是创建一个模型，该模型是用于根据一个句子的前一个token预测下一个token。因此，我们要把与所有图像相关联的标注转换为标记化单词列表，然后将其转换为可用于训练网络的PyTorch张量。为了更详细地了解COCO描述是如何进行预处理的，我们首先需要看一下CoCoDataset类的vocab实例变量。下面的代码片段是从 CoCoDataset类中的__init__方法中提取的：

def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, 
        end_word, unk_word, annotations_file, vocab_from_file, img_folder):
        ...
        self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,
            end_word, unk_word, annotations_file, vocab_from_file)
        ...

从上面的代码片段中，你可以看到，data_loader.dataset.vocab是vocabulary.py中Vocabulary 类的一个实例。接下来，我们要使用这个实例对COCO描述进行预处理（来自CoCoDataset类中的__getitem__方法）：

# Convert caption to tensor of word ids.
tokens = nltk.tokenize.word_tokenize(str(caption).lower())   # line 1
caption = []                                                 # line 2
caption.append(self.vocab(self.vocab.start_word))            # line 3
caption.extend([self.vocab(token) for token in tokens])      # line 4
caption.append(self.vocab(self.vocab.end_word))              # line 5
caption = torch.Tensor(caption).long()                       # line 6

此代码会将所有字符串值的标注转换为整数列表，然后再将其转换为PyTorch张量。为了弄清楚此代码的工作原理，我们将其应用于下一个代码单元格中的示例标注。

sample_caption = 'A person doing a trick on a rail while riding a skateboard.'

在代码片段的line 1中，标注中的每个字母都转换为小写，且nltk.tokenize.word_tokenize 函数用于获取字符串值token的列表。运行下一个代码单元格，将其对sample_caption的影响可视化。

import nltk

sample_tokens = nltk.tokenize.word_tokenize(str(sample_caption).lower())
print(sample_tokens)

在line 2和line 3中，我们初始化一个空列表并附加一个整数来标记一个图像标注的开头。我们建议你阅读的这篇论文使用了一个特殊的起始单词（与一个特殊的结束单词，我们将在下面查看）来标记一个标注的开头（和结尾）。这个特殊的起始单词（"<start>"）是在实例化数据加载器时确定的，并作为参数（start_word）传递。你需要将此参数保持为其默认值（start_word="<start>"）。

你将在下面看到，整数0始终用于标记一个标注的开头。

sample_caption = []

start_word = data_loader.dataset.vocab.start_word
print('Special start word:', start_word)
sample_caption.append(data_loader.dataset.vocab(start_word))
print(sample_caption)

在line 4中，我们通过添加与标注中的每个token对应的整数来继续这个列表

sample_caption.extend([data_loader.dataset.vocab(token) for token in sample_tokens])
print(sample_caption)

在line 5，我们附加了最后一个整数，用来标记该标注的结尾。

与上面提到的特殊起始单词相同，特殊结束单词（"<end>"）会在实例化数据加载器时被确定，并作为参数（end_word）传递。你需要将此参数保持为其默认值（end_word="<end>"）。

你将在下面看到，整数1始终用于标记一个标注的结尾。

end_word = data_loader.dataset.vocab.end_word
print('Special end word:', end_word)

sample_caption.append(data_loader.dataset.vocab(end_word))
print(sample_caption)

最后，在line 6中，我们将整数列表转换为PyTorch张量并将其转换为 long 类型。此外，你可以在这个网站上阅读有关不同类型PyTorch张量的更多信息。

import torch

sample_caption = torch.Tensor(sample_caption).long()
print(sample_caption)

总之，所有标注都会转换为token列表，其中， 特殊的开始和结束token用来标记句子的开头和结尾，如下所示：

[<start>, 'a', 'person', 'doing', 'a', 'trick', 'while', 'riding', 'a', 'skateboard', '.', <end>]

然后将此token列表转换为整数列表，其中，词汇表中的每个不同单词都具有各自相关联的整数值：

[0, 3, 98, 754, 3, 396, 207, 139, 3, 753, 18, 1]

最后，此列表将转换为一个PyTorch张量。使用上述lines 1-6的相同步骤对COCO数据集中的所有标注进行预处理为了将token转换为其对应的整数，我们将data_loader.dataset.vocab 称作一个函数。你可以在vocabulary.py中Vocabulary类的__call__方法中详细了解此call具体是如何工作的。

def __call__(self, word):
    if not word in self.word2idx:
        return self.word2idx[self.unk_word]
    return self.word2idx[word]

word2idx实例变量是一个Python 字典，它由字符串值键索引，而这些字符串值键主要是从训练标注获得的token。对于每个键，对应的值是token在预处理步骤中映射到的整数。使用下面的代码单元格查看该字典的子集。

# Preview the word2idx dictionary.
dict(list(data_loader.dataset.vocab.word2idx.items())[:10])

通过遍历训练数据集中的图像标注就可以创建一个word2idx字典。如果token在训练集中出现的次数不小于vocab_threshold次数，则将其作为键添加到该字典中并分配一个相应的唯一整数。之后，你可以选择在实例化数据加载器时修改vocab_threshold参数。请注意，通常情况下，较小的vocab_threshold值会在词汇表中生成更多的token。

# Modify the minimum word count threshold.
vocab_threshold = 4

# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=False)
# Print the total number of keys in the word2idx dictionary.
print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))

word2idx 字典中还有一些特殊键。通过前面的内容，你已经熟悉了特殊的起始单词（"<start>"）和特殊的结束单词（"<end>"）。在这里，还有一个特殊的token，对应的是未知的单词（"<unk>"）。所有未出现在word2idx字典中的token都被视为未知单词。在预处理步骤中，任何未知token都会映射到整数2。

unk_word = data_loader.dataset.vocab.unk_word
print('Special unknown word:', unk_word)

print('All unknown words are mapped to this integer:', data_loader.dataset.vocab(unk_word))

print(data_loader.dataset.vocab('jfkafejw'))
print(data_loader.dataset.vocab('ieowoqjf'))

最后提到的是创建数据加载器时提供的vocab_from_file参数。在创建新的数据加载器时，词汇表（data_loader.dataset.vocab）需要保存为项目文件夹中的 pickle文件，文件名为vocab.pkl。如果你此刻还在调整vocab_threshold参数的值，则必须设置为vocab_from_file=False，这样才能使更改生效。但是，如果你对为vocab_threshold参数选定的值感到满意，则只需再次使用所选的vocab_threshold运行数据加载器即可，这样可以将新词汇表保存到文件中。然后，就可以设置vocab_from_file=True 了，这样便于在文件中加载词汇表并加速数据加载器的实例化。请注意，从零开始构建词汇表是实例化数据加载器过程中最耗时的一部分，因此我们强烈建议你尽快设置vocab_from_file=True。

# Obtain the data loader (from file). Note that it runs much faster than before!
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_from_file=True)

使用数据加载器获取批次

数据集中的图像标注长度差异很大，查看一下Python列表data_loader.dataset.caption_lengths就可以发现这一点。在这个列表中，每个训练标注都有一个entry（其中，值用于存储相应标注的长度）。

在下面的代码单元格中，我们使用此列表输出每个长度的训练数据中的标注总数。接下来你会看到，大多数标注的长度为10。同时，过短与过长的标注非常少见。

from collections import Counter

# Tally the total number of training captions with each length.
counter = Counter(data_loader.dataset.caption_lengths)
lengths = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)
for value, count in lengths:
    print('value: %2d --- count: %5d' % (value, count))

为了生成批量的训练数据，我们首先对标注长度进行采样。在采样中，抽取的所有长度的概率需要与数据集中具有该长度的标注的数量成比例。然后，我们检索一批图像标注对的sizebatch_size，其中，所有标注都具有采样长度。这种用于分配批次的方法与这篇文章中的过程相匹配，并且已被证明在不降低性能的情况下具有计算上的有效性。运行下面的代码单元格，生成一个批次。 CoCoDataset类中的get_train_indices方法首先对标注长度进行采样，然后对与训练数据点对应的batch_sizeindices进行采样，并使用该长度的标注。这些indices存储在indices。这些indices会提供给数据加载器，然后用于检索相应的数据点。该批次中的预处理图像和标注存储在images和captions中。

import numpy as np
import torch.utils.data as data

# Randomly sample a caption length, and sample indices with that length.
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)

# Create and assign a batch sampler to retrieve a batch with the sampled indices.
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler
    
# Obtain the batch.
images, captions = next(iter(data_loader))
    
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)

# (Optional) Uncomment the lines of code below to print the pre-processed images and captions.
# print('images:', images)
# print('captions:', captions)

使用CNN编码器

运行下面的代码单元格，从model.py中导入EncoderCNN和DecoderRNN。

# Watch for any changes in model.py, and re-load it automatically.
% load_ext autoreload
% autoreload 2

# Import EncoderCNN and DecoderRNN. 
from model import EncoderCNN, DecoderRNN

在下一个代码单元格中，我们定义了一个device，你将使用它将PyTorch张量移动到GPU（如果CUDA可用的话）。在进行下一步之前，运行此代码单元格。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

运行下面的代码单元格，在encoder中实例化CNN编码器。然后，该notebook的 Step 2中批次的预处理图像会通过编码器，且其输出会存储在features中。

# Specify the dimensionality of the image embedding.
embed_size = 256

#-#-#-# Do NOT modify the code below this line. #-#-#-#

# Initialize the encoder. (Optional: Add additional arguments if necessary.)
encoder = EncoderCNN(embed_size)

# Move the encoder to GPU if CUDA is available.
encoder.to(device)
    
# Move last batch of images (from Step 2) to GPU if CUDA is available.   
images = images.to(device)

# Pass the images through the encoder.
features = encoder(images)

print('type(features):', type(features))
print('features.shape:', features.shape)

# Check that your encoder satisfies some requirements of the project! :D
assert type(features)==torch.Tensor, "Encoder output needs to be a PyTorch Tensor." 
assert (features.shape[0]==batch_size) & (features.shape[1]==embed_size), "The shape of the encoder output is incorrect."

编码器使用预先训练的ResNet-50架构（删除了最终的完全连接层）从一批预处理图像中提取特征。然后将输出展平为矢量，然后通过 Linear层，将特征向量转换为与单词向量同样大小的向量。

实现RNN解码器

在model.py中的DecoderRNN 类中编写init和 forward方法。解码器将会是DecoderRNN类的一个实例，且必须接收下列输入：

包含嵌入图像特征的PyTorch张量features（在 Step 3 中输出，当 Step 2 中的最后一批图像通过编码器时）
与 Step 2中最后一批标注（captions）相对应的PyTorch张量。

outputs应该是一个大小为[batch_size, captions.shape[1], vocab_size]的PyTorch张量。这样设计输出的目的是outputs[i,j,k]包含模型的预测分数，而该分数表示批次中第 i个标注中的第j个token是词汇表中第k个token的可能性。

# Specify the number of features in the hidden state of the RNN decoder.
hidden_size = 512

#-#-#-# Do NOT modify the code below this line. #-#-#-#

# Store the size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the decoder.
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move the decoder to GPU if CUDA is available.
decoder.to(device)
    
# Move last batch of captions (from Step 1) to GPU if CUDA is available 
captions = captions.to(device)

# Pass the encoder output and captions through the decoder.
outputs = decoder(features, captions)

print('type(outputs):', type(outputs))
print('outputs.shape:', outputs.shape)

# Check that your decoder satisfies some requirements of the project! :D
assert type(outputs)==torch.Tensor, "Decoder output needs to be a PyTorch Tensor."
assert (outputs.shape[0]==batch_size) & (outputs.shape[1]==captions.shape[1]) & (outputs.shape[2]==vocab_size), "The shape of the decoder output is incorrect."

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018.10.08 ，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度