BLIP

最近更新时间:2024-08-13 09:55:21

我的收藏
本指导适用于在 TencentOS Server 3 上使用 OpenVINO 推理框架运行 BLIP 模型的官方 Demo,以 Docker 方式启动。

前置环境条件

请确保已经按照 CLIP 文档内进行操作,运行模型之前的所有步骤已经完成,并已经准备好了 OpenVINO 的所有必要环境。

运行模型

模型环境准备

1. 此时应该在 /opt/intel/openvino_2024.2.0.15519/ 路径下,创建 Demo 文件夹来存放模型代码。
mkdir -p demo/BLIP
cd demo/BLIP
2. 将 pip 换为国内清华源以加快下载速度。
#将pip换成清华源
#设为默认,永久有效
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
3. 安装运行 BLIP 模型必要的环境(大部分运行模型需要的包已经在镜像里存在了,这里是检查一遍比较重要的包,确保后续运行不会出错):
pip install --extra-index-url https://download.pytorch.org/whl/cpu "torch>=2.1.0" torchvision "transformers>=4.26.0" "gradio>=4.19" "openvino>=2023.3.0" "datasets>=2.14.6" "tqdm" "matplotlib>=3.4"
本指导使用的 BLIP 的 backbone 为blip-vqa-base,对于给定的图像,我们可以进行两种任务:
图像描述生成(Image Caption)。
图像视觉问答(Visual Question Answering)。

下载模型权重地址换源

由于中国大陆无法下载 Hugging Face 网站模型,首先需要对下载网站换源,使用国内镜像网站的 HF-Mirror 模型。
说明:
如果 docker run 的时候加上了-e HF_ENDPOINT="https://hf-mirror.com",则此步可以跳过。
#单次有效,退出容器且暂停容器运行后失效,再次重启进入容器需重新输入此条命令 export HF_ENDPOINT="https://hf-mirror.com"
注意:
这里使用 echo 'export HF_ENDPOINT="https://hf-mirror.com"' >> ~/.bashrc 的命令仍然会导致下载失败,请勿使用。

创建模型代码

1. 在 demo/BLIP 文件夹下创建 blip.py 文件,并写入以下代码:
import time
import os
import requests
from PIL import Image
from pathlib import Path
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch
from pathlib import Path
import openvino as ov
from functools import partial
from blip_model import text_decoder_forward
from blip_model import OVBlipModel
# get model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
# setup test input: download image
sample_path = Path("data/demo.jpg")
if os.path.exists(sample_path):
print("sample exists.")
else:
print("download sample.")
sample_path.parent.mkdir(parents=True, exist_ok=True)
r = requests.get("https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg")
with sample_path.open("wb") as f:
f.write(r.content)
# read image, prepare question
raw_image = Image.open(sample_path).convert("RGB")
question = "how many dogs are in the picture?"
# preprocess input data
inputs = processor(raw_image, question, return_tensors="pt")
# Question Answering
start = time.perf_counter()
out = model.generate(**inputs)
end = time.perf_counter() - start
# postprocess result
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"question: {question} Answer: {answer}")
print(f"Processing time: {end:.4f} s\\n")
# use OpenVINO to run the BLIP model
# blip vision model to OpenVINO format
VISION_MODEL_OV = Path("blip_vision_model.xml")
vision_model = model.vision_model
vision_model.eval()
# check that model works and save it outputs for reusage as text encoder input
with torch.no_grad():
vision_outputs = vision_model(inputs["pixel_values"])
# if openvino model does not exist, convert it to IR
if not VISION_MODEL_OV.exists():
# export pytorch model to ov.Model
with torch.no_grad():
ov_vision_model = ov.convert_model(vision_model, example_input=inputs["pixel_values"])
# save model on disk for next usages
ov.save_model(ov_vision_model, VISION_MODEL_OV)
print(f"Vision model successfuly converted and saved to {VISION_MODEL_OV}")
else:
print(f"Vision model will be loaded from {VISION_MODEL_OV}")
# blip text encoder to OpenVINO format
TEXT_ENCODER_OV = Path("blip_text_encoder.xml")
text_encoder = model.text_encoder
text_encoder.eval()
# if openvino model does not exist, convert it to IR
if not TEXT_ENCODER_OV.exists():
# prepare example inputs
image_embeds = vision_outputs[0]
image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long)
input_dict = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
"encoder_hidden_states": image_embeds,
"encoder_attention_mask": image_attention_mask,
}
# export PyTorch model
with torch.no_grad():
ov_text_encoder = ov.convert_model(text_encoder, example_input=input_dict)
# save model on disk for next usages
ov.save_model(ov_text_encoder, TEXT_ENCODER_OV)
print(f"Text encoder successfuly converted and saved to {TEXT_ENCODER_OV}")
else:
print(f"Text encoder will be loaded from {TEXT_ENCODER_OV}")
# blip text decoder to OpenVINO format
TEXT_DECODER_OV = Path("blip_text_decoder_with_past.xml")
text_decoder = model.text_decoder
text_decoder.eval()
# prepare example inputs
input_ids = torch.tensor([[30522]]) # begin of sequence token id
attention_mask = torch.tensor([[1]]) # attention mask for input_ids
encoder_hidden_states = torch.rand((1, 10, 768)) # encoder last hidden state from text_encoder
encoder_attention_mask = torch.ones((1, 10), dtype=torch.long) # attention mask for encoder hidden states
input_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"encoder_hidden_states": encoder_hidden_states,
"encoder_attention_mask": encoder_attention_mask,
}
text_decoder_outs = text_decoder(**input_dict)
# extend input dictionary with hidden states from previous step
input_dict["past_key_values"] = text_decoder_outs["past_key_values"]
text_decoder.config.torchscript = True
if not TEXT_DECODER_OV.exists():
# export PyTorch model
with torch.no_grad():
ov_text_decoder = ov.convert_model(text_decoder, example_input=input_dict)
# save model on disk for next usages
ov.save_model(ov_text_decoder, TEXT_DECODER_OV)
print(f"Text decoder successfuly converted and saved to {TEXT_DECODER_OV}")
else:
print(f"Text decoder will be loaded from {TEXT_DECODER_OV}")
# create OpenVINO Core object instance
core = ov.Core()
# check running device
device = "CPU"
# load models on device
ov_vision_model = core.compile_model(VISION_MODEL_OV, device)
ov_text_encoder = core.compile_model(TEXT_ENCODER_OV, device)
ov_text_decoder_with_past = core.compile_model(TEXT_DECODER_OV, device)
text_decoder.forward = partial(text_decoder_forward, ov_text_decoder_with_past=ov_text_decoder_with_past)
ov_model = OVBlipModel(model.config, model.decoder_start_token_id, ov_vision_model, ov_text_encoder, text_decoder)
# Image Captioning
out = ov_model.generate_caption(inputs["pixel_values"], max_length=20)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"caption: {caption}")
# Question Answering
start = time.perf_counter()
out = ov_model.generate_answer(**inputs, max_length=20)
end = time.perf_counter() - start
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"question: {question} Answer: {answer}")
print(f"Processing time: {end:.4f} s")
2. 创建 blip_model.py,写入以下代码:
import torch
import numpy as np
import openvino as ov
from typing import List, Dict
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
def init_past_inputs(model_inputs: List):
"""
Helper function for initialization of past inputs on first inference step
Parameters:
model_inputs (List): list of model inputs
Returns:
pkv (List[ov.Tensor]): list of filled past key values
"""
pkv = []
for input_tensor in model_inputs[4:]:
partial_shape = input_tensor.partial_shape
partial_shape[0] = 1
partial_shape[2] = 0
pkv.append(ov.Tensor(ov.Type.f32, partial_shape.get_shape()))
return pkv
def postprocess_text_decoder_outputs(output: Dict):
"""
Helper function for rearranging model outputs and wrapping to CausalLMOutputWithCrossAttentions
Parameters:
output (Dict): dictionary with model output
Returns
wrapped_outputs (CausalLMOutputWithCrossAttentions): outputs wrapped to CausalLMOutputWithCrossAttentions format
"""
logits = torch.from_numpy(output[0])
past_kv = list(output.values())[1:]
return CausalLMOutputWithCrossAttentions(
loss=None,
logits=logits,
past_key_values=past_kv,
hidden_states=None,
attentions=None,
cross_attentions=None,
)
def text_decoder_forward(
ov_text_decoder_with_past: ov.CompiledModel,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
past_key_values: List[ov.Tensor],
encoder_hidden_states: torch.Tensor,
encoder_attention_mask: torch.Tensor,
**kwargs
):
"""
Inference function for text_decoder in one generation step
Parameters:
input_ids (torch.Tensor): input token ids
attention_mask (torch.Tensor): attention mask for input token ids
past_key_values (List[ov.Tensor] list of cached decoder hidden states from previous step
encoder_hidden_states (torch.Tensor): encoder (vision or text) hidden states
encoder_attention_mask (torch.Tensor): attnetion mask for encoder hidden states
Returns
model outputs (CausalLMOutputWithCrossAttentions): model prediction wrapped to CausalLMOutputWithCrossAttentions class including predicted logits and hidden states for caching
"""
inputs = [input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask]
if past_key_values is None:
inputs.extend(init_past_inputs(ov_text_decoder_with_past.inputs))
else:
inputs.extend(past_key_values)
outputs = ov_text_decoder_with_past(inputs)
return postprocess_text_decoder_outputs(outputs)
class OVBlipModel:
"""
Model class for inference BLIP model with OpenVINO
"""
def __init__(
self,
config,
decoder_start_token_id: int,
vision_model,
text_encoder,
text_decoder,
):
"""
Initialization class parameters
"""
self.vision_model = vision_model
self.vision_model_out = vision_model.output(0)
self.text_encoder = text_encoder
self.text_encoder_out = text_encoder.output(0)
self.text_decoder = text_decoder
self.config = config
self.decoder_start_token_id = decoder_start_token_id
self.decoder_input_ids = config.text_config.bos_token_id
def generate_answer(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs):
"""
Visual Question Answering prediction
Parameters:
pixel_values (torch.Tensor): preprocessed image pixel values
input_ids (torch.Tensor): question token ids after tokenization
attention_mask (torch.Tensor): attention mask for question tokens
Retruns:
generation output (torch.Tensor): tensor which represents sequence of generated answer token ids
"""
image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int)
if isinstance(input_ids, list):
input_ids = torch.LongTensor(input_ids)
question_embeds = self.text_encoder(
[
input_ids.detach().numpy(),
attention_mask.detach().numpy(),
image_embed,
image_attention_mask,
]
)[self.text_encoder_out]
question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int)
bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id)
outputs = self.text_decoder.generate(
input_ids=torch.from_numpy(bos_ids),
eos_token_id=self.config.text_config.sep_token_id,
pad_token_id=self.config.text_config.pad_token_id,
encoder_hidden_states=torch.from_numpy(question_embeds),
encoder_attention_mask=torch.from_numpy(question_attention_mask),
**generate_kwargs,
)
return outputs
def generate_caption(self, pixel_values: torch.Tensor, input_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, **generate_kwargs):
"""
Image Captioning prediction
Parameters:
pixel_values (torch.Tensor): preprocessed image pixel values
input_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided text
attention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids provided
Retruns:
generation output (torch.Tensor): tensor which represents sequence of generated caption token ids
"""
batch_size = pixel_values.shape[0]
image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long)
if isinstance(input_ids, list):
input_ids = torch.LongTensor(input_ids)
elif input_ids is None:
input_ids = torch.LongTensor(
[
[
self.config.text_config.bos_token_id,
self.config.text_config.eos_token_id,
]
]
).repeat(batch_size, 1)
input_ids[:, 0] = self.config.text_config.bos_token_id
attention_mask = attention_mask[:, :-1] if attention_mask is not None else None
outputs = self.text_decoder.generate(
input_ids=input_ids[:, :-1],
eos_token_id=self.config.text_config.sep_token_id,
pad_token_id=self.config.text_config.pad_token_id,
attention_mask=attention_mask,
encoder_hidden_states=torch.from_numpy(image_embeds),
encoder_attention_mask=image_attention_mask,
**generate_kwargs,
)
return outputs
代码会首先使用 Pytorch 运行模型进行图像视觉问答任务,随后会将模型转换为 OpenVINO 的形式进行图像描述生成和图像视觉问答任务。
运行模型:
python3 blip.py
首先会下载模型,随后会下载测试的图像,保存在 data/demo.jpg 下,图像如下图所示:

3. 随后会使用 Pytorch 的方法先进行视觉问答,并计算运行的时间,结果如下(参考):
question: how many dogs are in the picture? Answer: 1
Processing time: 0.2532 s
4. 接下来代码会将模型转换为 OpenVINO 的形式,并进行图像描述生成和图像视觉问答任务,结果如下(参考):
caption: dog is sitting on beach
question: how many dogs are in the picture? Answer: 1
Processing time: 0.1179 s
第一行为图像描述的结果,第二、三行为图像视觉问答的结果,可以看到使用 OpenVINO 进行视觉问答的推理速度要显著快于单纯使用 Pytorch 推理。

注意事项

说明:
由于 OpenCloudOS 是 TencentOS Server 的开源版本,理论上上述文档当中的所有操作同样适用于 OpenCloudOS。

参考文档