本指导适用于在 TencentOS Server 3 上使用 OpenVINO 推理框架运行 BLIP 模型的官方 Demo,以 Docker 方式启动。
前置环境条件
运行模型
模型环境准备
1. 此时应该在
/opt/intel/openvino_2024.2.0.15519/ 路径下,创建 Demo 文件夹来存放模型代码。mkdir -p demo/BLIPcd demo/BLIP
2. 将 pip 换为国内清华源以加快下载速度。
#将pip换成清华源#设为默认,永久有效pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
3. 安装运行 BLIP 模型必要的环境(大部分运行模型需要的包已经在镜像里存在了,这里是检查一遍比较重要的包,确保后续运行不会出错):
pip install --extra-index-url https://download.pytorch.org/whl/cpu "torch>=2.1.0" torchvision "transformers>=4.26.0" "gradio>=4.19" "openvino>=2023.3.0" "datasets>=2.14.6" "tqdm" "matplotlib>=3.4"
本指导使用的 BLIP 的 backbone 为
blip-vqa-base,对于给定的图像,我们可以进行两种任务:图像描述生成(Image Caption)。
图像视觉问答(Visual Question Answering)。
下载模型权重地址换源
说明:
如果 docker run 的时候加上了
-e HF_ENDPOINT="https://hf-mirror.com",则此步可以跳过。#单次有效,退出容器且暂停容器运行后失效,再次重启进入容器需重新输入此条命令 export HF_ENDPOINT="https://hf-mirror.com"
注意:
这里使用
echo 'export HF_ENDPOINT="https://hf-mirror.com"' >> ~/.bashrc 的命令仍然会导致下载失败,请勿使用。创建模型代码
1. 在 demo/BLIP 文件夹下创建 blip.py 文件,并写入以下代码:
import timeimport osimport requestsfrom PIL import Imagefrom pathlib import Pathfrom transformers import BlipProcessor, BlipForQuestionAnsweringimport torchfrom pathlib import Pathimport openvino as ovfrom functools import partialfrom blip_model import text_decoder_forwardfrom blip_model import OVBlipModel# get model and processorprocessor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")# setup test input: download imagesample_path = Path("data/demo.jpg")if os.path.exists(sample_path):print("sample exists.")else:print("download sample.")sample_path.parent.mkdir(parents=True, exist_ok=True)r = requests.get("https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg")with sample_path.open("wb") as f:f.write(r.content)# read image, prepare questionraw_image = Image.open(sample_path).convert("RGB")question = "how many dogs are in the picture?"# preprocess input datainputs = processor(raw_image, question, return_tensors="pt")# Question Answeringstart = time.perf_counter()out = model.generate(**inputs)end = time.perf_counter() - start# postprocess resultanswer = processor.decode(out[0], skip_special_tokens=True)print(f"question: {question} Answer: {answer}")print(f"Processing time: {end:.4f} s\\n")# use OpenVINO to run the BLIP model# blip vision model to OpenVINO formatVISION_MODEL_OV = Path("blip_vision_model.xml")vision_model = model.vision_modelvision_model.eval()# check that model works and save it outputs for reusage as text encoder inputwith torch.no_grad():vision_outputs = vision_model(inputs["pixel_values"])# if openvino model does not exist, convert it to IRif not VISION_MODEL_OV.exists():# export pytorch model to ov.Modelwith torch.no_grad():ov_vision_model = ov.convert_model(vision_model, example_input=inputs["pixel_values"])# save model on disk for next usagesov.save_model(ov_vision_model, VISION_MODEL_OV)print(f"Vision model successfuly converted and saved to {VISION_MODEL_OV}")else:print(f"Vision model will be loaded from {VISION_MODEL_OV}")# blip text encoder to OpenVINO formatTEXT_ENCODER_OV = Path("blip_text_encoder.xml")text_encoder = model.text_encodertext_encoder.eval()# if openvino model does not exist, convert it to IRif not TEXT_ENCODER_OV.exists():# prepare example inputsimage_embeds = vision_outputs[0]image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long)input_dict = {"input_ids": inputs["input_ids"],"attention_mask": inputs["attention_mask"],"encoder_hidden_states": image_embeds,"encoder_attention_mask": image_attention_mask,}# export PyTorch modelwith torch.no_grad():ov_text_encoder = ov.convert_model(text_encoder, example_input=input_dict)# save model on disk for next usagesov.save_model(ov_text_encoder, TEXT_ENCODER_OV)print(f"Text encoder successfuly converted and saved to {TEXT_ENCODER_OV}")else:print(f"Text encoder will be loaded from {TEXT_ENCODER_OV}")# blip text decoder to OpenVINO formatTEXT_DECODER_OV = Path("blip_text_decoder_with_past.xml")text_decoder = model.text_decodertext_decoder.eval()# prepare example inputsinput_ids = torch.tensor([[30522]]) # begin of sequence token idattention_mask = torch.tensor([[1]]) # attention mask for input_idsencoder_hidden_states = torch.rand((1, 10, 768)) # encoder last hidden state from text_encoderencoder_attention_mask = torch.ones((1, 10), dtype=torch.long) # attention mask for encoder hidden statesinput_dict = {"input_ids": input_ids,"attention_mask": attention_mask,"encoder_hidden_states": encoder_hidden_states,"encoder_attention_mask": encoder_attention_mask,}text_decoder_outs = text_decoder(**input_dict)# extend input dictionary with hidden states from previous stepinput_dict["past_key_values"] = text_decoder_outs["past_key_values"]text_decoder.config.torchscript = Trueif not TEXT_DECODER_OV.exists():# export PyTorch modelwith torch.no_grad():ov_text_decoder = ov.convert_model(text_decoder, example_input=input_dict)# save model on disk for next usagesov.save_model(ov_text_decoder, TEXT_DECODER_OV)print(f"Text decoder successfuly converted and saved to {TEXT_DECODER_OV}")else:print(f"Text decoder will be loaded from {TEXT_DECODER_OV}")# create OpenVINO Core object instancecore = ov.Core()# check running devicedevice = "CPU"# load models on deviceov_vision_model = core.compile_model(VISION_MODEL_OV, device)ov_text_encoder = core.compile_model(TEXT_ENCODER_OV, device)ov_text_decoder_with_past = core.compile_model(TEXT_DECODER_OV, device)text_decoder.forward = partial(text_decoder_forward, ov_text_decoder_with_past=ov_text_decoder_with_past)ov_model = OVBlipModel(model.config, model.decoder_start_token_id, ov_vision_model, ov_text_encoder, text_decoder)# Image Captioningout = ov_model.generate_caption(inputs["pixel_values"], max_length=20)caption = processor.decode(out[0], skip_special_tokens=True)print(f"caption: {caption}")# Question Answeringstart = time.perf_counter()out = ov_model.generate_answer(**inputs, max_length=20)end = time.perf_counter() - startanswer = processor.decode(out[0], skip_special_tokens=True)print(f"question: {question} Answer: {answer}")print(f"Processing time: {end:.4f} s")
2. 创建 blip_model.py,写入以下代码:
import torchimport numpy as npimport openvino as ovfrom typing import List, Dictfrom transformers.modeling_outputs import CausalLMOutputWithCrossAttentionsdef init_past_inputs(model_inputs: List):"""Helper function for initialization of past inputs on first inference stepParameters:model_inputs (List): list of model inputsReturns:pkv (List[ov.Tensor]): list of filled past key values"""pkv = []for input_tensor in model_inputs[4:]:partial_shape = input_tensor.partial_shapepartial_shape[0] = 1partial_shape[2] = 0pkv.append(ov.Tensor(ov.Type.f32, partial_shape.get_shape()))return pkvdef postprocess_text_decoder_outputs(output: Dict):"""Helper function for rearranging model outputs and wrapping to CausalLMOutputWithCrossAttentionsParameters:output (Dict): dictionary with model outputReturnswrapped_outputs (CausalLMOutputWithCrossAttentions): outputs wrapped to CausalLMOutputWithCrossAttentions format"""logits = torch.from_numpy(output[0])past_kv = list(output.values())[1:]return CausalLMOutputWithCrossAttentions(loss=None,logits=logits,past_key_values=past_kv,hidden_states=None,attentions=None,cross_attentions=None,)def text_decoder_forward(ov_text_decoder_with_past: ov.CompiledModel,input_ids: torch.Tensor,attention_mask: torch.Tensor,past_key_values: List[ov.Tensor],encoder_hidden_states: torch.Tensor,encoder_attention_mask: torch.Tensor,**kwargs):"""Inference function for text_decoder in one generation stepParameters:input_ids (torch.Tensor): input token idsattention_mask (torch.Tensor): attention mask for input token idspast_key_values (List[ov.Tensor] list of cached decoder hidden states from previous stepencoder_hidden_states (torch.Tensor): encoder (vision or text) hidden statesencoder_attention_mask (torch.Tensor): attnetion mask for encoder hidden statesReturnsmodel outputs (CausalLMOutputWithCrossAttentions): model prediction wrapped to CausalLMOutputWithCrossAttentions class including predicted logits and hidden states for caching"""inputs = [input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask]if past_key_values is None:inputs.extend(init_past_inputs(ov_text_decoder_with_past.inputs))else:inputs.extend(past_key_values)outputs = ov_text_decoder_with_past(inputs)return postprocess_text_decoder_outputs(outputs)class OVBlipModel:"""Model class for inference BLIP model with OpenVINO"""def __init__(self,config,decoder_start_token_id: int,vision_model,text_encoder,text_decoder,):"""Initialization class parameters"""self.vision_model = vision_modelself.vision_model_out = vision_model.output(0)self.text_encoder = text_encoderself.text_encoder_out = text_encoder.output(0)self.text_decoder = text_decoderself.config = configself.decoder_start_token_id = decoder_start_token_idself.decoder_input_ids = config.text_config.bos_token_iddef generate_answer(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs):"""Visual Question Answering predictionParameters:pixel_values (torch.Tensor): preprocessed image pixel valuesinput_ids (torch.Tensor): question token ids after tokenizationattention_mask (torch.Tensor): attention mask for question tokensRetruns:generation output (torch.Tensor): tensor which represents sequence of generated answer token ids"""image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int)if isinstance(input_ids, list):input_ids = torch.LongTensor(input_ids)question_embeds = self.text_encoder([input_ids.detach().numpy(),attention_mask.detach().numpy(),image_embed,image_attention_mask,])[self.text_encoder_out]question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int)bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id)outputs = self.text_decoder.generate(input_ids=torch.from_numpy(bos_ids),eos_token_id=self.config.text_config.sep_token_id,pad_token_id=self.config.text_config.pad_token_id,encoder_hidden_states=torch.from_numpy(question_embeds),encoder_attention_mask=torch.from_numpy(question_attention_mask),**generate_kwargs,)return outputsdef generate_caption(self, pixel_values: torch.Tensor, input_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, **generate_kwargs):"""Image Captioning predictionParameters:pixel_values (torch.Tensor): preprocessed image pixel valuesinput_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided textattention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids providedRetruns:generation output (torch.Tensor): tensor which represents sequence of generated caption token ids"""batch_size = pixel_values.shape[0]image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long)if isinstance(input_ids, list):input_ids = torch.LongTensor(input_ids)elif input_ids is None:input_ids = torch.LongTensor([[self.config.text_config.bos_token_id,self.config.text_config.eos_token_id,]]).repeat(batch_size, 1)input_ids[:, 0] = self.config.text_config.bos_token_idattention_mask = attention_mask[:, :-1] if attention_mask is not None else Noneoutputs = self.text_decoder.generate(input_ids=input_ids[:, :-1],eos_token_id=self.config.text_config.sep_token_id,pad_token_id=self.config.text_config.pad_token_id,attention_mask=attention_mask,encoder_hidden_states=torch.from_numpy(image_embeds),encoder_attention_mask=image_attention_mask,**generate_kwargs,)return outputs
代码会首先使用 Pytorch 运行模型进行图像视觉问答任务,随后会将模型转换为 OpenVINO 的形式进行图像描述生成和图像视觉问答任务。
运行模型:
python3 blip.py
首先会下载模型,随后会下载测试的图像,保存在 data/demo.jpg 下,图像如下图所示:

3. 随后会使用 Pytorch 的方法先进行视觉问答,并计算运行的时间,结果如下(参考):
question: how many dogs are in the picture? Answer: 1Processing time: 0.2532 s
4. 接下来代码会将模型转换为 OpenVINO 的形式,并进行图像描述生成和图像视觉问答任务,结果如下(参考):
caption: dog is sitting on beachquestion: how many dogs are in the picture? Answer: 1Processing time: 0.1179 s
第一行为图像描述的结果,第二、三行为图像视觉问答的结果,可以看到使用 OpenVINO 进行视觉问答的推理速度要显著快于单纯使用 Pytorch 推理。
注意事项
说明:
由于 OpenCloudOS 是 TencentOS Server 的开源版本,理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档