TencentOS Server BLIP

本指导适用于在 TencentOS Server 3 上使用 OpenVINO 推理框架运行 BLIP 模型的官方 Demo，以 Docker 方式启动。
前置环境条件
请确保已经按照 CLIP 文档内进行操作，运行模型之前的所有步骤已经完成，并已经准备好了 OpenVINO 的所有必要环境。
运行模型
模型环境准备
1. 此时应该在 /opt/intel/openvino_2024.2.0.15519/ 路径下，创建 Demo 文件夹来存放模型代码。
mkdir -p demo/BLIP
cd demo/BLIP
2. 将 pip 换为国内清华源以加快下载速度。
#将pip换成清华源
#设为默认，永久有效
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
3. 安装运行 BLIP 模型必要的环境（大部分运行模型需要的包已经在镜像里存在了，这里是检查一遍比较重要的包，确保后续运行不会出错）：
pip install --extra-index-url https://download.pytorch.org/whl/cpu "torch>=2.1.0" torchvision "transformers>=4.26.0" "gradio>=4.19" "openvino>=2023.3.0" "datasets>=2.14.6" "tqdm" "matplotlib>=3.4"
本指导使用的 BLIP 的 backbone 为blip-vqa-base，对于给定的图像，我们可以进行两种任务：
图像描述生成（Image Caption）。
图像视觉问答（Visual Question Answering）。
下载模型权重地址换源
由于中国大陆无法下载 Hugging Face 网站模型，首先需要对下载网站换源，使用国内镜像网站的 HF-Mirror 模型。
说明：
如果 docker run 的时候加上了-e HF_ENDPOINT="https://hf-mirror.com"，则此步可以跳过。
#单次有效，退出容器且暂停容器运行后失效，再次重启进入容器需重新输入此条命令
export HF_ENDPOINT="https://hf-mirror.com"
注意：
这里使用 echo 'export HF_ENDPOINT="https://hf-mirror.com"' >> ~/.bashrc 的命令仍然会导致下载失败，请勿使用。
创建模型代码
1. 在 demo/BLIP 文件夹下创建 blip.py 文件，并写入以下代码：
import time
import os
import requests
from PIL import Image
from pathlib import Path
from transformers import BlipProcessor, BlipForQuestionAnswering


import torch
from pathlib import Path
import openvino as ov


from functools import partial
from blip_model import text_decoder_forward
from blip_model import OVBlipModel


# get model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")


# setup test input: download image
sample_path = Path("data/demo.jpg")
if os.path.exists(sample_path):
    print("sample exists.")
else:
    print("download sample.")
    sample_path.parent.mkdir(parents=True, exist_ok=True)
    r = requests.get("https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg")


    with sample_path.open("wb") as f:
        f.write(r.content)


# read image, prepare question
raw_image = Image.open(sample_path).convert("RGB")
question = "how many dogs are in the picture?"


# preprocess input data
inputs = processor(raw_image, question, return_tensors="pt")


# Question Answering
start = time.perf_counter()
out = model.generate(**inputs)
end = time.perf_counter() - start


# postprocess result
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"question: {question} Answer: {answer}")
print(f"Processing time: {end:.4f} s\\n")


# use OpenVINO to run the BLIP model


# blip vision model to OpenVINO format
VISION_MODEL_OV = Path("blip_vision_model.xml")
vision_model = model.vision_model
vision_model.eval()


# check that model works and save it outputs for reusage as text encoder input
with torch.no_grad():
    vision_outputs = vision_model(inputs["pixel_values"])


# if openvino model does not exist, convert it to IR
if not VISION_MODEL_OV.exists():
    # export pytorch model to ov.Model
    with torch.no_grad():
        ov_vision_model = ov.convert_model(vision_model, example_input=inputs["pixel_values"])
    # save model on disk for next usages
    ov.save_model(ov_vision_model, VISION_MODEL_OV)
    print(f"Vision model successfuly converted and saved to {VISION_MODEL_OV}")
else:
    print(f"Vision model will be loaded from {VISION_MODEL_OV}")




# blip text encoder to OpenVINO format
TEXT_ENCODER_OV = Path("blip_text_encoder.xml")
text_encoder = model.text_encoder
text_encoder.eval()


# if openvino model does not exist, convert it to IR
if not TEXT_ENCODER_OV.exists():
    # prepare example inputs
    image_embeds = vision_outputs[0]
    image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long)
    input_dict = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "encoder_hidden_states": image_embeds,
        "encoder_attention_mask": image_attention_mask,
    }
    # export PyTorch model
    with torch.no_grad():
        ov_text_encoder = ov.convert_model(text_encoder, example_input=input_dict)
    # save model on disk for next usages
    ov.save_model(ov_text_encoder, TEXT_ENCODER_OV)
    print(f"Text encoder successfuly converted and saved to {TEXT_ENCODER_OV}")
else:
    print(f"Text encoder will be loaded from {TEXT_ENCODER_OV}")




# blip text decoder to OpenVINO format
TEXT_DECODER_OV = Path("blip_text_decoder_with_past.xml")
text_decoder = model.text_decoder
text_decoder.eval()


# prepare example inputs
input_ids = torch.tensor([[30522]])  # begin of sequence token id
attention_mask = torch.tensor([[1]])  # attention mask for input_ids
encoder_hidden_states = torch.rand((1, 10, 768))  # encoder last hidden state from text_encoder
encoder_attention_mask = torch.ones((1, 10), dtype=torch.long)  # attention mask for encoder hidden states


input_dict = {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "encoder_hidden_states": encoder_hidden_states,
    "encoder_attention_mask": encoder_attention_mask,
}
text_decoder_outs = text_decoder(**input_dict)
# extend input dictionary with hidden states from previous step
input_dict["past_key_values"] = text_decoder_outs["past_key_values"]


text_decoder.config.torchscript = True
if not TEXT_DECODER_OV.exists():
    # export PyTorch model
    with torch.no_grad():
        ov_text_decoder = ov.convert_model(text_decoder, example_input=input_dict)
    # save model on disk for next usages
    ov.save_model(ov_text_decoder, TEXT_DECODER_OV)
    print(f"Text decoder successfuly converted and saved to {TEXT_DECODER_OV}")
else:
    print(f"Text decoder will be loaded from {TEXT_DECODER_OV}")




# create OpenVINO Core object instance
core = ov.Core()
# check running device
device = "CPU"


# load models on device
ov_vision_model = core.compile_model(VISION_MODEL_OV, device)
ov_text_encoder = core.compile_model(TEXT_ENCODER_OV, device)
ov_text_decoder_with_past = core.compile_model(TEXT_DECODER_OV, device)


text_decoder.forward = partial(text_decoder_forward, ov_text_decoder_with_past=ov_text_decoder_with_past)
ov_model = OVBlipModel(model.config, model.decoder_start_token_id, ov_vision_model, ov_text_encoder, text_decoder)


# Image Captioning
out = ov_model.generate_caption(inputs["pixel_values"], max_length=20)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"caption: {caption}")


# Question Answering
start = time.perf_counter()
out = ov_model.generate_answer(**inputs, max_length=20)
end = time.perf_counter() - start
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"question: {question} Answer: {answer}")
print(f"Processing time: {end:.4f} s")
2. 创建 blip_model.py，写入以下代码：
import torch
import numpy as np
import openvino as ov
from typing import List, Dict
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions




def init_past_inputs(model_inputs: List):
    """
    Helper function for initialization of past inputs on first inference step
    Parameters:
      model_inputs (List): list of model inputs
    Returns:
      pkv (List[ov.Tensor]): list of filled past key values
    """
    pkv = []
    for input_tensor in model_inputs[4:]:
        partial_shape = input_tensor.partial_shape
        partial_shape[0] = 1
        partial_shape[2] = 0
        pkv.append(ov.Tensor(ov.Type.f32, partial_shape.get_shape()))
    return pkv




def postprocess_text_decoder_outputs(output: Dict):
    """
    Helper function for rearranging model outputs and wrapping to CausalLMOutputWithCrossAttentions
    Parameters:
      output (Dict): dictionary with model output
    Returns
      wrapped_outputs (CausalLMOutputWithCrossAttentions): outputs wrapped to CausalLMOutputWithCrossAttentions format
    """
    logits = torch.from_numpy(output[0])
    past_kv = list(output.values())[1:]
    return CausalLMOutputWithCrossAttentions(
        loss=None,
        logits=logits,
        past_key_values=past_kv,
        hidden_states=None,
        attentions=None,
        cross_attentions=None,
    )




def text_decoder_forward(
    ov_text_decoder_with_past: ov.CompiledModel,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    past_key_values: List[ov.Tensor],
    encoder_hidden_states: torch.Tensor,
    encoder_attention_mask: torch.Tensor,
    **kwargs
):
    """
    Inference function for text_decoder in one generation step
    Parameters:
      input_ids (torch.Tensor): input token ids
      attention_mask (torch.Tensor): attention mask for input token ids
      past_key_values (List[ov.Tensor] list of cached decoder hidden states from previous step
      encoder_hidden_states (torch.Tensor): encoder (vision or text) hidden states
      encoder_attention_mask (torch.Tensor): attnetion mask for encoder hidden states
    Returns
      model outputs (CausalLMOutputWithCrossAttentions): model prediction wrapped to CausalLMOutputWithCrossAttentions class including predicted logits and hidden states for caching
    """
    inputs = [input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask]
    if past_key_values is None:
        inputs.extend(init_past_inputs(ov_text_decoder_with_past.inputs))
    else:
        inputs.extend(past_key_values)
    outputs = ov_text_decoder_with_past(inputs)
    return postprocess_text_decoder_outputs(outputs)




class OVBlipModel:
    """
    Model class for inference BLIP model with OpenVINO
    """


    def __init__(
        self,
        config,
        decoder_start_token_id: int,
        vision_model,
        text_encoder,
        text_decoder,
    ):
        """
        Initialization class parameters
        """
        self.vision_model = vision_model
        self.vision_model_out = vision_model.output(0)
        self.text_encoder = text_encoder
        self.text_encoder_out = text_encoder.output(0)
        self.text_decoder = text_decoder
        self.config = config
        self.decoder_start_token_id = decoder_start_token_id
        self.decoder_input_ids = config.text_config.bos_token_id


    def generate_answer(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs):
        """
        Visual Question Answering prediction
        Parameters:
          pixel_values (torch.Tensor): preprocessed image pixel values
          input_ids (torch.Tensor): question token ids after tokenization
          attention_mask (torch.Tensor): attention mask for question tokens
        Retruns:
          generation output (torch.Tensor): tensor which represents sequence of generated answer token ids
        """
        image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
        image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int)
        if isinstance(input_ids, list):
            input_ids = torch.LongTensor(input_ids)
        question_embeds = self.text_encoder(
            [
                input_ids.detach().numpy(),
                attention_mask.detach().numpy(),
                image_embed,
                image_attention_mask,
            ]
        )[self.text_encoder_out]
        question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int)


        bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id)


        outputs = self.text_decoder.generate(
            input_ids=torch.from_numpy(bos_ids),
            eos_token_id=self.config.text_config.sep_token_id,
            pad_token_id=self.config.text_config.pad_token_id,
            encoder_hidden_states=torch.from_numpy(question_embeds),
            encoder_attention_mask=torch.from_numpy(question_attention_mask),
            **generate_kwargs,
        )
        return outputs


    def generate_caption(self, pixel_values: torch.Tensor, input_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, **generate_kwargs):
        """
        Image Captioning prediction
        Parameters:
          pixel_values (torch.Tensor): preprocessed image pixel values
          input_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided text
          attention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids provided
        Retruns:
          generation output (torch.Tensor): tensor which represents sequence of generated caption token ids
        """
        batch_size = pixel_values.shape[0]


        image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]


        image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long)


        if isinstance(input_ids, list):
            input_ids = torch.LongTensor(input_ids)
        elif input_ids is None:
            input_ids = torch.LongTensor(
                [
                    [
                        self.config.text_config.bos_token_id,
                        self.config.text_config.eos_token_id,
                    ]
                ]
            ).repeat(batch_size, 1)
        input_ids[:, 0] = self.config.text_config.bos_token_id
        attention_mask = attention_mask[:, :-1] if attention_mask is not None else None


        outputs = self.text_decoder.generate(
            input_ids=input_ids[:, :-1],
            eos_token_id=self.config.text_config.sep_token_id,
            pad_token_id=self.config.text_config.pad_token_id,
            attention_mask=attention_mask,
            encoder_hidden_states=torch.from_numpy(image_embeds),
            encoder_attention_mask=image_attention_mask,
            **generate_kwargs,
        )


        return outputs
代码会首先使用 Pytorch 运行模型进行图像视觉问答任务，随后会将模型转换为 OpenVINO 的形式进行图像描述生成和图像视觉问答任务。
运行模型：
python3 blip.py
首先会下载模型，随后会下载测试的图像，保存在 data/demo.jpg 下，图像如下图所示：
﻿
3. 随后会使用 Pytorch 的方法先进行视觉问答，并计算运行的时间，结果如下（参考）：
question: how many dogs are in the picture? Answer: 1
Processing time: 0.2532 s
4. 接下来代码会将模型转换为 OpenVINO 的形式，并进行图像描述生成和图像视觉问答任务，结果如下（参考）：
caption: dog is sitting on beach
question: how many dogs are in the picture? Answer: 1
Processing time: 0.1179 s
第一行为图像描述的结果，第二、三行为图像视觉问答的结果，可以看到使用 OpenVINO 进行视觉问答的推理速度要显著快于单纯使用 Pytorch 推理。
注意事项
说明：
由于 OpenCloudOS 是 TencentOS Server 的开源版本，理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档
﻿openvino blip-visual-language-processing Demo﻿
﻿Hugging Face 镜像网站﻿
﻿openvino Interactive Tutorials﻿
﻿Hugging Face Salesforce/blip-vqa-base 模型﻿
﻿
BLIP

本页目录：

前置环境条件

运行模型

模型环境准备

下载模型权重地址换源

创建模型代码

注意事项

参考文档