计算加速套件 TACO Kit 量化

在本文档中，我们将介绍模型量化的基本概念，以及使用 TACO-LLM 部署量化模型的完整流程实践。
量化概述
模型量化通常是指将一个连续取值(通常是 fp32，fp16)或者大量离散值的浮点型权重，转化为有限个离散值(通常是 int8，int4)的过程。这个过程会带来轻微的推理损失精度，但是存在如下优势：
减小模型体积
降低内存占用
在支持低精度运算的设备上提升推理速度
1. 量化比特
工业界目前常用的量化比特位数是 4 bits 和 8 bits，低于 4bits 的量化位宽被称为低比特量化。
2. 量化目标
权重：权重的量化是最常规的，量化权重可以减少模型大小和占用空间。
激活：量化激活可以大大减少内存占用，结合权重的量化可以充分利用设备的算力。
KV cache： 显存占用会随着生成的序列长度线性增长，量化 KV cache 可以节省显存，从而能够处理更大批次的大小。
量化还可以选择不同的量化粒度，例如 per-tensor，per-group 等等。并且对于激活还有动态量化和静态量化的区别。
3. 量化形式
线性量化：将浮点数值域均匀的映射到整数值域，用固定的步长进行量化。该方式实现简单，硬件比较友好，适合分布相对均匀的数据。
非线性量化：根据数据的实际分布特征进行不均匀的量化，在数据密集区域使用更细的量化粒度。实现和计算都比较复杂，理论上可以获得更好的量化精度。
在实际的推理业务中，由于非线性量化的计算复杂度较高，通常使用线性量化的方式。
4. 量化方法
量化感知训练（Quantization Aware Training， QAT）：在训练过程中模拟量化效果，通过反向传播来补偿量化误差，让模型适应量化带来的损失。
训练后量化（Post Training Quantization， PTQ）：在模型训练完成后，使用少量校准数据来确定量化参数，直接将模型量化，无需重新训练。
在实际推理业务中，PTQ 的应用更加广泛。PTQ 的主要优势在于简单和高效，但可能会引入一定程度的精度损失。
TACO-LLM 量化支持
下面展示了 TACO-LLM 在各种硬件上对不同量化方案的支持情况：
GPTQ：在 Volta, Turing, Ampere, Ada, Hopper, Intel CPU 上支持。
AWQ：在 Turing, Ampere, Ada, Hopper, Intel CPU 上支持。
Marlin：在 Ampere, Ada, Hopper 上支持。
FP8：在 Ada, Hopper  上支持。
Bitsandbytes：在 Turing, Ampere, Ada, Hopper 上支持。
INT8(W8A8)：在 Turing, Ampere, Ada, Hopper 上支持。
AQLM：在 Volta, Turing, Ampere, Ada, Hopper 上支持。
TACO-LLM 快速启动
执行 taco_llm serve -h 命令可以查看 taco-llm 完整的在线模式参数配置，其中找到 quantization 的配置参数如下：
--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,experts_int8,qqq,neuron_quant,None}, -q {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,experts_int8,qqq,neuron_quant,None}
                        Method used to quantize the weights. If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the
                        data type of the weights.
1. GPTQ-Marlin (AWQ)：
首先使用 AutoGPTQ(AutoAWQ) 将 fp16 模型权重量化。启动 TACO-LLM 的时候无需传入其他参数，server 会自动读取 config 文件中的量化参数来加载模型。TACO-LLM 在条件允许的情况下会默认使用 marlin kernel，可以传入 --quantization gptq 参数来强制使用 gptq kernel。
2. Bitsandbytes：
启动时添加启动参数 --quantization bitsandbytes，server 会自动读取 config 文件中的量化参数来加载模型。
3. FP8（W8A8）：
TACO-LLM 采用动态量化的方案来将 BF16/FP16 量化到 FP8，并且不需要额外的矫正数据集。除了 lm_head 的所有 linear modules 都会按照 per-tensor 的方式进行量化。
from taco_llm import LLM
model = LLM(moth_path, quantization="fp8")
result = model.generate("Tell me about computer science.")
GPTQ 量化实践（W4A16）：
本节以 TinyLlama-1.1B-Chat-v1.0量化流程为例，介绍整个量化过程。
模型量化流程
1. 首先安装 autogptq，来作为量化工具。然后下载对应的模型权重 TinyLlama-1.1B。
pip install autogptq datasets transformers
2. 接下来可以使用下面脚本，来执行整个量化过程（其中所需的矫正数据集会自动下载）。矫正数据集可以优先使用模型对应的垂类数据集，如果没有的话，可以使用模型的预训练数据集或者是微调数据集。
import torch
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoTokenizer
﻿
pretrained_model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantized_model_id = "TinyLlama-1.1B-Chat-v1.0-4bit-128g"
﻿
# os.makedirs(quantized_model_dir, exist_ok=True)
def get_wikitext2(tokenizer, nsamples, seqlen):
    traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train").filter(
        lambda x: len(x["text"]) >= seqlen)
﻿
    return [tokenizer(example["text"]) for example in traindata.select(range(nsamples))]
﻿
﻿
@torch.no_grad()
def calculate_avg_ppl(model, tokenizer):
    from gptqmodel.utils import Perplexity
﻿
    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path="wikitext",
        dataset_name="wikitext-2-raw-v1",
        split="train",
        text_column="text",
    )
﻿
    all = ppl.calculate(n_ctx=512, n_batch=512)
﻿
    # average ppl
    avg = sum(all) / len(all)
﻿
    return avg
﻿
def main():
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_id, use_fast=True)
﻿
    traindataset = get_wikitext2(tokenizer, nsamples=256, seqlen=1024)
﻿
    quantize_config = QuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act= False, # 
    )
﻿
    # load un-quantized model, the model will always be force loaded into cpu
    model = GPTQModel.from_pretrained(pretrained_model_id, quantize_config)
﻿
    # quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
    # with value under torch.LongTensor type.
    model.quantize(traindataset)
﻿
    # save quantized model
    model.save_quantized(quantized_model_id)
﻿
    # save quantized model using safetensors
    model.save_quantized(quantized_model_id, use_safetensors=True)
﻿
    # load quantized model, currently only support cpu or single gpu
    model = GPTQModel.from_quantized(quantized_model_id, device="cuda:0")
﻿
    # inference with model.generate
    print(tokenizer.decode(model.generate(**tokenizer("test is", return_tensors="pt").to("cuda:0"))[0]))
﻿
    print(f"Quantized Model {quantized_model_id} avg PPL is {calculate_avg_ppl(model, tokenizer)}")
﻿
﻿
if __name__ == "__main__":
    import logging
﻿
    logging.basicConfig(
        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
        level=logging.INFO,
        datefmt="%Y-%m-%d %H:%M:%S",
    )
﻿
    main()
﻿
下面介绍一下量化参数的选择：
--bits ： 权重量化的位宽。根据需求选择，要求节省显存选择 4，但是会对精度有较大的影响。基于显存和精度的平衡，建议选择 8，此时精度基本没有损失。
--group_size：group 量化的 size 大小，越小精度越高，但是会增加推理成本。建议选择 128。
--desc_act : 是否使用激活重排。打开会提高量化精度，但是会增加推理成本。建议选择False。
--nsamples： 矫正数据集的样本数量。数量太多会增加量化时间，且还会改变权重分布。建议选择 256。
--seqlen：矫正数据集的样本长度。数量太多会增加量化时间，且还会改变权重分布。7B 模型建议选择 2048，70B 以上模型建议选择 4096。
bits = [4,8]
group_size = [64,128]
nsamples = [256,512]
seqlen = [2048, 4096]
desc_act = [True, False]
﻿
量化

本页目录：

量化概述

1. 量化比特

2. 量化目标

3. 量化形式

4. 量化方法

TACO-LLM 量化支持

TACO-LLM 快速启动

1. GPTQ-Marlin (AWQ)：

2. Bitsandbytes：

3. FP8（W8A8）：

GPTQ 量化实践（W4A16）：

模型量化流程