计算加速套件 TACO Kit Auto Prefix Cach

在多轮对话、系统提示以及其他具有大量共同前缀的应用场景中，自动前缀缓存功能能够存储之前的前缀键值缓存，加快后续具有相同前缀请求的预填充阶段，从而降低首次响应时间，优化处理能力。
原理
LLM 推理计算主要分为两个过程：Prefill 阶段（Prompt 计算）和 Decode 阶段。这两个阶段的计算特性存在不同，Prefill 阶段是计算受限的，而 Decode 阶段是访存受限的。为了避免重复计算，Prefill 阶段主要作用就是给 Decode 阶段准备 KV Cache。但这些 KV Cache 通常只是为单条推理请求服务的，当请求结束，对应的 KV-Cache 就会清除。那很自然的一种想法就是，KV Cache 能不能跨请求复用？在某些 LLM 业务场景下，多次请求的 Prompt 可能会共享同一个前缀（Prefix），比如少量样本学习，多轮对话等。在这些情况下，很多请求 Prompt 的前缀的 KV Cache 计算的结果是相同的，可以被缓存起来，给之后的请求复用。TACO-LLM 的 Auto Prefix Cache 技术可以针对这种场景进行优化，使得具有相同 Prompt 前缀的 KV-Cache 可以跨请求复用，降低计算开销，提升推理计算性能。
﻿
具体详情请参见：Fast and Expressive LLM Inference with RadixAttention and SGLang。
启动选项
--enable-prefix-caching                                                                                      
                        Enables automatic prefix caching.
在 server 启动指令中添加该指令即可开启 Auto Prefix Caching 功能。
Prefix Cache Offload
显卡的显存有限，除装载模型权重和运行激活空间以外，留给 prefix kv cache 的 blocks 数量是固定而有限的。当不同的 prefix 请求较多，随着请求的不断输入，之前的 prefix cache 就会被驱逐，之后有相同 prefix 的请求将无法命中 prefix cache。这导致需要重新执行 prefill 流程，从而无法获得加速效果。
TACO LLM 提供 prefix cache offload 功能，在显存 prefix cache 被驱逐时 offload 到 cpu 内存上，命中时 load 回 gpu 中，进而加速之前 prefix cache 被驱逐而得不到加速的请求。
Offload 选项
  --enable-prefix-cache-offload
                        Enables prefix cache offloading
  --cpu-prefill-memory-utilization CPU_PREFILL_MEMORY_UTILIZATION    
                        the memory is used for prefill cache, which can range
                        from 0 to 1. If unspecified, will use the default
                        value of 0.3. 
  --apc-offload-not-lazy
                        If set, lazy launch of layer 2~n-1 will be disabled.
  --apc-offload-min-access-threshold APC_OFFLOAD_MIN_ACCESS_THRESHOLD
                        Min threshold for evict offloading. Default 1.
  --apc-offload-enable-hit-cnt
                        Enable hit count in APC.
  --apc-offload-gpu-evictor-limit APC_OFFLOAD_GPU_EVICTOR_LIMIT
                        The free table size limited in gpu evictor. -1 default
--enable-prefix-cache-offload
在开启了 APC 的基础上打开该开关，即可启用 prefix cache offload 功能。
--cpu-prefill-memory-utilization
用于 kv cache 的 cpu 内存比例，默认0.3，按卡均分。
Note：该比例按机器资源内存（psutil.virtual_memory.total）进行计算，与 lookahead cache 的 --cpu-decoding-memory-utilization（默认0.15）类似。请预留足够内存或配置相应比例值。
--apc-offload-not-lazy
是否关闭 offload 按层延迟启动，仅作调试用途。
  --apc-offload-min-access-threshold
一个 block 会被 offload 的最小使用阈值，默认为1，即所有 block 都会被 offload。增加此值，被多次使用的 block 才会被 offload。
--apc-offload-enable-hit-cnt
打开 prefix cache offload 的命中率 log，每100个 block 打印一次。
--apc-offload-gpu-evictor-limit
gpu evictor 的 free table 大小，默认为-1不生效，设置具体值限制 gpu上prefix cache 容量，仅作调试用途。
场景
Auto Prefix Caching 适用于 common prefix 较多，prefill 计算占比较大的场景，例如 多轮对话，system prompt，代码补全等等。此类场景打开 --enable-prefix-caching 即可加速命中 gpu prefix cache 的请求的 prefill 阶段。
GPU 容纳 prefix cache 的空间是有限的，当 common prefix 的累积量比较多，相同 common prefix 的请求间相隔比较远的场景下，之前保留的 prefix cache已经被驱逐而无法获得收益。
比如 GPU 的 block 数为100，有11个不同 prefix 的请求[Q1, ..., Q11]，每个对应 prefix block 数为10，则Q11结束后Q1的 prefix cache 会被驱逐。即使Q12跟Q1的 prefix 一致，也无法获得加速。
此时 prefix cache offload 通过额外的内存空间保留被驱逐的 prefix cache，进而加速Q12。
可开启 --apc-offload-enable-hit-cnt 参数，通过 log cpu 判断是否有 offload 的 prefix cache 得到命中。该 log 统计所有 allocate的block 的 hit 情况。
[HIT CNT] total: 177800, gpu: 21944 (12.34%) cpu: 66166 (37.21%) not hit: 89690 (50.44%)
﻿
Auto Prefix Caching

本页目录：

原理

启动选项

Prefix Cache Offload

Offload 选项

场景