LLM
LLM 构造参数
class taco_llm.LLM(model: str,tokenizer: Optional[str] = None,tokenizer_mode: str = "auto",skip_tokenizer_init: bool = False,trust_remote_code: bool = False,tensor_parallel_size: int = 1,dtype: str = "auto",quantization: Optional[str] = None,revision: Optional[str] = None,tokenizer_revision: Optional[str] = None,seed: int = 0,gpu_memory_utilization: float = 0.9,swap_space: float = 4,cpu_offload_gb: float = 0,enforce_eager: Optional[bool] = None,max_context_len_to_capture: Optional[int] = None,max_seq_len_to_capture: int = 8192,disable_custom_all_reduce: bool = False,disable_async_output_proc: bool = False,**kwargs,)"""This class includes a tokenizer, a language model (possibly distributedacross multiple GPUs), and GPU memory space allocated for intermediatestates (aka KV cache). Given a batch of prompts and sampling parameters,this class generates texts from the model, using an intelligent batchingmechanism and efficient memory management.Args:model: The name or path of a HuggingFace Transformers model.tokenizer: The name or path of a HuggingFace Transformers tokenizer.tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizerif available, and "slow" will always use the slow tokenizer.skip_tokenizer_init: If true, skip initialization of tokenizer anddetokenizer. Expect valid prompt_token_ids and None for promptfrom the input.trust_remote_code: Trust remote code (e.g., from HuggingFace) whendownloading the model and tokenizer.tensor_parallel_size: The number of GPUs to use for distributedexecution with tensor parallelism.dtype: The data type for the model weights and activations. Currently,we supportfloat32
,float16
, andbfloat16
. Ifauto
, we usethetorch_dtype
attribute specified in the model config file.However, if thetorch_dtype
in the config isfloat32
, we willusefloat16
instead.quantization: The method used to quantize the model weights. Currently,we support "awq", "gptq", and "fp8" (experimental).If None, we first check thequantization_config
attribute in themodel config file. If that is None, we assume the model weights arenot quantized and usedtype
to determine the data type ofthe weights.revision: The specific model version to use. It can be a branch name,a tag name, or a commit id.tokenizer_revision: The specific tokenizer version to use. It can be abranch name, a tag name, or a commit id.seed: The seed to initialize the random number generator for sampling.gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory toreserve for the model weights, activations, and KV cache. Highervalues will increase the KV cache size and thus improve the model'sthroughput. However, if the value is too high, it may cause out-of-memory (OOM) errors.swap_space: The size (GiB) of CPU memory per GPU to use as swap space.This can be used for temporarily storing the states of the requestswhen theirbest_of
sampling parameters are larger than 1. If allrequests will havebest_of=1
, you can safely set this to 0.Otherwise, too small values may cause out-of-memory (OOM) errors.cpu_offload_gb: The size (GiB) of CPU memory to use for offloadingthe model weights. This virtually increases the GPU memory spaceyou can use to hold the model weights, at the cost of CPU-GPU datatransfer for every forward pass.enforce_eager: Whether to enforce eager execution. If True, we willdisable CUDA graph and always execute the model in eager mode.If False, we will use CUDA graph and eager execution in hybrid.max_context_len_to_capture: Maximum context len covered by CUDA graphs.When a sequence has context length larger than this, we fall backto eager mode (DEPRECATED. Usemax_seq_len_to_capture
instead).max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.When a sequence has context length larger than this, we fall backto eager mode.disable_custom_all_reduce: See ParallelConfig**kwargs: Arguments for :class:taco_llm.EngineArgs
.Note:This class is intended to be used for offline inference. For onlineserving, use the :class:taco_llm.AsyncLLMEngine
class instead."""
TACO-LLM 支持离线和在线两种模式,这两种模式的参数配置是一致的。因此,除了上述明确提到的参数外,您还可以设置任意 TACO-LLM 在线模式支持的参数。完整的参数配置请参见 在线模式 API 章节。
chat 接口
def chat(self,messages: List[ChatCompletionMessageParam],sampling_params: Optional[Union[SamplingParams,List[SamplingParams]]] = None,use_tqdm: bool = True,lora_request: Optional[LoRARequest] = None,chat_template: Optional[str] = None,add_generation_prompt: bool = True,) -> List[RequestOutput]:"""Generate responses for a chat conversation.The chat conversation is converted into a text prompt using thetokenizer and calls the :meth:`generate` method to generate theresponses.Multi-modal inputs can be passed in the same way you would pass themto the OpenAI API.Args:messages: A single conversation represented as a list of messages.Each message is a dictionary with 'role' and 'content' keys.sampling_params: The sampling parameters for text generation.If None, we use the default sampling parameters. When itis a single value, it is applied to every prompt. When itis a list, the list must have the same length as theprompts and it is paired one by one with the prompt.use_tqdm: Whether to use tqdm to display the progress bar.lora_request: LoRA request to use for generation, if any.chat_template: The template to use for structuring the chat.If not provided, the model's default chat template will be used.add_generation_prompt: If True, adds a generation templateto each message.Returns:A list of ``RequestOutput`` objects containing the generatedresponses in the same order as the input messages."""
generate 接口
def generate(self,prompts: Union[Union[PromptInputs, Sequence[PromptInputs]],Optional[Union[str, List[str]]]] = None,sampling_params: Optional[Union[SamplingParams,Sequence[SamplingParams]]] = None,prompt_token_ids: Optional[Union[List[int], List[List[int]]]] = None,use_tqdm: bool = True,lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,prompt_adapter_request: Optional[PromptAdapterRequest] = None,guided_options_request: Optional[Union[LLMGuidedOptions,GuidedDecodingRequest]] = None) -> List[RequestOutput]:"""Generates the completions for the input prompts.This class automatically batches the given prompts, consideringthe memory constraint. For the best performance, put all of your promptsinto a single list and pass it to this method.Args:inputs: A list of inputs to generate completions for.sampling_params: The sampling parameters for text generation. IfNone, we use the default sampling parameters.When it is a single value, it is applied to every prompt.When it is a list, the list must have the same length as theprompts and it is paired one by one with the prompt.use_tqdm: Whether to use tqdm to display the progress bar.lora_request: LoRA request to use for generation, if any.prompt_adapter_request: Prompt Adapter request to use forgeneration, if any.Returns:A list of ``RequestOutput`` objects containing thegenerated completions in the same order as the input prompts.Note:Using ``prompts`` and ``prompt_token_ids`` as keyword parameters isconsidered legacy and may be deprecated in the future. You shouldinstead pass them via the ``inputs`` parameter."""