vllm源码解析(四)：LLM模型权重加载与kv-cache初始化

图来自B站某个视频，发现找不到原视频了！我们先来看下LLM是怎么结合到vllm中的。这是模型的入口，model_path路径指向下载的。可以看到通过from_engine_args来加载，继续往下看from_engine_args输入参数如下：cls(…, 这在本章开头的结构图中也能清晰看到。tokenizer比较简单，这里略过，schedule在第二篇文章中已经讲过。

weixin_42479327

2299人浏览 · 2024-09-08 13:36:16

weixin_42479327 · 2024-09-08 13:36:16 发布

七模型初始化

在这里插入图片描述
图来自B站某个视频，发现找不到原视频了！

我们先来看下LLM是怎么结合到vllm中的。

llm = LLM(model=model_path,
          dtype='half',
          enable_prefix_caching= False,
            # dtype='float16'
          # 把模型层均分到n个gpu上, 而不是运行n个完整模型
          # tensor_parallel_size=1
          # gpu利用率最大70%
          # gpu_memory_utilization=0.7,
          )
tokenizer = AutoTokenizer.from_pretrained(model_path, )

这是模型的入口，model_path路径指向下载的hugging-face模型文件。

class LLM:
	...
    def __init__(
            self,
			...
    ) -> None:
		...
        # 将外部参数映射为EngineArgs的属性,没做其他修改,便于后续参数的管理
        engine_args = EngineArgs(...)
        # 使用配置好的engine参数,初始LLMEngine实例
        self.llm_engine = LLMEngine.from_engine_args(engine_args, usage_context=UsageContext.LLM_CLASS)
        # 全局唯一id,1个 prompt(一个batch可能包含多条prompt)的视为1个request,为这个prompt分配一个唯一id
        self.request_counter = Counter()

可以看到通过from_engine_args来加载，继续往下看

    def from_engine_args(
            cls,
            engine_args: EngineArgs,
            usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
            stat_loggers: Optional[Dict[str, StatLoggerBase]] = None,
    ) -> "LLMEngine":
        """Creates an LLM engine from the engine arguments."""
        # Create the engine configs.
        engine_config = engine_args.create_engine_config()
        executor_class = cls._get_executor_cls(engine_config)
        # Create the LLM engine.
        engine = cls(
                **engine_config.to_dict(),
                executor_class=executor_class,
                log_stats=not engine_args.disable_log_stats,
                usage_context=usage_context,
                stat_loggers=stat_loggers,
        )

        return engine

from_engine_args输入参数如下：
在这里插入图片描述
cls(…) 指向如下代码：

class LLMEngine:
	...
    def __init__(
            self,
			...
    ) -> None:
		...
        if not self.model_config.skip_tokenizer_init:
            self.tokenizer = self._init_tokenizer()
            self.detokenizer = Detokenizer(self.tokenizer)
        else:
            self.tokenizer = None
            self.detokenizer = None
		...
        self.model_executor = executor_class(
                model_config=model_config,
                cache_config=cache_config,
                parallel_config=parallel_config,
                scheduler_config=scheduler_config,
                device_config=device_config,
                lora_config=lora_config,
                multimodal_config=multimodal_config,
                speculative_config=speculative_config,
                load_config=load_config,
                prompt_adapter_config=prompt_adapter_config,
        )

        if not self.model_config.embedding_mode:
            self._initialize_kv_caches()
		...
        # pipeline_parallel_size:并行的gpu数量, 会把可用的 物理blocks平均分配到并行的gpu上
        # 同时, 每个gpu都会维护一个调度器scheduler, self.scheduler是包含多个scheduler的list
        self.scheduler = [
            Scheduler(
                    scheduler_config, cache_config, lora_config,
                    parallel_config.pipeline_parallel_size
            ) for _ in range(parallel_config.pipeline_parallel_size)
        ]
		...

可以发现在vllm初始化时，主要初始化4个模块：tokenizer（分词器），model_executor（tf模型转换到vllm模型），self._initialize_kv_caches（kv block初始化），scheduler （调度器）, 这在本章开头的结构图中也能清晰看到。

tokenizer比较简单，这里略过，schedule在第二篇文章中已经讲过。

我们来看下model_executor与_initialize_kv_caches的具体工作，这两部分代码是以后向vllm手动添加新模型（model_executor），优化vllm推理性能（_initialize_kv_caches）的核心代码。

7.1 model_executor

executor_class继承自基类ExecutorBase，有cpu_executor,gpu_executor,tpu_executor…,等各种执行器可选，由当前设备类型，或指定executor来决定使用哪一个。我们以gpu_executor来说明，其他executor也都大同小异。

class GPUExecutor(ExecutorBase):
    uses_ray: bool = False

    def _init_executor(self) -> None:
        """Initialize the worker and load the model.
        """
        assert self.parallel_config.world_size == 1, (
            "GPUExecutor only supports single GPU.")

        self.driver_worker = self._create_worker()
        self.driver_worker.init_device()
        self.driver_worker.load_model()
	...
    def execute_model(
            self, execute_model_req: ExecuteModelRequest
    ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
        output = self.driver_worker.execute_model(execute_model_req)
        return output
	...

self.driver_worker是work（vllm/worker/worker.py）的一个实例对象（每个gpu上的都维护着自己的Worker实例），负责维护 KV-cache，并在 GPU 上执行模型。在分布式推理的情况下，每个work都会被分配模型的一部分（不同的head并行计算，然后汇总计算结果）。

self.driver_worker.load_model()是加载模型的方法，但经过多层转包后，才能找到真正的初始化模型的代码：

vllm/model_executor/model_loader/loader.py class DefaultModelLoader

    def load_model(self, *, model_config: ModelConfig,
                   device_config: DeviceConfig,
                   lora_config: Optional[LoRAConfig],
                   multimodal_config: Optional[MultiModalConfig],
                   parallel_config: ParallelConfig,
                   scheduler_config: SchedulerConfig,
                   cache_config: CacheConfig) -> nn.Module:
        target_device = torch.device(device_config.device)
        with set_default_torch_dtype(model_config.dtype):
            with target_device:
                model = _initialize_model(model_config, self.load_config,
                                          lora_config, multimodal_config,
                                          cache_config, scheduler_config)
            model.load_weights(
            	# 加载model.safetensors权重文件
                self._get_weights_iterator(model_config.model,
                                           model_config.revision,
                                           fall_back_to_pt=getattr(
                                               model,
                                               "fall_back_to_pt_during_load",
                                               True)), )

            for _, module in model.named_modules():
                quant_method = getattr(module, "quant_method", None)
                if quant_method is not None:
                    # When quant methods need to process weights after loading
                    # (for repacking, quantizing, etc), they expect parameters
                    # to be on the global target device. This scope is for the
                    # case where cpu offloading is used, where we will move the
                    # parameters onto device for processing and back off after.
                    with device_loading_context(module, target_device):
                        quant_method.process_weights_after_loading(module)
        return model.eval()

我们解析下涉及的两个主要函数：

vllm/model_executor/model_loader/loader.py

def _initialize_model(
        model_config: ModelConfig,
        load_config: LoadConfig,
        lora_config: Optional[LoRAConfig],
        multimodal_config: Optional[MultiModalConfig],
        cache_config: CacheConfig,
        scheduler_config: Optional[SchedulerConfig] = None) -> nn.Module:
    """Initialize a model with the given configurations."""
    # 通过下载hf模型时自带的config，根据config['architectures']参数，获得当前模型名称
    model_class = get_model_architecture(model_config)[0]
    # 取得量化相关参数，在当前版本中没有启用该参数
    quant_config = _get_quantization_config(model_config, load_config)
    # 通过加载vllm/model_executor/models/llama.py，获得模型结构(这是vllm改造后的结构)
    return model_class(config=model_config.hf_config,
                       # cache_config=cache_config,
                       # quant_config=quant_config,
                       **_get_model_initialization_kwargs(
                           model_class, lora_config, multimodal_config,
                           scheduler_config))

_initialize_model函数的功能为通过hf模型的config参数，获得模型名，
然后根据这个名称去加载vllm改造后的该模型模型结构

我们以llama为例来说明如何加载hf权重：

vllm/model_executor/models/llama.py

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
        # vllm与hf两种模型实现方式之间的名称映射
        stacked_params_mapping = [
            # (param_name, shard_name, shard_id)
            # vllm, hf,share_id
            (".qkv_proj", ".q_proj", "q"),
            (".qkv_proj", ".k_proj", "k"),
            (".qkv_proj", ".v_proj", "v"),
            (".gate_up_proj", ".gate_proj", 0),
            (".gate_up_proj", ".up_proj", 1),
        ]
        # 获得当前vllm改造后llama模型的参数和对应的权重(此时的权重应是随机生成的)
        params_dict = dict(self.named_parameters())
        # 遍历hf模型每层参数的名称和权重
        for name, loaded_weight in weights:
			...
            # vllm, hf,share_id
            for (param_name, weight_name, shard_id) in stacked_params_mapping:
                if weight_name not in name:
                    continue
                # 将hf模型的层名，替换为vllm中的层名
                name = name.replace(weight_name, param_name)
                # Skip loading extra bias for GPTQ models.
                if name.endswith(".bias") and name not in params_dict:
                    continue

                if is_pp_missing_parameter(name, self):
                    continue
                # 获得vllm改造后llama权重参数
                param = params_dict[name]
                weight_loader = param.weight_loader
                # 将hf模型参数更新到对应的vllm模型参数中,完成权重参数的映射工作
                weight_loader(param, loaded_weight, shard_id)

                break
            else:
				...

通过上述vllm中llama的load_weights方法(经过观察，所有decode-only模型的load_weights几乎都一样)，将vllm模型和hf模型不同参数名之间做映射，之后将hf类型的权重赋值给vllm模型中（通过参数名联系），至此，完成模型转换工作。

注：需要知道模型中有不同结构，所有weight_loader（vllm/model_executor/layers/linear.py）也有多个变体（分布在不同类中）。

以对QKV的转换为例说明weight_loader的变换过程（源码比较复杂，这里仅描述下处理逻辑）：
llama3.1的qkv是分开计算的，类似于下面这样

        self.q = nn.Linear(dim, dim_q, bias=False)
        self.k = nn.Linear(dim, dim_kv, bias=False)
        self.v = nn.Linear(dim, dim_kv, bias=False)

而vllm中会把他们合并起来，类似于下面这样

self.qkv=nn.Linear(dim, dim_q+2*dim_kv, bias=False)

通过这个模块的解析，我们可以知道，对未支持的新模型也能通过手动修改load_model源码的方式在vllm中使用。

7.2 _initialize_kv_caches

作用是计算当前blocks总量，可用blocks数量。
tranformers中，一个正常的k/v shape为[batch_size, nums_head, len_k, head_dim]（推理阶段，len_k=1）
vllm中kv_cache_shape=[2, num_blocks, block_size, num_kv_heads, head_size]

一个块（block）占用空间的计算公式如下(2表示kv各一个，它们是成对出现的)：2 * block_size * num_head * head_size * num_layers，
即每个 token 对应的 K V 个数为2, 每个块可以存放 block_size 个 token 对应的 K V 值，每个 token 对应的 K V 占用空间为2 * num_head * head_size * num_layers * dtype_size，所以每个块总共要存放block_size * 2 * num_head * head_size * num_layers * dtype_size个值。
num_layers是模型的layers层数，每个token要保存计算过的所有层的kv值，这样才算一个完整的kv-cache。

kv每个值占用的空间为 dtype_size 个字节（如果 tensor 的 dtype 为 float16，则 dtype_size 为 2，dtype 为 float32，则 dtype_size 为 4）。

一个block占用空间的计算代码如下：

vllm/worker/cache_engine.py
vllm/engine/llm_engine.py

    def _initialize_kv_caches(self) -> None:
        """Initialize the KV cache in the worker(s).

        The workers will determine the number of blocks in both the GPU cache
        and the swap CPU cache.
        """
        num_gpu_blocks, num_cpu_blocks = self.model_executor.determine_num_available_blocks()
		...
        self.cache_config.num_gpu_blocks = num_gpu_blocks
        self.cache_config.num_cpu_blocks = num_cpu_blocks

        self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

_initialize_kv_caches方法的目的是计算出GPU/CPU block数量，然后对这些block进行初始化。

计算block数量的方法为self.model_executor.determine_num_available_blocks()

vllm/worker/worker.py

    def determine_num_available_blocks(self) -> Tuple[int, int]:
        # Profile the memory usage of the model and get the maximum number of
        # cache blocks that can be allocated with the remaining free memory.
        torch.cuda.empty_cache()

        # Execute a forward pass with dummy inputs to profile the memory usage
        # of the model.
        # 构建推理允许的最大seq和tokens 数量组成的推理数据，进行不使用kv-cache的模型推理
        self.model_runner.profile_run()

        # Calculate the number of blocks that can be allocated with the
        # profiled peak memory.
        torch.cuda.synchronize()
        # 记录此时可用的GPU和总GPU数量，此时模型运行占用的GPU显存还没释放
        free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
        # peak_memory就是当前模型占用的显存
        peak_memory = self.init_gpu_memory - free_gpu_memory
		...
		# 获得一个block占用的GPU显存
        cache_block_size = self.get_cache_block_size_bytes()
        # 计算总的可用GPU block数量
        num_gpu_blocks = int(
            (total_gpu_memory * self.cache_config.gpu_memory_utilization -peak_memory) // cache_block_size)
        # 计算CPU数量,对于CPU，不需要额外计算，因为是固定大小的内存。
        num_cpu_blocks = int(self.cache_config.swap_space_bytes // cache_block_size)
        num_gpu_blocks = max(num_gpu_blocks, 0)
        num_cpu_blocks = max(num_cpu_blocks, 0)
        if self.model_runner.lora_manager:
            self.model_runner.remove_all_loras()
        gc.collect()
        torch.cuda.empty_cache()
        return num_gpu_blocks, num_cpu_blocks

**self.model_runner.profile_run()**作用是构建假数据，走一遍不使用kv-cache的模型推理，记录此时的GPU占用情况。
profile_run流程如下（代码太多，不在此贴出，代码不难，想进一步了解细节可去看源码）：

构建假数据
初始化LLMEngine引擎时，会提供两个重要参数（这两个参数在当前版本由budget管理）：
max_num_seqs：在1个推理阶段中，可处理的最大seqs数量
max_num_batched_tokens：在1个推理阶段中，可处理的最大tokens数量

这两个参数值由外部指定，若未指定，系统会分配一个。那么如何通过这两个值构建数据呢？
假设在推理过程中，平均一个seq要处理max_num_batched_tokens // max_num_seqs个token，余数部分我们默认放在第一个seq中。
例如，若max_num_batched_tokens=10，max_num_seqs = 3，那么可以构建出3条seq，每个seq的长度分别为4，3，3

使用这些空数据，走一遍推理流程，可以获得模型使用GPU显存的情况。
（free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()）

计算出分配多少的显存给KV cache：

分配给KV cache显存 = gpu总显存 - 不使用KV cache做1次推理时的显存占用（包括模型本身和推理过程中的中间数据）

在上述代码中有详细注释。

分配kv-cache
计算出了可用block数量，接下就能通过initialize_cache初始化vllm推理过程中的kv-cache了。

vllm/worker/worker.py

    def _init_cache_engine(self):
        assert self.cache_config.num_gpu_blocks is not None
        self.cache_engine = [
            CacheEngine(self.cache_config, self.model_config,
                        self.parallel_config, self.device_config)
            for _ in range(self.parallel_config.pipeline_parallel_size)
        ]
        self.gpu_cache = [
            self.cache_engine[ve].gpu_cache
            for ve in range(self.parallel_config.pipeline_parallel_size)
        ]

初始化kv-cache的工作最终是在CacheEngine的__init__()函数中完成，层层嵌套，vllm架构越来复杂了。

vllm_module/worker/cache_engine.py

    def _allocate_kv_cache(
        self,
        num_blocks: int,
        device: str,
    ) -> List[torch.Tensor]:
        """Allocates KV cache on the specified device."""
        # shape=[num_blocks, block_size，num_kv_heads，head_size]
        kv_cache_shape = self.attn_backend.get_kv_cache_shape(
            num_blocks, self.block_size, self.num_kv_heads, self.head_size)
        
        pin_memory = is_pin_memory_available() if device == "cpu" else False
        kv_cache: List[torch.Tensor] = []
        # 遍历每一层，一个token的完整kv-cache包含所有层的子kv
        for _ in range(self.num_attention_layers):
            # null block in CpuGpuBlockAllocator requires at least that
            # block to be zeroed-out.
            # We zero-out everything for simplicity.
            kv_cache.append(
                torch.zeros(kv_cache_shape,
                            dtype=self.dtype,
                            pin_memory=pin_memory,
                            device=device))
        return kv_cache