1. HuggingFace模型下载

模型在 HuggingFace 下载,如果下载速度太慢,可以在 HuggingFace镜像网站ModelScope 进行下载。

使用HuggingFace的下载命令(需要先注册HuggingFace账号):

第一步:安装 git-lfs

curl https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs

第二步:下载 Qwen2-0.5B 模型

git lfs clone https://huggingface.co/Qwen/Qwen2-0.5B

下载完后的模型包括以下文件:

config.json  # 模型配置文件,包含了模型的各种参数设置,例如层数、隐藏层大小、注意力头数
generation_config.json   #文本生成相关的模型配置
merges.txt   #训练tokenizer阶段所得到的合并词表结果
model.Safetensors    #模型文件
tokenizer.json    #分词器,将词转换为数字
tokenizer_config.json   #分词模型的配置信息,如分词器的类型、词汇表大小、最大序列长度、特殊标记等
vocab.json    #词表

2. Hugging Face Transformers 库模型推理

Hugging Face Transformers 库既可以用于训练,也可以用于推理
vLLM 库只能用于推理

本文使用单卡 A100-80G 进行推理实验

注意:使用 Qwen2 模型需要将 transformers 库更新到最新版本

code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 从本地加载预训练模型
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_path = "models/Qwen2-0.5B"
model = AutoModelForCausalLM.from_pretrained(model_path,device_map=device)  
# 设置 device_map="auto" 会自动使用所有多卡
print(f"model: {model}")

# 加载 tokenizer(分词器)
# 分词器负责将句子分割成更小的文本片段 (词元) 并为每个词元分配一个称为输入 id 的值(数字),因为模型只能理解数字。
# 每个模型都有自己的分词器词表,因此使用与模型训练时相同的分词器很重要,否则它会误解文本。
tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=True, padding_side='left')
# add_eos_token=True: 可选参数,表示在序列的末尾添加一个结束标记(end-of-sequence token),这有助于模型识别序列的结束。
# padding_side='left': 可选参数,表示 padding 应该在序列的哪一边进行,确保所有序列的长度一致。

# 模型输入
input_text = "介绍一下悉尼这座城市。"

# 对输入文本分词
input_ids = tokenizer(input_text, return_tensors="pt").to(device)
# return_tensors="pt": 指定返回的数值序列的数据类型。"pt"代表 PyTorch Tensor,表示分词器将返回一个PyTorch而不是TensorFlow对象

# 生成文本回答
# max_new_tokens:模型生成的新的 token 的最大数量为 200
outputs = model.generate(input_ids["input_ids"], max_new_tokens=200)
print(f"type(outputs) = {type(outputs)}")   # <class 'torch.Tensor'>
print(f"outputs.shape = {outputs.shape}")   # torch.Size([1, 95]),outputs.shape是随机的,是不超过200的数

# 将输出token解码为文本
decoded_outputs = tokenizer.decode(outputs[0])
print(f"decoded_outputs: {decoded_outputs}")

模型输出的文本回答如下:

decoded_outputs: 介绍一下悉尼这座城市。 悉尼这座城市位于澳大利亚东南部,是澳大利亚最大的城市之一。它是一个现代化的城市,拥有许多现代化的建筑和设施,如购物中心、博馆、剧院和音乐厅等。悉尼的气候宜人,四季分明,夏季炎热,冬季寒冷,适合旅游和度假。此外,悉尼还有许多著名的景点,如悉尼歌剧院、悉尼塔、悉尼海港大桥等,这些景点吸引来自世界各地的游客。<|endoftext|>

Qwen2-0.5B 模型结构:

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): Linear(in_features=896, out_features=151936, bias=False)
)

参考资料:Hugging Face Transformers 萌新完全指南

3. 关于 prompt 的组成:system、user、assistant

本文使用四张 A100-80G 进行推理实验,模型选择Qwen2-72B-Instruct,代码来自 【huggingface官网】https://huggingface.co/Qwen/Qwen2-72B-Instruct

from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# 使用前四张卡
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
device = "cuda"

model_path="models/Qwen2-72B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,    # 不对生成的文本进行分词
    add_generation_prompt=True   # 在生成文本前添加特殊标记
)
print("text: ",text)
"""
text: 
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
"""

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("response: ",response)

"""
response(第一次执行结果):  

A large language model refers to a type of artificial intelligence system that has been trained on vast amounts of text data to understand and generate human-like language. These models use deep learning algorithms to analyze patterns in language and learn how words, phrases, and sentences are used in context.

The size of these models is typically measured by the number of parameters they have, which can range from millions to billions. The larger the model, the more data it has been trained on, and the better it can understand complex language structures and nuances.

Large language models have a wide range of applications, including natural language processing tasks such as language translation, text summarization, question answering, and chatbot development. They are also used in content generation, such as writing news articles, creating product descriptions, and generating social media posts.

However, large language models are not perfect and can sometimes generate inappropriate or biased responses due to the biases present in the training data. Therefore, it is important to carefully curate the training data and continuously monitor and refine the model's performance.


response(第二次执行结果):  

A large language model is a type of artificial intelligence (AI) system that has been trained on vast amounts of text data to generate human-like language. These models use deep learning algorithms, specifically neural networks, to analyze and understand the patterns and structures in natural language.

The "large" in large language model refers to the size of the model's neural network, which can have billions of parameters or weights. This allows the model to capture complex relationships between words and phrases, making it capable of generating coherent and contextually appropriate responses to a wide range of inputs.

Large language models can be used for various tasks such as language translation, text summarization, question answering, and chatbot development. They can also be fine-tuned for specific domains or industries by training them on additional specialized datasets.

However, large language models are not without their limitations. They can sometimes generate biased or inappropriate content due to the biases present in the training data. Additionally, they require significant computational resources to train and operate, making them expensive to develop and maintain. Despite these challenges, large language models continue to advance rapidly and hold great promise for revolutionizing how we interact with computers and machines.
"""

注意两点:

  • text:text 中的 system 表示 system prompt,user 表示 user prompt,assistant 表示模型的输出,所以只有开始标志 <|im_start|>,没有结束标志 <|im_end|>
  • response:尽管输入的 prompt 是相同的,但多次 response 并不完全相同,这是模型生成结果多样性的体现。因为模型的输出并不一定采用概率最大采样(生成结果固定),关于大模型的输出采样策略可见 大模型推理常见采样策略:Top-k, Top-p, Temperature, Beam Search

4. vLLM 模型推理

vLLM (virtual LLM) 是一个开源的大模型推理加速框架,通过 PagedAttention 高效地管理 Attention中缓存的张量,实现了比HuggingFace Transformers高14-24倍的吞吐量。vLLM的核心功能包括优化内存管理、连续批处理、CUDA核心优化和分布式推理支持等,其中 PagedAttention 内存管理技术可以将 Attention 机制中的 key 和 value 存储在不连续的显存空间中,从而减少显存碎片,提高显存利用率。

官方 github 仓库:https://github.com/vllm-project/vllm

使用多卡推理 Qwen2-72B-Instruct 的简单code:

from vllm import LLM, SamplingParams

model_path="/etc/ssd1/limining/models/Qwen2-72B-Instruct"
llm = LLM(model=model_path, trust_remote_code=False, tensor_parallel_size=8)
# trust_remote_code=False:不从从远程服务器下载代码或权重
# tensor_parallel_size=8:设置8卡推理,否则默认为单卡

# 中文 prompt 可能会因为 tmux 的显示问题无法正常输出
prompts = [
    "Please introduce the city of Sydney.",
    "Explain natural language processing.",
    "What is the capital of China?",
]

# 设置模型的输出采样方式
# temperature:用于控制生成文本的随机性。温度参数的值越高,生成的文本越随机;值越低,生成的文本越倾向于选择概率最高的选项。
# top_p=0.95: top_p采样方式,top_p的值表示在生成下一个词时,只考虑累积概率最高的前95%的词汇。
sampling_params = SamplingParams(temperature=0.9, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

"""
Prompt: 'Please introduce the city of Sydney.', Generated text: ' Sydney is the largest city in Australia and the capital of New South Wales. It'
Prompt: 'Explain natural language processing.', Generated text: ' Natural Language Processing (NLP) is a field of study that deals with the'
Prompt: 'What is the capital of China?', Generated text: ' The capital of China is Beijing. Would you like to know more about Beijing or'
"""

除 temperature 外,SamplingParams 包括 top_k 和 top_p 参数,分别代表 “top-k采样” 和 “top-p采样”。它们用于限制模型在生成输出时考虑的候选词数量,以提高生成的效率和质量。

  • Top-K 采样:模型只会考虑具有最高 k 个概率的候选词;这种方法可以减少计算量,但可能会牺牲一些多样性。
  • Top-P 采样:基于累积概率来选择候选词;它考虑的是概率最高的词,直到累积概率超过某个阈值p。这种方法可以在保持多样性的同时减少计算量, t o p _ p = 0.95 top\_p=0.95 top_p=0.95意味着在生成每个词时,只考虑前95%概率的词汇。

vLLM 的多卡推理踩坑

如果想指定 vLLM 的多卡(比如使用两张卡),只需要在程序最上面加一行 os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘6,7’ 即可,然后修改 tensor_parallel_size=2

需要注意的是:tensor_parallel_size=2 是必须要加的,否则默认使用 os.environ[‘CUDA_VISIBLE_DEVICES’] 的第一张单卡

指定多卡的简单 demo:

import os
from vllm import LLM, SamplingParams
os.environ['CUDA_VISIBLE_DEVICES'] = '6,7'

model_path="models/Qwen2-0.5B"
llm = LLM(model=model_path, trust_remote_code=False,tensor_parallel_size=2)
# tensor_parallel_size:设置多卡推理,否则默认使用 os.environ['CUDA_VISIBLE_DEVICES'] 的第一张单卡

prompts = [
    "Please introduce the city of Sydney.",
    "Explain natural language processing.",
    "What are the advantages of GPU training over CPU?"
]
sampling_params = SamplingParams(temperature=0.1, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n Answer: {generated_text!r}")
Logo

开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!

更多推荐