ChatGLM2-6B使用入门

原创

码之有理

修改于 2024-03-13 12:31:04

1.3K00

代码可运行

文章被收录于专栏：AI技术探索和应用AI技术探索和应用

运行总次数：0

代码可运行

ChatGLM2-6B模型的中文效果较好，相比ChatGLM-6B模型有了进一步的优化，可以本地部署尝试。

模型下载和调试

下载源代码

git clone https://github.com/THUDM/ChatGLM2-6B
cd ChatGLM2-6B
pip install -r requirements.txt

下载模型

模型的国内官方下载地址：https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/

方式一：自动从HuggingFace下载模型（下载没问题，启动时使用量化模型可能也会显存不足，启动建议使用Web Demo方式）

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda')
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。

方式二：从 Hugging Face Hub 下载模型需要先安装Git LFS（yum install git-lfs），然后运行

git clone https://huggingface.co/THUDM/chatglm2-6b

方式三：如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢，可以只下载模型实现

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b

然后从这里手动下载模型参数文件，并将下载的文件替换到本地的 chatglm2-6b 目录下。

Gradio网页Demo

python web_demo.py

Streamlit网页Demo

streamlit run web_demo2.py

命令行Demo

python cli_demo.py

API服务

pip install fastapi uvicorn
python api.py

测试Client

curl -X POST "http://127.0.0.1:8000" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'

OpenAI流式API服务

需要将openai_api.py中的如下三处yield代码进行替换，否则请求时会报pydantic相关错误，若拉取的代码已更新，则忽略。

yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))

# 替换为

yield "{}".format(chunk.model_dump_json(exclude_unset=True))

python openai_api.py

测试Client

# pip install openai
import openai
if __name__ == "__main__":
    openai.api_base = "http://localhost:8000/v1"
    openai.api_key = "none"
    for chunk in openai.ChatCompletion.create(
        model="chatglm2-6b",
        messages=[
            {"role": "user", "content": "你好"}
        ],
        stream=True
    ):
        if hasattr(chunk.choices[0].delta, "content"):
            print(chunk.choices[0].delta.content, end="", flush=True)

模型量化

默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：

model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()

td {white-space:nowrap;border:1px solid #dee0e3;font-size:10pt;font-style:normal;font-weight:normal;vertical-align:middle;word-break:normal;word-wrap:normal;}

量化等级	编码 2048 长度的最小显存	生成 8192 长度的最小显存
FP16 / BF16	13.1 GB	12.8 GB
INT8	8.2 GB	8.1 GB
INT4	5.5 GB	5.1 GB

量化也可以尝试使用Chatglm.cpp进行量化。

Github: https://github.com/li-plus/chatglm.cpp

支持流式返回内容。

模型部署

CPU 部署

如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存），如果你的内存不足的话，也可以使用量化后的模型chatglm2-6b-int4。

model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()

多卡部署

如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: pip install accelerate，然后通过如下方法加载模型：

from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)

即可将模型部署到两张 GPU 上进行推理。你可以将 num_gpus 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 device_map 参数来自己指定。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

LLM

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

登录后参与评论

0 条评论

热度