前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!

私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!

原创
作者头像
汀丶人工智能
发布2023-10-11 10:45:11
5740
发布2023-10-11 10:45:11
举报
文章被收录于专栏:NLP/KG

“私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!”

一个自托管的、离线的、类似chatgpt的聊天机器人。由美洲驼提供动力。100%私密,没有数据离开您的设备。

Demo

https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd

1.支持模型

Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.

Model name

Model size

Model download size

Memory required

Nous Hermes Llama 2 7B Chat (GGML q4_0)

7B

3.79GB

6.29GB

Nous Hermes Llama 2 13B Chat (GGML q4_0)

13B

7.32GB

9.82GB

Nous Hermes Llama 2 70B Chat (GGML q4_0)

70B

38.87GB

41.37GB

Code Llama 7B Chat (GGUF Q4_K_M)

7B

4.24GB

6.74GB

Code Llama 13B Chat (GGUF Q4_K_M)

13B

8.06GB

10.56GB

Phind Code Llama 34B Chat (GGUF Q4_K_M)

34B

20.22GB

22.72GB

1.1 安装LlamaGPT 在 umbrelOS

Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.

1.2 安装LlamaGPT on M1/M2 Mac

Make sure your have Docker and Xcode installed.

Then, clone this repo and cd into it:

代码语言:txt
复制
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

代码语言:txt
复制
./run-mac.sh --model 7b

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively. To run 7B, 13B or 34B Code Llama models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

1.3 在 Docker上安装

You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.

Then, clone this repo and cd into it:

代码语言:txt
复制
git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

Run LlamaGPT with the following command:

代码语言:txt
复制
./run.sh --model 7b

Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda flag, like:

代码语言:txt
复制
./run.sh --model 7b --with-cuda

You can access LlamaGPT at http://localhost:3000.

To run 13B or 70B chat models, replace 7b with 13b or 70b respectively. To run Code Llama 7B, 13B or 34B models, replace 7b with code-7b, code-13b or code-34b respectively.

To stop LlamaGPT, do Ctrl + C in Terminal.

Note: On the first run, it may take a while for the model to be downloaded to the /models directory. You may also see lots of output like this for a few minutes, which is normal: llama-gpt-llama-gpt-ui-1 | INFO wait Host llama-gpt-api-13b:8000 not yet available... After the model has been automatically downloaded and loaded, and the API server is running, you'll see an output like: llama-gpt-ui_1 | ready - started server on 0.0.0.0:3000, url: http://localhost:3000 You can then access LlamaGPT at http://localhost:3000.


1.4 在Kubernetes安装

First, make sure you have a running Kubernetes cluster and kubectl is configured to interact with it.

Then, clone this repo and cd into it.

To deploy to Kubernetes first create a namespace:

代码语言:shell
复制
kubectl create ns llama

Then apply the manifests under the /deploy/kubernetes directory with

代码语言:shell
复制
kubectl apply -k deploy/kubernetes/. -n llama

Expose your service however you would normally do that.

2.OpenAI兼容API

Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001. Open http://localhost:3001/docs to see the API documentation.

  • 基线

We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.

Feel free to add your own benchmarks to this table by opening a pull request.

2.1 Nous Hermes Llama 2 7B Chat (GGML q4_0)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

54 tokens/sec

GCP c2-standard-16 vCPU (64 GB RAM)

16.7 tokens/sec

Ryzen 5700G 4.4GHz 4c (16 GB RAM)

11.50 tokens/sec

GCP c2-standard-4 vCPU (16 GB RAM)

4.3 tokens/sec

Umbrel Home (16GB RAM)

2.7 tokens/sec

Raspberry Pi 4 (8GB RAM)

0.9 tokens/sec

2.2 Nous Hermes Llama 2 13B Chat (GGML q4_0)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

20 tokens/sec

GCP c2-standard-16 vCPU (64 GB RAM)

8.6 tokens/sec

GCP c2-standard-4 vCPU (16 GB RAM)

2.2 tokens/sec

Umbrel Home (16GB RAM)

1.5 tokens/sec

2.3 Nous Hermes Llama 2 70B Chat (GGML q4_0)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

4.8 tokens/sec

GCP e2-standard-16 vCPU (64 GB RAM)

1.75 tokens/sec

GCP c2-standard-16 vCPU (64 GB RAM)

1.62 tokens/sec

2.4 Code Llama 7B Chat (GGUF Q4_K_M)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

41 tokens/sec

2.5 Code Llama 13B Chat (GGUF Q4_K_M)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

25 tokens/sec

2.6 Phind Code Llama 34B Chat (GGUF Q4_K_M)

Device

Generation speed

M1 Max MacBook Pro (64GB RAM)

10.26 tokens/sec

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • “私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!”
    • Demo
    • 1.支持模型
      • 1.1 安装LlamaGPT 在 umbrelOS
        • 1.2 安装LlamaGPT on M1/M2 Mac
          • 1.3 在 Docker上安装
            • 1.4 在Kubernetes安装
            • 2.OpenAI兼容API
              • 2.1 Nous Hermes Llama 2 7B Chat (GGML q4_0)
                • 2.2 Nous Hermes Llama 2 13B Chat (GGML q4_0)
                  • 2.3 Nous Hermes Llama 2 70B Chat (GGML q4_0)
                    • 2.4 Code Llama 7B Chat (GGUF Q4_K_M)
                      • 2.5 Code Llama 13B Chat (GGUF Q4_K_M)
                        • 2.6 Phind Code Llama 34B Chat (GGUF Q4_K_M)
                        领券
                        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档