Setting up Meta's Llama 2 with llama.cpp

Zuck our beloved • November 25, 2023

Introduction

Meta's Llama family of Language Models have become increasingly popular for Text and Image generation. Today it is possible to run these very easily thanks to projects like llama.cpp, a port of Llama in C which makes it possible to run models using 4-bit integer quantization.


Preparing

First figure out what model size you want to use, as described here.. I'll just use the 7B parameterized model.

Ideally we want to skip quantization and conversion, so I will use the models from TheBloke's HuggingFace repository.

There are tons of settings mentioned to play around with for your system in llama.cpp's README, like building for Metal support for MacOS users. Same for Nvidia users.

Setting up

Clone the repository using git and cd into it:

sh
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Next, run the build command, again checkout the README for more build options:

sh
# M1/M2 Mac users:
# LLAMA_METAL=1 make
make
# M1/M2 Mac users:
# LLAMA_METAL=1 make
make

Now, download your preferred model (GGML is no longer supported so GGUF ones only) from HuggingFace in the same llama.cpp directory. Note that other models based on Llama are also supported.

sh
REPO_ID="TheBloke/Llama-2-7B-Chat-GGUF"
FILE="llama-2-7b-chat.Q3_K_L.gguf"
curl -L "https://huggingface.co/${REPO_ID}/resolve/main/${FILE}" -o models/${FILE}
REPO_ID="TheBloke/Llama-2-7B-Chat-GGUF"
FILE="llama-2-7b-chat.Q3_K_L.gguf"
curl -L "https://huggingface.co/${REPO_ID}/resolve/main/${FILE}" -o models/${FILE}

Congrats! 🎊

Using interactive mode

Use the built main binary from before to run the model interactively:

sh
./main -m ./models/${FILE} \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8
./main -m ./models/${FILE} \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8

Tweak settings as you prefer.

Using server api

You can do lots of things with this, like building discord chatbots, applications, or provide a simple OpenAI compatible endpoint for others to use.

sh
./server -m ./models/llama-2-7b-chat.Q3_K_L.gguf \
  --ctx_size 2048 \
  -t 8
./server -m ./models/llama-2-7b-chat.Q3_K_L.gguf \
  --ctx_size 2048 \
  -t 8

Now you can use the OpenAI compatible endpoint in the official libraries:

python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
    {"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
    {"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)