Update (Nov 11 2023)

I noticed that the default API behaviors have changed as of Nov 11 2023. OpenAI API will try to use the same port as the raw API, so it is advised to use only one. In addition, I also found that oobabooga's OpenAI API also supports more options such as specifying templates, so it seems that the project will support the OpenAI API approach more.

How to get the weights

TheBloke is a good source for accessing various fine-tuned/quantized versions for many open source models (as of 10/31/2023). To get your hands on the original LLaMA 2 weights as released by Meta, you should sign up for official access through this link. Getting this access will also allow you to access Meta research's huggingface repos once you link your huggingface account.

Downloading & Loading the model

Downloading with WebUI

  1. Go to the model tab.
  2. Specify the model name (and/or branch) in the download section.
  3. Reload the models list, select the newest model and load it.
  • The loader is usually automatically detected by the dashboard. However, you may also manually switch to a different one.

Downloading with terminal

You can clone the model weight repository by cloning the repository to under the models/ directory of the WebUI repository's root folder.

git clone \
  --single-branch \ # you don't want to download every branch!
  --branch {BRANCH_NAME} \ # optional: only if you are not using the main branch
  {MODEL_REPO_URL} \ # url to repo
  models/{MODEL_NAME} # path to download repo to. This is the name that WebUI will refer to the model

Once the model is downloaded, you can start the server with the model pre-configured as well as the API extensions so that you do not have to manually configure using the browser.

python server.py \
  --model {MODEL_NAME} \ # name of model path excluding "models/"
  --api \ # turns on raw API
  --extensions openai # turns on OpenAI API

Enabling the API Extension in WebUI

  1. Go to session tab.
  2. Select desired extensions (API, OpenAI APi, etc.).

Using the API

WebUI offers two types of API styles. The first one is their raw API, which has a more complicated structure but allows more paramters to be tweaked. The other is the OpenAI style API, which follows the known OpenAPI standards, which means that the client code can be easily switched out to OpenAI models.

Raw API (deprecated)

The raw API is listening on port 5000 by default. The history parameter can be a bit confusing. There are two parts of the history parameter: internal and visible. The internal history is the history that the model sees, while the visible history is the history that the user sees.

I have found that using the same history for both worked fine for my purposes. Each history list is a list of tuples, where each tuple is a [user_input, model_output] pair. Your final input is sent through the user_input paramter, and the system message can be set by context_instruct.

import requests
from chat_api import DEFAULT_CHAT_PARAMS
import html

instruction = 'Your job is to play the assigned role and give responses to your best ability.\n'
chat_history = [
        'You are a helpful assistant. You will answer questions I ask you. Reply with Yes if you understand.',
        'Yes, I understand'
params = dict(
    user_input = 'What color is the sky?',
    history = dict(
        internal = chat_history,
        visible = chat_history,
    context_instruct = instruction,
response = requests.post('http://localhost:5000/api/v1/chat', json=params)
result = response.json()['results'][0]['history']
output = html.unescape(result['visible'][-1][1])

OpenAI style API

The OpenAI style API follows the known standards of OpenAI's standards. One thing to note is to set dummy values for OPENAI_API_KEY and OPENAI_API_BASE environment variables. More info on OpenAI API can be found here.

headers = {
    "Content-Type": "application/json;charset=UTF-8",
    "Authorization": f"Bearer {conf['api_key']}"

data = {
    "messages": [
            "role": "system", # system prompt
            "content": "You are a helpful assistant." 
        }, {
            "role": "user",
            "content": "Tell me a funny joke." 
        }, {
            "role": "assistant",
            "content": "Why did the chicken cross the road? To get to the other side!" 
        }, {
            "role": "user",
            "content": "That was not very funny. Tell me another one." 
    "instruction_template": "Orca Mini"

response = requests.post(
    url = conf['url'],
if response.status_code == 200:
    model_output = response.json()['choices'][0]['message']['content']
    raise Exception(f"Response returned with code {response.status_code}, message: {response.content.decode()}")


Training & Saving LoRA

Training with WebUI

  1. Go to training tab.
  2. Choose a LoRA to copy the weight shapes from.
  3. Set dataset to train on (refer to the tutorial link for formatting guide).

Training with transformers

Here is the full code I used to fine-tune the 4-bit 70B version of LLaMA2. It can also be found in the example repository, as well as the template files I used.

import dependencies
from transformers import (

from peft import LoraConfig
from trl import SFTTrainer
from auto_gptq import exllama_set_max_input_length

from datasets import Dataset

import pandas as pd
import yaml
from sklearn.model_selection import train_test_split

Load model
model_name = "TheBloke/Llama-2-70B-chat-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    # device_map="auto",
    device_map={"": 0},
model = exllama_set_max_input_length(model, 8192) # Need to set when using LLaMa models
model.enable_input_require_grads() # Need to enable when training!

Load dataset
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

print('successfully loaded model and tokenzier')

lora_config = LoraConfig(
    # target_modules=["q_proj", "k_proj"],

training_arguments = TrainingArguments(

Load and wrap the dataset to fine-tune on

template = yaml.safe_load(open('./templates/college_confidential.yaml'))
df = pd.read_csv('./data/college_confidential/dataset.csv')

def wrap_task(alt_a, alt_b, text, label):
    prompt = '### System:\n'
    prompt += template['instruction']
    prompt += '\n\n'
    prompt += '### User:\n'
    prompt += template['task'].replace(
        '{alternative_a}', alt_a
        '{alternative_b}', alt_b
        'text', text
    prompt += '\n\n'
    prompt += '### Response:\n'
    prompt += template['label'][label]
    return prompt

input_prompts = [
    wrap_task(alt_a, alt_b, text, label)
    for alt_a, alt_b, text, label in df[['alternative_a', 'alternative_b','text','label']].values

X_train, X_test, y_train, y_test = train_test_split(
    test_size = .2,
    random_state = 0,

dataset = Dataset.from_dict(dict(
    text = X_train

json.dump(X_train, open('./data/college_confidential/train.json','w'))
json.dump(X_test, open('./data/college_confidential/test.json','w'))

print('loaded dataset')

trainer = SFTTrainer(
    model = model,


Loading LoRA

Loading with WebUI

Once your model is trained, you can load the LoRA model on top of your base model by selecting it from the model tab of the WebUI. If your LoRA was not trained through the WebUI, make sure that the folder outputed by .save_pretrained method is placed under the loras/ directory of the WebUI directory so it can be detected.

Loading with Terminal

You can use the --lora parameter to add the LoRA to the base model. So an example command would look like the following:

python server.py \
  --model {MODEL_NAME} \
  --lora {LORA_NAME} \
  --api \
  --extensions openai