Skip to content

aLoRA Support #15327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

gabe-l-hart
Copy link
Collaborator

@gabe-l-hart gabe-l-hart commented Aug 14, 2025

DRAFT STATUS

This PR is in draft as a proof-of-concept while we discuss the best path forward.

The implementation is now robust enough to be ready for full review. The changes were a bit more involved than I had originally hoped based on Georgi's comment, but they are all contained to the tools/server except for the changes to support the new GGUF field.

Description

Closes #15212
Supports #15213

This PR adds support for Activated LoRA (aLoRA) in llama-server and in the GGUF representation of a LoRA adapter. The primary benefit of aLoRA is the ability to hot-swap adapters without needing to clear cache. This enables a much more efficient multi-adapter model where individual adapters provide "add-on" features to a model and can be applied during a model flow without redoing the prefill work.

Current Changes

  • Add adapter.alora.invocation_tokens GGUF KV
    • Support parsing adapter.alora.invocation_tokens from "alora_invocation_tokens" in convert_lora_to_gguf.py
    • Support reading adapter.alora.invocation_tokens when loading an adapter
  • Add alora_invocation_tokens to llama_lora_adapter struct
  • Add C-style APIs to llama.h to support getting the invocation tokens from a const llama_lora_adapter *
  • Add support to server to conditionally not clear cache when a request with an adapter change arrives under the following conditions:
    • The current cache was populated without any adapters
    • The enabled new adapters are all aloras

TODO

  • The correct way to apply an alora is to identify the invocation tokens within an input request and only use the adapter for tokens starting with the invocation sequence. This may require a much deeper intervention to support adapter scaling on a per-token basis rather than on a per computation basis.

Testing

I'm testing this using the following models and adapters:

Conversion

# Convert base model
convert_hf_to_gguf.py ~/models/granite-3.2-8b-instruct/

# Convert alora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-alora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! Look for "DEBUG:lora-to-gguf:GGUF KV: adapter.alora.invocation_tokens = [6989, 24933, 49153]"

# Convert lora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-lora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! You should not see log about adapter.alora.invocation_tokens

Execution

# Boot with both adapters (0: alora, 1: lora)
# NOTE: Disabling reasoning budget is critical for these adapters!
./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-lora-uncertainty/granite-3.2-8B-uncertainty-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Sniff test

This script simply verifies that the two adapters can be toggled and that the cache is cleared appropriately. The example inputs are trivial, so the timings are not particularly valuable.

server-req.py
import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.2-8b-instruct")

url = "http://localhost:8081"

messages = [
    {
      "role": "document A",
      "content": "The first document"
    },
    {
        "role": "document B",
        "content": "The second document"
    },
    {
        "role": "user",
        "content": "Which document is first?"
    },
]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Response

----
ASSISTANT RESPONSE (0.39293885231018066s):
The first document is document A.
----
UQ RESPONSE w/ aLoRA (0.18337106704711914s)
85
>>
{
  "completion_tokens": 3,
  "prompt_tokens": 95,
  "total_tokens": 98
}
{
  "prompt_n": 6,
  "prompt_ms": 78.906,
  "prompt_per_token_ms": 13.151000000000002,
  "prompt_per_second": 76.03984487871644,
  "predicted_n": 3,
  "predicted_ms": 102.247,
  "predicted_per_token_ms": 34.08233333333333,
  "predicted_per_second": 29.340714152982482
}
----
UQ RESPONSE w/ LoRA (0.3489878177642822s)
85%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 95,
  "total_tokens": 99
}
{
  "prompt_n": 95,
  "prompt_ms": 193.069,
  "prompt_per_token_ms": 2.0323052631578946,
  "prompt_per_second": 492.05206428789705,
  "predicted_n": 4,
  "predicted_ms": 153.721,
  "predicted_per_token_ms": 38.43025,
  "predicted_per_second": 26.021168220347253
}

…ation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
This is the preferred method in PEFT which is the source of ground truth

https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Aug 14, 2025

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

I've updated my sniff test script above to use client-side template expansion and the raw /completions endpoint for the UQ requests. This is not ideal since it means that this style of adapter would require careful orchestration on the client side to use.

NOTE: This is a property of these adapters and not of aLoRA in general. Theoretically, an adapter could be trained to invoke on the full <|start_of_role|>certainty<|end_of_role|>None<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> or a similar "last turn plus agent prompt."

@CISC
Copy link
Collaborator

CISC commented Aug 14, 2025

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_turn|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

Add the following to your request to remove the assistant generation prompt:

"add_generation_prompt": false,

@gabe-l-hart
Copy link
Collaborator Author

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

@CISC
Copy link
Collaborator

CISC commented Aug 14, 2025

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

Ah, didn't notice that, I suppose that's just because the template doesn't properly handle unknown roles?

@gabe-l-hart
Copy link
Collaborator Author

Yeah, the real issue is that it was trained to act like the generation prompt, so the activation sequence is intentionally an incomplete turn, but with a different role.

This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

Update

I've now added support for correctly applying the alora only to the tokens starting with the invocation sequence. The changes look like the following:

  1. When an alora is requested, search the prompt tokens backwards for the invocation sequence
  • if no invocation sequence is found, simply disable it
  • NOTE (to self): Would this have any impact on subsequent calls? I don't think so since the slot loras are always initialized from the server loras on each task
  1. When processing a slot, only pull tokens from cache up to the token before the invocation sequence start
  • NOTE (to self): We may need to allow for the case where we do want to pull tokens for the invocation sequence if it's from the same alora. This would be a strange use though, since it would require the user to send a request with the alora enabled, but with no un-cached alora invocation strings since the last one is always what gets found.
  1. Once cached tokens have been filled, identify tokens that fall between the end of the cached tokens (slot.n_past) and the invocation start sequence. These should be prefilled without the alora, so the alora is temporarily disabled and the batch filling breaks at the token before the invocation start. The alora is then re-enabled with the correct scale so that the next task can finish prefill from the invocation start with the adapter enabled.

Testing

I've got a few tweaks to my test script that allow it to stimulate these conditions:

uq-req.py
import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("/Users/ghart/models/granite-3.2-8b-instruct")

url = "http://localhost:8081"

documents = [
    {"text": "My name is Gabe"},
    {"text": "I work for IBM"}
]
messages = [{"role": "user", "content": "Who does Gabe work for?"}]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "chat_template_kwargs": {
        "documents": documents,
    },
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# UNCOMMENT this to extend the assistant's response so that it isn't cached
"""
assistant_resp["content"] = assistant_resp["content"] + "\nRespect my authority!"
"""

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, documents=documents, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with both adapters disabled
# UNCOMMENT this to exercise the case where the invocation string itself has
# been cached without the adapter
"""
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/out adapters ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))
"""

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Don't use cached invocation sequence from base model

This stimulates the case where the user ran the invocation sequence through the base model without the adapter and those tokens are cached (uncomment starting at line 57)

----
ASSISTANT RESPONSE (0.3047969341278076s):
IBM
----
UQ RESPONSE w/out adapters (0.1103658676147461s)
high
>>
{
  "completion_tokens": 2,
  "prompt_tokens": 140,
  "total_tokens": 142
}
{
  "prompt_n": 6,
  "prompt_ms": 60.143,
  "prompt_per_token_ms": 10.023833333333334,
  "prompt_per_second": 99.7622333438638,
  "predicted_n": 2,
  "predicted_ms": 47.758,
  "predicted_per_token_ms": 23.879,
  "predicted_per_second": 41.877800577913646
}
----
UQ RESPONSE w/ aLoRA (0.2115638256072998s)
87%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 140,
  "total_tokens": 144
}
{
  "prompt_n": 3,
  "prompt_ms": 56.051,
  "prompt_per_token_ms": 18.683666666666667,
  "prompt_per_second": 53.52268469786445,
  "predicted_n": 4,
  "predicted_ms": 153.106,
  "predicted_per_token_ms": 38.2765,
  "predicted_per_second": 26.125690697947828
}
----
UQ RESPONSE w/ LoRA (2.164383888244629s)
85%

Based on the information provided, Gabe works for IBM. The document states "I work for IBM" and the name associated with this statement is Gabe.
>>
{
  "completion_tokens": 38,
  "prompt_tokens": 140,
  "total_tokens": 178
}
{
  "prompt_n": 140,
  "prompt_ms": 283.203,
  "prompt_per_token_ms": 2.022878571428571,
  "prompt_per_second": 494.3450457798823,
  "predicted_n": 38,
  "predicted_ms": 1877.568,
  "predicted_per_token_ms": 49.409684210526315,
  "predicted_per_second": 20.238947404301733
}

Don't use adapter for uncached tokens before invocation sequence

This stimulates the case where for some reason there are additional tokens not pulled from cache that come before the invocation sequence (uncomment line 45)

----
ASSISTANT RESPONSE (0.30304694175720215s):
IBM
----
UQ RESPONSE w/ aLoRA (0.39620471000671387s)
87.5%
>>
{
  "completion_tokens": 6,
  "prompt_tokens": 146,
  "total_tokens": 152
}
{
  "prompt_n": 12,
  "prompt_ms": 136.993,
  "prompt_per_token_ms": 11.416083333333333,
  "prompt_per_second": 87.59571656945977,
  "predicted_n": 6,
  "predicted_ms": 256.634,
  "predicted_per_token_ms": 42.772333333333336,
  "predicted_per_second": 23.379598961945803
}
----
UQ RESPONSE w/ LoRA (1.1565330028533936s)
85%

Based on the information provided, Gabe works for IBM.
>>
{
  "completion_tokens": 18,
  "prompt_tokens": 146,
  "total_tokens": 164
}
{
  "prompt_n": 146,
  "prompt_ms": 287.421,
  "prompt_per_token_ms": 1.9686369863013697,
  "prompt_per_second": 507.9656670876519,
  "predicted_n": 18,
  "predicted_ms": 866.559,
  "predicted_per_token_ms": 48.14216666666667,
  "predicted_per_second": 20.771811267322825
}

Too much python 🤦

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

I've now extended this to test with multiple aloras in the same conversation. Here's the setup:

Adapters

Adapters are converted using convert_lora_to_gguf.py.

Boot with adapters

./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Test Script

(sorry, it requires my personal logging framework just 'cuz 😉... pip install alchemy-logging)

alora-chat.py
#!/usr/bin/env bash
"""
This is a simple implementation of an interactive chat that leverages several
aLoRA adapters during the flow
"""

# Standard
import argparse
import os

# First Party
import alog

# Third Party
import requests

log = alog.use_channel("MAIN")


def make_document(i: int, doc: str) -> dict:
    """Make a document dict from the given doc as either text or a path"""
    log.info("Adding document: %s", doc)
    if os.path.exists(doc):
        with open(doc, "r") as handle:
            return {"text": handle.read(), "doc_id": i, "title": doc}
    return{"text": doc, "doc_id": i}


def make_lora_req(adapter_ids: list[int], loras: list[int]) -> list[dict]:
    return [
        {"id": i, "scale": 1.0 if i in loras else 0.0}
        for i in adapter_ids
    ]


def make_chat_req(messages: list[dict], documents: list[dict], adapter_ids: list[int], loras: list[int]) -> dict:
    return {
        "messages": messages,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    }


def make_completion_req(prompt: str, documents: list[dict], adapter_ids: list[int], loras: list[int], **kwargs) -> dict:
    kwargs.update({
        "prompt": prompt,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    })
    return kwargs


def run_main_loop(host: str, documents: list[dict], uq_id: int, ans_id: int, adapter_ids: list[int]):
    """Run the main loop with questions"""
    help_cmd = "/?"
    doc_cmd = "/doc"
    reset_cmd = "/reset"
    quit_cmd = "/quit"
    doc_pfx = f"{doc_cmd} "

    def print_help():
        print("Commands:")
        print(f"{help_cmd}: Print help")
        print(f"{doc_cmd}: Add a document")
        print(f"{reset_cmd}: Reset the chat history")
        print(f"{quit_cmd}: Quit")

    messages = []
    print_help()
    while True:
        inp = input("?> ").strip()
        if inp == quit_cmd:
            break
        if not inp:
            continue
        if inp == help_cmd:
            print_help()
            continue
        if inp == reset_cmd:
            messages.clear()
            continue
        if inp.startswith(doc_pfx):
            doc = inp[len(doc_pfx):].lstrip()
            documents.append(make_document(len(documents), doc))
            continue

        # Apply the chat template with the user query
        user_message = {"role": "user", "content": inp}
        resp = requests.post(f"{host}/apply-template", json=make_chat_req(messages + [user_message], documents, adapter_ids, []))
        resp.raise_for_status()
        formatted_prompt = resp.json()["prompt"]
        log.debug4("Formatted prompt: %s", formatted_prompt)

        # Run the Answerability query
        ans_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>answerability<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(ans_prompt, documents, adapter_ids, [ans_id], max_tokens=3))
        resp.raise_for_status()
        js = resp.json()
        answerability = js["choices"][0]["text"]
        log.debug("Answerability: %s", answerability)
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])
        answerable = not answerability.split()[0].lower().startswith("unanswerable")
        if answerable:
            print(">> The question is answerable!")
        else:
            print(">> I'm sorry, but that question isn't answerable with the given context")
            if input("?> Do you want to try anyway [yN]? ").strip().lower() not in ["y", "yes"]:
                continue
        messages.append(user_message)

        # If not unanswerable, run the question and get the assistant's response
        resp = requests.post(f"{host}/v1/chat/completions", json=make_chat_req(messages, documents, adapter_ids, []))
        resp.raise_for_status()
        js = resp.json()
        assistant_msg = js["choices"][0]["message"]
        answer = assistant_msg["content"]
        messages.append(assistant_msg)
        print(f"ASSISTANT: {answer}")

        # Get the uncertainty
        formatted_prompt = requests.post(f"{host}/apply-template", json=make_chat_req(messages, documents, adapter_ids, [])).json()["prompt"]
        uq_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(uq_prompt, documents, adapter_ids, [uq_id], max_tokens=5))
        resp.raise_for_status()
        js = resp.json()
        uq = js["choices"][0]["text"]
        print(f">> CERTAINTY: {uq}")
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])

        print()


def main():
    parser = argparse.ArgumentParser(description=__doc__)
    # Logging
    parser.add_argument("--log-level", "-l", default=os.getenv("LOG_LEVEL", "info"))
    parser.add_argument("--log-filters", "-lf", default=os.getenv("LOG_FILTERS", "urllib3.connectionpool:info"))
    parser.add_argument("--log-json", "-lj", action="store_true", default=os.getenv("LOG_JSON", "").lower() == "true")
    # Models
    parser.add_argument("--alora-uq", "-u", type=int, default=None, help="Adapter ID for the UQ adapter")
    parser.add_argument("--alora-answerability", "-a", type=int, default=None, help="Adapter ID for the Answerability adapter")
    # Server
    parser.add_argument("--host", "-s", default="http://localhost:8081", help="Host where llama-server is running")
    # Docs
    parser.add_argument("--document", "-d", nargs="+", help="document (text or path) to add as context")

    # Configure logging
    args = parser.parse_args()
    alog.configure(
        default_level=args.log_level,
        filters=args.log_filters,
        formatter="json" if args.log_json else "pretty",
        thread_id=True,
    )

    # Make sure llama-server is up!
    resp = requests.get(f"{args.host}/health")
    resp.raise_for_status()
    log.info("llama-server is up at %s", args.host)

    # Get the loaded adapters
    resp = requests.get(f"{args.host}/lora-adapters")
    adapters = resp.json()
    adapter_ids = [entry["id"] for entry in adapters]

    # Figure out which adapter is which
    uq_id = args.alora_uq
    if uq_id is None:
        candidates = [entry for entry in adapters if "uncertainty" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce UQ adapter ID"
        uq_id = candidates[0]["id"]
    ans_id = args.alora_answerability
    if ans_id is None:
        candidates = [entry for entry in adapters if "answerability" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce Answerability adapter ID"
        ans_id = candidates[0]["id"]
    log.info("UQ aLoRA ID: %d, Answerability aLoRA ID: %d", uq_id, ans_id)

    # Load documents
    documents = []
    for i, doc in enumerate(args.document or []):
        documents.append(make_document(i, doc))

    # Start the prompt loop
    log.info("Starting main loop")
    run_main_loop(args.host, documents, uq_id, ans_id, adapter_ids)

if __name__ == "__main__":
    main()

Example Output

(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T21:20:46.381939 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T21:20:46.383507 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T21:20:46.383543 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T21:20:46.383571 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T21:20:46.383591 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> Where does Gabe work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? 
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 88%

?> How about Bob? Who does he work for?                                                           
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the question is unanswerable from the provided document.
>> CERTAINTY: 60.75

?> /doc Bob works for Widgets Inc.
2025-08-18T21:22:20.766356 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc.
?> Try again. Where does Bob work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc.
>> CERTAINTY: 75.85

?> Alright, time for something different. Write a haiku about python logging frameworks
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Python logs with ease,
Structured, colored, or plain,
Logging, a breeze.
>> CERTAINTY: 40.55

?>

(NOTE: It's clear from my experiments that these adapters are not particularly robust, but that's a property of these specific ones that are being continuously refined!)

@gabe-l-hart gabe-l-hart marked this pull request as ready for review August 18, 2025 21:29
@gabe-l-hart gabe-l-hart requested a review from ngxson as a code owner August 18, 2025 21:29
@gabe-l-hart
Copy link
Collaborator Author

I realized that my local adapter_config.json files have the updates from "invocation_string" to "alora_invocation_tokens". These changes will eventually be pushed up to the hosted adapters. For compatibility, I'm going to add the automated tokenization in the python conversion layer.

…er_config.json

While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Collaborator Author

The other contingency for this PR is #15404. The functionality is not linked at all, but the above chat script will fail out when trying to perform the chat template expansion without the fix there.

@gabe-l-hart
Copy link
Collaborator Author

gabe-l-hart commented Aug 18, 2025

One additional note: These adapters seem to still work well when attached to a quantized model, so they don't require losing the speed/footprint benefits of quantization.

./bin/llama-server -m ~/models/granite-3.2-8b-instruct/ggml-model-Q4_K_M.gguf --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf --port 8081 --jinja --reasoning-budget 0
(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T22:08:48.310437 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T22:08:48.311779 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T22:08:48.311808 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T22:08:48.311832 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T22:08:48.311851 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 87%

?> How about Bob? Who does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 60.64

?> /doc Bob works for Widgets Inc
2025-08-18T22:10:43.037856 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc
?> /doc Bob's favorite ice cream is Mint Chocolate Chip
2025-08-18T22:10:58.262979 [MAIN :INFO:8299700416] Adding document: Bob's favorite ice cream is Mint Chocolate Chip
?> Try again. Can you tell me who Bob works for now?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Apologies for any confusion, but the document provided does not contain information about Bob's employer.
>> CERTAINTY: 30.64

?> What company does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 40.00

?> Who works for Widgets Inc?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc, as per the information in the provided document.
>> CERTAINTY: 50.55

?> What's Bob's favorite Ice Cream?
>> The question is answerable!
ASSISTANT: Bob's favorite ice cream is Mint Chocolate Chip, according to the information I have.
>> CERTAINTY: 65.55

?> cool, what about Gabe? 
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the document does not provide information about Gabe's ice cream preference.
>> CERTAINTY: 40.55

EDIT: I also tried with MXFP4_MOE and the results seem to be closer to F16 than Q4_K_M

@gabe-l-hart
Copy link
Collaborator Author

Also important to test will be concurrent requests to the same alora. It's possible that these could end up in the same slot, but due to the logic for doing pre-invocation tokens without the adapter, they could pollute a single batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Support for Activated LoRA
2 participants