aLoRA Support #15327

gabe-l-hart · 2025-08-14T19:32:40Z

DRAFT STATUS

~~This PR is in draft as a proof-of-concept while we discuss the best path forward.~~

The implementation is now robust enough to be ready for full review. The changes were a bit more involved than I had originally hoped based on Georgi's comment, but they are all contained to the tools/server except for the changes to support the new GGUF field.

Description

Closes #15212
Supports #15213

This PR adds support for Activated LoRA (aLoRA) in llama-server and in the GGUF representation of a LoRA adapter. The primary benefit of aLoRA is the ability to hot-swap adapters without needing to clear cache. This enables a much more efficient multi-adapter model where individual adapters provide "add-on" features to a model and can be applied during a model flow without redoing the prefill work.

Current Changes

Add adapter.alora.invocation_tokens GGUF KV
- Support parsing adapter.alora.invocation_tokens from "alora_invocation_tokens" in convert_lora_to_gguf.py
- Support reading adapter.alora.invocation_tokens when loading an adapter
Add alora_invocation_tokens to llama_lora_adapter struct
Add C-style APIs to llama.h to support getting the invocation tokens from a const llama_lora_adapter *
Add support to server to conditionally not clear cache when a request with an adapter change arrives under the following conditions:
- The current cache was populated without any adapters
- The enabled new adapters are all aloras

TODO

The correct way to apply an alora is to identify the invocation tokens within an input request and only use the adapter for tokens starting with the invocation sequence. This may require a much deeper intervention to support adapter scaling on a per-token basis rather than on a per computation basis.

Testing

I'm testing this using the following models and adapters:

base model: https://huggingface.co/ibm-granite/granite-3.2-8b-instruct
LoRA adapter: https://huggingface.co/ibm-granite/granite-3.2-8b-lora-uncertainty
aLoRA adapter: https://huggingface.co/ibm-granite/granite-3.2-8b-alora-uncertainty

Conversion

# Convert base model
convert_hf_to_gguf.py ~/models/granite-3.2-8b-instruct/

# Convert alora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-alora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! Look for "DEBUG:lora-to-gguf:GGUF KV: adapter.alora.invocation_tokens = [6989, 24933, 49153]"

# Convert lora
python convert_lora_to_gguf.py ~/models/granite-3.2-8b-lora-uncertainty/ --base ~/models/granite-3.2-8b-instruct/ --verbose
# NOTE! You should not see log about adapter.alora.invocation_tokens

Execution

# Boot with both adapters (0: alora, 1: lora)
# NOTE: Disabling reasoning budget is critical for these adapters!
./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-lora-uncertainty/granite-3.2-8B-uncertainty-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Sniff test

This script simply verifies that the two adapters can be toggled and that the cache is cleared appropriately. The example inputs are trivial, so the timings are not particularly valuable.

server-req.py

import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-3.2-8b-instruct")

url = "http://localhost:8081"

messages = [
    {
      "role": "document A",
      "content": "The first document"
    },
    {
        "role": "document B",
        "content": "The second document"
    },
    {
        "role": "user",
        "content": "Which document is first?"
    },
]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Response

----
ASSISTANT RESPONSE (0.39293885231018066s):
The first document is document A.
----
UQ RESPONSE w/ aLoRA (0.18337106704711914s)
85
>>
{
  "completion_tokens": 3,
  "prompt_tokens": 95,
  "total_tokens": 98
}
{
  "prompt_n": 6,
  "prompt_ms": 78.906,
  "prompt_per_token_ms": 13.151000000000002,
  "prompt_per_second": 76.03984487871644,
  "predicted_n": 3,
  "predicted_ms": 102.247,
  "predicted_per_token_ms": 34.08233333333333,
  "predicted_per_second": 29.340714152982482
}
----
UQ RESPONSE w/ LoRA (0.3489878177642822s)
85%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 95,
  "total_tokens": 99
}
{
  "prompt_n": 95,
  "prompt_ms": 193.069,
  "prompt_per_token_ms": 2.0323052631578946,
  "prompt_per_second": 492.05206428789705,
  "predicted_n": 4,
  "predicted_ms": 153.721,
  "predicted_per_token_ms": 38.43025,
  "predicted_per_second": 26.021168220347253
}

…ation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-08-14T20:41:32Z

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

I've updated my sniff test script above to use client-side template expansion and the raw /completions endpoint for the UQ requests. This is not ideal since it means that this style of adapter would require careful orchestration on the client side to use.

CISC · 2025-08-14T22:16:08Z

One interesting update: For the specific adapters I'm using to test here, the invocation_tokens look like the beginning of a turn with the role certainty. I had originally been attempting to append this to the chat using the /chat/completions endpoint and appending {"role": "certainty", "content": ""}. This, however, resulted in the template expanding to <|start_of_role|>certainty<|end_of_role|>None<|end_of_turn|>\n<|start_of_role|>assistant<|end_of_role|> which is not correct for these adapters.

Add the following to your request to remove the assistant generation prompt:

"add_generation_prompt": false,

gabe-l-hart · 2025-08-14T22:20:48Z

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

CISC · 2025-08-14T22:28:31Z

Add the following to your request to remove the assistant generation prompt:

Ah, yep, that will definitely help, but it won't eliminate the None<|end_of_text|> portion. Talking with @kgreenewald, it sounds like the team will be moving to a training pattern for these adapters that will be more friendly to the chat template going forward.

Ah, didn't notice that, I suppose that's just because the template doesn't properly handle unknown roles?

gabe-l-hart · 2025-08-14T22:30:00Z

Yeah, the real issue is that it was trained to act like the generation prompt, so the activation sequence is intentionally an incomplete turn, but with a different role.

This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-08-15T18:23:13Z

Update

I've now added support for correctly applying the alora only to the tokens starting with the invocation sequence. The changes look like the following:

When an alora is requested, search the prompt tokens backwards for the invocation sequence

if no invocation sequence is found, simply disable it
NOTE (to self): Would this have any impact on subsequent calls? I don't think so since the slot loras are always initialized from the server loras on each task

When processing a slot, only pull tokens from cache up to the token before the invocation sequence start

NOTE (to self): We may need to allow for the case where we do want to pull tokens for the invocation sequence if it's from the same alora. This would be a strange use though, since it would require the user to send a request with the alora enabled, but with no un-cached alora invocation strings since the last one is always what gets found.

Once cached tokens have been filled, identify tokens that fall between the end of the cached tokens (slot.n_past) and the invocation start sequence. These should be prefilled without the alora, so the alora is temporarily disabled and the batch filling breaks at the token before the invocation start. The alora is then re-enabled with the correct scale so that the next task can finish prefill from the invocation start with the adapter enabled.

Testing

I've got a few tweaks to my test script that allow it to stimulate these conditions:

uq-req.py

import json
import time

from transformers import AutoTokenizer
import requests

tokenizer = AutoTokenizer.from_pretrained("/Users/ghart/models/granite-3.2-8b-instruct")

url = "http://localhost:8081"

documents = [
    {"text": "My name is Gabe"},
    {"text": "I work for IBM"}
]
messages = [{"role": "user", "content": "Who does Gabe work for?"}]

adapter_message = {
    "role": "certainty",
    "content": ""
}

# Run base messages
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/chat/completions", json={
    "model": "unused",
    "messages": messages,
    "chat_template_kwargs": {
        "documents": documents,
    },
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
assistant_resp = resp.json()["choices"][0]["message"]
print(f"ASSISTANT RESPONSE ({end-start}s):")
print(assistant_resp["content"])

# UNCOMMENT this to extend the assistant's response so that it isn't cached
"""
assistant_resp["content"] = assistant_resp["content"] + "\nRespect my authority!"
"""

# Create the serialized version as a string so we can append the right prompt
messages.append(assistant_resp)
raw_prompt = tokenizer.apply_chat_template(messages, documents=documents, tokenize=False)
uq_prompt = raw_prompt + "<|start_of_role|>certainty<|end_of_role|>"

# Run with both adapters disabled
# UNCOMMENT this to exercise the case where the invocation string itself has
# been cached without the adapter
"""
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/out adapters ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))
"""

# Run with the adapter and the prompt for UQ with the alora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 1.0},
        # lora
        {"id": 1, "scale": 0.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ aLoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

# Run with the adapter and the prompt for UQ with the lora enabled
print("----")
start = time.time()
resp = requests.post(f"{url}/v1/completions", json={
    "model": "unused",
    "prompt": uq_prompt,
    "temperature": 0.0,
    "max_tokens": 100,
    "lora": [
        # alora
        {"id": 0, "scale": 0.0},
        # lora
        {"id": 1, "scale": 1.0},
    ],
})
end = time.time()
js = resp.json()
uq_resp = js["choices"][0]["text"]
print(f"UQ RESPONSE w/ LoRA ({end-start}s)")
print(uq_resp)
print(">>")
print(json.dumps(js["usage"], indent=2))
print(json.dumps(js["timings"], indent=2))

Don't use cached invocation sequence from base model

This stimulates the case where the user ran the invocation sequence through the base model without the adapter and those tokens are cached (uncomment starting at line 57)

----
ASSISTANT RESPONSE (0.3047969341278076s):
IBM
----
UQ RESPONSE w/out adapters (0.1103658676147461s)
high
>>
{
  "completion_tokens": 2,
  "prompt_tokens": 140,
  "total_tokens": 142
}
{
  "prompt_n": 6,
  "prompt_ms": 60.143,
  "prompt_per_token_ms": 10.023833333333334,
  "prompt_per_second": 99.7622333438638,
  "predicted_n": 2,
  "predicted_ms": 47.758,
  "predicted_per_token_ms": 23.879,
  "predicted_per_second": 41.877800577913646
}
----
UQ RESPONSE w/ aLoRA (0.2115638256072998s)
87%
>>
{
  "completion_tokens": 4,
  "prompt_tokens": 140,
  "total_tokens": 144
}
{
  "prompt_n": 3,
  "prompt_ms": 56.051,
  "prompt_per_token_ms": 18.683666666666667,
  "prompt_per_second": 53.52268469786445,
  "predicted_n": 4,
  "predicted_ms": 153.106,
  "predicted_per_token_ms": 38.2765,
  "predicted_per_second": 26.125690697947828
}
----
UQ RESPONSE w/ LoRA (2.164383888244629s)
85%

Based on the information provided, Gabe works for IBM. The document states "I work for IBM" and the name associated with this statement is Gabe.
>>
{
  "completion_tokens": 38,
  "prompt_tokens": 140,
  "total_tokens": 178
}
{
  "prompt_n": 140,
  "prompt_ms": 283.203,
  "prompt_per_token_ms": 2.022878571428571,
  "prompt_per_second": 494.3450457798823,
  "predicted_n": 38,
  "predicted_ms": 1877.568,
  "predicted_per_token_ms": 49.409684210526315,
  "predicted_per_second": 20.238947404301733
}

Don't use adapter for uncached tokens before invocation sequence

This stimulates the case where for some reason there are additional tokens not pulled from cache that come before the invocation sequence (uncomment line 45)

----
ASSISTANT RESPONSE (0.30304694175720215s):
IBM
----
UQ RESPONSE w/ aLoRA (0.39620471000671387s)
87.5%
>>
{
  "completion_tokens": 6,
  "prompt_tokens": 146,
  "total_tokens": 152
}
{
  "prompt_n": 12,
  "prompt_ms": 136.993,
  "prompt_per_token_ms": 11.416083333333333,
  "prompt_per_second": 87.59571656945977,
  "predicted_n": 6,
  "predicted_ms": 256.634,
  "predicted_per_token_ms": 42.772333333333336,
  "predicted_per_second": 23.379598961945803
}
----
UQ RESPONSE w/ LoRA (1.1565330028533936s)
85%

Based on the information provided, Gabe works for IBM.
>>
{
  "completion_tokens": 18,
  "prompt_tokens": 146,
  "total_tokens": 164
}
{
  "prompt_n": 146,
  "prompt_ms": 287.421,
  "prompt_per_token_ms": 1.9686369863013697,
  "prompt_per_second": 507.9656670876519,
  "predicted_n": 18,
  "predicted_ms": 866.559,
  "predicted_per_token_ms": 48.14216666666667,
  "predicted_per_second": 20.771811267322825
}

Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-08-18T21:25:42Z

I've now extended this to test with multiple aloras in the same conversation. Here's the setup:

Adapters

Uncertainty Quantification (UQ): https://huggingface.co/ibm-granite/granite-3.2-8b-alora-uncertainty
- As above, this one estimates the certainty of the given answer relative to the provided context
Answerability: https://huggingface.co/ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction
- This adapter is used before running the full generation to determine if the given question can be answered relative to the given context

Adapters are converted using convert_lora_to_gguf.py.

Boot with adapters

./bin/llama-server \
  -m ~/models/granite-3.2-8b-instruct/granite-3.2-8B-instruct-F16.gguf \
  --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf \
  --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf \
  --port 8081 \
  --jinja \
  --reasoning-budget 0

Test Script

(sorry, it requires my personal logging framework just 'cuz 😉... pip install alchemy-logging)

alora-chat.py

#!/usr/bin/env bash
"""
This is a simple implementation of an interactive chat that leverages several
aLoRA adapters during the flow
"""

# Standard
import argparse
import os

# First Party
import alog

# Third Party
import requests

log = alog.use_channel("MAIN")


def make_document(i: int, doc: str) -> dict:
    """Make a document dict from the given doc as either text or a path"""
    log.info("Adding document: %s", doc)
    if os.path.exists(doc):
        with open(doc, "r") as handle:
            return {"text": handle.read(), "doc_id": i, "title": doc}
    return{"text": doc, "doc_id": i}


def make_lora_req(adapter_ids: list[int], loras: list[int]) -> list[dict]:
    return [
        {"id": i, "scale": 1.0 if i in loras else 0.0}
        for i in adapter_ids
    ]


def make_chat_req(messages: list[dict], documents: list[dict], adapter_ids: list[int], loras: list[int]) -> dict:
    return {
        "messages": messages,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    }


def make_completion_req(prompt: str, documents: list[dict], adapter_ids: list[int], loras: list[int], **kwargs) -> dict:
    kwargs.update({
        "prompt": prompt,
        "chat_template_kwargs": {
            "documents": documents,
        },
        "temperature": 0.0,
        "lora": make_lora_req(adapter_ids, loras),
    })
    return kwargs


def run_main_loop(host: str, documents: list[dict], uq_id: int, ans_id: int, adapter_ids: list[int]):
    """Run the main loop with questions"""
    help_cmd = "/?"
    doc_cmd = "/doc"
    reset_cmd = "/reset"
    quit_cmd = "/quit"
    doc_pfx = f"{doc_cmd} "

    def print_help():
        print("Commands:")
        print(f"{help_cmd}: Print help")
        print(f"{doc_cmd}: Add a document")
        print(f"{reset_cmd}: Reset the chat history")
        print(f"{quit_cmd}: Quit")

    messages = []
    print_help()
    while True:
        inp = input("?> ").strip()
        if inp == quit_cmd:
            break
        if not inp:
            continue
        if inp == help_cmd:
            print_help()
            continue
        if inp == reset_cmd:
            messages.clear()
            continue
        if inp.startswith(doc_pfx):
            doc = inp[len(doc_pfx):].lstrip()
            documents.append(make_document(len(documents), doc))
            continue

        # Apply the chat template with the user query
        user_message = {"role": "user", "content": inp}
        resp = requests.post(f"{host}/apply-template", json=make_chat_req(messages + [user_message], documents, adapter_ids, []))
        resp.raise_for_status()
        formatted_prompt = resp.json()["prompt"]
        log.debug4("Formatted prompt: %s", formatted_prompt)

        # Run the Answerability query
        ans_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>answerability<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(ans_prompt, documents, adapter_ids, [ans_id], max_tokens=3))
        resp.raise_for_status()
        js = resp.json()
        answerability = js["choices"][0]["text"]
        log.debug("Answerability: %s", answerability)
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])
        answerable = not answerability.split()[0].lower().startswith("unanswerable")
        if answerable:
            print(">> The question is answerable!")
        else:
            print(">> I'm sorry, but that question isn't answerable with the given context")
            if input("?> Do you want to try anyway [yN]? ").strip().lower() not in ["y", "yes"]:
                continue
        messages.append(user_message)

        # If not unanswerable, run the question and get the assistant's response
        resp = requests.post(f"{host}/v1/chat/completions", json=make_chat_req(messages, documents, adapter_ids, []))
        resp.raise_for_status()
        js = resp.json()
        assistant_msg = js["choices"][0]["message"]
        answer = assistant_msg["content"]
        messages.append(assistant_msg)
        print(f"ASSISTANT: {answer}")

        # Get the uncertainty
        formatted_prompt = requests.post(f"{host}/apply-template", json=make_chat_req(messages, documents, adapter_ids, [])).json()["prompt"]
        uq_prompt = formatted_prompt + "<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>"
        resp = requests.post(f"{host}/v1/completions", json=make_completion_req(uq_prompt, documents, adapter_ids, [uq_id], max_tokens=5))
        resp.raise_for_status()
        js = resp.json()
        uq = js["choices"][0]["text"]
        print(f">> CERTAINTY: {uq}")
        log.debug2("Usage: %s", js["usage"])
        log.debug2("Timings: %s", js["timings"])

        print()


def main():
    parser = argparse.ArgumentParser(description=__doc__)
    # Logging
    parser.add_argument("--log-level", "-l", default=os.getenv("LOG_LEVEL", "info"))
    parser.add_argument("--log-filters", "-lf", default=os.getenv("LOG_FILTERS", "urllib3.connectionpool:info"))
    parser.add_argument("--log-json", "-lj", action="store_true", default=os.getenv("LOG_JSON", "").lower() == "true")
    # Models
    parser.add_argument("--alora-uq", "-u", type=int, default=None, help="Adapter ID for the UQ adapter")
    parser.add_argument("--alora-answerability", "-a", type=int, default=None, help="Adapter ID for the Answerability adapter")
    # Server
    parser.add_argument("--host", "-s", default="http://localhost:8081", help="Host where llama-server is running")
    # Docs
    parser.add_argument("--document", "-d", nargs="+", help="document (text or path) to add as context")

    # Configure logging
    args = parser.parse_args()
    alog.configure(
        default_level=args.log_level,
        filters=args.log_filters,
        formatter="json" if args.log_json else "pretty",
        thread_id=True,
    )

    # Make sure llama-server is up!
    resp = requests.get(f"{args.host}/health")
    resp.raise_for_status()
    log.info("llama-server is up at %s", args.host)

    # Get the loaded adapters
    resp = requests.get(f"{args.host}/lora-adapters")
    adapters = resp.json()
    adapter_ids = [entry["id"] for entry in adapters]

    # Figure out which adapter is which
    uq_id = args.alora_uq
    if uq_id is None:
        candidates = [entry for entry in adapters if "uncertainty" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce UQ adapter ID"
        uq_id = candidates[0]["id"]
    ans_id = args.alora_answerability
    if ans_id is None:
        candidates = [entry for entry in adapters if "answerability" in entry["path"]]
        assert len(candidates) == 1, "Couldn't auto-deduce Answerability adapter ID"
        ans_id = candidates[0]["id"]
    log.info("UQ aLoRA ID: %d, Answerability aLoRA ID: %d", uq_id, ans_id)

    # Load documents
    documents = []
    for i, doc in enumerate(args.document or []):
        documents.append(make_document(i, doc))

    # Start the prompt loop
    log.info("Starting main loop")
    run_main_loop(args.host, documents, uq_id, ans_id, adapter_ids)

if __name__ == "__main__":
    main()

Example Output

(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T21:20:46.381939 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T21:20:46.383507 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T21:20:46.383543 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T21:20:46.383571 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T21:20:46.383591 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> Where does Gabe work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? 
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 88%

?> How about Bob? Who does he work for?                                                           
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the question is unanswerable from the provided document.
>> CERTAINTY: 60.75

?> /doc Bob works for Widgets Inc.
2025-08-18T21:22:20.766356 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc.
?> Try again. Where does Bob work?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc.
>> CERTAINTY: 75.85

?> Alright, time for something different. Write a haiku about python logging frameworks
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Python logs with ease,
Structured, colored, or plain,
Logging, a breeze.
>> CERTAINTY: 40.55

?>

(NOTE: It's clear from my experiments that these adapters are not particularly robust, but that's a property of these specific ones that are being continuously refined!)

gabe-l-hart · 2025-08-18T21:36:18Z

I realized that my local adapter_config.json files have the updates from "invocation_string" to "alora_invocation_tokens". These changes will eventually be pushed up to the hosted adapters. For compatibility, I'm going to add the automated tokenization in the python conversion layer.

…er_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart · 2025-08-18T21:53:28Z

The other contingency for this PR is #15404. The functionality is not linked at all, but the above chat script will fail out when trying to perform the chat template expansion without the fix there.

gabe-l-hart · 2025-08-18T22:15:29Z

One additional note: These adapters seem to still work well when attached to a quantized model, so they don't require losing the speed/footprint benefits of quantization.

./bin/llama-server -m ~/models/granite-3.2-8b-instruct/ggml-model-Q4_K_M.gguf --lora ~/models/granite-3.2-8b-alora-uncertainty/granite-3.2-8B-alora-uncertainty-F16-LoRA.gguf --lora ~/models/granite-3.2-8b-alora-rag-answerability-prediction/granite-3.2-8B-alora-rag-answerability-prediction-F16-LoRA.gguf --port 8081 --jinja --reasoning-budget 0

(llama.cpp) ghart@Mac [llama.cpp gabe-l-hart/alora-support ?~]$ python alora-chat.py -d "My name is Gabe" "I work for IBM" 
2025-08-18T22:08:48.310437 [MAIN :INFO:8299700416] llama-server is up at http://localhost:8081
2025-08-18T22:08:48.311779 [MAIN :INFO:8299700416] UQ aLoRA ID: 0, Answerability aLoRA ID: 1
2025-08-18T22:08:48.311808 [MAIN :INFO:8299700416] Adding document: My name is Gabe
2025-08-18T22:08:48.311832 [MAIN :INFO:8299700416] Adding document: I work for IBM
2025-08-18T22:08:48.311851 [MAIN :INFO:8299700416] Starting main loop
Commands:
/?: Print help
/doc: Add a document
/reset: Reset the chat history
/quit: Quit
?> What company does Gabe work for?
>> The question is answerable!
ASSISTANT: IBM
>> CERTAINTY: 87%

?> How about Bob? Who does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I am sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 60.64

?> /doc Bob works for Widgets Inc
2025-08-18T22:10:43.037856 [MAIN :INFO:8299700416] Adding document: Bob works for Widgets Inc
?> /doc Bob's favorite ice cream is Mint Chocolate Chip
2025-08-18T22:10:58.262979 [MAIN :INFO:8299700416] Adding document: Bob's favorite ice cream is Mint Chocolate Chip
?> Try again. Can you tell me who Bob works for now?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Apologies for any confusion, but the document provided does not contain information about Bob's employer.
>> CERTAINTY: 30.64

?> What company does Bob work for?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the information about Bob's employer is not available in the provided document.
>> CERTAINTY: 40.00

?> Who works for Widgets Inc?
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: Bob works for Widgets Inc, as per the information in the provided document.
>> CERTAINTY: 50.55

?> What's Bob's favorite Ice Cream?
>> The question is answerable!
ASSISTANT: Bob's favorite ice cream is Mint Chocolate Chip, according to the information I have.
>> CERTAINTY: 65.55

?> cool, what about Gabe? 
>> I'm sorry, but that question isn't answerable with the given context
?> Do you want to try anyway [yN]? y
ASSISTANT: I'm sorry, the document does not provide information about Gabe's ice cream preference.
>> CERTAINTY: 40.55

EDIT: I also tried with MXFP4_MOE and the results seem to be closer to F16 than Q4_K_M

gabe-l-hart · 2025-08-18T22:45:02Z

Also important to test will be concurrent requests to the same alora. It's possible that these could end up in the same slot, but due to the logic for doing pre-invocation tokens without the adapter, they could pollute a single batch.

gabe-l-hart added 7 commits August 13, 2025 10:57

feat: Add python-side constants and conversion for adapter.lora.invoc…

23fffd5

…ation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add c++ side constants for adapter.lora.invocation_string

6fbe28c

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Parse invocation string for adapters from GGUF

61357b3

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

fix(cpp): Update to alora_invocation_tokens on c++ side

c3f4c66

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

feat: Add C APIs to get alora invocation token array from lora

5cfc88d

Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart mentioned this pull request Aug 14, 2025

Feature Request: Support for Activated LoRA #15212

Open

4 tasks

github-actions bot added examples python python script changes server labels Aug 14, 2025

gabe-l-hart added 3 commits August 15, 2025 10:48

fix: Use || instead of 'or'

291f531

Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart mentioned this pull request Aug 18, 2025

Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196

Merged

gabe-l-hart marked this pull request as ready for review August 18, 2025 21:29

gabe-l-hart requested a review from ngxson as a code owner August 18, 2025 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aLoRA Support #15327

aLoRA Support #15327

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

Uh oh!

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 15, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025 •

edited

Loading

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

Uh oh!

aLoRA Support #15327

Are you sure you want to change the base?

aLoRA Support #15327

Conversation

gabe-l-hart commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DRAFT STATUS

Description

Current Changes

TODO

Testing

Conversion

Execution

Sniff test

Uh oh!

gabe-l-hart commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

CISC commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 14, 2025

Uh oh!

gabe-l-hart commented Aug 15, 2025

Update

Testing

Don't use cached invocation sequence from base model

Don't use adapter for uncached tokens before invocation sequence

Uh oh!

gabe-l-hart commented Aug 18, 2025

Adapters

Boot with adapters

Test Script

Example Output

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

gabe-l-hart commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Aug 18, 2025

Uh oh!

Uh oh!

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

gabe-l-hart commented Aug 14, 2025 •

edited

Loading

gabe-l-hart commented Aug 18, 2025 •

edited

Loading