-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
It would be highly beneficial if llama.cpp
supported an option in its inference APIs (e.g., llama_tokenize
, llama_eval
, etc.) to return the log probability or probability distribution of each token generated during inference.
This feature is particularly useful for tasks such as:
Confidence estimation in generated outputs
Building applications involving uncertainty modeling
Language model calibration studies
Advanced prompting workflows that rely on token-level analysis
Motivation
Currently, users need to make workarounds like wrapping logits manually or modifying the source to extract token probabilities, which:
-
Increases maintenance overhead
-
Reduces usability for research and production use-cases
-
Deters new contributors or developers from integrating llama.cpp
Possible Implementation
Add an optional flag or method in the inference API to return per-token log probabilities (similar to logprobs in OpenAI API).
Ensure the output can be toggled to avoid performance penalties when not needed.
Documentation update with example usage.
This feature has been widely adopted in APIs like OpenAI’s and HuggingFace Transformers, and its inclusion would increase llama.cpp’s utility in academic and production settings.
If this sounds like a good fit for the project, I’d be happy to help explore or prototype the implementation!