Skip to content

Update Llama Integration to support OpenAI API for embeddings and support user task setting field and dimensions service setting field #131827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Jan-Kazlouski-elastic
Copy link
Contributor

@Jan-Kazlouski-elastic Jan-Kazlouski-elastic commented Jul 24, 2025

There are few issues with current Llama stack integration.

  1. Existing integration with Llama stack supports only Hugging Face embedding API, not allowing OpenAI embedding API to be used.
  2. dimensions field from service settings is not being sent to the model because it is only possible in OpenAI embedding API.
  3. user field cannot be sent to the model as part of embeddings and completion service requests.

This update:

  1. Makes integration support OpenAI embedding API instead of Hugging Face API.
  2. Allows passing dimensions field from service settings to the model as part of OpenAI style API request structure for embedding services.
  3. Allows passing of user field as part of task settings so it would be sent as part of OpenAI style API request structure for embedding and completion services.

Here are examples of requests and responses Llama model receives/returns (not elastic search inference API request/response structure)

Current Embedding Request Structure (Hugging Face style API):
{
    "model_id": "all-MiniLM-L6-v2",
    "contents": [
        "why is the sky blue?",
        "why is the grass green?"
    ]
}
Current Embedding Response Structure (Hugging Face style API):
{
    "embeddings": [
        [
            0.010060793,
            -0.0017529363
        ],
        [
            -0.009805486,
            0.0604241
        ]
    ]
}
New Embedding Request Structure (OpenAI style API):
{
    "model": "all-MiniLM-L6-v2",
    "input": [
        "why is the sky blue?",
        "why is the grass green?"
    ],
    "dimensions": 384,
    "user": "unique-user"
}
New Embedding Response Structure (OpenAI style API):
{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                0.04231193,
                0.087434106
            ],
            "index": 0
        },
        {
            "object": "embedding",
            "embedding": [
                -0.07564079,
                0.041386202
            ],
            "index": 1
        }
    ],
    "model": "all-MiniLM-L6-v2",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 4
    }
}

Testing was performed for changed scenarios and here are the testing results:

Create Embedding with `dimensions` field set by user (dimensions of result is correct):
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
RQ:
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "api_key": "{{llama-api-key}}",
        "model_id": "all-MiniLM-L6-v2",
        "dimensions": 384,
        "similarity": "cosine"
    }
}

RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}
Create Embedding with `dimensions` field set by user (`dimensions` of result are NOT correct):
PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
RQ:
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "api_key": "{{llama-api-key}}",
        "model_id": "all-MiniLM-L6-v2",
        "dimensions": 385,
        "similarity": "cosine"
    }
}

RS:
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "The retrieved embeddings size [384] does not match the size specified in the settings [385]. Please recreate the [llama-text-embedding] configuration with the correct dimensions"
            }
        ],
        "type": "status_exception",
        "reason": "The retrieved embeddings size [384] does not match the size specified in the settings [385]. Please recreate the [llama-text-embedding] configuration with the correct dimensions"
    },
    "status": 400
}
Perform Embedding with `user` field set:
POST {{base-url}}/_inference/llama-text-embedding
RQ:
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ],
    "task_settings": {
        "user": "123"
    }
}
RS:
{
    "text_embedding": [
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        },
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        }
    ]
}
Perform Non-Streaming Completion with `user` field set:
POST {{base-url}}/_inference/completion/llama-completion
RQ:
{
    "input": "The sky above the port was the color of television tuned to a dead channel.",
    "task_settings": {
        "user": "123"
    }
}
RS:
{
    "completion": [
        {
            "result": "I recognize that iconic phrase! \"The sky above the port was the color of television tuned to a dead channel\" is one of the most famous opening lines in literature, written by Anthony Burgess in his novel \"A Clockwork Orange\".\n\nThis sentence sets the tone for the rest of the book, which explores themes of free will, individuality, and the dangers of societal conformity. It's a powerful opener that draws the reader into the dystopian world of 1960s London, where teenage gangs roam the streets and ultraviolence is often rewarded.\n\nThe phrase has become synonymous with Burgess's unique prose style, which blends elements of science fiction, satire, and philosophical inquiry. Have you read \"A Clockwork Orange\" before?"
        }
    ]
}
Perform Streaming Completion with `user` field set:
POST {{base-url}}/_inference/completion/llama-completion/_stream
RQ:
{
    "input": "The sky above the port was the color of television tuned to a dead channel.",
    "task_settings": {
        "user": "123"
    }
}
RS:
event: message
data: {"completion":[{"delta":"That"}]}

event: message
data: {"completion":[{"delta":"'s"}]}

event: message
data: [DONE]

  • - Have you signed the contributor license agreement?
  • - Have you followed the contributor guidelines?
  • - If submitting code, have you built your formula locally prior to submission with gradle check?
  • - If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
  • - If submitting code, have you checked that your submission is for an OS and architecture that we support?
  • - If you are submitting this code for a class then read our policy for that.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 24, 2025
@AI-IshanBhatt AI-IshanBhatt added the :ml Machine learning label Jul 28, 2025
@elasticsearchmachine elasticsearchmachine added Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Jul 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -20,15 +22,17 @@ public interface LlamaActionVisitor {
* Creates an executable action for the given Llama embeddings model.
*
* @param model the Llama embeddings model
* @param taskSettings the settings for the task, which may include parameters like user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Can you remove ", which may include parameters like user". Just to avoid having to update this in the future if the user parameter is removed from valid task settings. The same comment applies for other comments across these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Fixed here and in other places that mention "user" task setting specifically.

@@ -48,6 +51,7 @@ public LlamaChatCompletionModel(
taskType,
service,
LlamaChatCompletionServiceSettings.fromMap(serviceSettings, context),
OpenAiChatCompletionTaskSettings.fromMap(taskSettings),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we only able to send user as a task setting for completion or does Llama have some extra settings we can support that OpenAI does not support? Also do we already know that Llama will always support all task settings that OpenAI supports?

Copy link
Contributor Author

@Jan-Kazlouski-elastic Jan-Kazlouski-elastic Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we only able to send user as a task setting for completion or does Llama have some extra settings we can support that OpenAI does not support?

According to Llama Stack Specification - they don't support anything on top of OpenAI "user" task parameter.
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-embeddings/post
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-chat-completions/post

Also do we already know that Llama will always support all task settings that OpenAI supports?

Llama Stack inference API is under active development and we cannot make any solid predictions. However according to their API reference - they do support OpenAI's "user" task setting, the only task setting in OpenAI integration.

@@ -50,6 +54,7 @@ public LlamaEmbeddingsModel(
taskType,
service,
LlamaEmbeddingsServiceSettings.fromMap(serviceSettings, context),
OpenAiEmbeddingsTaskSettings.fromMap(taskSettings, context),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question regarding whether there are other Llama specific settings we may want to support/whether it is known that Llama will always support all OpenAI embeddings task settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer.
According to Llama Stack Specification - they don't support anything on top of OpenAI "user" task parameter.
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-embeddings/post
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-chat-completions/post
Llama Stack inference API is under active development and we cannot make any solid predictions. However according to their API reference - they do support OpenAI's "user" task setting, the only task setting in OpenAI integration.

Strings.toString(new LlamaEmbeddingsRequestEntity(model.getServiceSettings().modelId(), truncationResult.input()))
.getBytes(StandardCharsets.UTF_8)
Strings.toString(
new OpenAiEmbeddingsRequestEntity(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we deprecating support for the hugging face API entirely? Is there any reason why someone might want to use it over the OpenAI API? Are there any existing users of the hugging face API for which this could cause breaking changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we deprecating support for the hugging face API entirely?

This PR deprecates Hugging Face API support, correct.

Is there any reason why someone might want to use it over the OpenAI API?

There is no visible reason for someone to choose Hugging Face API over the OpenAI API. Hugging Face Embedding API is not superior in any way compared to OpenAI Embedding API.

Are there any existing users of the hugging face API for which this could cause breaking changes?

Despite Llama integration PR being merged last week, specification changes that describe the change are not yet merged. So according to documentation - Llama is not yet supported as a provider. So there shouldn't be any existing users that would have integrated with Elastic Llama Inference Provider.

);
}
dimensionsSetByUser = dimensions != null;
} else if (context == ConfigurationParseContext.PERSISTENT && dimensionsSetByUser == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is this only going to happen for endpoints created before these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. In practice, there shouldn’t be any endpoints created before these changes, since Llama Integration specification hasn’t been merged yet. I’ve updated the logic to remove the default value, as users are not aware that Llama integration is available while the documentation is still pending.

}

@SuppressWarnings("unchecked")
private void assertEmbeddingsRequest() throws IOException {
private void assertEmbeddingsRequest(String user) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this doesn't test any of the cases where dimensions are set either by the user or the system. Can we add tests for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Tests added. Validations for requestMap for dimensions are there.

@@ -11,39 +11,43 @@
import org.elasticsearch.inference.EmptySecretSettings;
import org.elasticsearch.inference.TaskType;
import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.xpack.inference.services.openai.embeddings.OpenAiEmbeddingsTaskSettings;
import org.elasticsearch.xpack.inference.services.settings.DefaultSecretSettings;

import static org.elasticsearch.xpack.inference.chunking.ChunkingSettingsTests.createRandomChunkingSettings;

public class LlamaEmbeddingsModelTests extends ESTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the of function that was added in this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for this "of" function.

@@ -20,38 +21,41 @@

public class LlamaChatCompletionModelTests extends ESTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have testing for overriding the user field in a model? Both if the user is set and we override to another user or if it's not set and we override to a user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for this "of" function.

@@ -39,6 +39,8 @@ public class LlamaEmbeddingsServiceSettingsTests extends AbstractWireSerializing
private static final SimilarityMeasure SIMILARITY_MEASURE = SimilarityMeasure.DOT_PRODUCT;
private static final int MAX_INPUT_TOKENS = 128;
private static final int RATE_LIMIT = 2;
private static final Boolean DIMENSIONS_SET_BY_USER = Boolean.TRUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests where the dimensions are not set by the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we do have such tests

@@ -27,45 +27,84 @@
public class LlamaEmbeddingsRequestTests extends ESTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests to ensure that dimensions are handled properly when they are set by the user/when they are not set by the user? It seems since we are reusing the OpenAIEmbeddingsRequestEntity there is logic that changes how the request is formed across these cases (see here) so we should test that here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic related to dimensions is located in OpenAIEmbeddingsRequestEntity and tested in OpenAIEmbeddingsRequestEntityTests. However I added tests here as well.

@Jan-Kazlouski-elastic
Copy link
Contributor Author

Hello @dan-rubinstein
I made the changes:

  • New TransportVersion is renamed to ML_INFERENCE_LLAMA_REFACTORED
  • Initial integration TransportVersion ML_INFERENCE_LLAMA_ADDED mentions are removed, because it would be better to have this whole change as one single integration without the need of versioning between them. Documentation is not yet published, so no users would be affected.
  • "user" param is no longer mentioned in JavaDoc
  • Default false value for dimensionsSetByUser in LlamaEmbeddingsModel fromMap method is removed because it must always be set during validation call to the model
  • Versions checks are simplified because only ML_INFERENCE_LLAMA_REFACTORED is going to be used
  • Tests are updated and expanded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants