Update Llama Integration to support OpenAI API for embeddings and support user task setting field and dimensions service setting field #131827

Jan-Kazlouski-elastic · 2025-07-24T12:59:55Z

There are few issues with current Llama stack integration.

Existing integration with Llama stack supports only Hugging Face embedding API, not allowing OpenAI embedding API to be used.
dimensions field from service settings is not being sent to the model because it is only possible in OpenAI embedding API.
user field cannot be sent to the model as part of embeddings and completion service requests.

This update:

Makes integration support OpenAI embedding API instead of Hugging Face API.
Allows passing dimensions field from service settings to the model as part of OpenAI style API request structure for embedding services.
Allows passing of user field as part of task settings so it would be sent as part of OpenAI style API request structure for embedding and completion services.

Here are examples of requests and responses Llama model receives/returns (not elastic search inference API request/response structure)

Current Embedding Request Structure (Hugging Face style API):

{
    "model_id": "all-MiniLM-L6-v2",
    "contents": [
        "why is the sky blue?",
        "why is the grass green?"
    ]
}

Current Embedding Response Structure (Hugging Face style API):

{
    "embeddings": [
        [
            0.010060793,
            -0.0017529363
        ],
        [
            -0.009805486,
            0.0604241
        ]
    ]
}

New Embedding Request Structure (OpenAI style API):

{
    "model": "all-MiniLM-L6-v2",
    "input": [
        "why is the sky blue?",
        "why is the grass green?"
    ],
    "dimensions": 384,
    "user": "unique-user"
}

New Embedding Response Structure (OpenAI style API):

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                0.04231193,
                0.087434106
            ],
            "index": 0
        },
        {
            "object": "embedding",
            "embedding": [
                -0.07564079,
                0.041386202
            ],
            "index": 1
        }
    ],
    "model": "all-MiniLM-L6-v2",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 4
    }
}

Testing was performed for changed scenarios and here are the testing results:

Create Embedding with `dimensions` field set by user (dimensions of result is correct):

PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
RQ:
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "api_key": "{{llama-api-key}}",
        "model_id": "all-MiniLM-L6-v2",
        "dimensions": 384,
        "similarity": "cosine"
    }
}

RS:
{
    "inference_id": "llama-text-embedding",
    "task_type": "text_embedding",
    "service": "llama",
    "service_settings": {
        "model_id": "all-MiniLM-L6-v2",
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "dimensions": 384,
        "similarity": "cosine",
        "rate_limit": {
            "requests_per_minute": 3000
        }
    },
    "chunking_settings": {
        "strategy": "sentence",
        "max_chunk_size": 250,
        "sentence_overlap": 1
    }
}

Create Embedding with `dimensions` field set by user (`dimensions` of result are NOT correct):

PUT {{base-url}}/_inference/text_embedding/llama-text-embedding
RQ:
{
    "service": "llama",
    "service_settings": {
        "url": "http://localhost:8321/v1/openai/v1/embeddings",
        "api_key": "{{llama-api-key}}",
        "model_id": "all-MiniLM-L6-v2",
        "dimensions": 385,
        "similarity": "cosine"
    }
}

RS:
{
    "error": {
        "root_cause": [
            {
                "type": "status_exception",
                "reason": "The retrieved embeddings size [384] does not match the size specified in the settings [385]. Please recreate the [llama-text-embedding] configuration with the correct dimensions"
            }
        ],
        "type": "status_exception",
        "reason": "The retrieved embeddings size [384] does not match the size specified in the settings [385]. Please recreate the [llama-text-embedding] configuration with the correct dimensions"
    },
    "status": 400
}

Perform Embedding with `user` field set:

POST {{base-url}}/_inference/llama-text-embedding
RQ:
{
    "input": [
        "The sky above the port was the color of television tuned to a dead channel.",
        "The sky above the port was the color of television tuned to a dead channel."
    ],
    "task_settings": {
        "user": "123"
    }
}
RS:
{
    "text_embedding": [
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        },
        {
            "embedding": [
                0.055843446,
                0.01615099
            ]
        }
    ]
}

Perform Non-Streaming Completion with `user` field set:

POST {{base-url}}/_inference/completion/llama-completion
RQ:
{
    "input": "The sky above the port was the color of television tuned to a dead channel.",
    "task_settings": {
        "user": "123"
    }
}
RS:
{
    "completion": [
        {
            "result": "I recognize that iconic phrase! \"The sky above the port was the color of television tuned to a dead channel\" is one of the most famous opening lines in literature, written by Anthony Burgess in his novel \"A Clockwork Orange\".\n\nThis sentence sets the tone for the rest of the book, which explores themes of free will, individuality, and the dangers of societal conformity. It's a powerful opener that draws the reader into the dystopian world of 1960s London, where teenage gangs roam the streets and ultraviolence is often rewarded.\n\nThe phrase has become synonymous with Burgess's unique prose style, which blends elements of science fiction, satire, and philosophical inquiry. Have you read \"A Clockwork Orange\" before?"
        }
    ]
}

Perform Streaming Completion with `user` field set:

POST {{base-url}}/_inference/completion/llama-completion/_stream
RQ:
{
    "input": "The sky above the port was the color of television tuned to a dead channel.",
    "task_settings": {
        "user": "123"
    }
}
RS:
event: message
data: {"completion":[{"delta":"That"}]}

event: message
data: {"completion":[{"delta":"'s"}]}

event: message
data: [DONE]

- Have you signed the contributor license agreement?
- Have you followed the contributor guidelines?
- If submitting code, have you built your formula locally prior to submission with gradle check?
- If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed.
- If submitting code, have you checked that your submission is for an OS and architecture that we support?
- If you are submitting this code for a class then read our policy for that.

…upport user field

…dingsModel for enhanced customization

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

elasticsearchmachine · 2025-07-28T10:33:26Z

Pinging @elastic/ml-core (Team:ML)

dan-rubinstein · 2025-07-28T18:11:16Z

...rc/main/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionVisitor.java

@@ -20,15 +22,17 @@ public interface LlamaActionVisitor {
     * Creates an executable action for the given Llama embeddings model.
     *
     * @param model the Llama embeddings model
+     * @param taskSettings the settings for the task, which may include parameters like user


Minor: Can you remove ", which may include parameters like user". Just to avoid having to update this in the future if the user parameter is removed from valid task settings. The same comment applies for other comments across these changes.

Makes sense. Fixed here and in other places that mention "user" task setting specifically.

dan-rubinstein · 2025-07-28T18:18:29Z

...va/org/elasticsearch/xpack/inference/services/llama/completion/LlamaChatCompletionModel.java

@@ -48,6 +51,7 @@ public LlamaChatCompletionModel(
            taskType,
            service,
            LlamaChatCompletionServiceSettings.fromMap(serviceSettings, context),
+            OpenAiChatCompletionTaskSettings.fromMap(taskSettings),


Are we only able to send user as a task setting for completion or does Llama have some extra settings we can support that OpenAI does not support? Also do we already know that Llama will always support all task settings that OpenAI supports?

Are we only able to send user as a task setting for completion or does Llama have some extra settings we can support that OpenAI does not support?

According to Llama Stack Specification - they don't support anything on top of OpenAI "user" task parameter.
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-embeddings/post
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-chat-completions/post

Also do we already know that Llama will always support all task settings that OpenAI supports?

Llama Stack inference API is under active development and we cannot make any solid predictions. However according to their API reference - they do support OpenAI's "user" task setting, the only task setting in OpenAI integration.

dan-rubinstein · 2025-07-28T18:29:01Z

...n/java/org/elasticsearch/xpack/inference/services/llama/embeddings/LlamaEmbeddingsModel.java

@@ -50,6 +54,7 @@ public LlamaEmbeddingsModel(
            taskType,
            service,
            LlamaEmbeddingsServiceSettings.fromMap(serviceSettings, context),
+            OpenAiEmbeddingsTaskSettings.fromMap(taskSettings, context),


Same question regarding whether there are other Llama specific settings we may want to support/whether it is known that Llama will always support all OpenAI embeddings task settings?

Same answer.
According to Llama Stack Specification - they don't support anything on top of OpenAI "user" task parameter.
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-embeddings/post
https://llama-stack.readthedocs.io/en/latest/references/api_reference/index.html#/paths/v1-openai-v1-chat-completions/post
Llama Stack inference API is under active development and we cannot make any solid predictions. However according to their API reference - they do support OpenAI's "user" task setting, the only task setting in OpenAI integration.

dan-rubinstein · 2025-07-28T18:34:37Z

.../elasticsearch/xpack/inference/services/llama/request/embeddings/LlamaEmbeddingsRequest.java

-            Strings.toString(new LlamaEmbeddingsRequestEntity(model.getServiceSettings().modelId(), truncationResult.input()))
-                .getBytes(StandardCharsets.UTF_8)
+            Strings.toString(
+                new OpenAiEmbeddingsRequestEntity(


Are we deprecating support for the hugging face API entirely? Is there any reason why someone might want to use it over the OpenAI API? Are there any existing users of the hugging face API for which this could cause breaking changes?

Are we deprecating support for the hugging face API entirely?

This PR deprecates Hugging Face API support, correct.

Is there any reason why someone might want to use it over the OpenAI API?

There is no visible reason for someone to choose Hugging Face API over the OpenAI API. Hugging Face Embedding API is not superior in any way compared to OpenAI Embedding API.

Are there any existing users of the hugging face API for which this could cause breaking changes?

Despite Llama integration PR being merged last week, specification changes that describe the change are not yet merged. So according to documentation - Llama is not yet supported as a provider. So there shouldn't be any existing users that would have integrated with Elastic Llama Inference Provider.

dan-rubinstein · 2025-07-28T19:02:22Z

.../elasticsearch/xpack/inference/services/llama/embeddings/LlamaEmbeddingsServiceSettings.java

+                );
+            }
+            dimensionsSetByUser = dimensions != null;
+        } else if (context == ConfigurationParseContext.PERSISTENT && dimensionsSetByUser == null) {


Just to clarify, is this only going to happen for endpoints created before these changes?

You're right. In practice, there shouldn’t be any endpoints created before these changes, since Llama Integration specification hasn’t been merged yet. I’ve updated the logic to remove the default value, as users are not aware that Llama integration is available while the documentation is still pending.

dan-rubinstein · 2025-07-28T19:40:29Z

...st/java/org/elasticsearch/xpack/inference/services/llama/action/LlamaActionCreatorTests.java

    }

    @SuppressWarnings("unchecked")
-    private void assertEmbeddingsRequest() throws IOException {
+    private void assertEmbeddingsRequest(String user) throws IOException {


It seems this doesn't test any of the cases where dimensions are set either by the user or the system. Can we add tests for this?

Done. Tests added. Validations for requestMap for dimensions are there.

dan-rubinstein · 2025-07-28T19:44:49Z

...a/org/elasticsearch/xpack/inference/services/llama/embeddings/LlamaEmbeddingsModelTests.java

@@ -11,39 +11,43 @@
 import org.elasticsearch.inference.EmptySecretSettings;
 import org.elasticsearch.inference.TaskType;
 import org.elasticsearch.test.ESTestCase;
+import org.elasticsearch.xpack.inference.services.openai.embeddings.OpenAiEmbeddingsTaskSettings;
 import org.elasticsearch.xpack.inference.services.settings.DefaultSecretSettings;

 import static org.elasticsearch.xpack.inference.chunking.ChunkingSettingsTests.createRandomChunkingSettings;

 public class LlamaEmbeddingsModelTests extends ESTestCase {


Can you add tests for the of function that was added in this change?

I added tests for this "of" function.

dan-rubinstein · 2025-07-28T19:47:35Z

...g/elasticsearch/xpack/inference/services/llama/completion/LlamaChatCompletionModelTests.java

@@ -20,38 +21,41 @@

 public class LlamaChatCompletionModelTests extends ESTestCase {


Do we have testing for overriding the user field in a model? Both if the user is set and we override to another user or if it's not set and we override to a user?

I added tests for this "of" function.

dan-rubinstein · 2025-07-28T19:53:29Z

...ticsearch/xpack/inference/services/llama/embeddings/LlamaEmbeddingsServiceSettingsTests.java

@@ -39,6 +39,8 @@ public class LlamaEmbeddingsServiceSettingsTests extends AbstractWireSerializing
    private static final SimilarityMeasure SIMILARITY_MEASURE = SimilarityMeasure.DOT_PRODUCT;
    private static final int MAX_INPUT_TOKENS = 128;
    private static final int RATE_LIMIT = 2;
+    private static final Boolean DIMENSIONS_SET_BY_USER = Boolean.TRUE;


Do we have any tests where the dimensions are not set by the user?

Now we do have such tests

dan-rubinstein · 2025-07-28T20:05:49Z

...ticsearch/xpack/inference/services/llama/request/embeddings/LlamaEmbeddingsRequestTests.java

@@ -27,45 +27,84 @@
 public class LlamaEmbeddingsRequestTests extends ESTestCase {


Do we have tests to ensure that dimensions are handled properly when they are set by the user/when they are not set by the user? It seems since we are reusing the OpenAIEmbeddingsRequestEntity there is logic that changes how the request is formed across these cases (see here) so we should test that here as well.

Logic related to dimensions is located in OpenAIEmbeddingsRequestEntity and tested in OpenAIEmbeddingsRequestEntityTests. However I added tests here as well.

…lamaServiceTests

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Jan-Kazlouski-elastic · 2025-07-29T21:20:37Z

Hello @dan-rubinstein
I made the changes:

New TransportVersion is renamed to ML_INFERENCE_LLAMA_REFACTORED
Initial integration TransportVersion ML_INFERENCE_LLAMA_ADDED mentions are removed, because it would be better to have this whole change as one single integration without the need of versioning between them. Documentation is not yet published, so no users would be affected.
"user" param is no longer mentioned in JavaDoc
Default false value for dimensionsSetByUser in LlamaEmbeddingsModel fromMap method is removed because it must always be set during validation call to the model
Versions checks are simplified because only ML_INFERENCE_LLAMA_REFACTORED is going to be used
Tests are updated and expanded

Refactor Llama Integration to support OpenAI API for embeddings and s…

45966dd

…upport user field

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.2.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 24, 2025

Jan-Kazlouski-elastic added 3 commits July 24, 2025 16:23

Add taskSettings parameter to LlamaChatCompletionModel and LlamaEmbed…

fe90810

…dingsModel for enhanced customization

Merge remote-tracking branch 'origin/main' into llama-refactoring

267f372

Merge remote-tracking branch 'origin/main' into llama-refactoring

1ebe45c

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

AI-IshanBhatt added the :ml Machine learning label Jul 28, 2025

elasticsearchmachine added Team:ML Meta label for the ML team and removed needs:triage Requires assignment of a team area label labels Jul 28, 2025

dan-rubinstein reviewed Jul 28, 2025

View reviewed changes

Jan-Kazlouski-elastic and others added 8 commits July 29, 2025 18:50

Fix comments

12745bf

[CI] Auto commit changes from spotless

8bc7ef9

Update tests

c8775f7

Update tests

7e1bdb2

Update tests

ed7d4e0

Remove already covered test method and related helper function from L…

a6a854d

…lamaServiceTests

Add tests for user override behavior in LlamaChatCompletionModel

9dab7db

Merge remote-tracking branch 'origin/main' into llama-refactoring

efeb10b

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Add test for handling null user override in LlamaEmbeddingsModel

e7e4593

		@@ -20,38 +21,41 @@

		public class LlamaChatCompletionModelTests extends ESTestCase {

		@@ -27,45 +27,84 @@
		public class LlamaEmbeddingsRequestTests extends ESTestCase {

Update Llama Integration to support OpenAI API for embeddings and support user task setting field and dimensions service setting field #131827

Are you sure you want to change the base?

Update Llama Integration to support OpenAI API for embeddings and support user task setting field and dimensions service setting field #131827

Uh oh!

Conversation

Jan-Kazlouski-elastic commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jan-Kazlouski-elastic Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jan-Kazlouski-elastic commented Jul 29, 2025

Uh oh!

Uh oh!

Jan-Kazlouski-elastic commented Jul 24, 2025 •

edited

Loading

Jan-Kazlouski-elastic Jul 29, 2025 •

edited

Loading