-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] model deployment DOES NOT fail when there is exception "Failed to retrieve model" due to TransportService "discovery node must not be null" #3582
Comments
@nathaliellenaa could you please try to reproduce the issue in your end? @maxlepikhin if you can give more step by step process for @nathaliellenaa to reproduce the issue, that'll be helpful. |
Few questions:
|
|
More detailed steps for repro in a minikube for example (setting up minikube locally is beyond the steps):
Inspecting the code for why the model task didn't fail when that exception happened would by the first attempt to fix the issue. It must fail. |
Thank you for providing the detailed steps @maxlepikhin. I was able to setup minikube locally, and I will follow the steps and see if I can reproduce the issue on my end. |
@nathaliellenaa note that this happens when the cluster is in "green" status but shortly after it becomes "green". The workaround on our side was to give the cluster more time (1-2 minutes) after it first becomes "green" before deploying the model. |
Hi @maxlepikhin. I'm trying to setup the cluster, but I couldn't see the health status. Here are the steps that I follow:
The health status is not showing up in the OpenSearchCluster resource status. I've completed the initial setup and the cluster appears to be running, but I can't see the health information. Do I miss something during the setup? |
@nathaliellenaa you can try "kubectl get opensearchcluster opensearch-cluster -o yam" or use curl to probe health endpoint, possibly it's a difference between OS's. |
@maxlepikhin I was able to create the OpenSearch cluster through opensearch operator, and run the register and deploy API. But I couldn't replicate the error you encountered
This is the log of my cluster, and we can see here that the cluster is GREEN at 22:36:51, and it successfully deploys at 22:36:52, which is shortly after it becomes GREEN as you mentioned here. I also tried to run this process several times and still couldn't reproduce the error.
|
What is the bug?
When using OpenSearch oeprator, the following bug appears possibly when the cluster is not yet fully initialized. It is showing "green" status when the model is being deployed. There is no recovery from this error as the model task is in RUNNING state indefinitely after this exception. The bug must be in how the transport service is set up but also in the MLModelManager not changing the model task to FAILED after this exception.
How can one reproduce the bug?
What is the expected behavior?
At the minimum, the model task must fail in case of such exceptions. Ideally, the transport service is initialized correctly so it does not fail.
What is your host/environment?
OS: Ubuntu 24.04
Version: 2.19.0
Do you have any screenshots?
N/A
Do you have any additional context?
This is critical for us. The work-around is bad - after certain time of observing ML task in RUNNING state we can abandon it and retry.
The text was updated successfully, but these errors were encountered: