Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proxy): make the proxy resilient on mlmd failure #700

Conversation

Al-Pragliola
Copy link
Contributor

@Al-Pragliola Al-Pragliola commented Jan 14, 2025

Description

This PR aims to improve the resiliency of the model registry. Previously, if mlmd was down when the proxy started, it would exit with an error. With this update, the proxy will now start and attach a dynamic router that will respond with a 503 status to every request until mlmd is up and running. Once mlmd is up and running, the router will switch to the correct one.

There's also a time limit of ~5 minutes. If mlmd is not up and running within this timeframe, the proxy will still exit with an error.

E2E automated testing will follow in another PR using the strategy described here. #194 (comment)

How Has This Been Tested?

Testing scenarios:

MR UP AND RUNNING - DB GOES DOWN

TIME 0

  • MR is up and running
  • MLMD is up and running
  • DB is up and running

TIME 1

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
  • DB goes down

TIME 2

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          11m
curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:42:05 GMT
< Content-Length: 102
<
{"code":"","message":"rpc error: code = Internal desc = mysql_real_connect failed: errno: , error: "}
* Connection #0 to host localhost left intact
  • MR is up and running
  • MLMD is up and running

TIME 3

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-db-7c4bb9f76f-lkmmb          1/1     Running   0          8s
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          21m

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:49:59 GMT
< Content-Length: 54
<
{"items":[],"nextPageToken":"","pageSize":0,"size":0}
* Connection #0 to host localhost left intact

MR STARTING UP WHILE DB IS DOWN

TIME 0

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 0}}'

kubectl get pod -n kubeflow

No resources found in kubeflow namespace.
  • MR is down
  • MLMD is down
  • DB is down

TIME 1

kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (20s ago)   21s

k describe pod model-registry-deployment-cb6987594-gkrf8

....
Warning  BackOff    3s (x8 over 40s)   kubelet            Back-off restarting failed container grpc-container in pod model-registry-deployment-cb6987594-gkrf8_kubeflow(1bdd5e06-5939-4dd9-b1c9-4e8e68190245)

kubectl logs model-registry-deployment-cb6987594-gkrf8 -c grpc-container

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0114 17:59:16.438417     1 mysql_metadata_source.cc:174] MySQL database was not initialized. Please ensure your MySQL server is running. Also, this error might be caused by starting from MySQL 8.0, mysql_native_password used by MLMD is not supported as a default for authentication plugin. Please follow <https://dev.mysql.com/blog-archive/upgrading-to-mysql-8-0-default-authentication-plugin-considerations/>to fix this issue.
F0114 17:59:16.438586     1 metadata_store_server_main.cc:555] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error:  [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 14 Jan 2025 18:02:39 GMT
< Content-Length: 101
<
MLMD server is down or unavailable. Please check that the database is reachable and try again later.
* Connection #0 to host localhost left intact
  • MR up and running
  • MLMD is down
  • DB is down

TIME 2

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running            0             8s
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (18s ago)   19s
  • MR up and running
  • MLMD is restarting
  • DB is up and running

TIME 3

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running   0             1m
model-registry-deployment-cb6987594-gkrf8   2/2     Running   2 (38s ago)   2m
  • MR is up and running
  • MLMD is up and running
  • DB is up and running

Merge criteria:

  • All the commits have been signed-off (To pass the DCO check)
  • The commits have meaningful messages; the author will squash them after approval or in case of manual merges will ask to merge with squash.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.

@Al-Pragliola Al-Pragliola marked this pull request as ready for review January 15, 2025 21:42
@Al-Pragliola
Copy link
Contributor Author

/cc @tarilabs

@google-oss-prow google-oss-prow bot requested a review from tarilabs January 15, 2025 21:44
@tarilabs
Copy link
Member

love this @Al-Pragliola ❤️ thanks a lot !!

@Al-Pragliola
Copy link
Contributor Author

/cc @pboyd

Copy link

@Al-Pragliola: GitHub didn't allow me to request PR reviews from the following users: pboyd.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @pboyd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@pboyd pboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but /lgtm.

cmd/proxy.go Outdated

err := http.ListenAndServe(fmt.Sprintf("%s:%d", cfg.Hostname, cfg.Port), router)
if err != nil {
errChan <- err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the other goroutine closes this channel this send might be a problem (admittedly a rare one, but it might be an issue someday if that goroutine ever panics early).

You could perhaps add the first goroutine to the WaitGroup and close errCh in the parent. Or, it looks like ListenAndServe errors were fatal before, maybe just make it fatal again? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @pboyd , I think we can just revert to it being a Fatal error like before

@Al-Pragliola Al-Pragliola requested a review from pboyd January 28, 2025 20:20
Copy link
Contributor

@pboyd pboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @Al-Pragliola.

/lgtm
/approve

Copy link

@pboyd: changing LGTM is restricted to collaborators

In response to this:

Looks good to me @Al-Pragliola.

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @Al-Pragliola

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pboyd, tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tarilabs
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jan 30, 2025
@google-oss-prow google-oss-prow bot merged commit 5507207 into kubeflow:main Jan 30, 2025
17 checks passed
@Al-Pragliola Al-Pragliola deleted the al-pragliola-fix-panic-no-grpc-connection branch January 30, 2025 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants