feat(proxy): make the proxy resilient on mlmd failure #700

Al-Pragliola · 2025-01-14T20:22:59Z

Description

This PR aims to improve the resiliency of the model registry. Previously, if mlmd was down when the proxy started, it would exit with an error. With this update, the proxy will now start and attach a dynamic router that will respond with a 503 status to every request until mlmd is up and running. Once mlmd is up and running, the router will switch to the correct one.

There's also a time limit of ~5 minutes. If mlmd is not up and running within this timeframe, the proxy will still exit with an error.

E2E automated testing will follow in another PR using the strategy described here. #194 (comment)

How Has This Been Tested?

Testing scenarios:

MR UP AND RUNNING - DB GOES DOWN

TIME 0

MR is up and running
MLMD is up and running
DB is up and running

TIME 1

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'

DB goes down

TIME 2

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          11m

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:42:05 GMT
< Content-Length: 102
<
{"code":"","message":"rpc error: code = Internal desc = mysql_real_connect failed: errno: , error: "}
* Connection #0 to host localhost left intact

MR is up and running
MLMD is up and running

TIME 3

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS   AGE
model-registry-db-7c4bb9f76f-lkmmb          1/1     Running   0          8s
model-registry-deployment-cb6987594-psbj2   2/2     Running   0          21m

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Vary: Origin
< Date: Tue, 14 Jan 2025 17:49:59 GMT
< Content-Length: 54
<
{"items":[],"nextPageToken":"","pageSize":0,"size":0}
* Connection #0 to host localhost left intact

MR STARTING UP WHILE DB IS DOWN

TIME 0

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 0}}'
kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 0}}'

kubectl get pod -n kubeflow

No resources found in kubeflow namespace.

MR is down
MLMD is down
DB is down

TIME 1

kubectl patch deployment -n kubeflow model-registry-deployment --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (20s ago)   21s

k describe pod model-registry-deployment-cb6987594-gkrf8

....
Warning  BackOff    3s (x8 over 40s)   kubelet            Back-off restarting failed container grpc-container in pod model-registry-deployment-cb6987594-gkrf8_kubeflow(1bdd5e06-5939-4dd9-b1c9-4e8e68190245)

kubectl logs model-registry-deployment-cb6987594-gkrf8 -c grpc-container

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0114 17:59:16.438417     1 mysql_metadata_source.cc:174] MySQL database was not initialized. Please ensure your MySQL server is running. Also, this error might be caused by starting from MySQL 8.0, mysql_native_password used by MLMD is not supported as a default for authentication plugin. Please follow <https://dev.mysql.com/blog-archive/upgrading-to-mysql-8-0-default-authentication-plugin-considerations/>to fix this issue.
F0114 17:59:16.438586     1 metadata_store_server_main.cc:555] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error:  [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

curl -v "localhost:8080/api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC"

* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /api/model_registry/v1alpha3/registered_models/1/versions?sortOrder=DESC HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 14 Jan 2025 18:02:39 GMT
< Content-Length: 101
<
MLMD server is down or unavailable. Please check that the database is reachable and try again later.
* Connection #0 to host localhost left intact

MR up and running
MLMD is down
DB is down

TIME 2

kubectl patch deployment -n kubeflow model-registry-db --patch '{"spec": {"replicas": 1}}'

kubectl get pod -n kubeflow

NAME                                        READY   STATUS             RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running            0             8s
model-registry-deployment-cb6987594-gkrf8   1/2     CrashLoopBackOff   1 (18s ago)   19s

MR up and running
MLMD is restarting
DB is up and running

TIME 3

kubectl get pod -n kubeflow

NAME                                        READY   STATUS    RESTARTS      AGE
model-registry-db-7c4bb9f76f-qwp5m          1/1     Running   0             1m
model-registry-deployment-cb6987594-gkrf8   2/2     Running   2 (38s ago)   2m

MR is up and running
MLMD is up and running
DB is up and running

Merge criteria:

All the commits have been signed-off (To pass the DCO check)

The commits have meaningful messages; the author will squash them after approval or in case of manual merges will ask to merge with squash.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work.
Code changes follow the kubeflow contribution guidelines.

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola · 2025-01-15T21:44:19Z

/cc @tarilabs

tarilabs · 2025-01-21T10:22:56Z

love this @Al-Pragliola ❤️ thanks a lot !!

Al-Pragliola · 2025-01-21T14:33:35Z

/cc @pboyd

google-oss-prow · 2025-01-21T14:33:41Z

@Al-Pragliola: GitHub didn't allow me to request PR reviews from the following users: pboyd.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @pboyd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pboyd

One nit, but /lgtm.

pboyd · 2025-01-21T14:48:16Z

cmd/proxy.go

+
+		err := http.ListenAndServe(fmt.Sprintf("%s:%d", cfg.Hostname, cfg.Port), router)
+		if err != nil {
+			errChan <- err


Since the other goroutine closes this channel this send might be a problem (admittedly a rare one, but it might be an issue someday if that goroutine ever panics early).

You could perhaps add the first goroutine to the WaitGroup and close errCh in the parent. Or, it looks like ListenAndServe errors were fatal before, maybe just make it fatal again? What do you think?

Great catch @pboyd , I think we can just revert to it being a Fatal error like before

Signed-off-by: Alessio Pragliola <[email protected]>

pboyd

Looks good to me @Al-Pragliola.

/lgtm
/approve

google-oss-prow · 2025-01-30T15:56:23Z

@pboyd: changing LGTM is restricted to collaborators

In response to this:

Looks good to me @Al-Pragliola.

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tarilabs

thank you @Al-Pragliola

/approve

google-oss-prow · 2025-01-30T16:46:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pboyd, tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tarilabs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tarilabs · 2025-01-30T17:16:32Z

/lgtm

feat(proxy): make the proxy resilient on mlmd failure

44d738e

Signed-off-by: Alessio Pragliola <[email protected]>

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 14, 2025

google-oss-prow bot requested review from andreyvelich, rareddy and zijianjoy January 14, 2025 20:23

github-actions bot added the Area/Go REST server label Jan 14, 2025

google-oss-prow bot added the size/XL label Jan 14, 2025

chore(proxy): simplify code by removing generic function

913e5f5

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola marked this pull request as ready for review January 15, 2025 21:42

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 15, 2025

google-oss-prow bot requested a review from tarilabs January 15, 2025 21:44

pboyd approved these changes Jan 21, 2025

View reviewed changes

fix(proxy): prevent race condition on err channel

ba92d73

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola requested a review from pboyd January 28, 2025 20:20

pboyd approved these changes Jan 30, 2025

View reviewed changes

tarilabs approved these changes Jan 30, 2025

View reviewed changes

google-oss-prow bot added the approved label Jan 30, 2025

google-oss-prow bot assigned tarilabs Jan 30, 2025

google-oss-prow bot added the lgtm label Jan 30, 2025

google-oss-prow bot merged commit 5507207 into kubeflow:main Jan 30, 2025
17 checks passed

Al-Pragliola deleted the al-pragliola-fix-panic-no-grpc-connection branch January 30, 2025 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy): make the proxy resilient on mlmd failure #700

feat(proxy): make the proxy resilient on mlmd failure #700

Al-Pragliola commented Jan 14, 2025 •

edited

Loading

Al-Pragliola commented Jan 15, 2025

tarilabs commented Jan 21, 2025

Al-Pragliola commented Jan 21, 2025

google-oss-prow bot commented Jan 21, 2025

pboyd left a comment

pboyd Jan 21, 2025

Al-Pragliola Jan 21, 2025

pboyd left a comment

google-oss-prow bot commented Jan 30, 2025

tarilabs left a comment

google-oss-prow bot commented Jan 30, 2025

tarilabs commented Jan 30, 2025

feat(proxy): make the proxy resilient on mlmd failure #700

feat(proxy): make the proxy resilient on mlmd failure #700

Conversation

Al-Pragliola commented Jan 14, 2025 • edited Loading

Description

How Has This Been Tested?

MR UP AND RUNNING - DB GOES DOWN

TIME 0

TIME 1

TIME 2

TIME 3

MR STARTING UP WHILE DB IS DOWN

TIME 0

TIME 1

TIME 2

TIME 3

Merge criteria:

Al-Pragliola commented Jan 15, 2025

tarilabs commented Jan 21, 2025

Al-Pragliola commented Jan 21, 2025

google-oss-prow bot commented Jan 21, 2025

pboyd left a comment

Choose a reason for hiding this comment

pboyd Jan 21, 2025

Choose a reason for hiding this comment

Al-Pragliola Jan 21, 2025

Choose a reason for hiding this comment

pboyd left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jan 30, 2025

tarilabs left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jan 30, 2025

tarilabs commented Jan 30, 2025

Al-Pragliola commented Jan 14, 2025 •

edited

Loading