Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue Proxy health checks incompatible with non-HTTP/2 applications #15432

Open
braunsonm opened this issue Jul 31, 2024 · 16 comments · May be fixed by #15436
Open

Queue Proxy health checks incompatible with non-HTTP/2 applications #15432

braunsonm opened this issue Jul 31, 2024 · 16 comments · May be fixed by #15436
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Milestone

Comments

@braunsonm
Copy link

braunsonm commented Jul 31, 2024

/area networking

What version of Knative?

1.15.0

Expected Behavior

Legacy applications may have undefined behavior when HTTP/2 upgrade requests are made. Knative should gracefully handle those errors and downgrade the health check attempt to HTTP/1 or HTTP/1.1.

Actual Behavior

Applications which do not support HTTP/2 will not handle the upgrade request properly. In our case, a legacy application returns a 500 when OPTIONS are sent to upgrade the connection. Knative fails the entire healthcheck because of this, even if the same check over HTTP/1 or HTTP/1.1 will properly return a 200.

Steps to Reproduce the Problem

  1. Create an application which does not support HTTP/2 or returns a 500 on the OPTIONS request
  2. Notice that Knative will start failing the health checks and the pod will be killed

Additional Context

It is not within the Kubernetes spec that an application must support HTTP/2 or that it should expect an OPTIONS call to its health/liveness probes. Only GET is part of the contract, which the Queue Proxy does not follow.

I believe the logic is flawed in the queue proxy's HTTP probes here.

return maxProto, fmt.Errorf("HTTP probe did not respond Ready, got status code: %d", res.StatusCode)

When an error occurs during the upgrade, maxProto should be set to 1 and Knative should stop trying to make HTTP/2 requests. Currently because of this line, HTTP/2 will be retried indefinitely and HTTP/1 will never be attempted.

@braunsonm braunsonm added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2024
@braunsonm braunsonm changed the title Queue Proxy does not gracefully handle applications which do not support HTTP/2 Queue Proxy health checks incompatible with anything but HTTP/2 Jul 31, 2024
@braunsonm braunsonm changed the title Queue Proxy health checks incompatible with anything but HTTP/2 Queue Proxy health checks incompatible with non-HTTP/2 applications Jul 31, 2024
@dprotaso
Copy link
Member

I'm confused what's making HTTP2 requests? Knative healthchecks are HTTP/1

@braunsonm
Copy link
Author

braunsonm commented Jul 31, 2024

@dprotaso I can see requests being made from the queue-proxy to the user-container and attempting to upgrade to HTTP/2 during the readiness probes.

And the code I linked above I believe is the logic for the queue-proxy to perform the HTTP/2 upgrade for these probes. This happens when the feature gate for auto-detecting HTTP2 is set to true

@dprotaso
Copy link
Member

oh interesting - i didn't realize this was added. h2c upgrade is deprecated https://datatracker.ietf.org/doc/html/rfc9113#section-3.1

We should probably just always be doing HTTP/1 unless the user has specified h2c OR we change the detection to use h2c prior knowledge

@dprotaso
Copy link
Member

You don't have an example app where this breaks?

@braunsonm
Copy link
Author

braunsonm commented Jul 31, 2024

I agree that probes should have always been HTTP/1 to match what would be expected from Kubernetes. But if you want this to remain so you can tell if an app supports HTTP/2 or not, then I would suggest at least gracefully failing if the HTTP/2 check fails (fallback to HTTP/1).

Unfortunately I don't have a sample that I could share, but I think it should be reproducible if you just had an app that throws a 500 whenever an OPTIONS request is made (ie, the upgrade request)

@skonto
Copy link
Contributor

skonto commented Aug 1, 2024

Hi @braunsonm, thanks for reporting this.

This happens when the feature gate for auto-detecting HTTP2 is set to true

Would it work if you turn this off for now or is this something that fails in other scenarios?

@braunsonm
Copy link
Author

Would it work if you turn this off for now or is this something that fails in other scenarios?

It does work if it is set to false, but that does mean other applications deployed on Knative can no longer benefit from HTTP/2 which is unfortunate.

@skonto
Copy link
Contributor

skonto commented Aug 1, 2024

but that does mean other applications deployed on Knative can no longer benefit from HTTP/2 which is unfortunate.

That autodetect feature was never completed. So if the app is using http2 you mean that QP is not going to use it with autodetect= off? What do you mean apps on Knative cant benefit from HTTP/2, could you elaborate?

@braunsonm
Copy link
Author

What do you mean apps on Knative cant benefit from HTTP/2, could you elaborate?

I was under the impression that autodetecting HTTP2 feature was required for HTTP2 to be used between the activator and ksvc's. Is that not true?

@skonto
Copy link
Contributor

skonto commented Aug 1, 2024

This is has to do with probes here. We do support http2 without setting that auto-detect property which btw is not done as a feature (check our grpc tests for example). Also see here on what happens when you turn that on: https://github.com/knative/serving/blob/main/pkg/queue/readiness/probe.go#L233-L242.
We only try the upgrade if maxProto = 0 see https://github.com/knative/serving/blob/main/pkg/queue/health/probe.go#L163
cc @dprotaso if has more to add for the background info of this feature

@dprotaso
Copy link
Member

dprotaso commented Aug 1, 2024

Right now to support HTTP2 requires people to set the containerPort name to be h2c.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: grpc-ping
  namespace: default
spec:
  template:
    spec:
      containers:
      - image: docker.io/{username}/grpc-ping-go
        ports:
          - name: h2c
            containerPort: 8080

The feature has an issue here #4283 - the idea is to detect the protocol without the labelling

@braunsonm
Copy link
Author

I see. We use func which doesn't support naming the port so that's why the autodetection was going to be required for us.

@skonto
Copy link
Contributor

skonto commented Oct 24, 2024

@braunsonm is this something functions could help with instead? Do you mind opening an issue there too?

@braunsonm
Copy link
Author

No it is not. @skonto this is broken in functions because of the flawed implementation in serving.

Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2025
@dprotaso
Copy link
Member

/lifecycle frozen

seems like there are some golang changes in 1.24 (due out feb) that might make this simpler

https://tip.golang.org/doc/go1.24#nethttppkgnethttp

@knative-prow knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 23, 2025
@dprotaso dprotaso added this to the v1.18 milestone Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants