Skip to content

fix(Helm)!: Refresh helm #880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 1, 2025
Merged

fix(Helm)!: Refresh helm #880

merged 23 commits into from
Jul 1, 2025

Conversation

DiamondJoseph
Copy link
Contributor

@DiamondJoseph DiamondJoseph commented Mar 28, 2025

Re-applies blueapi specific configuration over the result of a fresh helm create blueapi, recovering where we had previously removed configuration that was thought not necessary.

Breaking changes to the helm chart:

  • Ingress form changes:
    create is now enabled
    host is now ``` hosts:
    • host: foo.diamond.ac.uk
      paths:
      • path: /
        pathType: Prefix```

Changes worth considering:

  • Can define arbitrary volumes, volumeMounts
  • Liveness/Readiness probes use new /healthz endpoints

Copy link

codecov bot commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.44%. Comparing base (3f6fd8c) to head (a275362).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #880   +/-   ##
=======================================
  Coverage   94.44%   94.44%           
=======================================
  Files          41       41           
  Lines        2537     2537           
=======================================
  Hits         2396     2396           
  Misses        141      141           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@callumforrester
Copy link
Contributor

Feels like this should really go in behind #871

@DiamondJoseph DiamondJoseph force-pushed the refresh-helm branch 6 times, most recently from 114042e to 6e4908b Compare April 1, 2025 12:23
OTLP_EXPORT_ENABLED: "true"
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ .Values.tracing.otlp.protocol | default "http/protobuf" }}
OTEL_EXPORTER_OTLP_ENDPOINT: {{ required "OTLP export enabled but server address not set" .Values.tracing.otlp.server.host }}:{{ .Values.tracing.otlp.server.port | default 4318 }}
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ default "http/protobuf" .Values.tracing.otlp.protocol }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Values.tracing.otlp must exist for us to have reached here, but protocol could be None, so use the default.

@DiamondJoseph DiamondJoseph mentioned this pull request Apr 2, 2025
Copy link
Contributor

@callumforrester callumforrester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I'm not very happy about merging this. The majority of helm changes we have merged have included bugs/regressions, we should have done something like #871 a long time ago. Most of the time we have done small, incremental changes to the helm chart so we could feel more confident, but even those broke. This, meanwhile, is a big change to a lot of high-risk areas. #871 is a good start but is not a comprehensive test suite. I think to merge this I will need:

  • More tests
  • An analysis of breaking changes
  • A migration guide to go into the changelog
  • Any areas of uncertainty we can explore on the test rigs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any point in adding HPA to an inherently stateful service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much, but I could see a use case if we suddenly decide to start running a bunch of simulated device beamlines, scaling out as they become busy (if our readiness probe started reflecting whether the pod had an active task perhaps). End of the day, it comes for free with your helm create and if we remove it, it's one more thing to remove every time we refresh helm create again. I don't think it harms anyone to have it available but turned off.

@@ -16,44 +12,6 @@ jobs:
with:
# Need this to get version number from last tag
fetch-depth: 0
- name: Validate SemVer2 version compliance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may have already discussed the pros and cons of getting rid of this, but I don't remember, I think it should stay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer required, as the blueapi release tag is no longer the version of the Helm chart, and therefore does not require being a semantic version. If you want to enforce that the blueapi container release is a semver it can be restored, but this was added for issues publishing the Helm chart.

Comment on lines 77 to 130
livenessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.readinessProbe }}
readinessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.resources }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should: I think these were removed because they caused problems, are you trying to fix with #884?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the purpose of 884.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#884 has now gone in, so the probes now get /healthz which returns a 200 code

@DiamondJoseph DiamondJoseph mentioned this pull request Apr 2, 2025
@DiamondJoseph
Copy link
Contributor Author

Differences by document with ingress/tracing/initContainer enabled:

main config: unchanged
otel-config: unchanged
init container config:
image
No longer mounts enabled incorrectly in the initContainer applicationConfig. See also potential fix in comments in #887
ConfigMap and yaml file names adjusted
release-name-blueapi-service:
image
Gain labels
Mapping of service to container port now done via name of the port, meaning the service definition in Values.yaml refers to the external port and not the container port, as is the default behaviour.
release-name-blueapi-ingress:
image
Gain labels
secretName explicitly null rather than implicitly, allows overriding when e.g. not using Diamond ingress controller
test container:
image
Names adjusted
Uses version from appVersion rather than 0.1.0 container (in case of adding additional tests for cli functions that do not exist in 0.1.0)
statefulset:
image

""")
)
group[name] = manifest
if manifest is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New handling of init config means it may now be None instead of an empty configMap

@DiamondJoseph
Copy link
Contributor Author

Added Keith and Zoheb for extra points of view, but I won't merge until Callum is happy.

@DiamondJoseph DiamondJoseph changed the title Refresh helm fix (Helm): Refresh helm Apr 23, 2025
@DiamondJoseph DiamondJoseph changed the title fix (Helm): Refresh helm fix(Helm): Refresh helm Apr 23, 2025
Copy link
Contributor

@ZohebShaikh ZohebShaikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good overall! One suggestion that I believe would be helpful for developers using this Helm chart is to include a table in the documentation listing all the parameters—along with their names, descriptions, and default values. Something similar to what's done here

@DiamondJoseph DiamondJoseph changed the title fix(Helm): Refresh helm fix(Helm)!: Refresh helm Jun 26, 2025
| worker.stomp | object | `{"auth":{"password":"guest","username":"guest"},"enabled":false,"url":"http://rabbitmq:61613/"}` | Message bus configuration for returning status to GDA/forwarding documents downstream |

----------------------------------------------
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to not have generated code in the repo or a way to ensure it stays up to date without someone having to remember to update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linting will fail if the output should have changed, as it's run as a pre-commit hook

@DiamondJoseph DiamondJoseph dismissed callumforrester’s stale review June 27, 2025 12:00

He's too busy with the vertical slice sprints to be interested in the likes of us

@ZohebShaikh
Copy link
Contributor

ZohebShaikh commented Jun 30, 2025

There were somethings about the readiness and liveliness probes that you are using

As the pod will take time to startup due to scratch, I was just going through the docs to see if it starts the probe after the initContainer is done booting up

And the other was we can have a cookie created on the fly and pass it to the healthz probe as a header so that at the fastapi side we can just accept requests from that which has that header for the probable case of the healthz getting DDOS'ed and making blueapi go into boot-loop

I also didn't see anything about if the liveliness probe fails it trying to restart the pod

@DiamondJoseph
Copy link
Contributor Author

As the pod will take time to startup due to scratch, I was just going through the docs to see if it starts the probe after the initContainer is done booting up

The probes are defined on the container level, so they not attempt to run until the initContainer has run to completion. We could have a startupProbe that is slightly more permissive in case we are slow to start up, but with the default resources and nothing in scratch startup takes ~4s. The probes runs every 10s and needs to fail 3 in a row to die, so our startup is fast enough to not fail too many- whether they're permissive enough to not die on a filesystem blip I'm not sure. We do probably want a startupProbe for the potential 10s to connect to devices, but I don't know if we start serving the API prior to the subprocess being ready?

And the other was we can have a cookie created on the fly and pass it to the healthz probe as a header so that at the fastapi side we can just accept requests from that which has that header for the probable case of the healthz getting DDOS'ed and making blueapi go into boot-loop

It's a possibility, I think if we're going to add that it can be after 1.0.0, since it's non breaking. If the REST API is being hammered enough to freeze and kill the subprocess running a scan, someone is doing something wrong or malicious. After the document store we can consider re-architecting to extract the subprocess- then if the API goes down the scan is unaffected.

I also didn't see anything about if the liveliness probe fails it trying to restart the pod

If the liveness probes fails failureThreshold times (default 3), the pod is killed and subject to its restartpolicy.

@ZohebShaikh
Copy link
Contributor

For the hammering I had a malicious user in mind, If a issue is created for the cookie we can work on it at a later stage...
But I think having a startupProbe with reasonable values will be good, to let the devices startup

Copy link
Contributor

@ZohebShaikh ZohebShaikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall, few minor change requests.

As this is all helm changes can we see this working with all the configuration before merge? Just for more confidence in helm changes

@ZohebShaikh ZohebShaikh self-requested a review July 1, 2025 09:03
@DiamondJoseph DiamondJoseph merged commit 67a645a into main Jul 1, 2025
18 checks passed
@DiamondJoseph DiamondJoseph deleted the refresh-helm branch July 1, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants