-
Notifications
You must be signed in to change notification settings - Fork 6
fix(Helm)!: Refresh helm #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #880 +/- ##
=======================================
Coverage 94.44% 94.44%
=======================================
Files 41 41
Lines 2537 2537
=======================================
Hits 2396 2396
Misses 141 141 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c3bd350
to
d4c1295
Compare
Feels like this should really go in behind #871 |
114042e
to
6e4908b
Compare
OTLP_EXPORT_ENABLED: "true" | ||
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ .Values.tracing.otlp.protocol | default "http/protobuf" }} | ||
OTEL_EXPORTER_OTLP_ENDPOINT: {{ required "OTLP export enabled but server address not set" .Values.tracing.otlp.server.host }}:{{ .Values.tracing.otlp.server.port | default 4318 }} | ||
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ default "http/protobuf" .Values.tracing.otlp.protocol }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Values.tracing.otlp must exist for us to have reached here, but protocol could be None, so use the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest, I'm not very happy about merging this. The majority of helm changes we have merged have included bugs/regressions, we should have done something like #871 a long time ago. Most of the time we have done small, incremental changes to the helm chart so we could feel more confident, but even those broke. This, meanwhile, is a big change to a lot of high-risk areas. #871 is a good start but is not a comprehensive test suite. I think to merge this I will need:
- More tests
- An analysis of breaking changes
- A migration guide to go into the changelog
- Any areas of uncertainty we can explore on the test rigs
helm/blueapi/templates/hpa.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any point in adding HPA to an inherently stateful service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much, but I could see a use case if we suddenly decide to start running a bunch of simulated device beamlines, scaling out as they become busy (if our readiness probe started reflecting whether the pod had an active task perhaps). End of the day, it comes for free with your helm create
and if we remove it, it's one more thing to remove every time we refresh helm create
again. I don't think it harms anyone to have it available but turned off.
.github/workflows/_container.yml
Outdated
@@ -16,44 +12,6 @@ jobs: | |||
with: | |||
# Need this to get version number from last tag | |||
fetch-depth: 0 | |||
- name: Validate SemVer2 version compliance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may have already discussed the pros and cons of getting rid of this, but I don't remember, I think it should stay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no longer required, as the blueapi release tag is no longer the version of the Helm chart, and therefore does not require being a semantic version. If you want to enforce that the blueapi container release is a semver it can be restored, but this was added for issues publishing the Helm chart.
livenessProbe: | ||
{{- toYaml . | nindent 12 }} | ||
{{- end }} | ||
{{- with .Values.readinessProbe }} | ||
readinessProbe: | ||
{{- toYaml . | nindent 12 }} | ||
{{- end }} | ||
{{- with .Values.resources }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should: I think these were removed because they caused problems, are you trying to fix with #884?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the purpose of 884.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#884 has now gone in, so the probes now get /healthz
which returns a 200 code
Differences by document with ingress/tracing/initContainer enabled: main config: unchanged |
tests/unit_tests/test_helm_chart.py
Outdated
""") | ||
) | ||
group[name] = manifest | ||
if manifest is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New handling of init config means it may now be None instead of an empty configMap
Added Keith and Zoheb for extra points of view, but I won't merge until Callum is happy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good overall! One suggestion that I believe would be helpful for developers using this Helm chart is to include a table in the documentation listing all the parameters—along with their names, descriptions, and default values. Something similar to what's done here
5e5bfe1
to
cecf616
Compare
| worker.stomp | object | `{"auth":{"password":"guest","username":"guest"},"enabled":false,"url":"http://rabbitmq:61613/"}` | Message bus configuration for returning status to GDA/forwarding documents downstream | | ||
|
||
---------------------------------------------- | ||
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to not have generated code in the repo or a way to ensure it stays up to date without someone having to remember to update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linting will fail if the output should have changed, as it's run as a pre-commit hook
6d66e17
to
95cc1c2
Compare
He's too busy with the vertical slice sprints to be interested in the likes of us
There were somethings about the readiness and liveliness probes that you are using As the pod will take time to startup due to scratch, I was just going through the docs to see if it starts the probe after the initContainer is done booting up And the other was we can have a cookie created on the fly and pass it to the healthz probe as a header so that at the fastapi side we can just accept requests from that which has that header for the probable case of the healthz getting DDOS'ed and making blueapi go into boot-loop I also didn't see anything about if the liveliness probe fails it trying to restart the pod |
The probes are defined on the container level, so they not attempt to run until the initContainer has run to completion. We could have a startupProbe that is slightly more permissive in case we are slow to start up, but with the default resources and nothing in scratch startup takes ~4s. The probes runs every 10s and needs to fail 3 in a row to die, so our startup is fast enough to not fail too many- whether they're permissive enough to not die on a filesystem blip I'm not sure. We do probably want a startupProbe for the potential 10s to connect to devices, but I don't know if we start serving the API prior to the subprocess being ready?
It's a possibility, I think if we're going to add that it can be after 1.0.0, since it's non breaking. If the REST API is being hammered enough to freeze and kill the subprocess running a scan, someone is doing something wrong or malicious. After the document store we can consider re-architecting to extract the subprocess- then if the API goes down the scan is unaffected.
If the liveness probes fails failureThreshold times (default 3), the pod is killed and subject to its restartpolicy. |
For the hammering I had a malicious user in mind, If a issue is created for the cookie we can work on it at a later stage... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good overall, few minor change requests.
As this is all helm changes can we see this working with all the configuration before merge? Just for more confidence in helm changes
Re-applies blueapi specific configuration over the result of a fresh
helm create blueapi
, recovering where we had previously removed configuration that was thought not necessary.Breaking changes to the helm chart:
create
is nowenabled
host
is now ``` hosts:paths:
pathType: Prefix```
Changes worth considering:
/healthz
endpoints