Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Lazy Loading of Pod Templates #17701

Merged
merged 11 commits into from
Feb 19, 2025

Conversation

GWphua
Copy link
Contributor

@GWphua GWphua commented Feb 6, 2025

Description

There may be some circumstances where we want to change the config of the task pod, such as changing the cpu requests/limits of default task pods by editing the base YAML pod template file. In my case, I am editing something that looks like this example ConfigMap, to directly make changes to the pod templates.

However, since the code only updates HashMap<String, PodTemplate> templates during Overlord initialization, the changes are not immediately reflected.

This PR aims to solve this problem and make Druid more adaptable in Kubernetes environments. Instead of internally keeping track of the deserialized PodTemplates we have read from all template files, we will now run the deserialization process whenever a new Task Pod is to be created. Do note this PR does not support reading from newly created pod template files. (e.g. If we create a newPodTemplate.yaml, we will still need to let the Overlord run template initialization for the new file to take effect).

The original intention of fail-fast for invalid pod templates is retained by validating all of the pod templates defined during initialization.

Performance Considerations

Given that a ingestion task will create n task pods, and we have k pod templates defined, we will be reading from the pod template n + k times instead of just k times.

Release note

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this feature and it looks good to me

@FrankChen021
Copy link
Member

I think we need one or two sentence description of this feature in the doc so that people don't need to dive into the code or do some testing to know whether this is support or not

@GWphua could u update it?

Copy link
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion on doc

private HashMap<String, PodTemplate> podTemplates;
private Supplier<KubernetesTaskRunnerDynamicConfig> dynamicConfigRef;
private final Supplier<KubernetesTaskRunnerDynamicConfig> dynamicConfigRef;
private HashMap<String, Supplier<PodTemplate>> podTemplates;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a one line comment to explain why the Supplier is defined here

@FrankChen021 FrankChen021 merged commit fd45182 into apache:master Feb 19, 2025
75 checks passed
@cryptoe
Copy link
Contributor

cryptoe commented Feb 19, 2025

Earlier the pod template code was de-serialized and kept in memory only once no ? Now for each invocation of task, the supplier gets called which reads the contents from file no ?
Is that correct ?
@GWphua ?

I am kind of on the fence for this. Its easier to debug when stuff is immutable. Earlier I just use to check if the template name is same or not but now templates with same name can be different.

@GWphua
Copy link
Contributor Author

GWphua commented Feb 20, 2025

Earlier the pod template code was de-serialized and kept in memory only once no ? Now for each invocation of task, the supplier gets called which reads the contents from file no ? Is that correct ? @GWphua ?

I am kind of on the fence for this. Its easier to debug when stuff is immutable. Earlier I just use to check if the template name is same or not but now templates with same name can be different.

Hi @cryptoe, you are right about the logic of how it works. Regarding ease of debugging, it is also possible for the case that you have described: If the user and the cluster both have podTemplateFile.yaml, but the contents are different, there may be some confusion.

However, I think we can still do debugging by going into the pod and cat /path/to/podTemplateFile.yaml to see the template files. The main motivation behind this PR is to allow users to easily configure task pod resources without restarting the Overlord.

I also suggest taking a look at "Example 2: Using a ConfigMap to upload the Pod Template file" to easily maintain a single source of truth to allow you to identify issues easier.

GWphua added a commit to GWphua/druid that referenced this pull request Feb 20, 2025
* Lazy loading for PodTemplate to allow changing template files without restarting.

* Revert accidental changes to inspectionProfiles/Druid.xml

* Checkstyle

* Add another example for pod template configuration, and for lazy loading of pod templates

* Add tls port to example

* Add unit test for lazy pod template loading

* Fix spell-checks

* Allow k8s-jobs.md to dynamically take in Druid Version

* Update docs/development/extensions-core/k8s-jobs.md

Co-authored-by: Frank Chen <[email protected]>

* Update docs/development/extensions-core/k8s-jobs.md

Co-authored-by: Frank Chen <[email protected]>

* Add description for why Supplier is used

---------

Co-authored-by: Frank Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants