Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neon-cluster-operator: wrap-up issues #1901

Closed
jefflill opened this issue Apr 11, 2024 · 2 comments
Closed

neon-cluster-operator: wrap-up issues #1901

jefflill opened this issue Apr 11, 2024 · 2 comments
Assignees

Comments

@jefflill
Copy link
Collaborator

jefflill commented Apr 11, 2024

I've verified that cluster stabilization works and I've coded the neon-cluster-operator job but couldn't test that due to: #1900

@jefflill jefflill self-assigned this Apr 11, 2024
@jefflill
Copy link
Collaborator Author

jefflill commented Apr 17, 2024

  • The scheduling problem with the terminated pod GC job was due to a mistyped CRON schedule, which I've fixed.

The other thing that's a bit odd with the current design is that the job classes all reference CronJob as the base class to pick up the Name, Group, and Type properties as well as the AddToSchedulerAsync() and DeleteFromSchedulerAsync() methods. The current design has us create job instances that are never actually executed but instead are used for managing job schedules.

This seems like a bit of anti-pattern and in fact, I added TerminatedPodGcDelayMilliseconds and TerminatedPodGcThresholdMinutes properties to the TerminatedPodGcJob and set those on the global job, thinking that instance was going to be executed. After playing with Quartz a bit, I see that Quartz constructs a new job instance every time a job is executed, so TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes will never be set the way I coded things.

  • I'm going to refactor this by removing the CronJob base class and relocating the scheduling code as well as passing the TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes properties in the parameter dictionary.

  • All of the jobs run immediately first and then start running on the CRON schedule. This doesn't really make sense. For example, we don't want to run the terminated pod GC job just because neon-cluster-operator was rescheduled; we have the CRON schedules for a reason. Other jobs like certificate renewal run once a week so it really doesn't make sense to execute this just because the operator restarted.

    Hopefully, cluster setup isn't depending on the cluster operator promptly executing jobs because that's pretty fragile and this change will probably break things. We'll address any problems like this by having cluster setup do any configuration explicitly.

  • Random CRON schedule field multiple executions: we use our custom "R" field characters for some CRON fields to randomize when jobs execute to avoid the potential of having a large number of clusters doing things like transmitting telemetry pings at exactly the same time. This works OK, but it's possible for jobs to be re-run when leadership changes. Here's the scenario:

    1. We have a job scheduled with "R R 0 ? * * " which will fire some time between 12:00am and 1:00am
    2. Let's say the first operator resolves this to "0 0 0 ? * *" (12:00am) and executes the job
    3. Then shortly after the job completes, the first operator loses leadership to a second operator
    4. The second operator re-resolves the date to "0 15 0 ? * *" (12:15am)
    5. Since 12:15am is still in the future, the second operator instance will re-run the job

    I think the way to address this is by only resolving these random fields once and then storing the resolved CRON schedule in V1NeonClusterJobs.status and using these status CRON values for when subsequent operator instances schedule jobs.

  • Some jobs are not updating their status in V1NeonClusterJobs:

    • TerminatedPodGcJob
    • MinWorkerNodeVcpuJob
  • neon-cluster-operator needs RBAC access to namespaces and pods for the TerminatedPodGcJob.

  • Implement the LinuxSecurityPatch job.

@jefflill jefflill changed the title Quartz doesn't appear to be working as expected neon-cluster-operator: wrap-up issues Apr 18, 2024
@jefflill
Copy link
Collaborator Author

DONE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant