[after GA] Commitments syncer and change-api share mutex lock#640
[after GA] Commitments syncer and change-api share mutex lock#640
Conversation
📝 WalkthroughWalkthroughA shared Changes
Sequence DiagramsequenceDiagram
participant Client
participant API
participant CRMutex
participant Syncer
participant K8s
Client->>API: POST /change-commitments
API->>CRMutex: Lock()
activate API
API->>API: Process request
API->>K8s: Update CR state
API->>CRMutex: Unlock()
deactivate API
API-->>Client: Response
rect rgba(173, 216, 230, 0.5)
Note over Syncer,K8s: Concurrent sync operation
Syncer->>CRMutex: Lock()
activate Syncer
Syncer->>K8s: Check flavor readiness
Syncer->>K8s: Apply CR states
Syncer->>K8s: Delete obsolete CRs
Syncer->>CRMutex: Unlock()
deactivate Syncer
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Test Coverage ReportTest Coverage 📊: 68.3% |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/scheduling/reservations/commitments/api_change_commitments.go (1)
67-104:⚠️ Potential issue | 🟠 MajorMove the shared lock acquisition to after request validation to avoid blocking the syncer on input I/O.
The lock is acquired before the method check, body decode, and dry-run validation. This allows a slow or stalled client to monopolize
api.crMutexwhile blocked on input, which also stallssyncer.SyncReservations()and every other change-commitments request even though no CR mutation has started yet.The lock should only be held during the actual
processCommitmentChanges()call where CR mutations occur (viamanager.ApplyCommitmentState()), not during HTTP validation and I/O operations.Suggested move
- // Serialize all change-commitments requests (shared with syncer) - api.crMutex.Lock() - defer api.crMutex.Unlock() - ctx := reservations.WithGlobalRequestID(context.Background(), "committed-resource-"+requestID) logger := LoggerFromContext(ctx).WithValues("component", "api", "endpoint", "/commitments/v1/change-commitments") @@ -100,6 +96,10 @@ func (api *HTTPAPI) HandleChangeCommitments(w http.ResponseWriter, r *http.Req return } + // Serialize actual CR mutations only after request validation succeeded. + api.crMutex.Lock() + defer api.crMutex.Unlock() + // Process commitment changes // For now, we'll implement a simplified path that checks capacity for immediate start CRs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api_change_commitments.go` around lines 67 - 104, The handler currently locks api.crMutex before HTTP validation and body decoding, which can block syncer and other requests; move the api.crMutex.Lock()/defer Unlock() out of the start of the handler and only acquire the lock immediately before calling processCommitmentChanges() (the section that calls manager.ApplyCommitmentState()), keeping the defer Unlock() to release after that call; preserve all existing request validation (method check, json decode, dry-run handling, logger, api.recordMetrics, response writes) outside the locked region so input I/O does not hold api.crMutex.
🧹 Nitpick comments (1)
internal/scheduling/reservations/commitments/api_change_commitments_test.go (1)
1001-1004: Add one contention test for the sharedCRMutex.
NewAPIWithConfig(..., nil, nil)still exercises a private mutex, so this suite will not catch a regression whereHTTPAPIandSyncerstop sharing the same lock. A single integration test that runs both against oneCRMutexand asserts one blocks until the other completes would cover the PR’s main contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api_change_commitments_test.go` around lines 1001 - 1004, Add an integration test that verifies HTTPAPI and Syncer share the same CRMutex by constructing a single CRMutex instance and passing it into NewAPIWithConfig (instead of nil) and into the Syncer constructor, then run an operation on the HTTPAPI that acquires the mutex and blocks (e.g., handler goroutine that holds lock and signals when held), start the Syncer operation in another goroutine and assert it does not proceed until the first releases (use channels/timeouts to detect blocking), finally release the first and assert the Syncer completes; reference NewAPIWithConfig, HTTPAPI, Syncer, and CRMutex when locating where to inject the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/commitments/api.go`:
- Around line 24-33: The CRMutex type and its Lock/Unlock methods only use an
in-memory sync.Mutex and therefore do not provide cross-pod serialization;
replace or augment CRMutex to use a Kubernetes Lease-based distributed lock (or
another cluster-wide coordination primitive) inside the CRMutex methods so
Lock/Unlock block across pods, ensure the existing nil-receiver semantics remain
safe, and update the CRMutex construction logic in main.go (where crMutex is
created) to initialize the Kubernetes client/Lease lock instance and pass it
into CRMutex so API and syncer pods contend on the same Lease.
---
Outside diff comments:
In `@internal/scheduling/reservations/commitments/api_change_commitments.go`:
- Around line 67-104: The handler currently locks api.crMutex before HTTP
validation and body decoding, which can block syncer and other requests; move
the api.crMutex.Lock()/defer Unlock() out of the start of the handler and only
acquire the lock immediately before calling processCommitmentChanges() (the
section that calls manager.ApplyCommitmentState()), keeping the defer Unlock()
to release after that call; preserve all existing request validation (method
check, json decode, dry-run handling, logger, api.recordMetrics, response
writes) outside the locked region so input I/O does not hold api.crMutex.
---
Nitpick comments:
In `@internal/scheduling/reservations/commitments/api_change_commitments_test.go`:
- Around line 1001-1004: Add an integration test that verifies HTTPAPI and
Syncer share the same CRMutex by constructing a single CRMutex instance and
passing it into NewAPIWithConfig (instead of nil) and into the Syncer
constructor, then run an operation on the HTTPAPI that acquires the mutex and
blocks (e.g., handler goroutine that holds lock and signals when held), start
the Syncer operation in another goroutine and assert it does not proceed until
the first releases (use channels/timeouts to detect blocking), finally release
the first and assert the Syncer completes; reference NewAPIWithConfig, HTTPAPI,
Syncer, and CRMutex when locating where to inject the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c47f7efa-ce44-47c8-8fe4-2e2c47914516
📒 Files selected for processing (7)
cmd/main.gointernal/scheduling/reservations/commitments/api.gointernal/scheduling/reservations/commitments/api_change_commitments.gointernal/scheduling/reservations/commitments/api_change_commitments_test.gointernal/scheduling/reservations/commitments/api_report_usage_test.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/syncer_test.go
| // CRMutex serializes CR state changes between the syncer and change-commitments API. | ||
| // This ensures that the syncer's Limes state snapshot is applied atomically without | ||
| // interference from concurrent change-commitments API calls. The Lock and Unlock | ||
| // methods are no-ops if the receiver is nil, allowing safe use when either component | ||
| // is disabled. | ||
| // TODO: If the syncer and API are moved to separate pods, replace with a K8s | ||
| // distributed lock (e.g., Lease-based coordination). | ||
| type CRMutex struct { | ||
| mu sync.Mutex | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and inspect the CRMutex definition
find . -name "api.go" -path "*/commitments/*" | head -5Repository: cobaltcore-dev/cortex
Length of output: 118
🏁 Script executed:
# Check main.go for mutex creation
find . -name "main.go" -type f | head -10Repository: cobaltcore-dev/cortex
Length of output: 165
🏁 Script executed:
# Look at the directory structure to understand the project layout
ls -la internal/scheduling/reservations/commitments/ 2>/dev/null || echo "Path may differ"
git ls-files | grep -E "(main|api|commitments)" | head -20Repository: cobaltcore-dev/cortex
Length of output: 3527
🏁 Script executed:
# Read the api.go file to see CRMutex definition
cat -n internal/scheduling/reservations/commitments/api.go | head -50Repository: cobaltcore-dev/cortex
Length of output: 1892
🏁 Script executed:
# Read main.go to see how CRMutex is initialized
cat -n cmd/main.go | head -100Repository: cobaltcore-dev/cortex
Length of output: 4468
🏁 Script executed:
# Search for where CRMutex is used
rg "CRMutex" --type go -B 2 -A 2Repository: cobaltcore-dev/cortex
Length of output: 6169
🏁 Script executed:
# Search for deployment configuration or documentation about multi-replica setup
find . -name "*.yaml" -o -name "*.yml" -o -name "Dockerfile" -o -name "README*" | grep -E "(deploy|helm|kube|docker|README)" | head -20Repository: cobaltcore-dev/cortex
Length of output: 955
🏁 Script executed:
# Look for environment/deployment configurations
git ls-files | grep -E "(helm|deploy|kube|config)" | head -20Repository: cobaltcore-dev/cortex
Length of output: 887
🏁 Script executed:
# Check if there are multiple instances expected to run
rg "replicas|replica|multi-pod|multi-instance" --type go -i | head -20Repository: cobaltcore-dev/cortex
Length of output: 2820
🏁 Script executed:
# Look at the full main.go to understand the complete context of crMutex usage
sed -n '1,300p' cmd/main.go | tail -100Repository: cobaltcore-dev/cortex
Length of output: 4633
🏁 Script executed:
# Check if leader election is actually being used and how it impacts the syncer/API
rg "LeaderElection" --type go -B 2 -A 2 | head -40Repository: cobaltcore-dev/cortex
Length of output: 2152
🏁 Script executed:
# Look at how the syncer and API are registered with the manager
sed -n '280,400p' cmd/main.goRepository: cobaltcore-dev/cortex
Length of output: 5630
🏁 Script executed:
# Check helm values to understand the deployment architecture
cat helm/bundles/cortex-cinder/values.yaml | grep -A 20 "replicas\|deployment" | head -40Repository: cobaltcore-dev/cortex
Length of output: 688
🏁 Script executed:
# Check if the syncer and API are actually in separate pods or the same process
rg "NewSyncer|syncer.*crMutex" --type go -B 2 -A 5Repository: cobaltcore-dev/cortex
Length of output: 5176
🏁 Script executed:
# Look for where the syncer is initialized in main.go to confirm co-location
sed -n '380,500p' cmd/main.go | grep -A 20 -B 5 -i "syncer"Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Search for any split deployment or separate syncer pod configuration
rg "syncer|commitments" helm/bundles/cortex-cinder/values.yaml -A 3 -B 3Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Check the complete syncer initialization and confirm both syncer and API are conditional
sed -n '300,450p' cmd/main.go | grep -A 30 "commitments-sync-task"Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Verify if there's any cross-pod coordination or distributed locking currently in place
rg "Lease|distributed|coordination|lock" --type go internal/scheduling/reservations/commitments/ -iRepository: cobaltcore-dev/cortex
Length of output: 2653
🏁 Script executed:
# Check if leader election is actually enabled in practice and if multi-replica scenarios are supported
cat helm/bundles/cortex-cinder/values.yaml | head -50Repository: cobaltcore-dev/cortex
Length of output: 1283
🏁 Script executed:
# Verify the actual syncer initialization block in main.go
rg "commitments-sync-task|commitments-api" cmd/main.go -B 2 -A 15Repository: cobaltcore-dev/cortex
Length of output: 2330
🏁 Script executed:
# Check if there's any documentation about deployment models or whether split deployments are planned
find . -name "*.md" -type f | xargs grep -l "syncer\|commitments\|deployment" | head -5Repository: cobaltcore-dev/cortex
Length of output: 230
🏁 Script executed:
# Check the full syncer initialization to confirm both components are in the same process currently
sed -n '430,470p' cmd/main.goRepository: cobaltcore-dev/cortex
Length of output: 1948
🏁 Script executed:
# Look for any task.Runner or separate process initialization for syncer
rg "SetupWithManager.*syncer\|commitments.*SetupWithManager" --type go -B 3 -A 3Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Check the deployment documentation to understand the expected deployment model
cat docs/reservations/committed-resource-reservations.md | head -100Repository: cobaltcore-dev/cortex
Length of output: 5148
🏁 Script executed:
# Look for any existing distributed coordination mechanisms or TODO comments about multi-replica
rg "TODO|FIXME|XXX" internal/scheduling/reservations/commitments/ --type go | grep -i "replica\|pod\|distributed"Repository: cobaltcore-dev/cortex
Length of output: 194
🏁 Script executed:
# Confirm the actual scenario: are syncer and API always co-located?
rg "nova-pipeline-controllers.*commitments-sync-task\|commitments-sync-task.*nova-pipeline-controllers" helm/ --type yamlRepository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Confirm whether the syncer and API are conditionally enabled or always together
rg "nova-pipeline-controllers" helm/bundles/cortex-cinder/values.yaml -A 5 -B 5Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Check if there are other configurations that could enable them separately
git ls-files | xargs grep -l "commitments-sync-task\|nova-pipeline-controllers" | grep -E "(yaml|helm)" | head -5Repository: cobaltcore-dev/cortex
Length of output: 101
🏁 Script executed:
# Check the nova deployment configuration to see how syncer and API are enabled
cat helm/bundles/cortex-nova/values.yaml | grep -A 20 -B 5 "commitments-sync-task\|enabledControllers\|enabledTasks"Repository: cobaltcore-dev/cortex
Length of output: 2273
🏁 Script executed:
# Verify the helm deployment splits - check if cortex-nova-knowledge and cortex-nova-scheduling are separate
cat helm/bundles/cortex-nova/Chart.yamlRepository: cobaltcore-dev/cortex
Length of output: 1119
🏁 Script executed:
# Confirm the dependencies and deployment structure
git ls-files helm/bundles/cortex-nova/ | grep -E "values|deployment|templates"Repository: cobaltcore-dev/cortex
Length of output: 815
CRMutex provides no cross-pod synchronization in current deployment.
The syncer and API run in separate Kubernetes deployments: cortex-nova-knowledge (syncer) and cortex-nova-scheduling (API). Each pod independently creates its own crMutex instance in main.go. In multi-replica scenarios, a request to the API pod can race concurrent writes from the syncer running in a different pod, violating the atomicity guarantee claimed in the docstring. The current in-memory mutex only serializes within a single pod. A distributed coordination mechanism (e.g., Kubernetes Lease-based lock) is required to properly serialize CR state changes across pods.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/api.go` around lines 24 - 33,
The CRMutex type and its Lock/Unlock methods only use an in-memory sync.Mutex
and therefore do not provide cross-pod serialization; replace or augment CRMutex
to use a Kubernetes Lease-based distributed lock (or another cluster-wide
coordination primitive) inside the CRMutex methods so Lock/Unlock block across
pods, ensure the existing nil-receiver semantics remain safe, and update the
CRMutex construction logic in main.go (where crMutex is created) to initialize
the Kubernetes client/Lease lock instance and pass it into CRMutex so API and
syncer pods contend on the same Lease.
|
I will do this properly with lease based lock.. after GA |
| // Mutex to serialize change-commitments requests | ||
| changeMutex sync.Mutex | ||
| // Shared mutex to serialize CR state changes with the syncer | ||
| crMutex *CRMutex |
There was a problem hiding this comment.
The ideomatic way in Golang is as follows. You can use sync.Mutex as a field in your type struct.
type HTTPAPI struct {
mutex sync.Mutex
}And when you're initializing the struct, you can directly use your mutex. No nil check needed.
api := HTTPAPI{}
api.mutex.Lock() // This should work without explicit initializationThere was a problem hiding this comment.
This because you are embedding mutex here, and initializing the surrounding struct will initialize the embedded struct as well.
There was a problem hiding this comment.
There was a problem hiding this comment.
The mutex wrapper shouldn't be necessary.
|
To implement lease locking, we can use this as reference: https://github.com/ironcore-dev/network-operator/blob/main/internal/resourcelock/resourcelock.go |
No description provided.