Skip to content

Conversation

sknat
Copy link
Collaborator

@sknat sknat commented Sep 2, 2025

This patch splits services in two components,

  • a watcher that handles the informer fetching services and
    endpoints from the k8s API.
  • a handler that takes care of programming VPP with the NAT
    rules, within the context of the felix server's single goroutine.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

sknat added 5 commits August 26, 2025 15:16
This patch changes the way we persist the data on disk when
running Calico/VPP. Instead of using struc and binary format
we transition to json files. Size should not be an issue as number
of pods per node are typically low (~100). This will make
troubleshooting easier and errors clearer when parsing fails.

We thus remove the /bin/debug troubleshooting utility as the
data format is not human readable.

Doing this, we address an issue where PBL indexes were reused
upon dataplane restart, as they were stored in a list. We now
will use a map to retain the containerIP mapping.

We also split the configuration from runtime spec in LocalPodSpec
and add a step to clear it when corresponding VRFs are not found
in VPP.

Finally we address an issue where uRPF was not properly set up
for ipv6.

Signed-off-by: Nathan Skrzypczak <[email protected]>
This patch splits the felix server in two pieces:
- a felix watcher placed under `agent/watchers/felix`
- a felix server placed under `agent/felix`

The former will have only the responsibility of watching
and submitting events into a single event queue.
The latter will receive the event in a single goroutine
and proceed to program VPP as a single thred.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

Signed-off-by: Nathan Skrzypczak <[email protected]>
This patch splits the CNI watcher and handlers
in two pieces. The handling will be done in the main
'felix' goroutine, while the watching / grpc server
will live under watchers/ and not store or access agent
state.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

Signed-off-by: Nathan Skrzypczak <[email protected]>
This patch moves the Connectivity handlers in the main felix
loop to allow lockless access to the cache.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

Signed-off-by: Nathan Skrzypczak <[email protected]>
This patch splits services in two components,
- a watcher that handles the informer fetching services and
endpoints from the k8s API.
- a handler that takes care of programming VPP with the NAT
rules, within the context of the felix server's single goroutine.

The intent is to move away from a model with multiple servers
replicating state and communicating over a pubsub. This being
prone to race conditions, deadlocks, and not providing many
benefits as scale & asynchronicity will not be a constraint
on nodes with relatively small number of pods (~100) as is k8s
default.

Signed-off-by: Nathan Skrzypczak <[email protected]>
@sknat sknat self-assigned this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant