Skip to content

Conversation

@arghosh93
Copy link
Contributor

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

- If an infra node is being used as an egressNode then health
check for egress IP also succeeds when it is added to public load
balancer and LB considers it as a legitimate ingress router backend.

- Limits the number of egress IP which can be created on a cluster
due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

  • If an infra node is being used as an egressNode then health
    check for egress IP also succeeds when it is added to public load
    balancer and LB considers it as a legitimate ingress router backend.

  • Limits the number of egress IP which can be created on a cluster
    due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arghosh93
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from huiran0826 September 26, 2025 11:19
@arghosh93 arghosh93 changed the title OCPBUGS-57447: Refrain from adding Egress IP to public LB backend pool OCPBUGS-57447,OCPBUGS-45056: Refrain from adding Egress IP to public LB backend pool Sep 26, 2025
@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-45056, which is invalid:

  • expected the bug to target either version "4.21." or "openshift-4.21.", but it targets "4.20.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR is to stop adding Egress IP to public load balancer
backend pool regardless of presence of an OutBoundRule in any
Azure cluster.

This change comes with a consequence of no outbound connectivity
except to the infrastructure subnet even if there is no OutBoundRule.

However this is required to tackle following situation:

  • If an infra node is being used as an egressNode then health
    check for egress IP also succeeds when it is added to public load
    balancer and LB considers it as a legitimate ingress router backend.

  • Limits the number of egress IP which can be created on a cluster
    due to some Azure specific limitation.

this PR also let cobtroller remove any egress IP
added to public load balancer backend pool previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@arghosh93
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 26, 2025
@openshift-ci-robot
Copy link

@arghosh93: This pull request references Jira Issue OCPBUGS-57447, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @huiran0826

This pull request references Jira Issue OCPBUGS-45056, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@arkadeepsen arkadeepsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests which can be used to verify the changes made in this PR will correctly solve the issue?

@arghosh93
Copy link
Contributor Author

Do we have any tests which can be used to verify the changes made in this PR will correctly solve the issue?

We lack knowledge of API of different cloud providers to fake it. That is the main reason behind not having enough unit tests.

@pperiyasamy
Copy link
Member

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

@arghosh93
Copy link
Contributor Author

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

Yes @pperiyasamy , that is correct. The plan is to document this limitation along with a suggestion of using NAT gateway instead of a general purpose public load balancer. I am also gonna notify support team members so that everyone is well aware.

@pperiyasamy
Copy link
Member

pperiyasamy commented Oct 16, 2025

@arghosh93 Does this PR introduce any limitations on pod egress traffic? From my understanding, if we skip adding the EgressIP to the load balancer backend pools, the egress traffic will be restricted to the infra subnet. Is that correct?

Yes @pperiyasamy , that is correct. The plan is to document this limitation along with a suggestion of using NAT gateway instead of a general purpose public load balancer. I am also gonna notify support team members so that everyone is well aware.

Thanks @arghosh93 , If this is agreed by everyone, i'm fine with it. one comment on the sync function.

@pperiyasamy
Copy link
Member

/retest-required

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/cloudprovider/azure.go (1)

310-381: Well-structured cleanup implementation.

The SyncLBBackend method correctly implements one-time upgrade cleanup:

✅ Gracefully handles deleted nodes (logs warning, continues processing)
✅ Uses anonymous function with defer to avoid lock accumulation across loop iterations
✅ Idempotent updates (only modifies when LoadBalancerBackendAddressPools is present)
✅ Proper per-node locking prevents concurrent modifications
✅ Comprehensive error wrapping for Azure API calls

Optional: Consider adding observability for successful cleanups.

For better visibility during upgrades, consider logging when backend pools are actually removed:

🔎 View suggested enhancement
 			if loadBalancerBackendPoolModified {
 				networkInterface.Properties.IPConfigurations = ipConfigurations
+				klog.Infof("Removing Egress IP %s from load balancer backend pool for node %s", ipc, node.Name)
 				poller, err := a.createOrUpdate(networkInterface)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b8dcddb and 8116e47.

📒 Files selected for processing (9)
  • cmd/cloud-network-config-controller/main.go (1 hunks)
  • pkg/cloudprovider/azure.go (8 hunks)
  • pkg/cloudprovider/cloudprovider.go (1 hunks)
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go (4 hunks)
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go (1 hunks)
  • pkg/controller/configmap/configmap_controller.go (1 hunks)
  • pkg/controller/controller.go (2 hunks)
  • pkg/controller/node/node_controller.go (2 hunks)
  • pkg/controller/secret/secret_controller.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/cloudprovider/cloudprovider.go
  • pkg/controller/node/node_controller.go
  • cmd/cloud-network-config-controller/main.go
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go
  • pkg/controller/secret/secret_controller.go
  • pkg/controller/controller.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go
  • pkg/cloudprovider/azure.go
  • pkg/controller/configmap/configmap_controller.go
🧬 Code graph analysis (2)
pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go (1)
pkg/cloudprovider/cloudprovider.go (1)
  • CloudProviderConfig (88-98)
pkg/cloudprovider/azure.go (1)
pkg/cloudprivateipconfig/cloudprivateipconfig.go (1)
  • NameToIP (28-39)
🔇 Additional comments (6)
pkg/controller/configmap/configmap_controller.go (1)

106-108: LGTM: Appropriate no-op implementation.

The no-op InitialSync() is correct for ConfigMapController, which only watches for configuration changes and triggers shutdown. No startup initialization is needed.

pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go (1)

100-100: LGTM: Correct test setup update.

The empty CloudProviderConfig{} is appropriate for these tests, which use a FakeCloudProvider and don't exercise platform-specific behavior.

pkg/cloudprovider/azure.go (4)

16-16: LGTM!

The import additions correctly support the new SyncLBBackend functionality. All imported packages are utilized in the implementation.

Also applies to: 23-24, 26-27, 29-29, 32-32


121-170: LGTM!

The removal of backend pool client initialization is correct and aligns with the PR objective to stop managing Egress IPs in Azure load balancer backend pools.


172-230: LGTM!

The changes correctly implement the new behavior:

  • Enhanced error messages with proper context wrapping improve debuggability
  • Simplified IP configuration omits backend pool assignment as intended
  • The warning about limited outbound connectivity (lines 223-224) is important for operators and aligns with the documented limitations in the PR

232-273: LGTM!

The error message improvements provide better context for debugging Azure API interactions. The absence of backend pool cleanup logic is correct since Egress IPs are no longer added to backend pools.

@arghosh93
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@kyrtapz kyrtapz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last iteration looks great!
I have just a few more comments.
One thing that came to mind is that we will have to document that it won't be possible for the user to add the CPIC addresses to the backend pools on their own because cncc would remove it on restart.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8116e47 and b32c33e.

📒 Files selected for processing (9)
  • cmd/cloud-network-config-controller/main.go
  • pkg/cloudprovider/azure.go
  • pkg/cloudprovider/cloudprovider.go
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go
  • pkg/controller/configmap/configmap_controller.go
  • pkg/controller/controller.go
  • pkg/controller/node/node_controller.go
  • pkg/controller/secret/secret_controller.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/cloudprovider/cloudprovider.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller_test.go
  • pkg/controller/secret/secret_controller.go
  • pkg/controller/node/node_controller.go
  • cmd/cloud-network-config-controller/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/controller/configmap/configmap_controller.go
  • pkg/controller/controller.go
  • pkg/cloudprovider/azure.go
  • pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go
🧬 Code graph analysis (1)
pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go (3)
pkg/cloudprovider/cloudprovider.go (1)
  • CloudProviderConfig (88-98)
pkg/controller/controller.go (1)
  • CloudNetworkConfigController (48-68)
pkg/cloudprovider/azure.go (2)
  • PlatformTypeAzure (38-38)
  • Azure (49-60)
🔇 Additional comments (9)
pkg/controller/configmap/configmap_controller.go (1)

106-108: LGTM!

The no-op InitialSync() implementation correctly satisfies the interface requirement. ConfigMapController doesn't need startup cleanup, so returning nil is appropriate.

pkg/cloudprovider/azure.go (3)

22-34: LGTM!

The new imports are required for the SyncLBBackend function and isCloudPrivateIPConfigAssigned helper.


225-226: LGTM!

The warning log clearly communicates the operational impact of not adding egress IPs to the LB backend pool. This aligns with the PR objective and helps operators understand the connectivity constraints.


683-690: LGTM!

The helper correctly identifies assigned CloudPrivateIPConfigs by checking the Assigned condition status, which is consistent with the condition handling elsewhere in the codebase.

pkg/controller/controller.go (2)

42-45: LGTM!

The InitialSync() interface method is well-documented and correctly positioned in the startup sequence. The comment clearly explains the intended use case.


104-108: LGTM!

Calling InitialSync() after cache sync but before worker startup is the correct placement for one-time cleanup operations that need full cluster state visibility. Failing startup on error is appropriate since incomplete cleanup could leave the cluster in an inconsistent state.

pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go (3)

59-61: LGTM!

The optional initialSyncHook field provides a clean extension point for platform-specific cleanup without coupling the controller to cloud provider implementations.


80-86: LGTM!

The Azure-specific hook wiring is appropriately guarded by the platform type check. The type assertion is safe since PlatformTypeAzure guarantees the client is an *Azure instance. Capturing the listers in the closure ensures the hook has access to current cluster state during InitialSync().


131-140: LGTM!

The InitialSync() implementation correctly delegates to the optional hook and provides appropriate logging. The nil check ensures non-Azure platforms safely skip cleanup.

The consensus is to not add egress IP to public load balancer
backend pool regardless of the presence of an OutBoundRule.
During upgrade this PR let cobtroller removes any egress IP
added to public load balancer backend pool previously.

Signed-off-by: Arnab Ghosh <[email protected]>
@arghosh93
Copy link
Contributor Author

/retest-required

@kyrtapz
Copy link
Contributor

kyrtapz commented Dec 22, 2025

/lgtm
@arghosh93 please feel free to remove the hold once CI is green.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arghosh93, arkadeepsen, kyrtapz, pperiyasamy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kyrtapz,pperiyasamy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 23, 2025
@arghosh93
Copy link
Contributor Author

/retest-required

4 similar comments
@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/retest-requied

@arghosh93
Copy link
Contributor Author

/retest-required

@yingwang-0320
Copy link

/verified by pre-merge testing

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 25, 2025
@openshift-ci-robot
Copy link

@yingwang-0320: This PR has been marked as verified by pre-merge testing.

Details

In response to this:

/verified by pre-merge testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD dcdf24f and 2 for PR HEAD ba4e72b in total

@arghosh93
Copy link
Contributor Author

/retest-required

2 similar comments
@arghosh93
Copy link
Contributor Author

/retest-required

@arghosh93
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 26, 2025

@arghosh93: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn-serial-e2e-only c2b5065 link false /test e2e-openstack-ovn-serial-e2e-only
ci/prow/e2e-aws-ovn-serial c2b5065 link false /test e2e-aws-ovn-serial
ci/prow/security ba4e72b link false /test security
ci/prow/okd-scos-images ba4e72b link true /test okd-scos-images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants