Redundant VPC routers stuck in BACKUP; cannot add default route, interface remains down

### problem

We have a new CloudStack 4.20.0.0 environment with redundant VPC routers, but they never transition to MASTER. Instead:

1. Each VR tries to bring up the public interface (eth1) and add a default route, e.g.:

```
ip route add default via x.x.x.x table Table_eth1 proto static
```

…but this fails with exit code 2 (“Nexthop has invalid gateway”). We believe it fails as the interface remains in the DOWN state.

2. The VR script then tears eth1 down, inserts a “throw x.x.x.0/27” route in Table_eth1, and marks the router as BACKUP or FAULT.

3. Keepalived never starts because the script believes routing is broken. Thus no VRRP negotiation occurs, and no router becomes MASTER.

We can manually bring eth1 up (ip link set eth1 up) and add a default route to the main or custom table, and it works fine. However, CloudStack’s scripts immediately revert the interface to DOWN again and keep the router in BACKUP.

**Key details:**

VR logs show repeated attempts to configure the default route via x.x.x.x inside Table_eth1, followed by throw x.x.x.0/27.
Even if we remove the throw route, the script tries to add a route while eth1 is still down, fails, and resets to BACKUP.
Because of this cycle, we never see /etc/keepalived/keepalived.conf generated or keepalived started.

### versions

Apache CloudStack: 4.20.0.0
System VM template: Debian GNU/Linux 12
Hypervisor: KVM
Networking: Advanced networking with VLAN trunking, rp_filter disabled

We modified the systemvm template to add a static route which our setup needs. We added `/etc/network/if-up.d/91-add-route`:

```
#!/bin/sh
#
# /etc/network/if-up.d/91-add-route
#
# This script is automatically invoked by ifup each time
# an interface is brought up. The environment variable $IFACE
# contains the interface name (e.g., eth0, ens3, etc.).

[ "$IFACE" = "lo" ] && exit 0

# Gather *all* IPv4 addresses (CIDR format) on this interface
IP_CIDR_LIST=$(ip -o -4 addr show dev "$IFACE" | awk '{print $4}')
[ -z "$IP_CIDR_LIST" ] && exit 0  # no IPv4 addresses on $IFACE, so exit

# Loop through each IPv4 address on this interface
for IP_CIDR in $IP_CIDR_LIST
do
  # Extract the actual IP address (without /mask)
  IP_ADDR=$(echo "$IP_CIDR" | cut -d '/' -f 1)

  # Check if IP is in x.x.x.x/27
  if echo "$IP_ADDR" | grep -Eq '^-redacted-$'; then
    echo "Interface $IFACE has IP $IP_ADDR in x.x.x.x/27; adding route..."
    ip route add x.x.x.x/27 dev "$IFACE" scope link src "$IP_ADDR" 2>/dev/null || true

    # Once we've added the route for the first matching IP, we're done.
    exit 0
  fi
done

exit 0
```

We do not believe this is related to the issue.

### The steps to reproduce the bug

1. Install or upgrade to CloudStack 4.20.0.0 with advanced networking.
2. Create a VPC offering that uses redundant VR.
3. Deploy a VPC that picks up two VRs.
4. Observe in /var/log/cloud.log (and the VR’s cloud.log) that each router fails to add its default route via x.x.x.x, then tears down eth1 and remains BACKUP/FAULT indefinitely.

### What to do about it?

Ideally, the VR script should:

1. Ensure eth1 is brought up before adding the default route in the policy routing table (Table_eth1).
2. Avoid placing a “throw” route for x.x.x.0/27 on the router that’s intended to be MASTER.
3. Generate and start keepalived once the router is designated MASTER (or “PRIMARY” per the cmdline), so it can finalize the interface config instead of reverting to BACKUP.

If you need more logs or specifics, we can provide full VR logs and examples of the failing ip route commands. Let us know if you have any questions or potential workarounds—thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redundant VPC routers stuck in BACKUP; cannot add default route, interface remains down #10281

problem

versions

The steps to reproduce the bug

What to do about it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redundant VPC routers stuck in BACKUP; cannot add default route, interface remains down #10281

Description

problem

versions

The steps to reproduce the bug

What to do about it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions