Adding proposal for reliable state sync transport #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mitalum wants to merge 2 commits into sonic-net:main from mitalum:main

mitalum commented May 10, 2022

No description provided.


          Adding proposal for reliable state sync transport

4bc7bec

ghost commented May 10, 2022 •

edited by ghost

Loading

All CLA requirements met.

marian-pritsak reviewed

View reviewed changes

Collaborator

marian-pritsak left a comment

I see a fundamental issue with trying to implement a reliable transport for sending real-time data. If something goes wrong in the path between peers, we cannot tell the customer VMs to pause opening connections, which will fill the transport window, and make it drop all the future connections until a peer catches up (also, what happens to the live connection for which we receive FIN?).

We are not trying to transfer a piece of static data like a file, that needs to be exact to the last bit, but a real-time data, which can be lost beyond window size, and the choice made here is to sacrifice the future connections to keep the open ones synchronized while complicating the data plane implementation significantly.

documentation/high-avail/design/xsight-labs-reliable-state-sync-v1.md


		The basic concept is that state synchronization messages will be coalesced into larger UDP data packets, using a vendor selected algorithm (such as Nagle’s algorithm). Sequence numbers will be inserted on all transmitted data packets, starting at 0, and incremented for each packet transmitted. Acknowledgement numbers will also be inserted on all transmitted packets. The acknowledgement number represent the number of received packets that have been consumed by the application. Acknowledgement numbers generally will piggyback on data packets that carry messages. However, in the absence of any available data packets to transmit, keepalive control packets will be transmitted to convey acknowledgement information.

		Due to the low/bounded latency of the network, a fixed window size will be used. The window size represents the maximum number of unacknowledged packets that will be buffered. The transmitter will buffer up to a window size worth of packets and then will stop accepting new messages from the application for transmission. When the application is flow controlled in this manner, it will not generate new state synchronization messages. Note: for the DASH data plane to not generate state synchronization messages, it must drop packets that cause connection state changes. Degraded connection setup and closure performance will occur during periods of state synchronization flow control. While this is an extreme circumstance that is not expected to occur in normal operation, this is a necessary behavior to protect the system from unrecoverable losses of state synchronization.

Collaborator

marian-pritsak May 10, 2022

"Note: for the DASH data plane to not generate state synchronization messages, it must drop packets that cause connection state changes." - this contradicts the requirement "If appliance receives a valid packet, it must not drop it due to flow replication delays"

Author

mitalum May 11, 2022

The system will be engineered for this to not happen. This is a fail safe in the event of over subscription of the CPS capacity of the DPU or over subscription of the transport capacity for state synchronization.

documentation/high-avail/design/xsight-labs-reliable-state-sync-v1.md


		Connection state synchronization messages are approximately 24B for IPV4 connections. To amortize transport overhead, multiple messages may be coalesced into a single packet. A reasonable estimate is that 60 IPv4 state update messages can be coalesced into a single 1500B packet. Since 1 million CPS (connections per second) will require 60 million state update messages per second, this equates to 1 MPPS of 1500B packets. In other words, every 1 MCPS requires 12 Gbps of bandwidth for state synchronization. Suppose a DPU has 200GE of Ethernet interfaces and processes 5 MCPS. When this DPU is paired with another DPU of the same capability, 60 Gbps of bidirectional bandwidth is required to synchronize the combined 10 MCPS of the two DPUs. Of course, DPUs capable of handling higher CPS loads will require proportionately more bandwidth for state synchronization.

		The channel for connection state synchronization will be in-band, using the same Ethernet interfaces as the main DASH data plane. The network topology between the two paired DPUs should have as few switch hops as possible. Typically, in a DASH deployment, there will be at most two switch hops between HA partners. To the extent possible, the network should be engineered to minimize dropping of state synchronization packets. To prevent dropping of state synchronization packets caused by network congestion, QoS in the DPUs and switches should be configured to provision dedicated priority buffers and queues for state synchronization packets. Network dropping of state synchronization packets should be very infrequent.

Collaborator

marian-pritsak May 10, 2022

"The network topology between the two paired DPUs should have as few switch hops as possible. Typically, in a DASH deployment, there will be at most two switch hops between HA partners." - this is an assumption, and not necessarily common. In case of a permanent failure, another backup may be chosen as quickly as possible regardless of its location.

The IP-based protocol should not make any assumption about the network topology.

Author

mitalum May 11, 2022 •

edited

Loading

You make a good point about the new backup. I think the statement stands, "the network topology should have as few switch hops as possible". A preference should be given to a closer backup rather than a more remote one. The protocol can work over any topology. The "at most two" statement should probably be removed, but it is qualified with the word "typically". The configuration of window size and protocol timers should account for maximum hops between peer DPUs. Ideally priority is used for state synchronization packets.

documentation/high-avail/design/xsight-labs-reliable-state-sync-v1.md

+              ## Operation
+              There are several factors that allow a purpose-built reliable UDP transport for state synchronization to be high performing, while also being simple to implement.
+              It is assumed that a TCP connection will be used as a control channel between the two paired DPUs. This control channel will be used for multiple purposes such as negotiation of capabilities and exchange of health information. This same control channel may also be used for configuration, opening, and closing of reliable UDP transport connections, eliminating the complexity of implementing these control functions within the transport protocol itself.

Collaborator

marian-pritsak May 10, 2022

How does the control plane know that the data channel is established?

Author

mitalum May 11, 2022

There may be multiple possibilities. One might be to have control plane accessible counters that count data packets, nack packets, nack re-transmits, and keepalives. Also, the transport has detection for it being broken (idle timeout and max nack re-transmits). These can generate events to the control plane.

documentation/high-avail/design/xsight-labs-reliable-state-sync-v1.md


		Connection state synchronization messages are approximately 24B for IPV4 connections. To amortize transport overhead, multiple messages may be coalesced into a single packet. A reasonable estimate is that 60 IPv4 state update messages can be coalesced into a single 1500B packet. Since 1 million CPS (connections per second) will require 60 million state update messages per second, this equates to 1 MPPS of 1500B packets. In other words, every 1 MCPS requires 12 Gbps of bandwidth for state synchronization. Suppose a DPU has 200GE of Ethernet interfaces and processes 5 MCPS. When this DPU is paired with another DPU of the same capability, 60 Gbps of bidirectional bandwidth is required to synchronize the combined 10 MCPS of the two DPUs. Of course, DPUs capable of handling higher CPS loads will require proportionately more bandwidth for state synchronization.

		The channel for connection state synchronization will be in-band, using the same Ethernet interfaces as the main DASH data plane. The network topology between the two paired DPUs should have as few switch hops as possible. Typically, in a DASH deployment, there will be at most two switch hops between HA partners. To the extent possible, the network should be engineered to minimize dropping of state synchronization packets. To prevent dropping of state synchronization packets caused by network congestion, QoS in the DPUs and switches should be configured to provision dedicated priority buffers and queues for state synchronization packets. Network dropping of state synchronization packets should be very infrequent.

Collaborator

marian-pritsak May 11, 2022

"To prevent dropping of state synchronization packets caused by network congestion, QoS in the DPUs and switches should be configured to provision dedicated priority buffers and queues for state synchronization packets." - Why is this a "should"? This will steal the buffer from the switches and DPU. In case of failover, the DPU will need all the available buffer for customer packets, which will double, and no buffer for synchronization because the peer is down.

Author

mitalum May 11, 2022

My understanding is that DPUs will be less than 100% throughput loaded when not in failover (maybe 75%), but of course may be 100% loaded when failed over. Managing the DPU buffers can be a vendor implementation choice. The DPU and switches should not drop state synchronization messages due to congestion. If switches and DPUs must do congestion dropping, they should drop data packets. With QoS you should be able to dedicate a small buffer for priority packets since they should never oversubscribe the links. It's not clear to me that in the failover situation you need every available buffer for data packets. More buffering may just mean sustained higher latency when the DPU is chronically unable to keep up with the load.

documentation/high-avail/design/xsight-labs-reliable-state-sync-v1.md


		During the life of a typical TCP connection, state may be synchronized between paired DPUs up to six times. Long lived connections may require additional periodic synchronization to ensure that a passive DPU will not inadvertently age out a connection while the connection is still active on the partner DPU.

		Connection state synchronization messages are approximately 24B for IPV4 connections. To amortize transport overhead, multiple messages may be coalesced into a single packet. A reasonable estimate is that 60 IPv4 state update messages can be coalesced into a single 1500B packet. Since 1 million CPS (connections per second) will require 60 million state update messages per second, this equates to 1 MPPS of 1500B packets. In other words, every 1 MCPS requires 12 Gbps of bandwidth for state synchronization. Suppose a DPU has 200GE of Ethernet interfaces and processes 5 MCPS. When this DPU is paired with another DPU of the same capability, 60 Gbps of bidirectional bandwidth is required to synchronize the combined 10 MCPS of the two DPUs. Of course, DPUs capable of handling higher CPS loads will require proportionately more bandwidth for state synchronization.

Collaborator

marian-pritsak May 11, 2022

What is a breakdown of synchronization message size 24B? It is expected to be different for different use cases (e.g. SLB).

Author

mitalum May 11, 2022 •

edited

Loading

It is roughly SA (4B) + DA (4B) + SPORT (2B) + DPORT (2B) + VPORT (2B) + direction (1B) + connection state (6B = 2B of flags + 4B seq#). We came up with 21 bytes. We figure there will be at least different encodings for v4 and v6, and likely others. There will need to be a type field and possibly a length field (for TLV). I rounded to 24B for v4. It might make some sense to keep 4B alignment for messages. Note: the seq# in the message is for the seq# tracking for fast flow removal. It is not the seq# for reliable transport.

Author

mitalum commented May 11, 2022

I see a fundamental issue with trying to implement a reliable transport for sending real-time data. If something goes wrong in the path between peers, we cannot tell the customer VMs to pause opening connections, which will fill the transport window, and make it drop all the future connections until a peer catches up (also, what happens to the live connection for which we receive FIN?).

We are not trying to transfer a piece of static data like a file, that needs to be exact to the last bit, but a real-time data, which can be lost beyond window size, and the choice made here is to sacrifice the future connections to keep the open ones synchronized while complicating the data plane implementation significantly.

I see a fundamental issue with trying to implement a reliable transport for sending real-time data. If something goes wrong in the path between peers, we cannot tell the customer VMs to pause opening connections, which will fill the transport window, and make it drop all the future connections until a peer catches up (also, what happens to the live connection for which we receive FIN?).

We are not trying to transfer a piece of static data like a file, that needs to be exact to the last bit, but a real-time data, which can be lost beyond window size, and the choice made here is to sacrifice the future connections to keep the open ones synchronized while complicating the data plane implementation significantly.

We already indirectly tell the customer VMs to pause opening new connections by limiting the VM's CPS rate. We enforce this by dropping syn packets from the VM when the CPS rate is exceeded. The system must be engineered to limit the CPS of the VMs and to provision guaranteed bandwidth (or priority) and buffers in the network for state synchronization. The CPS limiting of VMs is designed to not oversubscribe the DPU's CPS capacity and to not oversubscribe the provisioned bandwidth for state synchronization. The issue of reliability for state updates is to quickly recover from very infrequent drops in this engineered network. The transport capacity for state synchronization should also account for this small packet drop rate. We are not trying to synchronize state over the internet, but between two very localized DPUs. In extremely rare cases (or due to misconfiguration) when the state synchronization transport is flow controlled, it will be necessary to limit state updates. Perhaps this can be accomplished by simply dropping syn packets. Losing state synchronization messages for existing connections would be far more detrimental.

Although we have not yet shown this publicly, we have done some internal analysis to determine that dropped (and not re-transmitted) synchronization messages may lead to broken TCP connections, may allow active connections to inadvertently age out of a passive peer, or may significantly delay removal of connections that will now to age out (in minutes) rather than be quickly removed when the connection is closed. We should be able to show that with a reliable transport, each DPU can simply forward packets based on its local connection state and asynchronously send state update messages to the peer. Due to the locality of the peer DPU, the message to the peer will almost always win the race with the response (ack) packet arriving at the peer from the endpoint. Even in very rare cases when the response (ack) is received at the peer DPU before the synchronization message, we can show that the system will still work (i.e. get into a reasonably good state).

I am not sure how the system can work if the state synchronization is unreliable, except to send all the packets that cause state changes to the peer first and have them returned before updating the local state and then transmit the packet to the endpoint. If packets are dropped along the way, then the endpoints will re-transmit them. We can show the math, but this consumes multiples of the bandwidth for state synchronization than implementing a reliable transport over an engineered network that is already mostly reliable.

Collaborator

chrispsommers commented May 13, 2022

The drawings are nice, however we prefer editable .svg (see tools) for the longer term because it allows maintenance without having to archive source and image files together.

Author

mitalum commented May 18, 2022

Thanks Chris. I will will replace the diagrams with the correct format.

The drawings are nice, however we prefer editable .svg (see tools) for the longer term because it allows maintenance without having to archive source and image files together.


          Replace image format from png to svg

b652a55

lguohan force-pushed the main branch from b9a60e3 to f642775 Compare

November 19, 2022 21:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet