At this point we have a TCP/IP stack that is able to communicate to other hosts in the Internet. The implementation so far has been fairly straight-forward, but missing a major feature: Reliability.
Namely, our TCP does not guarantee the integrity of the data stream it presents to applications. Even establishing the connection can fail if the handshake packets are lost in transit.
Introducing reliability and control is our next main focus in creating a TCP/IP stack from scratch.
- Automatic Repeat Request
- TCP Retransmission
- Karn’s Algorithm
- Managing the RTO timer
- Requesting retransmission
- Trying it out
Automatic Repeat Request
The basis for many reliable protocols is the concept of Automatic Repeat reQuest (ARQ)1.
In ARQ, a receiver sends acknowledgments for data it has received, and the sender retransmits data it never received acknowledgements for.
As we have discussed, TCP keeps the sequence numbers of transmitted data in memory and responds with acknowledgments. The transmitted data is put into a retransmission queue and timers associated with the data are started. If no acknowledgment for the sequence of data is received before the timer runs out, a retransmission occurs.
As can be seen, TCP builds its reliability on the principles of ARQ. However, detailed implementations of ARQ are involved. Simple questions like “how long should the sender wait for an acknowledgement?” are tricky to answer, especially when maximum performance is desired. TCP extensions like Selective Acknowledgments (SACK)2 alleviate efficiency problems by acknowledging out-of-order data and avoiding unnecessary round-trips.
Retransmissions in TCP are described in the original specification3 as:
When the TCP transmits a segment containing data, it puts a copy on a retransmission queue and starts a timer; when the acknowledgment for that data is received, the segment is deleted from the queue. If the acknowledgment is not received before the timer runs out, the segment is retransmitted.
However, the original formula for calculating the retransmission timeout was deemed inadequate for different network environments. The current “standard method”4 was described by Jacobson5 and the latest codified specification can be found from RFC62986.
The basic algorithm is relatively straightforward. For a given TCP sender, state variables are defined for calculating the timeout:
srttis smoothed round-trip time for averaging the round-trip time (RTT) of a segment
rttvarholds the round-trip time variation
rtoeventually holds the retransmission timeout, e.g. in milliseconds
srtt acts as a low-pass filter for consecutive RTTs. Due to possible large variation in the RTT,
rttvar is used to detect those changes and prevent them from skewing the averaging function. Additionally, a clock granularity of
G seconds is assumed.
As described in RFC62986, the steps for computation are as follows:
- Before the first RTT measurement:
rto = 1000ms
- On the first RTT measurement R:
srtt = R rttvar = R/2 rto = srtt + max(G, 4*rttvar)
- On subsequent measurements:
alpha = 0.125 beta = 0.25 rttvar = (1 - beta) * rttvar + beta * abs(srtt - r) srtt = (1 - alpha) * srtt + alpha * r rto = srtt + max(g, 4*rttvar)
- After computing
rto, if it is less than 1 second, round it up to 1 second. A maximum amount can be provided but it has to be at least 60 seconds
The clock granularity of TCP implementations has traditionally been estimated to be fairly high, ranging from 500ms to 1 second. Modern systems like Linux, however, use a clock granularity of 1 millisecond4.
One thing to note is that the RTO is suggested to always be at least 1 second. This is to guard against spurious retransmissions, i.e. when a segment is retransmitted too soon, causing congestion in the network. In practice, many implementations go for sub-second rounding: Linux uses 200 milliseconds.
One mandatory algorithm that prevents the RTT measurement from giving false results is Karn’s Algorithm7. It simply states that the RTT samples should not be taken for retransmitted packets.
In other words, the TCP sender keeps track whether the segment it sent was a retransmission and skips the RTT routine for those acknowledgments. This makes sense, since otherwise the sender could not distinguish acknowledgements between the original and retransmitted segment.
When utilizing the timestamp TCP option, however, the RTT can be measured for every ACK segment. We will deal with the TCP Timestamp option in a separate blog post.
Managing the RTO timer
Managing the retransmission timer is relatively straightforward. RFC6298 recommends the following algorithm:
- When sending a data segment and the RTO timer is not running, activate it with the timeout value of
- When all outstanding data segments have been acknowledged, turn off the RTO timer
- When an ACK is received for new data, restart the RTO timer with the value of
And when the RTO timer expires:
- Retransmit the earliest unacknowledged segment
- Back off the RTO timer with a factor of 2, i.e. (
rto = rto * 2)
- Start the RTO timer
Additionally, when the backing off of the RTO value occurs and a subsequent measurement is successfully made, the RTO value can shrink drastically. The TCP implementation may clear
rttvar when backing off and waiting for an acknowledgment6.
A TCP is usually not only relying on the TCP sender’s timers to fix lost packets. The receiving side can also inform the sender that segments need to be retransmitted.
Duplicate acknowledgment is an algorithm where out-of-sequence segments are acknowledged, but by the sequence number of the latest in-order segment. After three duplicate acknowledgments, the TCP sender should realize that it needs to retransmit the segment that was advertised by the duplicate acknowledgments.
Furthermore, Selective Acknowledgment (SACK) is a more sophisticated version of the duplicate acknowledgment. It is a TCP option where the receiver is able to encode the received sequences into its acknowledgements. Then the sender immediately notices any lost segments and resends them. We will discuss the SACK TCP option in a later blog post.
Trying it out
Now that we’ve gone through the concepts and general algorithms, let’s see how TCP retransmissions look like on the wire.
Let’s change the firewall rules to drop packets after the connection is established and try to fetch the front page of HN:
Observing the connection traffic we see that the
HTTP GET is retransmitted with roughly a doubling interval:
Verifying the retransmission back-off and the receiver going silent is easy, but what about the scenario when retransmissions are triggered only for some segments? For optimal performance, the RTO algorithm needs to ‘bounce back’ when the connection is detected as healthy.
Let’s set the firewall rule to only block the 6th packet for 6000 bytes:
Now, if we try to send some data, our TCP has to recognize the communication blackout and when it ends. Let’s send 6009 bytes:
Let’s step through the connection phases, see when retransmissions are triggered and how the RTO value changes. Below is a modified
tcpdump output with inline comments about our TCP socket’s internal state:
Retransmissions in TCP are an essential part of a robust implementation. A TCP has to survive and be performant in changing network environments, where for example the latencies can suddenly spike or the network path is blocked for a moment.
Next time, we will take a look at TCP Congestion Control for achieving maximum performance without degrading the network’s health.
I’d be happy if you try the project out and give feedback. See Getting Started on how to use it with cURL and Firefox!