Writing your own TCP/IP stack may seem like a daunting task. Indeed, TCP has accumulated many specifications over its lifetime of more than thirty years. The core specification, however, is seemingly compact1 - the important parts being TCP header parsing, the state machine, congestion control and retransmission timeout computation.
The most common layer 2 and layer 3 protocols, Ethernet and IP respectively, pale in comparison to TCP’s complexity. In this blog series, we will implement a minimal userspace TCP/IP stack for Linux.
The purpose of these posts and the resulting software is purely educational - to learn network and system programming at a deeper level.
- TUN/TAP devices
- Ethernet Frame Format
- Ethernet Frame Parsing
- Address Resolution Protocol
- Address Resolution Algorithm
To intercept low-level network traffic from the Linux kernel, we will use a Linux TAP device. In short, a TUN/TAP device is often used by networking userspace applications to manipulate L3/L2 traffic, respectively. A popular example is tunneling, where a packet is wrapped inside the payload of another packet.
The advantage of TUN/TAP devices is that they’re easy to set up in a userspace program and they are already being used in a multitude of programs, such as OpenVPN.
As we want to build the networking stack from the layer 2 up, we need a TAP device. We instantiate it like so:
After this, the returned file descriptor
fd can be used to
write data to the virtual device’s ethernet buffer.
IFF_NO_PI is crucial here, otherwise we end up with unnecessary packet information prepended to the Ethernet frame. You can actually take a look at the kernel’s source code of the tun-device driver and verify this yourself.
Ethernet Frame Format
The multitude of different Ethernet networking technologies are the backbone of connecting computers in Local Area Networks (LANs). As with all physical technology, the Ethernet standard has greatly evolved from its first version2, published by Digital Equipment Corporation, Intel and Xerox in 1980.
The first version of Ethernet was slow in today’s standards - about 10Mb/s and it utilized half-duplex communication, meaning that you either sent or received data, but not at the same time. This is why a Media Access Control (MAC) protocol had to be incorporated to organize the data flow. Even to this day, Carrier Sense, Multiple Access with Collision Detection (CSMA/CD) is required as the MAC method if running an Ethernet interface in half-duplex mode.
The invention of the 100BASE-T Ethernet standard used twisted-pair wiring to enable full-duplex communication and higher throughput speeds. Additionally, the simultaneous increase in popularity of Ethernet switches made CSMA/CD largely obsolete.
The different Ethernet standards are maintained by the IEEE 802.33 working group.
Next, we’ll take a look at the Ethernet Frame header. It can be declared as a C struct followingly:
smac are pretty self-explanatory fields. They contain the MAC addresses of the communicating parties (destination and source, respectively).
The overloaded field,
ethertype, is a 2-octet field, that depending on its value, either indicates the length or the type of the payload. Specifically, if the field’s value is greater or equal to 1536, the field contains the type of the payload (e.g. IPv4, ARP). If the value is less than that, it contains the length of the payload.
After the type field, there is a possibility of several different tags for the Ethernet frame. These tags can be used to describe the Virtual LAN (VLAN) or the Quality of Service (QoS) type of the frame. Ethernet frame tags are excluded from our implementation, so the corresponding field also does not show up in our protocol declaration.
payload contains a pointer to the Ethernet frame’s payload. In our case, this will contain an ARP or IPv4 packet. If the payload length is smaller than the minimum required 48 bytes (without tags), pad bytes are appended to the end of the payload to meet the requirement.
We also include the
if_ether.h Linux header to provide a mapping between ethertypes and their hexadecimal values.
Lastly, the Ethernet Frame Format also includes the Frame Check Sequence field in the end, which is used with Cyclic Redundancy Check (CRC) to check the integrity of the frame. We will omit the handling of this field in our implementation.
Ethernet Frame Parsing
The attribute packed in a struct’s declaration is an implementation detail - It is used to instruct the GNU C compiler not to optimize the struct memory layout for data alignment with padding bytes4. The use of this attribute stems purely out of the way we are “parsing” the protocol buffer, which is just a type cast over the data buffer with the proper protocol struct:
A portable, albeit slightly more laborious approach, would be to serialize the protocol data manually. This way, the compiler is free to add padding bytes to conform better to different processor’s data alignment requirements.
The overall scenario for parsing and handling incoming Ethernet frames is straightforward:
handle_frame function just looks at the
ethertype field of the Ethernet header, and decides its next action based upon the value.
Address Resolution Protocol
The Address Resolution Protocol (ARP) is used for dynamically mapping a 48-bit Ethernet address (MAC address) to a protocol address (e.g. IPv4 address). The key here is that with ARP, multitude of different L3 protocols can be used: Not just IPv4, but other protocols like CHAOS, which declares 16-bit protocol addresses.
The usual case is that you know the IP address of some service in your LAN, but to establish actual communications, also the hardware address (MAC) needs to be known. Hence, ARP is used to broadcast and query the network, asking the owner of the IP address to report its hardware address.
The ARP packet format is relatively straightforward:
The ARP header (
arp_hdr) contains the 2-octet
hwtype, which determines the link layer type used. This is Ethernet in our case, and the actual value is
protype field indicates the protocol type. In our case, this is IPv4, which is communicated with the value
prosize fields are both 1-octet in size, and they contain the sizes of the hardware and protocol fields, respectively. In our case, these would be 6 bytes for MAC addresses, and 4 bytes for IP addresses.
The 2-octet field
opcode declares the type of the ARP message. It can be ARP request (1), ARP reply (2), RARP request (3) or RARP reply (4).
data field contains the actual payload of the ARP message, and in our case, this will contain IPv4 specific information:
The fields are pretty self explanatory.
dmac contain the 6-byte MAC addresses of the sender and receiver, respectively.
dip contain the sender’s and receiver’s IP addresses, respectively.
Address Resolution Algorithm
The original specification depicts this simple algorithm for address resolution:
translation table is used to store the results of ARP, so that hosts can just look up whether they already have the entry in their cache. This avoids spamming the network for redundant ARP requests.
The algorithm is implemented in arp.c.
Finally, the ultimate test for an ARP implementation is to see whether it replies to ARP requests correctly:
The kernel’s networking stack recognized the ARP reply from our custom networking stack, and consequently populated its ARP cache with the entry of our virtual network device. Success!
The minimal implementation of Ethernet Frame handling and ARP is relatively easy and can be done in a few lines of code. On the contrary, the reward-factor is quite high, since you get to populate a Linux host’s ARP cache with your own make-belief Ethernet device!
The source code for the project can be found at GitHub.
In the next post, we’ll continue the implementation with ICMP echo & reply (ping) and IPv4 packet parsing.
Kudos to Xiaochen Wang, whose similar implementation proved invaluable for me in getting up to speed with C network programming and protocol handling. I find his source code5 easy to understand and some of my design choices were straight-out copied from his implementation.