Network Stack
This article is about writing a TCP/IP stack, ie. a subsystem which uses a link layer (eg. ethernet card) to process packets of such protocols as IP, ARP, TCP, UDP.
Scanning the PCI devices
The first thing to do is to scan the PCI devices installed on the machine so you can detect an Ethernet card by looking at a specific vendor ID and device ID. See the PCI page for more details.
Writing a driver for your NIC
Once you have located the Ethernet card(s), you will need to implement a driver for it to be able to send and receive data. If you are using an emulator, a good card to write a driver for is the Intel E1000 as it is available on a variety of emulators such as VirtualBox - and has a thorough coverage on osdev.org (see Intel Ethernet i217). If you have trouble implementing the E1000 driver, you can start with the RTL8139, an older ethernet card that is much simpler.
The first thing to get out of the Ethernet card is the machine's MAC address. This 6-bytes address is needed to exchange data on the local network.
The easiest test you can do is to send an ARP broadcast on the network. You can use Wireshark both to capture an example of a valid ARP request and to verify your own request has been received by the target host. As far as receiving data, your network card should capture data sent across the local network, even if it is not addressed to your machine.
Networking protocols
Once you can send and receive data through your NIC and have your machine's MAC address, you will have to implement (at least partially) several networking protocols that coexist on top of each other:
- Ethernet: this is the basic protocol that sends data to another machine on your local network using your MAC address. This is the building block for all the rest as you need to send data to the router if you want to communicate with the outside world.
- ARP (Address Resolution Protocol): allows to translate an IPv4 address into a MAC address
- IP (Internet Protocol): this sits on top of Ethernet and is required to send data on the Internet given an IP address. The mostly common version is IPv4 which uses a 32-bit IP address, but IPv6 (which is using 128-bit IP addresses) is gaining some traction. Note that IP provides a "best effort" to send a packet, but does not guarantee it will successfully reach its destination, nor that the packets will be received in the order they were sent
- ICMP (Internet Control Message Protocol): used by tools such as ping or traceroute
- UDP (User Datagram Protocol): a connectionless transmission protocol that adds the notion of source and target ports to IP. Application services can subscribe to one or more port(s) to be notified if a UDP message is sent to that port
- DHCP (Dynamic Host Configuration Protocol): allows to request the machine network configuration information such as its IP address, the IP address of the local router, the DNS, etc.
- DNS (Domain Name System): get the IP address for a given domain name
- TCP (Transmission Control Protocol): like UDP, it adds the notion of source and destination port. TCP is however more complex as it creates its own session mechanism and makes sure that the application using it will receive the packets in order, resending packets if need be.
- SSL/TLS (optional): if you want to use a secure connection
- HTTP (HyperText Transfer Protocol): defines a request and response mechanism to transfer web pages, images and other resources.
- Telnet: a protocol to remotely access a machine using a command line shell.
- SSL/TLS (optional): if you want to use a secure connection
A tool of choice to help you will be Wireshark, a free network sniffer and analyzer. It is a great tool to understand how the various networking protocols are encoded as it explains in great details what each byte of a packet corresponds to. Note that on Windows, Wireshark does not capture the loopback traffic (i.e. traffic made from localhost to localhost), so may not capture network traffic between an emulator and the host machine. You can however use Rawcap to capture the networking traffic into a file and use Wireshark to examine it.
The network stack
Networking protocols are organized as a stack where each layer calls the next layer. A packet sent across the network will be composed of several headers, one for each layer involved.
Consider the example of a DHCP request. This is one of the protocols you might want to implement early on as it allows your machine to find its IP address, get the local router IP address, the DNS IP address - the basic information to be able to properly communicate across the network.
One way to implement this is as follows:
- The Operating System decides to send a DHCP request, so calls the DHCP layer
- The DHCP layer asks the UDP layer to create a packet whose target is IP address 255,255,255,255 (broadcast to the whole local network), port 53, and whose payload size is 300 bytes (the length may vary)
- The UDP layer asks the IP layer to create a packet of type UDP to IP address 255,255,255,255, of size 308 bytes
- The IP layer asks the Ethernet layer to create a packet of type IPv4 of length 328 bytes whose target is IP address 255,255,255,255
- The Ethernet layer creates a packet of size 342 bytes, and writes in the first 14 bytes the Ethernet header, including the source address (the machine's MAC address), the destination MAC address FF:FF:FF:FF:FF:FF (translated from the IP address 255,255,255,255) and sends it back to the IP layer
- The IP layer writes the IP header in the 20 bytes after the Ethernet header and sends it to the UDP layer
- The IP layer asks the Ethernet layer to create a packet of type IPv4 of length 328 bytes whose target is IP address 255,255,255,255
- The UDP layer writes its header in the 8 bytes after the IP header and sends it to the DHCP layer
- The UDP layer asks the IP layer to create a packet of type UDP to IP address 255,255,255,255, of size 308 bytes
- The DHCP layer writes its request in the 300 bytes left and sends it back to the UDP layer
- The UDP layer completes its header by writing its checksum (which encompasses the DHCP message) and sends it to the IP layer
- The IP layer sends it to the Ethernet layer
- The Ethernet layer sends the packet to the Ethernet card, that sends the message across the network
- The IP layer sends it to the Ethernet layer
- The UDP layer completes its header by writing its checksum (which encompasses the DHCP message) and sends it to the IP layer
- The DHCP layer asks the UDP layer to create a packet whose target is IP address 255,255,255,255 (broadcast to the whole local network), port 53, and whose payload size is 300 bytes (the length may vary)
The packet actually sent across the network will look like:
Ethernet header (14 bytes) |
IPv4 header (20 bytes) |
UDP header (8 bytes) |
The DHCP request (300 bytes payload) |
The DHCP response will have the same format as the request, and should be processed as follows:
- The Ethernet card driver will verify that the target MAC is the current machine's, and if so sends the packet to the Ethernet layer
- The Ethernet layer will look at the Ethernet header, check the service type (which should be IP) and will send the packet (stripped of its Ethernet header) to the IP layer
- The IP layer will check the IP header, verify the checksum and, because its type is UDP will forward the packet (without its IP header) to the UDP layer.
- The UDP layer will check the UDP header, verify the checksum, and based on the destination port will send the payload to right service - in this example the DHCP layer (once again stripping the UDP header)
- The DHCP layer will read the DHCP message verify that the message type is Response (i.e. It's a response from the router) and will retrieve its IP address, the router's IP address and other networking configuration information.
Note that networking protocols are by definition asynchronously i.e. you send a request on the network and you need to wait for its response. In particular, you have no way of predicting when will a response arrive, if at all. And because an incoming packet is handled by an interrupt handler, it could interrupt your code at any time.
Little and big endian
By convention, any message encoded on the Internet is using big endian (the most significant byte goes first). This is something to always keep in mind for people developing on Intel and AMD processors as x86 processors encode numbers using little endian. As a result, you will have to often convert numbers. Here are two functions to convert the endian for 16 and 32-bit integers:
uint16_t switch_endian16(uint16_t nb) { return (nb>>8) | (nb<<8); } uint_t switch_endian32(uint_t nb) { return ((nb>>24)&0xff) | ((nb<<8)&0xff0000) | ((nb>>8)&0xff00) | ((nb<<24)&0xff000000); }
Checksums
Several networking protocols use a checksum to verify that the message was not accidentally altered during the transport. Without a valid checksum, the packet is likely to be ignored. The checksum is a 16-bit number computed as follows:
- Split the message to checksum into 16-bit chunks
- Add those chunks
- If the message has an odd number of bytes, the last byte should be counted as the higher byte (e.g. if the last byte is 0x42 then add 0x4200)
- If the sum does not fit in a 16-bit number (i.e. is greater than 0xFFFF), strip the top 16 bits and add them to the low 16 bits. Repeat the last step until you have a 16-bit sum
- Return the binary inverse of that sum
The IP checksum only covers its own header. The UDP and TCP checksums are a bit more complex as they include the UDP/TCP header, the payload (i.e. anything after the UDP/TCP header) as well as a "pseudo header" composed of the source and target IP addresses, the IP type (0x11 for UDP, 0x06 for TCP) and the UDP/TCP message length (starting with the UDP/TCP header).
If properly computing the checksum can be tricky, Wireshark can help you. For this, make sure that it is verifying the checksum (an option not enabled by default) by going to Edit / Preferences / Protocols, select the desired protocol (e.g. UDP, TCP, IPv4) and make sure that "Validate the checksum if possible" is checked. This way, Wireshark will tell you if the checksum is valid, and if not, what its value should be.
ARP
The ARP protocol will be one of the first protocols you will need to implement. Without it, you will not be able to communicate on your local network, let alone on the Internet. Fortunately this is a simple protocol which only requires to implement a few functions:
- Sending requests and processing replies: your OS will need to perform a request to convert an IP address into a MAC address, something which is required to even communicate with your local router. This implies not only sending a request packet but also processing the reply when it comes so your OS can update its ARP table
- Receiving requests and sending replies: your OS will also need to honor the requests sent its way (e.g. when someone asks what is its MAC address). In particular, the local router will send an ARP requests to your machine on a regular basis. Failure to respond will have the router consider your machine is down, and won't forward any more traffic to it
TCP
TCP is one of the most complex networking protocols.
First of all, it creates a virtual connection between the client and the server. To achieve this, a TCP header contains multiple flags that will be used by both sides to communicate about the status of that connection: SYN (synchronize), ACK (acknowledge), PSH (Push), FIN (finish) and others.
On top of that, TCP is trying to alleviate the fact that IP does not guarantee that packets will be received in the order they were sent, let alone received at all. This is why it keeps track of the amount of data actually sent, requires each side to regularly acknowledge the data they have received, and will have packets resent if need be. For this, a TCP header contains a sequence number and an acknowledgement number.
In the course of a TCP connection, both sides send each other some data, split across multiple packets. One way to measure where the communication stands is to send the position (in number of bytes) in that communication. The sequence number in a TCP packet is the position the current packet is at. Likewise, the acknowledgement number indicates the position (still in bytes) where one party expects the other party to send.
When either side receives a TCP packet with a sequence number S, an acknowledgement number A and a payload of size N, its next packet it sends should have the sequence number A (i.e. it's sending the data the other party expects) and the acknowledgement number S + N (or S+1 if N is null).
Establishing a connection
A TCP connection is established with the following 3-way handshake:
- The client sends a SYN request to the server (i.e. a message with the SYN flag set)
- The server responds with an SYN+ACK request (the standard also allows it to send an ACK and SYN separately, though that rarely happens).
- The client sends an ACK response.
The sequence number used in the SYN packets is the initial sequence number; all further packets shall use sequence numbers that are increments of the initial sequency number. The sequence number can be reset by sending a new SYN packet with a new sequence number.
The initial SYN and SYN+ACK packets "may" also contain data to be sent to the application, but this is rarely used. The TCP specification states that this data shall not be delivered to the application until the connection is established (i.e. after the final ACK response packet is received).
Transmitting data
To send data, either side can send a PSH, ACK message, with the actual data after the TCP header. The other party will need to send an ACK message to acknowledge it has received the packet. If not, the sender will send again the packet. This is where multiple TCP/IP implementations differ - some may wait more or less long before sending an ACK.
Closing the connection
The termination of the connection :
- The side that wants to close the connection sends a packet with a FIN flag
- The other side sends a FIN, ACK message
- The first side sends an ACK message
Like for the packets used to establish a connection, those packets do not contain any payload - just a TCP header.
An example
Let's look at an example of a TCP communication for an HTTP GET request:
Source -> Destination | Destination -> Source | Comment |
---|---|---|
Flag: SYN
seq_nb=0, ack_nb=0 |
Beginning of the TCP handshake. It is sending byte #0 (a packet without a payload will be considered to have at least one byte communication) and hasn't received any data yet from the server | |
Flags: SYN, ACK
seq_nb=0, ack_nb=1 |
||
Flag: ACK
seq_nb=1, ack_nb=1 |
The TCP handshake is completed, the communication can start | |
Flags: PSH, ACK
seq_nb=1, ack_nb=1, len=77 |
This is the HTTP GET request sent by the client. This is the first packet with an actual payload | |
Flag: ACK
seq_nb=1, ack_nb=78 |
The server acknowledges the HTTP request: it has successfully read up to byte #77, so expects the next communication to start at byte #78 | |
Flags: PSH, ACK
seq_nb=1, ack_nb=78, len=1009 |
This is the body of the HTML | |
Flag: ACK
seq_nb=78, ack_nb=1010 |
The client acknowledges the message sent by the server: it is sending byte #78, and has receveid up to byte #1009 so expects the next communication to start at byte #1010 | |
Flags: FIN, ACK
seq_nb=78, ack_nb=1010 |
The client terminates the TCP connection | |
Flags: FIN, ACK
seq_nb=1010, ack_nb=79 |
||
Flag: ACK
seq_nb=79, ack_nb=1011 |
The end of the TCP communication |
What to focus on
The shape of the stack will vary on design decisions. These may include
- whether or not a packet is passed between processing layers in one buffer or is copied to a new buffer when passing a layer boundary;
- whether in and outbound frames are communicated with the link layer with the use of a dedicated thread, are fully contained in an interrupt handler or in a loop in a single-threaded environment;
- whether frames (eg. ethernet frames) are processed immediately or queued;
- whether you want TCP support or just UDP or maybe only IP support; TCP is the most complex part of the stack, in the lwip implementation half of the code is specific to TCP.
As an example, a stack might
- have the NIC's API provide three functions: setting up the NIC, poll for a frame and send a frame;
- communicate in and outbound frames to the NIC in a one thread;
- demultiplex inbound frames from a reception queue in another thread.
General considerations
- When writing a stack over an ethernet, you may want to provide support for the ARP protocol and resolve functions.
- For the sake of modularity, the station's IP would be better stored in an nic_info struct rather than as a global variable.
- You may want to use Wireshark or another packet sniffer to inspect the communication and netcat which would dump debugging data sent from your OS once you have UDP or TCP support. Also, arping is useful when debugging arp code. You may code a trigger which for example reboots your system upon receipt of an ARP who-has for a chosen IP.
- You may use a dedicated ethernet card on one computer connected with a crossed over cable to another computer (which runs your operating system) and use static IP. Other options include testing under bochs or qemu after implementing drivers for the network devices they provide.
See Also
Articles
Threads
External Links
- Details on a implementation of a embedded web server with overview of the hardware, TCP/IP suite, TCP/IP stack and its API.
- RFC 793 - Transmission Control Protocol
- TCP/IP Illustrated - A must have book for any type of networking, great reference book
A number of tcp/ip stacks come with a documentation of their implementations; it makes a good read.