curl-the-site

How curl http://site.com works, in microscopic detail

Background

This was an interview question for Infrastructure Engineer at GitHub. Even though I didn't get the position, I held onto this as it's one of my best technical pieces—a lot of deep details about Unix systems and networking, presented with a conversational tone. This is a piece for an advanced audience with a background in Unix systems.


Question


A user on an ubuntu machine runs curl http://github.com. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.



Answer

First, the user's shell reads the line of input from its standard input, parses it, and uses its PATH environment information to locate the curl command at /usr/bin/curl. The shell then invokes the fork() system call. The newly-forked child process sets up its environment & closes file descriptors that should not be passed to curl, and then calls some variant of the exec syscall, passing "/usr/bin/curl" as the path to the executable and a pointer to an array of strings that contains only http://github.com.


The kernel locates the curl executable through the VFS layer, possibly reading from an in-memory buffer or calling into a lower-level disk driver to retrieve the blocks from disk. The kernel creates a new entry in the process table, then reads the executable & arranges it in memory according to its object code format, which on Linux is ELF. Assuming no errors, the new process is scheduled to run at the default priority level.


When the curl process next receives a timeslice from the kernel, the glibc bootstrap code runs and further initializes the process in memory, loading & mapping dynamic libraries and updating the in-memory symbol table. glibc also reads in system locale information. After initialization, glibc calls curl's main() routine.


At last, the program the user wanted is running!


curl parses the URL provided by the shell as the argv parameter to exec. It may use getservbyname standard library function to look up the port for http, which is 80. It will use gethostbyname to resolve the name "github.com" to an IP address, carefully packing both the IP address and port into a sockaddr structure.


curl will then invoke the socket syscall to create a network socket in the PF_INET family, SOCK_STREAM because HTTP is a TCP service. If the syscall returns successfully, socket returns a socket descriptor to curl, which then passes that descriptor to the connect system call, along with the previously packed sockaddr structure to establish a connection to github.com.


The kernel now hands things to the network subsystem, which looks up the route to GitHub's IP and chooses the best interface on the user's machine to contact GitHub. After some housekeeping, the system eventually ends up writing a complete TCP/IP packet to the network: SYN from <user's IP>:<randomized source port>, destined for <github's IP>:80. The lower layer network protocol, such as Ethernet or 802.11 handles delivery of this packet to the user's "next hop," likely a small WiFi router or similar. This router uses its own routing information to retransmit the packet upstream, and so on, until the packet is eventually transmitted to GitHub's server by the destination router, likely a high-end gigabit Ethernet router in a colocation facility.


The kernel on GitHub's server recognizes that it has a process that has called bind() on port 80, and so completes the TCP handshake by replying with another TCP packet: SYN+ACK, destination <user's IP>:<user's source port>, source <server's IP>:<randomized source port>, with a randomly-chosen TCP sequence number. Reverse the sequence described above to get the packet back to the user, but hold on to your hats, we're going to reverse it again!


The user's kernel replies ACK back to the server: handshake completed, let's exchange some packets! In user-space on GitHub's server, an accept() system call returns the newly established connection, which is either put into an IO event pool to be polled or handed to a forked-off child process, depending on the concurrency model of the receiving process.


Back to the user's PC for the exciting conclusion!


Checking for errors, curl uses the write system call to send something like GET / HTTP/1.1\r\nHost: github.com\r\nAccept: *\r\n\r\n to the socket. This gets translated into a packet, or sequence of packets by the kernel, does the "traverse the Internet" jig, and eventually gets received by GitHub's server's kernel, after being reassembled if needed. Assuming no connection errors or timeouts, the web server process recognizes the double \r\n sequence as the end of the HTTP request and sends its response: HTTP/1.1 301 Moved Permanently\r\nLocation: https://github.com\r\n… and closes the connection.


The curl process on the user's computer uses the read system call to get the response body and then, since no options were passed to cause curl to output the headers (such as -i include headers or -v verbose), the process cleans itself up and gracefully exits with exit code 0.


What happens when the user retries with the https URL? Will we ever see an SSL handshake? What about firewalls!? Find out next time on Adventures in Computerland!


#◊site/notes