curl http://site.comworks, in microscopic detail
This was an interview question for Infrastructure Engineer at GitHub. Even though I didn't get the position, I held onto this as it's one of my best technical pieces—a lot of deep details about Unix systems and networking, presented with a conversational tone. This is a piece for an advanced audience with a background in Unix systems.
A user on an ubuntu machine runs
curl http://github.com. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.
First, the user's shell reads the line of input from its standard input, parses it, and uses its PATH environment information to locate the
curl command at
/usr/bin/curl. The shell then invokes the
fork() system call. The newly-forked child process sets up its environment & closes file descriptors that should not be passed to
curl, and then calls some variant of the
exec syscall, passing
"/usr/bin/curl" as the path to the executable and a pointer to an array of strings that contains only
The kernel locates the
curl executable through the VFS layer, possibly reading from an in-memory buffer or calling into a lower-level disk driver to retrieve the blocks from disk. The kernel creates a new entry in the process table, then reads the executable & arranges it in memory according to its object code format, which on Linux is ELF. Assuming no errors, the new process is scheduled to run at the default priority level.
curl process next receives a timeslice from the kernel, the
glibc bootstrap code runs and further initializes the process in memory, loading & mapping dynamic libraries and updating the in-memory symbol table.
glibc also reads in system locale information. After initialization,
At last, the program the user wanted is running!
curl parses the URL provided by the shell as the
argv parameter to
exec. It may use
getservbyname standard library function to look up the port for
http, which is 80. It will use
gethostbyname to resolve the name "github.com" to an IP address, carefully packing both the IP address and port into a
curl will then invoke the
socket syscall to create a network socket in the
SOCK_STREAM because HTTP is a TCP service. If the syscall returns successfully,
socket returns a socket descriptor to
curl, which then passes that descriptor to the
connect system call, along with the previously packed
sockaddr structure to establish a connection to
The kernel now hands things to the network subsystem, which looks up the route to GitHub's IP and chooses the best interface on the user's machine to contact GitHub. After some housekeeping, the system eventually ends up writing a complete TCP/IP packet to the network:
<user's IP>:<randomized source port>, destined for
<github's IP>:80. The lower layer network protocol, such as Ethernet or 802.11 handles delivery of this packet to the user's "next hop," likely a small WiFi router or similar. This router uses its own routing information to retransmit the packet upstream, and so on, until the packet is eventually transmitted to GitHub's server by the destination router, likely a high-end gigabit Ethernet router in a colocation facility.
The kernel on GitHub's server recognizes that it has a process that has called
bind() on port 80, and so completes the TCP handshake by replying with another TCP packet:
<user's IP>:<user's source port>, source
<server's IP>:<randomized source port>, with a randomly-chosen TCP sequence number. Reverse the sequence described above to get the packet back to the user, but hold on to your hats, we're going to reverse it again!
The user's kernel replies
ACK back to the server: handshake completed, let's exchange some packets! In user-space on GitHub's server, an
accept() system call returns the newly established connection, which is either put into an IO event pool to be polled or handed to a forked-off child process, depending on the concurrency model of the receiving process.
Back to the user's PC for the exciting conclusion!
Checking for errors,
curl uses the
write system call to send something like
GET / HTTP/1.1\r\nHost: github.com\r\nAccept: *\r\n\r\n to the socket. This gets translated into a packet, or sequence of packets by the kernel, does the "traverse the Internet" jig, and eventually gets received by GitHub's server's kernel, after being reassembled if needed. Assuming no connection errors or timeouts, the web server process recognizes the double
\r\n sequence as the end of the HTTP request and sends its response:
HTTP/1.1 301 Moved Permanently\r\nLocation: https://github.com\r\n… and closes the connection.
curl process on the user's computer uses the
read system call to get the response body and then, since no options were passed to cause
curl to output the headers (such as
-i include headers or
-v verbose), the process cleans itself up and gracefully exits with exit code 0.
What happens when the user retries with the
https URL? Will we ever see an SSL handshake? What about firewalls!? Find out next time on Adventures in Computerland!