curl http://site.com
works, in microscopic detailThis was an interview question for Infrastructure Engineer at GitHub. Spoilers: I didn’t get the job, so if you’ve got an interview scheduled maybe check somewhere else. If you are curious about the nitty gritty details of how a shell spawns a subcommand, sets up a network socket, and sends an HTTP request then read on, dear reader!
A user on an ubuntu machine runs
curl http://github.com
. Describe the lifecycle of the curl process and explain what happens in the kernel, over the network, and on github.com's servers before the command completes.
First, the user's shell reads the line of input from its standard input, parses it, and uses its PATH environment information to locate the curl
command at /usr/bin/curl
. The shell then invokes the fork()
system call. The newly-forked child process sets up its environment & closes file descriptors that should not be passed to curl
, and then calls some variant of the exec
syscall, passing "/usr/bin/curl"
as the path to the executable and a pointer to a 3-element array of strings that contains: ["/usr/bin/curl", "http://github.com", NULL]
.
The kernel locates the curl
executable through the VFS layer, possibly reading from an in-memory buffer or calling into a lower-level disk driver to retrieve the blocks from disk. The kernel creates a new entry in the process table, then reads the executable & arranges it in memory according to its object code format, which on Linux is ELF. Assuming no errors, the new process is scheduled to run at the default priority level.
When the curl
process next receives a timeslice from the kernel, the glibc
bootstrap code runs and further initializes the process in memory, loading & mapping dynamic libraries and updating the in-memory symbol table. glibc
also reads in system locale information. After initialization, glibc
calls curl
's main()
routine.
At last, the program the user wanted is running!
curl
parses the URL provided by the shell in the argv
parameter to exec
. argv[0]
is the name of the executed program, so argv[1]
contains "http://github.com"
. Curl’s code may use getservbyname
standard library function to look up the well-known port for http
, which is 80
. It will use gethostbyname
to resolve the name "github.com" to an IP address, carefully packing both the IP address and port into a sockaddr
structure.
curl
will then invoke the socket
syscall to create a network socket in the PF_INET
family, SOCK_STREAM
because HTTP is a TCP service. If the syscall returns successfully, socket
returns a socket descriptor to curl
, which then passes that descriptor to the connect
system call, along with the previously packed sockaddr
structure to establish a connection to github.com
.
The kernel now hands things to the network subsystem, which looks up the route to GitHub's IP and chooses the best interface on the user's machine to contact GitHub. After some housekeeping, the system eventually ends up writing a complete TCP/IP packet to the network: SYN
from <user's IP>:<randomized source port>
, destined for <github's IP>:80
. The lower layer network protocol, such as Ethernet or 802.11 handles delivery of this packet to the user's "next hop," likely a small WiFi router or similar. This router uses its own routing information to retransmit the packet upstream, and so on, until the packet is eventually transmitted to GitHub's server by the destination router, likely a high-end gigabit Ethernet router in a colocation facility.
The kernel on GitHub's server recognizes that it has a process that has called bind()
on port 80, and so completes the TCP handshake by replying with another TCP packet: SYN+ACK
, destination <user's IP>:<user's source port>
, source <server's IP>:<randomized source port>
, with a randomly-chosen TCP sequence number. Reverse the sequence described above to get the packet back to the user, but hold on to your hats, we're going to reverse it again!
The user's kernel replies ACK
back to the server: handshake1 completed, let's exchange some packets! In user-space on GitHub's server, an accept()
system call returns the newly established connection, which is either put into an IO event pool to be polled or handed to a forked-off child process, depending on the concurrency model of the receiving process.
Back to the user's PC for the exciting conclusion!
Checking for errors, curl
uses the write
system call to send something like GET / HTTP/1.1\r\nHost: github.com\r\nAccept: *\r\n\r\n
to the socket. This gets translated into a packet, or sequence of packets by the kernel, does the "traverse the Internet" jig, and eventually gets received by GitHub's server's kernel, after being reassembled if needed. Assuming no connection errors or timeouts, the web server process recognizes the double \r\n
sequence as the end of the HTTP request and sends its response: HTTP/1.1 301 Moved Permanently\r\nLocation: https://github.com\r\n…
and closes the connection.
The curl
process on the user's computer uses the read
system call to get the response body and then, since no options were passed to cause curl
to output the headers (such as -i
include headers or -v
verbose), nor to follow the redirect (the -L
flag) the process cleans itself up and gracefully exits with exit code 0.
What happens when the user retries with the https
URL? Will we ever see an SSL handshake? What about firewalls!? Proxies?? Find out next time on Adventures in Computerland!
#◊site/notes
1: this is the eponymous “three-way handshake” of TCP: send SYN, receive SYN+ACK, reply ACK. Here’s an article with pictures that covers the basics.