ICS 32 Winter 2022
Notes and Examples: URLs and HTTP


Background

Thus far in this course and the preceding one, you've written Python programs that read data from text files and that exchange data over a network via sockets, which are two big steps that push outward the boundaries around what we can accomplish in Python. However, there is an elephant in the room, so to speak. If we think about where most of the interesting data on the Internet resides, it's on the web. Web sites display content and allow human users to interact with web-based data, while web APIs provide a similar ability to other programs.

Both web sites and web APIs are organized around the same fundamentals we've already seen: A connection is initiated by a client connecting to a server and a protocol is followed that governs what the conversation looks like. So if we want to interact with web data — the simplest example of which is to download the content of a web page — we need to know enough about that protocol to be able to implement the conversation, and the rest isn't much different from what we've already done.

That HTTP is a common, standard protocol is good news for us: There's a pretty good chance we're going to be able to use it without having to implement all of the low-level, fiddly code we had to write when we implemented our own custom protocol before. But, nonetheless, we still need to understand what HTTP is, its basic structure, the terminology surrounding it, and so on. The fine-grained details, however, will be something we can gloss over, yet still be able to get real work done.


URLs

When we use a browser to visit a web page, all we need to do is tell the browser where we want to go and it handles the rest. The notion of "where we want to go" is encapsulated by a URL (Uniform Resource Locator), which specifies a few things:

One of the earlier code examples included a link to a short Python module called oops.py. The complete URL for that link is: http://www.ics.uci.edu/~thornton/ics32/Notes/Exceptions/oops.py. Here's what that URL means:

Given that information, a browser will know just what it needs to do:

But browsers aren't the only programs that can have conversations using HTTP. Our Python programs can do it, too. But we need to know a little bit about HTTP in order to do so effectively.


Some background on HTTP

HTTP (HyperText Transfer Protocol) is the protocol with which most web traffic on the Internet is transacted. Its latest version is HTTP/2 (with an HTTP/3 on the way), though there's still a fair amount of traffic that uses the older (and more easily-understood) HTTP/1.1 for now, so we'll stick with that.

HTTP is a request-response protocol, which means that its conversations go something like this:

After that single request and response, both sides close the connection. (I should note that there are performance optimizations available that let a client specify that the connection should be kept open if, for example, the client knows that it needs not just a web page's text but also several images from the same server. For our purposes, we'll stick with a single request and response per connection.)

Python programs can make these requests and parse these responses, but that requires us to know a little bit about the format of each. HTTP requests come in a few flavors, but the most common of them is called a GET, which means that the client would like to "get" a resource (a web page, an image, etc.) from the server. (We may see other alternatives later if we find a need for them.) A GET request in HTTP/1.1 looks like this.

GET /~thornton/ics32/Notes/Exceptions/oops.py HTTP/1.1
Host: www.ics.uci.edu

The first line of a GET request begins with the word GET, is followed by the web resource you want to download (the part of the URL that follows the protocol and host), and finally is followed by HTTP/1.1, as a way to indicate what protocol we expect to be using for the conversation. Notice that there are spaces separating the word GET and the resource, and also between the resource and the HTTP/1.1. Because these spaces are part of the protocol — and because the presence of spaces elsewhere could make this more difficult for a server to handle — note that URLs are not permitted to contain spaces.

The second and subsequent lines contain what are called headers, which allow us to specify a variety of supplementary information that the server can use to figure out how to send us a response. In our case, we've included just one, a header called Host:, which specifies the name or IP address of the host we think we're connecting to; this is useful in the case that the same machine has multiple names (e.g., more than one web site being served up by the same machine), and is generally required in most HTTP requests. Additional headers include specifying what browser (and what version) is being used — so, for example, a server can send back different output for a small-sized screen like an iPhone than to a larger-sized screen like a laptop or desktop — or a variety of performance optimizations that are available, or security-related information (such as a password or an access token that grants access to a page that might otherwise be hidden).

A blank line following the last header informs the server that there are no more headers. At that point, the request is complete.

If we were to send that response, we might get back a response very much like this one — though I've left out some details for brevity.

HTTP/1.1 200 OK
Date: Wed, 30 Dec 2021 07:56:07 GMT
Server: Apache/2.4.6 (CentOS)
...
...
Content-Length: 436
Content-Type: text/plain; charset=UTF-8

# oops.py
#
# ICS 32 Winter 2022
# Code Example
...
...
if __name__ == '__main__':
    f()

The first line of the response indicates that the server agrees to have an HTTP/1.1 conversation (that's the HTTP/1.1 part), followed by what's called a status code (in this case, 200) and a reason phrase (in this case, OK). There are forty or so status codes that are defined as part of the HTTP/1.1 standard; the two most common ones are:

The first line of the response is followed by headers, just as the first line of the request is. The server determines what headers to send, and the details there are too numerous to list, but I've included a few of the more interesting ones in the example above:

After the last header is a blank line, followed by the desired content — in this case, the contents of the file oops.py that is linked from one of my code examples.

For those of you who are interested in the full details of HTTP/1.1, the specification for it can be found here. Don't feel obligated to read through it unless you're interested; it's not a part of the course. But if you want to get an idea of the complexity level of HTTP, and why we should be so quick to want to find a library that implements all of that complexity for us, take a quick look through it (and note that one of the main authors of the specification, Roy Fielding, was completing his Ph.D. here at UCI at the time it was written).


How is HTTPS different from HTTP?

In recent years, the use of HTTPS has become significantly more common than in years past; many web servers (including the ICS web server from which you downloaded this web page) now require the use of HTTPS, with any attempt to reach it via HTTP being "redirected" to use HTTPS instead. So, why is that such an important thing?

HTTPS is a variant of HTTP. It solves the same basic problem that HTTP does and is used in the same way — there are still requests and responses, headers, status codes, and so on — except that it does two additional things that HTTP doesn't do:

  1. The contents of what's sent between clients and servers are encrypted. This is good, especially if you're on a shared, public network, because eavesdroppers can't easily determine what you're sending and receiving. That's important for sensitive traffic such as bank transactions, but an increasing number of people are finding that privacy is farther reaching than that. Where we go online is a window into the whole of who we are, and if that's not sensitive, what is?
  2. The server sends something to the client that demonstrates that it is who it says it is, so you have some assurance that you're talking to, for example, Bank of America and not an imposter. (If you're wondering why that's important, consider the implications of connecting to a site that you think is your bank's web site, but that's actually run by a malicious imposter. If you enter your login credentials, that malicious imposter can turn around and enter those same credentials into the actual Bank of America site and impersonate you.)

The second of these is more complex than the first, but not so complex that we can't understand it. The way it works is this: The server sends a certificate that establishes (more or less) that "Organization X certifies that the sender of this certificate is really Bank of America." The "Organization X" is called a certificate authority, one of a number of businesses around the world that provide the service of verifying identities and issuing certificates to establish that those identities have been verified.

So who trusts the certificate authority? The answer is that the certificate authority also issues a certificate that says "Organization Y establishes that Organization X is really a certificate authority," where "Organization Y" is another, higher-level authority. Organization Y might also issue a certificate that establishes its identity, as verified by Organization Z. This, generally, is called a chain of trust.

The certificates in a chain of trust are difficult to forge — they each require a piece of knowledge called a private key that is kept hidden by each organization — but, of course, they don't prove anything unless you eventually end up at an organization you inherently trust already. For this reason, operating systems generally include a set of root certificates from a few well-known organizations that are widely believed to be trustworthy. If you can establish a chain of trust from a person back to someone you already trust, the theory is that you should trust that person. (The root certificates are the ones that are trusted inherently.) Applying that principle to people, suppose you have a friend named Alice who you already trust.

So, as you can see, a lot happens when you say "https://" in a URL. Behind the scenes, certificates are sent and a chain of trust is built, and only if that chain of trust leads back to a root certificate will the connection even be succesful. (If you've ever wondered why "https" URLs sometimes seem a little slower than their "http" counterparts, this is why; it takes time to do all of this. In a world where we can't trust that everyone is acting in our best interest, though, the benefits outweigh the relatively minor costs.)

Root certificates in Python on macOS

(This section describes a problem that may affect those of you running macOS; you can safely skip this section if you're running Windows or Linux.)

A lot of the stuff I described above is often handled by a software library called OpenSSL. Starting in Python 3.7 (and still in its current version), Python no longer uses the version of OpenSSL bundled with your operating system; for security reasons, it uses its own version instead. It also no longer uses your operating system's root certificates; it uses its own.

On macOS, when you install Python, root certificates are often not installed alongside it. This means that Python, by default, trusts no one, which makes it impossible to use HTTPS. (Some of you who use Macs may have noticed that when you use Python to open some URLs, you can do it succesfully using "http://" but not using "https://". This is why.)

It turns out that there is a way to solve the problem, though: Install the operating system's root certificates into your Python installation (i.e., tell Python to trust whoever your operating system already trusts). You can do that by browsing to /Applications/Python 3.10 on your hard drive, then double-clicking Install Certificates.command. If you haven't done that already, you'll want to do that now; without that, you'll find that you're unable to run a lot of the example programs that we write, and you'll also have trouble completing Project #3, which will require the use of HTTPS.


The urllib.request module in the Python standard library

Unlike the protocols we've implemented in this course, which had a fairly straightforward sequence of what needed to be sent from client to server and vice versa, HTTP is anything but simple. It is used for everything from fetching a simple web page, implementing the "guts" of the conversations happening behind the scenes while you use full-featured web sites like Gmail, and even for allowing non-browsers to interact with web data (e.g., programs that can send tweets via Twitter). While we could certainly implement an HTTP conversation using the techniques we've seen so far — opening a socket connection to a server's port 80, constructing and sending a GET request, parsing the response — this is a very complex task. In order to do the job right, we would need to implement the entire specification, which weighs in (when printed) at well over 100 pages.

Happily, HTTP support is something so fundamental to the needs of so many programmers, many programming language libraries include HTTP support; Python is no exception. Python's library includes a number of modules that implement different parts of the HTTP specification, with the main trick being to understand which module you need in a given circumstance.

Suppose our goal is simple: We just want to download the contents of a single web page in Python, given its URL. (Note that your task in Project #3 is similar: Given the URL to information on the web that your program will need, you just want to download and use that information.) More complex interactions require more complex tools, but the interactions we've needed thus far are the simplest ones, so the simplest part of the library will suffice. That module is called urllib.request.

The urllib.request module has one function that we're interested in: urllib.request.urlopen(). Looking through its documentation reveals many more details than we need to know if we only want to download a web page using a GET request; downloading one page can be done in the Python shell by doing just this:

>>> import urllib.request
>>> request = urllib.request.Request('https://www.ics.uci.edu/~thornton/ics32/Notes/Exceptions/oops.py')
>>> response = urllib.request.urlopen(request)

What we did was a two-step process:

  1. We created an object representing the request we want to make. We could just use a string instead, but we're better off creating a Request object, because it's capable of letting us specify other things about our HTTP request — headers, content, a kind of request other than GET — that we might like to be in control of.
  2. We passed that Request object to urllib.request.urlopen, which is where we're asking to connect and obtain the information we're looking for.

The urlopen() function returns an object called an HTTPResponse, which provides a few useful attributes and methods, the most important of which is the read() method, which retrieves all of the content from the response (i.e., the contents of the web page you asked for).

>>> data = response.read()
>>> response.close()
>>> data
b"# oops.py\r\n#\r\n# ICS 32 Winter 2022\r\n# Code Example\r\n#\r\n......."

There are a couple of things worth noticing here. One is that we closed the response object once we were done reading the data from it. Just like you want to close files and sockets when you're finished with them, you're going to want to close the response objects you get back from urlopen(), too. (In fact, internally, there's probably a socket that is being closed behind the scenes when you close the response.)

Also, if you look carefully at what's shown in the Python shell when we look at the value of data, you'll notice that it doesn't look quite like the other strings you've seen before. This string has a b displayed in front of the quote that begins it; as usual, when you see a little distinction like that, it probably means something. Let's take a look at data's type.

>>> type(data)
<class 'bytes'>

Interlude: What is a bytes object?

A bytes object in Python represents what it sounds like: a collection of bytes. A byte is a simple concept: it's eight binary digits, each being either 0 or 1. Ultimately, everything in your computer's memory — and everything sent from one machine to another via a computer network — is represented this way; the question is how the bytes are interpreted. If you see the byte 10001101, what does it mean? The answer depends very much on what kind of data you expect to get. In other words, the bytes don't mean anything without you knowing what the encoding is; the encoding is a mapping between bytes and their meanings.

Conceptually, we think of strings as sequences of characters. But what is each character? In truth, each character is really stored numerically, using the same binary digits as everything else. But how do we know which binary digits mean 'A', which mean '8', and so on? That's where an encoding comes into play: An encoding for strings maps characters to their binary representations (and back again). We can encode a string into its bytes, and we can decode the bytes back into a string again. The only trick is telling Python which encoding to use. The most common encoding on the Internet is one called UTF-8, which is one way that a character set called Unicode can be encoded. The details are well beyond the scope of our work here; for us, it's enough to know that UTF-8 exists and that it's a particular kind of encoding of strings as bytes.

If we have a bytes object, we can turn it into the corresponding string by calling the method decode() on it.

>>> text = data.decode(encoding = 'utf-8')
>>> type(text)
<class 'str'>

Similarly, we can take a string object and call the encode() method on it to turn it back into a bytes object.

>>> encoded = text.encode(encoding = 'utf-8')
>>> type(encoded)
<class 'bytes'>

Notice, in both cases, that we passed an encoding argument to the method. This is because it's not enough to say that we want to do the conversion; because there are lots of different conversions possible, we have to say how we want to do the conversion.

It will be a safe assumption, in our work, that strings are encoded as UTF-8, though this is hardly a 100% safe assumption on every project you'll ever do. (Note that this is why the HTTP response we saw earlier included a Content-Type header, which described not only that the content was text, but that its encoding was UTF-8; unless the server tells us what we got, we have no way to know how to use it.)

Continuing with our previous example

Now that we understand what a bytes object is, we can decide what we want to do with it. Sometimes, we'll want to decode it, because we know it's a string (and we know what the appropriate encoding it). Other times, we'll want to write it to a file, send it to another host via a socket, or any number of other things. What you do next depends on what you want.

If our goal was to print out the text from the web page, though, we would need to decode it, because it's easier to work with strings when we print text than with bytes.

>>> text = data.decode(encoding = 'utf-8')
>>> lines = text.splitlines()
>>> lines
['# oops.py', '#', '# ICS 32 Winter 2022', '# Code Example', '#', .......]

And, at that point, we could loop over the line of lines and print each one out. Once we have a list of strings, all of the techniques we already know about will come into play.


The code

Below is a link to a short program that asks the user to type a URL, as well as a path on their local hard drive, then downloads the contents of that URL and saves it into a file at the specified path, using the techniques demonstrated above.