ICS 32 Winter 2022, Python Background Notes: Files

Files and file systems

If the only kind of input our Python programs can accept comes from the keyboard, in response to a call to the built-in input() function, it feels awfully limiting. If we think about the programs we actually use, we quickly realize that they take input from sources other than just what we type. Another source of input that's quite common is to read information from a file system.

What is a file system?

If you've ever used a personal computer — a desktop machine or a laptop, for example — there's a good chance that you've interacted with a file system, even if you've never heard the term before. A file system is software that manages how information is stored on a storage device such as a hard drive or a USB stick. There are a number of different kinds of file systems in use — sometimes more than one kind on the same operating system! — but they mostly share the same basic characteristics, while differing mainly in the fine-grained details. So if you know about those shared characteristics, you'll quickly find yourself at home using just about any file system on just about any operating system.

The basic abstraction in a file system is that of a file. A file is a container in which a sequence of bytes is stored. Each byte is effectively a sequence of eight "digits" that are either 1 or 0; each of these digits is called a bit. The bytes in each file are interpreted differently depending on what kind of file it is, and it should be noted that this is largely a matter of what the program reading the file expects that it should contain (e.g., text, an image, a song); the file system itself is mostly unconcerned with what's in each file, except for the metadata associated with the file, which keeps track of things like who owns the file, who has access to the file, and when the file was last modified. The file system manages the containers in which the bytes are stored, but cares little about the bytes inside of each file, other than to make sure that a file's contents don't change unless you ask for them to be changed.

The most basic things we'll want to do with files in Python are read data from them (so we can use it as input to our program) and write data to them (so we can "save" it, or use it in another program). We'll start our story there.

Files

Interacting with files in Python requires first that we establish a sort of a connection to them. The values we store in variables are part of our program; they're part of what Python knows or calculates while the program runs. But the contents of files lie outside of our program, so we'll need some way to cross that boundary.

In Python, we do that by opening a file, which can most easily be done using the built-in function open(). If we call open() and pass it a single argument that is the path to a file — an indication of where the file is — then we'll get back a file object that we can use to interact with it.

>>> the_file = open('D:\\Examples\\data\\myfile.txt')

When we open a file, we also have to establish what we want to do with it, which requires us to specify a couple of things.

Whether we want to read the file or write to it. In other words, we may want information that's in the file to become input to our program; that's reading. We may also want information in our program to be stored in the file instead; that's writing. For the most part, you'll only ever want one or the other on a particular file at any given time.
What you expect the file to contain. For our purposes, there are two kinds of things we expect to find in a file: text or binary data. For now, we'll focus our attention on files that contain only text.

If you pass a second argument to the built-in open() function, you can specify both of these choices; the default, if you don't pass a second argument, is that you intend to read text. So we would expect the_file to be a file object from which we can read text. Let's see what we got.

>>> type(the_file)
<class '_io.TextIOWrapper'>

It's not especially important that we know exactly what a _io.TextIOWrapper is, but the name at least gives us the sense that it provides input or output capabilites — that what is often abbreviated in computing as I/O — and that it deals with text. In this course, I'll just refer to these as "file objects."

I tend to prefer to be explict about whether I intend to read from a file or write to it, so I'll generally pass the second argument. (It's a little more to type, but it makes clear that I intend to read from the file, as opposed to my simply having forgotten to say anything.) The way we say that we want to read text from the file is to pass the string literal 'r' in the second argument.

>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

Once you have a file object, you can read from it or write to it, which we'll return to shortly. But one important detail that we should consider first is that the notion of "opening" might make you wonder if there exists an inverse notion of "closing," as well. Indeed there is, and, in fact, it's a vital one. When you're done using a file, you're always going to want to close it, which you do by calling the close() method on the file object.

>>> the_file.close()

Once closed, you'll no longer be able to use it, but you'll have ensured that any operating system resources involved in keeping it open will no longer be in use, and that other programs that may want to open the file will be able to do so. We'll always close files we've opened after we're done with them.

Reading text from a file

If you open a file because you want to read text from it, then there are methods you can to read that text. There are a number of methods available, but we'll only need a small handful of them, so let's focus on the ones we need; we'll see others later if we find a use for them.

The readline() method reads a line of text from the file. The file object has a sort of "cursor" inside of it, which keeps track of our current position in the file; each time we read a line, that cursor is moved to the beginning of the next line, so that each subsequent call to readline() gives us back the next line of text that we haven't yet seen.

>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')
>>> the_file.readline()
"'Boo'\n"
>>> the_file.readline()
'is\n'
>>> the_file.readline()
'happy\n'
>>> the_file.readline()
'today'
>>> the_file.readline()
''
>>> the_file.readline()
''
>>> the_file.close()

The contents of the file we're reading in the example above look like this, with newlines on the end of every line except the last one.

'Boo'
is
happy
today

There are a couple of wrinkles in the example above worth noting.

Each time we called readline(), we got back a line of text with a newline on the end of it. The only time that wasn't true is when we reached the last line — which, in the file, didn't have one.
When we had reached the end of the file already, every subsequent call to readline() gave us back the empty string. This provides us a handy way to know that we've reached the end of the file.

Given those two facts, we can write a loop that prints the contents of a file.

the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

while True:
    line = the_file.readline()

    if line == '':
        break
    elif line.endswith('\n'):
        line = line[:-1]

    print(line)

the_file.close()

The one trick that might seem strange there is this part: line = line[:-1]. Recall that this is slice notation, which can legally be done on strings (and returns a substring of that string). When line is a string, the expression line[:-1] gives you back a string containing everything but the last character of line. We're using this technique to eliminate the newline if it's there.

Iterating through every line of a file is so common, there are a couple of techniques that automate it. One of them is the method readlines(), which returns all of the lines of the file, instead of one at a time; what you get back is a list of strings.

>>> the_file = open('D:\\Examples\\data\\myfile.txt', 'r')
>>> the_file.readlines()
["'Boo'\n", 'is\n', 'happy\n', 'today']
>>> the_file.close()

The good news is that readlines() provides a single method you can call to read an entire text file in a form that it can be handy to use — as a list of its lines. However, the bad news is that it read the entire file into that list. If you have no need to store the entire file at once, you might instead want to process one line at a time. It turns out that file objects that read text can be iterated using a for loop, in which case there is one iteration of the loop for each line of text. Like readline() and readlines(), you'll get the newline on the end of each line. Using that technique, we could rewrite our loop that prints the contents of a file more simply this way.

the_file = open('D:\\Examples\\data\\myfile.txt', 'r')

for line in the_file:
    if line.endswith('\n'):
        line = line[:-1]

    print(line)

the_file.close()

Writing text to a file

Opening a file to write text into it is similar to how we opened it for reading; the only difference is the second argument we pass to open().

>>> output_file = open('D:\\Examples\\data\\stuff.txt', 'w')

The file object you'll get back from open() turns out to have the same type.

>>> type(output_file)
<class '_io.TextIOWrapper'>

However, it is configured differently, expecting to write text into the file instead of read from it. In fact, we can ask a file object what its mode is, which means whether it is intended to read or write, by accessing its mode attribute. (Note that mode is not a method, so we don't put parentheses after its name. It's not something we can call; it's more akin to a variable that lives inside the object.)

>>> output_file.mode
'w'

Once you've got a file object whose mode is writing, you can call the write() method to write text into it. You can pass only one argument to write(), a string, and whatever text is in that string will then be written to the file. Newlines aren't added by default, so if you want them, you'll need to include them in that string.

>>> output_file.write('Hello there\n')
>>> output_file.write('Boo is ')
>>> output_file.write('perfect today')
>>> output_file.close()

After writing this text and closing the file, the file's contents will be:

Hello there
Boo is perfect today

Note that when you've written text to a file, closing the file when you're done becomes more than just good hygiene; it's essential to writing a program that works. It turns out that there's more going on than meets the eye when you write to a file. Writing data into a file on, say, a hard disk involves a fair amount of overhead, so that writing a lot of it isn't much slower than writing only a tiny amount. (It takes longer for a storage device to decide where to write it than the actual writing, as it turns out.) For this reason, file objects use a technique called buffering, which means that they don't write the text immediately. Instead, they store it internally in what's called a buffer. Once in a while, when there's enough text stored in the file object to make it worth writing to the file, the text is written and the buffer is emptied. If you're writing a lot of text to a file, but writing it a little bit at a time, this can make the entire process significantly faster, because the overhead of all of the tiny writes is eliminated.

The problem is that the buffer is only written when it's explictly flushed. (Flushing is the act of taking the text in the buffer and writing it into the file, then emptying the buffer.) One thing that causes a buffer to flush is when the buffer's capacity is exceeded; that happens automatically at some point. But when you're done writing to the file, there will probably be text still in the buffer. One of the things that happens when you close a file is the buffer is flushed. So you'll really want to be sure that you close files when you're done with them, particularly when you're writing to them; otherwise, the text that was buffered but never flushed to the file will never appear in the file, even though your program ran to completion with no apparent errors.