ICS 33 Fall 2024
Notes and Examples: Searching


Background

You've previously learned about more than one of Python's built-in data structures: lists, tuples, sets, and dictionaries. While they differ in the details of their implementations and the nature of the problems they solve best, they share the characteristic that we store objects in them that we may want to find again later. We say that the problem of finding an object in a data structure is known as searching, that the object we're searching for is a search key, and that an algorithm that solves such a problem is a search algorithm.

So, how do we solve the problem of searching? It turns out that there isn't a single solution to the problem, because the devil is in the details.

Among the ways that data structures differ is in the way we'd answer those questions about them. While we'll defer a more thorough exploration of this topic into future courses such as ICS 46, we have the tools to begin that exploration now, by considering how we might search a Python list. So, let's do that before moving on to other things. In so doing, we'll have developed our ability to do the kinds of asymptotic analysis that we'll do repeatedly in this course, so we can focus on understanding not only how things work, but also how well things work.


The mechanics of Python lists

Before we consider the problem of searching in a Python list, let's make sure we understand the mechanics of how — and how well — Python lists work. After all, we can't analyze the costs of an algorithm if we don't know the costs of the operations the algorithm will perform. So, if we want to analyze an algorithm that manipulates a Python list, we'll need to understand the costs of those manipulations, which means we'll need to know some things about how Python lists work behind the scenes.

We can think of a Python list as being made up of three things.

As an example, suppose we created a list like this one in the Python shell.


>>> x = [13, 18, 11, 7]

This is what might be created behind the scenes by Python.

A depiction of the internals of a Python list

Notably, the list's elements are stored contiguously, which is to say that each reference to an element immediately follows the reference to the previous one. This turns out to be more than a curiosity; it's a vitally important aspect of understanding how lists work in Python, because it gives us an understanding of the cost of accessing their elements. Whenever we access an element in a list, we do so using an index; in other words, we access an element based on its location in the list. If we want to access the element at index i, Python can play a simple trick to find it cheaply.

Negative indexing doesn't change this outcome all that much, because Python can convert a negative index like -3 into the equivalent non-negative one by adding it to the list's size, which can be done in constant time, as well.

The primary consequence of this design is that we can always find an element in constant time given its index (i.e., as long as we know where to look). Sadly, we won't always be so fortunate; sometimes it's not clear where we should be looking, which is why we need a search algorithm.


Sequential (or linear) search

Suppose we have a Python list in which we've stored a collection of integers. Suppose, further, that we haven't intentionally organized them in any particular order — perhaps they've arisen as user input, or they've been read from a file that we didn't create — so that any integer might potentially be stored anywhere. And, finally, suppose that we want to determine whether the integer 37 is stored somewhere in the list and, if so, where it is. How do we do it? And, more importantly, once we come up with an algorithm to solve the problem, how can we assess its costs?

Knowing nothing whatsoever about the arrangement of integers in the list, our options are fairly limited. With anywhere we look being equally likely to be the "right place," there's no advantage in considering certain ones before others. Similarly, with the relative cost of accessing a list element being the same regardless of its index, the data structure doesn't force our hand; we can look at any elements we'd like, in any order we'd like. But, we still ought to avoid looking in places where we've already been — once we've established that the element we seek is not at index i, it would be wasteful to look there again — so there will be value in being more systematic than, say, choosing elements at random.

In this situation, the most straightforward algorithm is also the best choice. We could first look at the element at index 0. If that isn't the one we want, we could continue to index 1, then index 2, and so on. As soon as we find the element we're looking for, we can stop and the search would be successful. Only if we reach the list's last element, having never found what we're looking for, can we conclude that the search is unsuccessful.

That algorithm is not just straightforward; it's also common enough that it's got two well-known names: sequential search or linear search. Implementing sequential search in Python can be done with a relatively straightforward loop; the version below returns the index at which the search key was found, or None if it wasn't found at all, though this particular design choice is hardly sacrosanct.


def sequential_search(items: list, key: object) -> int | None:
    for index in range(len(items)):
        if items[index] == key:
            return index

    return None

Now that we understand how sequential search works, we should also consider how well it works, by applying the asymptotic analysis techniques we've learned.

Asymptotic analysis of memory

When we asymptotically analyze the memory usage of an algorithm, we're generally measuring what's needed by the algorithm. In our case, our algorithm searches in a given list for a given search key, but we presume that both the list and the key existed already; in other words, they aren't properties of our algorithm, as much as they're properties of the problem our algorithm is meant to solve.

In fact, our algorithm only needs to store one additional piece of information: the current index at which we're searching. At each step, that index increases by 1, but there's only one of them, whether the list contains two elements or two million elements. We can safely say that the size of an integer is constant — or, at the very least, does not significantly increase as the size of the list does — so, ultimately, the amount of memory required by our algorithm is O(1); it doesn't change, no matter how many elements are in the list or how many we have to look at before we're done.

Asymptotic analysis of time

Our intuition tells us that there's a more complicated story describing the time required, though. Sometimes, we get lucky and the first element we look at is the one we want. Other times, we're unlucky and only the last element we look at — after having already looked at all the others — is the one we want. And, of course, sometimes we don't find what we're looking for at all. How do we boil all of those possibilities down to a simple essence?

One way is to rely on the fact that O-notation is an upper bound, but not necessarily a tight one. For example, we saw that the function 3n is O(n2). So, if we consider all of the functions that describe all of the possible behaviors of our sequential search — the fastest, the slowest, and everything in between — and can determine an O-notation that fits above all of them, then we'll have a single answer.

Under what circumstances does our algorithm take the most time? There are two situations in which we have to look at every element in the list.

In those cases, the sum total of the work we're doing boils down to this, given a list whose size is n.

We can categorize this work by rearranging the steps.

So, in total, in this case we've spent O(1) + n * (O(1) + O(1)) = O(1) + n * O(1) = O(1) + O(n) = O(n) time. All in all, we could correctly say that sequential search takes O(n) time to run.

Still, this is an unsatisfying answer, despite the fact that it's technically true, because it hides an important reality: While sequential search can take this long to run, it doesn't have to. Sometimes, we get lucky. How can we characterize that difference? When confronted with a problem like this, we often consider (and separately analyze) three situations.

Best-, worst-, and average-case analysis

A worst-case analysis of the time required for sequential search is what we've already done, since we started by considering the worst situation. The worst-case time required for sequential search is O(n), for the reasons we already discussed.

But, what about the flip side? From a time perspective, what's the best thing that can happen in a sequential search of n elements? If we look at the first element and it's the one we want, we'll be done; we won't need to look at any of the others. In that case, we'll have initialized the current index to 0, accessed the element at index 0, and that will have been it; two steps, regardless of the size of the list. All in all, then, we'd say that the best-case time required for sequential search is O(1).

The problem we have now is that the best- and worst-case analyses are leading us to very different conclusions. When we're being optimistic, sequential search seems like a fantastic search algorithm — as good as it gets! When we're being pessimistic, sequential search seems much less of a win. So, what's the reality of the situation? An average-case analysis can give us a better idea. If we assume that there's an equal probability of finding the search key at any given index, then, on average, we would expect to look at n/2 elements in a list of size n. In total, we'll have spent O(1) + n/2 * (O(1) + O(1)) = O(1) + n/2 * O(1) = O(1) + O(n) = O(n) time. (Consider why it's O(n) and not O(n/2). What's the difference between n and n/2? The difference is a constant factor, 1/2, and constant factors are eliminated from O-notations.)

That the average- and worst-case analyses have the same O-notation doesn't tell us that the average- and worst-case times are identical; it just tells us that they grow at the same rate. And, in general, it tells us that if we have a large number of elements to search, we might want to consider whether there's a significantly faster solution to the problem of searching them.


Binary search

Sequential search has the benefit of being simple and workable on any Python list, no matter how its elements are organized. In fact, it's the disorganization of the list's elements that led to the algorithm's design; when we have no way to guess where something might be, our only recourse is to look everywhere, so we might as well do that job systematically, so we never look in the same place twice.

What if we instead have a list of integers that we know is already sorted into ascending order, which is to say that the smallest of the integers is at index 0, the next-smallest is at index 1, and so on? How might we benefit from that knowledge? If we look at an element at index 100 and find that its value was 1761, then we'll know something about the other elements that we haven't yet seen.

If our goal is searching, this is a powerful piece of knowledge indeed. If, for example, we're searching for 1301, then we'll know for sure that the element, if present in the list, will be at an index between 0 and 99 (inclusive). Consequently, we can safely skip every element at an index greater than 100; there's no point in looking at any of them, because they can't possibly be the one we want.

That same trick is one we can play repeatedly to good effect, so that we'll be able to dramatically reduce the number of elements we'll need to search. Suppose we have the following list of integers, sorted in ascending order, and that we'd like to search for the key 22.

3 9 15 17 22 27 34 36 37 41 48

We can begin by looking at the middle element. Notably, since we can ask the size of a Python list without counting the elements individually, we can quickly determine that the middle element is the one at index 5.

3 9 15 17 22 27 34 36 37 41 48

Sadly, 27 is not the search key, but we do learn something useful: Since 22 is smaller than 27, we can be sure that it won't be in any index greater than or equal to 5, so we can eliminate all of those elements from consideration without ever looking at them, leaving us only the elements in indexes 0 through 4 (inclusive), the middle element of which is at index (0 + 4) // 2 = 2.

3 9 15 17 22 27 34 36 37 41 48

15 isn't the key we want, either, but since 22 is larger than 15, we can eliminate 15 and every element that comes before it, leaving us only two elements: those at indexes 3 and 4.

3 9 15 17 22 27 34 36 37 41 48

After eliminating 17 from consideration similarly, we're left with only one element, which happens to be the one we want. (Had we been searching for 23 instead, we'd have made all the same decisions and, at this point, we could be confident that 23 was nowhere to be found, even though we've only looked at some of the elements in the list.)

3 9 15 17 22 27 34 36 37 41 48

We could write that algorithm in Python as follows.


def binary_search(items: list, key: object) -> int | None:
    first = 0
    last = len(items) - 1

    while first <= last:
        middle = (first + last) // 2

        if items[middle] == key:
            return middle
        elif items[middle] > key:
            last = middle - 1
        else:
            first = middle + 1

    return None

So, given a list of 11 elements, we were able to search for an element — even unsuccessfully — by looking at only four of them. But what is the actual proportion of elements we'll need to search, speaking more generally? And how does that proportion change as n grows?

Asymptotic analysis of time

To consider the worst-case time required to perform a binary search, let's think about how the time required changes as n grows.

If we consider how the number of visited elements, v, corresponds to the number of elements in the list, n, we'd find that it follows a formula similar to this, since visiting one additional element doubles the number of elements we can search.

n = 2v

But, of course, we want to know the opposite: How many elements will we need to visit as a function of the number of elements in the list? Mathematics tells us the answer to this.

log2n = v

So, we know that, at worst, if we have n elements to search, we'll visit log2n of them. Each would take a constant amount of time to visit — checking it for equality, and possibly for ordering. In total, then, we'd say that this takes O(log n) time.

Asymptotic analysis of memory

Like sequential search, binary search uses a constant number of integers to track its progress: first and last to keep track of what range of elements is still under consideration, and middle to store the index of the middle at each step (to avoid recalculating it a handful of times). So, all in all, it requires only O(1) memory.


Interlude: Logarithms in asymptotic analysis

You'll find that logarithms are a recurring theme in algorithm analysis, because so many algorithms play the kinds of tricks we're playing here: splitting up a problem into equal-sized pieces, then finding an inexpensive way to disregard all but one of those pieces. You've likely seen before that mathematics defines a logarithm as the inverse of exponentiation, which is why it forms the answer to questions like "How many times can I divide this problem up into equally sized subproblems, then divide one of those subproblems similarly, and so on, until there's no problem left?"

Why don't we include the base of logarithms in O-notations?

When we use logarithms in O-notations, you'll notice that we don't generally write a base; we're apt to simply write O(log n), rather than something more specific like O(log2n) or O(log10n). The reason follows from the formula that is used to convert between logarithms of different bases, which is this one.

logan / logbn = logab

The ratio, ultimately, is a constant. So, for example, the ratio of log2n to log16n is log216, which is a constant. Consequently, when we use logarithms in O-notations, we need not write a base, because the difference between any two bases is simply a constant factor; constant factors are considered irrelevant in O-notations, so there's no reason to write a base.

How much better is logarithmic than linear?

We've seen that binary search takes O(log n) time in the worst case, while sequential search requires O(n) time instead. From the explanation of these algorithms, your intuition likely tells you that logarithmic is better, but how much better is it? The answer is much, much better indeed!

To understand why, consider some values of n and log2n.

log2n n
0 1
1 2
2 4
3 8
4 16
5 32
6 64
7 128
8 256
9 512
10 1,024
11 2,048
12 4,096
13 8,192
14 16,384
15 32,768
16 65,536
17 131,072
18 262,144
19 524,288
20 1,048,576
21 2,097,152
22 4,194,304

The second column would represent the number of elements in a Python list whose elements are sorted. The first column would represent (roughly) the number of elements you'd need to visit in a binary search of those elements. So, for example, in a list of 1,000,000 elements, you could binary search it by visiting approximately 20 of them. With 1,000,000,000 elements, that number would only increase to about 30. The moral of this story is that logarithms grow very, very slowly as n grows, and increasingly slowly the bigger n gets. Logarithmic time, generally, is very good news.


Why not always use binary search?

Before we leave this topic, it's worth being sure we've thought about one other thing. If binary search is so much faster than sequential search, why would we ever want to use sequential search? And why will we return to the topic of searching in depth in future courses, exploring not only additional search algorithms, but also fancier data structures that enable some of them? In other words, why isn't binary search of one-dimensional lists the final word in searching? Is it that we greedily want searching to take O(1) time instead of O(log n) time? Or is there something more fundamental going on here?

The problem isn't the searching, as it turns out. If you have a large list that's sorted and your next need is to search it, binary search is a great way to do so. You'd be hard-pressed to do better than that, in the general case.

What if you have a list that isn't sorted already, though? As we'll see in future courses, the best that general-purpose sorting algorithms can do is O(n log n) time — log n times worse than linear — which is whole lot more time than we'd be saving, unless you needed to search it many, many times.

What if you have a list that's sorted, but you'll be adding or removing its elements fairly often? If so, the cost of those modifications will be much higher in the case that you need to keep the elements sorted. Adding an element to an unsorted list can be done by simply adding to the end, while adding an element to a sorted list requires finding the appropriate insertion point (though you could use binary search for that), but then sliding subsequent elements out of the way to make room for the new one. If there are n elements in the list, we might need to slide all n of them forward, in which case the cost of the insertion would be O(n). Removing elements from sorted lists is subject to the same kind of problem; subsequent elements need to be slid back to fill in the empty space.

So, unfortunately, the cost of maintaining a sorted list can be prohibitive, unless there are few modifications relative to the number of searches that will be done. That doesn't mean that all hope is lost; it just means we'll need better tools, which we'll learn about in future courses, allowing us to make cheap modifications and perform cheap searches. All things in time.