ICS 45C Spring 2022, Notes and Examples: Separate Compilation

Background

For the most part, writing programs in almost any programming language is done by writing the programs' text into individual files. In lecture, we've seen programs entirely written in a single file called main.cpp; we call these .cpp files source files. As is always the case when programs are written in files, as C++ programs grow beyond a certain size, it becomes important to be able to split them up into separate source files of a practical size, rather than writing them as one giant source file. The mechanisms provided for this in C++ require a bit more painstaking effort than what you might have seen in other programming languages, requiring you to more carefully arrange your code and to understand more precisely how the compiler works.

Part of why C++ has a clunkier mechanism for tying the code in separate source files together is due to a certain amount of historical baggage. In particular, C++ compilers only compile a single source file at a time, with no visibility into other source files — a reality motivated largely by a distant past in which available memory and processing power were orders of magnitude smaller than they are today. However, C++ compilers are required to do static type checking, meaning that they verify that all uses of variables, functions, etc., must use the correct types (e.g., only a value compatible with an integer can ever be stored in an int variable). Where it gets tricky is when the compiler needs to verify the use of something in one source file that's defined in another one, a problem which you'll encounter almost immediately once you write a program with multiple source files. To understand the mechanism for handling this problem, we need a slightly deeper understanding of how a C++ compiler works.

Declarations and definitions

Broadly, C++ programs are built out of two things: declarations and definitions. Recall the distinction:

Declarations allow you to tell the compiler about the existence of something, associating a name with a type. There are different kinds of declarations — variable declarations, function declarations, class declarations — and each has the job of specifying the type that a name has. Other than a few built-in types, such as int and char, no name can be used in a C++ source file until after the point where it is declared.
Definitions give full meaning to something; they are said to "fully elaborate" the meaning of a name. Like declarations, there are different kinds of definitions — variable definitions, function definitions, etc. — and each has the job of solidifying the meaning of some name. In the case of a variable, the declaration is the definition, because the declaration includes the type, and the type is sufficient information for the compiler to know how much memory to allocate for it. (Note, in particular, that the distinction between the declaration and the definition of a variable has nothing to do with whether it's been assigned a value; that's separate, because variables are not required to be initialized.) In the case of a function, on the other hand, the declaration is the function's signature (i.e., its name, parameters, and return type), while the definition includes its body.

Compilation and linking

Unlike in some programming languages, a C++ program typically needs to be "built" (i.e., an executable version of the program needs to be constructed) before it can be executed. The process of building a C++ program occurs, broadly, in two phases: compilation and linking.

Compilation is done one source file at a time, with the compiler starting fresh each time, having forgotten everything it knew about other source files. Part of the job of compilation is type checking, but type checking can only be done if each source file contains a declaration of every name it uses, even those that are defined in other source files. Each source file is generally compiled into a single object file, which we can think of as being "mostly" machine code, but with certain things left as placeholders — most especially, uses of things defined in other source files are left out, to be filled in later.
Linking is the process of taking all of the object files built by the compiler, along with any libraries (the C++ Standard Library or other third-party libraries) required by the program, and stitching them together into a single executable program. A big part of what linkers do is resolving the placeholders left behind by the compilation process; if a.cpp calls a function defined in b.cpp, it's the linker that replaces an "unresolved" call to that function in a's object file to an actual call to b's function.

Understanding these steps leads us to three rules that we'll need to follow:

No entity (e.g., a variable, a function, etc.) in C++ can be used in a source file without a declaration for that name appearing in that source file first. (Remember that a declaration is not the same thing as a definition.)
In a particular source file in C++, some kinds of declarations are not allowed to be repeated, though others can be. (For example, function declarations can be repeated multiple times, but, as we'll see later this quarter, class declarations cannot.) As a general rule, we're best off preventing multiple declarations of all entities, just to be safe.
No entity in C++ can be defined more than once, not even in separate source files. This rule has a name in C++; it's known as the One Definition Rule, and it exists for a good reason. Suppose that a function f() is defined (i.e., its body appears) in both a.cpp and b.cpp, and that the function is declared and called in c.cpp. When the linker tries to resolve c.cpp's call to f(), it won't know whether to resolve it as a call to the version of f() defined in a.cpp or b.cpp, so it will instead trigger a link-time error and the program will fail to build.

As you might expect, a pattern evolved in C++ for dealing with these three rules. If you follow the pattern, none of these three rules will be violated.

Source and header files

To divide our C++ programs into what you might call "modules" — separate groupings of closely-related functions, classes, etc. — we write our code in two kinds of files: source files and header files.

A source file contains the definitions of everything necessary to implement a module. If a module is a collection of functions — say, a collection of mathematical utility functions — those functions would all be defined in the module's source file. For reasons we'll see later, each module has only one source file.
A header file contains the declarations of those entities that are intended to be visible to other modules. Notably, though, header files do not contain definitions. So, for example, our module of mathematical utility functions would have a header file that contained the declarations of those utility functions (i.e., their signatures) but not their definitions (i.e., no bodies). Also, if there were functions in the module that were helpers, but were not intended to be "public" and available to other modules, we would simply leave their declarations out of the header file. Generally speaking, each module consists of one header file (or none if the module doesn't make anything publicly available to other modules).

Splitting our code up this way might seem like a bit of a burden, but the necessity arises from the rules we specified above.

Header files contain a set of declarations that would be needed in one source file in order to use the entities defined by another. This avoids the problem of having to copy and paste the declarations into many source files, which would require cascading changes in many places whenever one of these declarations needed to change (e.g., a parameter is added to a commonly-used function).
Header files are included in source files when needed. If a source file includes a header file, the declarations in that header file become available in the source file. Note, too, that the header file is included in a specific place in a source file, so the rules about the order in which declarations are written are important to remember.
We never write definitions in header files, so that we avoid the problem of creating multiple copies of the same definition (i.e., one in every source file that includes a header file), which can quite easily cause link-time errors. (As we'll discover later in the quarter, there are exceptions to this rule, but nothing we've seen so far in the course can be defined in a header file without causing potential link-time errors.)
Because some kinds of declarations are not allowed to be repeated in the same source file, we should prevent the same header file from ever being included more than once in the same source file. There's a relatively simple pattern that we can use to solve this problem, which we must use in every header file to be safe. We'll call this pattern the multiple inclusion prevention pattern.

Naming conventions for source and header files

In this course, when we write source files, their names will end in .cpp, while header files will have names that end in .hpp. Note that this is not the only convention in popular use in the world — unlike some programming languages, C++ compilers aren't especially finicky about file naming — but we'll use it, because it (a) establishes a clean distinction between header and source files, and (b) makes clear (with the "pp" on the end, a filesystem-compatible way of saying "++") that the code we're writing is C++ and not C.

If you prefer other naming conventions, that's fine for your own work, but you'll need to use .cpp and .hpp in this course, because the build and test tools for this course assume that you are. (Like many details about tools and techniques, you don't always get to choose what you want when you work on someone else's project; that's worth getting used to.)

Deciding on the boundaries between source files

Now that we've talked about the mechanism for splitting up a program into multiple source files, there's one more thing we need to nail down: How do we decide on the boundaries between them? What are we trying to accomplish when we split up a program, and how finely-grained does that split need to be?

One of the hallmarks of well-written software is what is sometimes called separation of concerns, the principle that you should handle separate issues in separate parts of a program, rather than munging everything together into fewer, larger, more complex functions. (This is one of the reasons that global variables are often said to be so problematic; they munge things together by their nature, spreading a part of your program's knowledge throughout the entire program.)

There are a few ground rules you should follow when you're designing a C++ program — and, in truth, these aren't that different from what you ought to be doing in just about any programming language. Header and source files are the C++ mechanism for doing something you'd want to do in any programming language: Break a large program into its component parts.

Functions should be relatively short, and each of them should have a single job. If you were to write a comment describing a function's purpose, that description shouldn't be very long, or what you're probably describing is more than one function.
Functions that are strongly related to one another should be placed in the same source file. (This is what software engineers often call high cohesion, the idea that things that are near one another belong together.)
Functions in one source file should depend on as few details as possible about how the functions in other source files are implemented. (This is what software engineers often call low coupling, the idea that we want the various parts of our program to be shielded from the details of the others, so that changing one doesn't cause cascading changes to many others.)
We need header files so that functions in one source file can be "seen" by others. If there is a function in one source file X.cpp that needs to be called by functions in other source files, then we need a header file X.hpp that declares it. Most source files have a corresponding header file, because most source files "export" at least one function to others.
Not all functions written in a source file need to be declared in a corresponding header file; only the ones that need to be called from within other source files need to be declared in a header.