ICS 45C Spring 2022
Notes and Examples: Strings
The std::string type
Like almost every programming language, C++ provides a string data type, which implements the notion of a sequence of text characters. As it turns out, C++ does not have a built-in string type; its string type is part of its standard library. Like most types in its standard library, the type is part of the std namespace, so its full name is std::string.
The std::string type plays roughly the same role in C++ that string types in most programming languages play. As you might imagine, strings are actually more complex than they seem at first glance — memory has to be managed properly, and the notion of "What is a character?" is trickier in practice than you might think — so it's handy to have a type that shields you from these details automatically. (It should be pointed out, though, that std::string provides no support for Unicode encodings like UTF-8; that limitation won't cause us any trouble in this course, but it's worth understanding that strings are complicated beasts, especially if you want to write software that supports internationalized text, emojis, and so on.)
A std::string object represents a sequence of characters (i.e., a sequence of char objects). It provides a combination of operators and member functions that implement the basic operations you'll likely want to use given one or more std::strings. A few examples follow:
#include <iostream> #include <string> ... std::string s = "Alex"; s += " Thornton"; // += appends characters to an existing string std::string s2 = "Alex Thornton"; if (s == s2) // == compares strings for equality, i.e., the same sequence of characters { std::cout << "Yep!" << std::endl; } else { std::cout << "Nope!" << std::endl; } // The length() member function returns the number of characters in the string. // There is also a member function called size() that does the same thing. for (unsigned int i = 0; i < s.length(); i++) { // The [] notation here asks for one character in the string. Indexing is // zero-based, so s[0] would be the first character, s[1] would // be the second, and so on. std::cout << s[i] << std::endl; }
Contrasting the C++ std::string type with strings in Python and Java
Both Python and Java also provide a built-in string type. While they're similar to their counterpart in C++, they differ in a few important ways. Since a lot of you know Python or Java already, it's not a bad idea to contrast the C++ std::string type with the string type(s) you might already know.
Immutability
In both Python and Java, string objects are immutable, which means that they cannot be changed once they're created. At first glance, that might seem strange, because you might have written something like this before in a Python interpreter:
>>> s = 'Alex' >>> s += ' Thornton' >>> s 'Alex Thornton'
Given that example, it certainly looks like s can be changed, and that the += operator is one way to do it. However, it's important to understand what's really happening there. The use of += in Python creates a new string object, copies the characters from the old one, then appends the new characters to it, then makes s refer to the new string. (The variable s isn't a string at all; it's a reference to a string.) Afterward, the old string is destroyed automatically via garbage collection in this case, since we can no longer refer to it.
All of this is true in Java, as well, with one additional caveat, demonstrated below:
String s = "Alex"; String s2 = s + " Thornton"; String s3 = "Alex Thornton"; if (s2 == s3) { System.out.println("Yep!"); } else { System.out.println("Nope!"); }
In Java, s, s2, and s3 are what are called references. The == operator explicitly compares two references to see whether they refer to the same object; it tests for object identity, not object equality. In this case, s2 and s3 will actually be referring to different string objects, albeit objects that have the same meaning (i.e., the same sequence of characters in them). Alas, the output here will be Nope!, because s2 and s3 don't refer to the same object. The workaround would be to write s2.equals(s3) instead, which is how you test for object equivalence in Java; however, this is an easy mistake to make, one that compiles and runs while giving sporadically incorrect output.
By way of contrast, a std::string variable in C++ is not a reference, a pointer, or anything else; it's actually a string. Updating it actually updates the string. Behind the scenes, there is some memory management going on, which we'll see, but the effect here is that std::string objects in C++ are mutable (i.e., their sequence of characters can be changed throughout their lifetimes).
Bounds checking
In both Python and Java, if you attempt to access parts of a string that aren't there, such as individual characters or substrings, the attempt will fail with an exception. This means it will be impossible to accidentally access memory that isn't part of the string, which eliminates a whole class of problems that might arise from mistakes you might make. Though these errors will generally be run-time errors, it's nonetheless valuable that they're errors.
In C++, things are different. Performance is considered a premium, and features that affect performance are generally provided only via an opt-in mechanism (i.e., you have to decide to do things that have cost, rather than having them done for you). As a result, bounds checking is not done by default on strings — or, as a general rule, any of the data structures that are part of the language and library — so attempts to access parts of a string that aren't there result in what is called "undefined behavior," meaning that there is no guarantee what will happen, but the program will not necessarily crash and, even if it does, will not crash in a meaningful, easily-debuggable way (e.g., an exception with a traceback, like you might see in Python). For example, this code in C++ results in undefined behavior:
std::string s = "Alex"; std::cout << s[5] << std::endl; // There is no index 5! s[5] = 'b'; // Oh no! Where will the 'b' get stored?
Depending on your compiler, your operating system, and your C++ standard library, this code might crash or it might simply misbehave — by grabbing the character in the memory beyond the end of the string s and printing it (or changing it!), even though it's not part of the string.
This is a dangerous default, but it is the reality in C++; the onus is on us to opt into bounds checking when we can afford it from a performance perspective. One way to do that with C++ std::strings is to use the member function at() instead of the [ ] (indexing) operator:
std::string s = "Alex"; std::cout << s.at(5) << std::endl; s.at(5) = 'b';
The primary difference between at() and the [ ] operator is bounds checking: at() checks whether the index you pass to it is legitimate (i.e., within the bounds of the string) and, if not, throws an exception. We'll talk about exceptions in more detail toward the end of the quarter, though there are a fair number of things we need to get under our belts first in order to use them effectively, so we'll postpone that conversation for now.
C-style strings
If you've ever programmed in C, you may have also seen another, more primitive form of string, implemented as a pointer to an array of characters, commonly written with the type char*. (We'll learn more about pointers and arrays soon.) C actually doesn't provide a string type in its library; instead, strings are implemented as pointers to arrays of characters, with various library functions used to manage these arrays and pointers. As you might imagine, this kind of thing turns out to be quite error-prone, and the kinds of issues that occur when you make a mistake can be difficult to find and fix (and can sometimes be dangerous, from a security perspective).
We'll generally avoid C-style strings in this course altogether. Because some parts of the C++ Standard Library do make use of them, we may occasionally need them, but whenever we can avoid using C-style strings, we certainly will.