Implementing Collection Classes with an Inheritance Hierarchy Introduction: In this lecture we will discuss the details of the implementation of Set: one of the six standard generic data types/collection classes that we have discussed (Stack, Queue, PriorityQueue, List, Set, and Map). The implementation uses an Array data strucure in a simple manner to store all the elements in the Set. There are strong similarities between this implementation of Set and the Array implementations of the other data types; also, Set is of middling complexity compared to Stack/Queue/PriorityQueue (lower) and List/Map (higher), so it is a good class to examine in detail. In Programming Assignments 2-4, you will be writing more complicated (but still similar) implementations of many of these collection classes. Generally, the structure of all the Java code (interfaces and classes) is shown below. First, here is the structure of the interfaces. The lower ones are subinterfaces that extend the higher ones. Iterable / \ OrderedCollection Collection Map / | \ / \ Stack Queue PriorityQueue Set List Recall the Stack, Queue, and Set interfaces supply names for these types, but do not add any methods to the OrderedCollection (for Stack and Queue) or Collection (for Set) interface they extend. The PriorityQueue interfaces adds just the "merge" method. The List interface adds half a dozen methods to the Collection interface it extends: methods that specify an index in their parameters. Map has no subinterfaces. On the implementation side, here is the structure of the classes. AbstractOrderedCollection / | \ AbstractStack AbstractQueue AbstractPriorityQueue | | | ArrayStack ArrayQueue ArrayPriorityQueue and AbstractCollection AbstractMap / \ | AbstractSet AbstractList | | | | ArraySet ArrayList ArrayMap We will see that the AbstractOrderedCollection, AbstractCollection, and AbstractMap define many methods. Some of these inherited methods will be overridden in the Array implementation (and later, in other implementation) classes, to be more efficient once the data structure is actually known. In the middle, the Abstract subclasses (e.g., AbstractStack) often define just the equals method, which is different for every data type: i.e., for ordered collections, stacks can only be .equals to stacks, queues can only be .equals to queues, etc. During the quarter we will write other implementations of these classes; each will extend the Abstract class that it implements (one of the six Abstract classes shown above) and override some of their inherited methods to be more efficient for that particular data structure. We have already examined the "code" inside the interfaces (really just the headers of methods representing the operations that we can perform on that data type). We are now going to examine in detail the code in the AbstractCollection, AbstractSet, and ArraySet classes. You can view all the code in the interfaces, abstract classes, and concrete classes by creating an Eclipse project that builds a path to the collections.jar library; then you can double-click on any of the the .class files in that library, and examine the .java file for that class (which comes up in an editor tab). I SUGGEST THAT YOU EXAMINE THESE FILES WHILE YOU READ THIS LECTURE NOTE. Note that you cannot change the contents of the file in this class. It is read- only in the editor. if you want to actually experiment with these classes and change them, you can download "All Collection Files" (and NOT USE collections.jar at all) Before we examine the interface, abstract class, and concrete class, here are some statistics for them: Interfaces Collection : 35 lines Set : 7 Classes AbstractCollection: 193 heavily using iterators; most overridden in ArraySet AbstractSet : 30 ArraySet : 264 The .size method is a simple example of a method written concretely in the AbstractCollection using an iterator, but reimplemented much more efficiently in ArraySet. In AbstractCollection we can use an iterator to iterate through all the values, counting each one, to compute the result of .size; in ArraySe we have an instance variable that stores how many array elements are occupied, so we just return this value. Review the Collection and Set interfaces, from the previous lecture. Recall that a Set stores unique (no duplicate) elements. Commong operations are adding values to a Set, checking whether a value is contained in a set, and removing values from a Set; we can also query for the size of a Set, convert a Set to an array, and iterate over all the elements in a Set. AbstractCollection: Now let's focus on the generic AbstractCollection class, which implements the Iterable interface (by defining an iterator() method, even though it is abstract). At the summary level. This abstract class implements many of the methods (all but add, iterator, newEmpty, shallowCopy, and equals). But the implemented methods typically iterate over elements and thus might be mucy slower than necessary. It is interesting that we can write so many methods by using an iterator, but most of these methods will be overridden in the concrete ArraySet class, to make them run more quickly, once we know that the data structure is an array. The AbstractSet class will define the equals method. The ArraySet class will define the methods add, iterator, newEmpty, and shallowCopy; by doing so, ArraySet becomes a working concrete class. If it only defined these methods (and didn't override the slower ones), it would still be a concrete class, but run more slowly than it should. This abstract class declares the variable modCount, which our code should increment every time it modifies/mutates the data structure implementing the data type. The iterators must be able to know about such changes (in which case they will throw the ConcurentModificationException). The declaration is //Used in subclasses: see their iterators protected transient volatile long modCount = 0; In this abstract class, this variable is typically incremented whenever the code in this classes calls remove on an iterator (right before the removal). Transient and volatile are "magic" (to us) keywords that tell the Java compiler that treat this variable very carefully; the type long is an unsigned integer, so Java uses the extra "sign" bit to be able to represent numbers twice as high as ints. Now, let's look at the code of the individual methods defined in the AbstractCollection class. Commands: Note the add method is not defined here, but is defined in the ArraySet class. (1) addAll: This method iterates over es, calling add on every value produced by the iterator. If any call to add returns true, it will return true; if there are no calls to add, or all calls to add returned false, it will return false. Another way to code the body would be boolean modified = false; for (E e : es) modified = add(e) || modified; //because of short-circuit evaluation, modifed || add(e) would NOT WORK! return modified; This method does NOT need to be overridden in the ArraySet concrete class. (2) remove: This method iterates over the Set, removing the specified value via the Iterator (and returning true) if the same value as the specified parameter o is produced. Note that o == null is treated as a special case in the if, checking item == null; in the else it checks o.equals(item), which could also be written item.equals(o). If it finds such an item it immediately removes it and returns true; if it fails to find such an item, it eventually returns false. This extra code complexity exists because we can write s.add(null) and it put a null reference into our Set. If we never do this, the code will always execute the else part of the if, checking o.equals(item). Note thet calling null.equals(...) throws NullPointerException because null refers to no object (so Java cannot execute the .equals method of "that" object). This method should be overridden in the ArraySet concrete class, with a faster and more direct method. (3) removeAll: This method iterates over es, calling remove on every value produced by the iterator. If any call to remove returned true, it will return true; if all calls to remove returned false, it will return false. See the comment about modified = ... in part(1), which applies here as well This method does NOT need to be overridden in the ArraySet concrete class. Note that the remove method is overridden in the ArraySet class to be faster. So, when removeAll is called on an ArraySet, it calls the faster remove method defined in ArraySet (that is how inheritance works). (4) retainAll: This method iterates over the Set, calling remove (via the iterator) on every value that is not in the parameter collection. If remove is called one or more times, this method returns true; if remove is never called, this method returns false. This method does NOT need to be overridden in the ArraySet concrete class. (5) clear: This method iterates over the Set, calling remove on every produced by the iterator, thus leaving the Set empty. This method should be overridden in the ArraySet concrete class, with a faster and more direct method. Queries: (6) contains: This method iterates over the Set, looking for the same value as the specified parameter o via the Iterator (and returning true) if the same value as the specified parameter o is produced. Note that o == null is treated as a special case in the if, checking item == null; in the else it checks o.equals(item), which could also be written item.equals(o). It returns whether such a value was found. This method should be overridden in the ArraySet concrete class, with a faster and more direct method. (7) containsAll: This method iterates over es, calling contains on every value produced by the iterator. If any value is not contained in the Set, it will immediatley return false (without needing to check any other values). If every value produced by the iterator is contained in the Set, it will return true. This method does NOT need to be overridden in the ArraySet concrete class. Note that the contains method is overridden in the ArraySet class, so when containsAll is called on an ArraySet, it calls the contains method defined in ArraySet (that is how inheritance works). (8) isEmpty: This method returns true if the size is 0. This method does NOT need to be overridden in the ArraySet concrete class. Note that the size method is overridden in the ArraySet class, so when isEmtpy is called on an ArraySet, it calls the size method defined in ArraySet (that is how inheritance works). (9) size: This method uses an iterator to count how many values are in the Set (how many times we can call next on the iterator before hasNext becomes false) and returns this value. This method should be overridden in the ArraySet concrete class, with a faster and more direct method. Others: (10) hashCode: We will skip this method for now (and discuss it later, when we discuss hashing). (11) toString: This method uses an iterator to catenate all the vaues produced by the iterator (with commas in between them). A StringBuffer is a more efficient way to catenate a large number of values (which is then converted to a String when the method returns). We could use String to make the code simpler, but it would be less efficient. This method should be overridden in the ArraySet concrete class, with a faster and more direct method, but mostly one that includes further details about the Array implementation of this collection class So, it is amazing how many methods we can concretely define in this abstract class, by using an iterator; but many (not all) should be overridden by more efficient methods in the ArraySet (dealing directly with the Array used to store the Set) Finally, at the bottom of the AbstractCollection I specify as abstract those methods in the interface Set but not defined here. The equals method will be defined in AbstractSet, the rest are defined in ArraySet. AbstractSet: The AbstractSet class defines just the equals method. When is a Set equal to another arbitrary object? If the other object is a Set (regardless of the data structure used to implement the Set data type), has the same number of values, and if every value in one set is in the other set. Note that this code cannot appear in the AbstractCollection class, because it is specific to Sets (for example, see the equals method in the AbstractList class: it is different because in Lists, we must compare to another List, and the order of values in a List, unlike a Set, is important). Here is a line-by-line analysis of this equals method. Lines 10-11: If the Set .equals was called on is == to the parameter o, the Sets are the same (==), and hence .equals with no further analysis. Lines 12-13: If o is not some kind of Set (don't worry about the type of its elements), it cannot be equals to the Set .equals was called on. Line 14: Cast o to be some kind of Set (don't worry about the type of its elements); this cast is GUARANTEED TO WORK, because lines 12-13 would return false if this were not some kind of Set. Lines 15-16: If the sizes of the Sets are not equal, the Sets cannot be equal. Lines 19-27: Iterate over every object in s (the Set represented by o); if the Set .equals was called on does not contain any of these values, the Sets are not equal. Line 28: If we found that o was a Set, the same size as the set .equals was called on, and that every value in o was contained in the set .equals was called on, then sets are equal. It doesn't matter which set we iterate over/which set we call .contains on. We can write ... for (E e : this) //This refers to the Set .equals was called on if (!s.contains(e)) return false; ... Note that by the time we execute this code, we know both are Sets and both are the same size. ArraySet: For a start, ignore the comments on lines 10-29 until after we discuss analyzing algorithms a bit. Let's jump down to where the instance variables are declared. Lines 224-225: These define the two instance variables for an Array implementation of a Set: the variable "set" stores an E[] and objectCount stores an int count of how many elements are currently in the set. This is NOT THE SAME as set.length (objectCount is always <= set.length). We will soon examine how the array size is increased; it is typically doubled in length. Constructors: Constructor 1: this(1) uses the second constructor (with the argument 1 matching the int parameter, initialCapacity) to construct an Array with length 1. Constructor 2: construct an ArraySet of the specified length, thowing an IllegalArgumentException if intialCapacity is <= 0. We must start with an Array that can contain at least 1 value (otherwise doubling fails: 2*0 = 0). Constructor 3: construct an empty ArraySet and then adds all the values produced by iterating over es. Constructor 4: construct an ArraySet that adds all the values stored in the Array es (by first allocating an Array whose size is at least 1 and at least as long as the Array whose values are being put into the Set: if some values in the array are duplicates, the set won't be as long as the array). Now, onto the methods defined here. I will say for each whether it must be defined here, or is defined here to override an inherited method, to improve the speed of the operation. After we discuss all the operations, we will finally discuss the ArraySetIterator class, defined inside the ArraySet class and used by the iterator() method. Also note that methods that change the data structure (either or both of the instance variables set/objectCount) must increment the modCount instance inherited from AbstractCollection. This instance variable is declared to be "protected", not "private", so that the code in subclasses can access it directly. Methods: (1) add: This method MUST appear in this concrete class. If the specified value is already in the Set, it immediatly returns false. Otherwise it ensures the length of the array is big enough for 1 more value (we will discuss the ensureCapacity method below), puts the value in the required spot (increasing the objectCount), increments modCount (because the data structure has changed), and returns true. Note that we start with objectCount = 0; so, the first time we execute set[objectCount++] = e; objectCount gets incremented to 1 (there is now 1 object in the set Array), but its old value is used in the [], setting set[0] = e; the next time we execute set[objectCount++] = e; objectCount gets incremented to 2 (there are now 2 objects in the set Array), but its old value is used in the [], setting set[1] = e; Notice that the set Array always stores values in indices 0 through objectCount-1. That is, if objectCount is 2, the indices 0 and 1 are used. (1a) ensureCapacity: This method is called by add; it ensures the set Array instance variable is long enough to contain minCapacity values. If not, it remembers the "old" set Array, determines the newCapacity (at least twice as big as the current set Array's length), constructs a new array that big and stores its reference into the set Array, and finally copies into it all the values in the old set Array. So, increasing the length of an Array is really accomplished by allocating a new, bigger Array, and then copying the needed values into it. (2) remove: This method SHOULD appear in this concrete class, providing a speed improvement over the method inherited from AbstractCollection. It calls a private helper method, indexOf (discussed below) to compute the smallest index containing the value o (of course since sets have no duplicates, the smallest index is the unique index, if the value is present); if indexOf returns -1, o is not in the Set so remove just returns false (and does not change the data structure). Otherwise, it calls the private helper method removeAt (discusse below) with this index and returns the result returned by removeAt, discussed below. (2a) indexOf: This method uses a for loop to scan the set Array directly (instead of using an iterator), looking for o (doing a different comparison depending on whether o == null). If it finds that value, it returns the index in which it first/uniquely appears; if not it returns -1. (2b) removeAt: This method moves the last value in the Set to index i, removing the value what was at index i from the Array. In the process it decrements objectCount (there is now one less object in the set Array). It replaces this last index with null, increments modCount (because the data structure has changed) and returns true. Notice if there are five objects stored in the set Array, then objectCount is 5 and the objects are stored in indices 0-4 (call these values a, b, c, d, and e). If we call removeAt(1), then the object at index 4 (e) is copied to index 1 (so b, the old value at index 1, has been removed), index 4 is set to null. So now indices 0-3 (objectCount is decremented to 4) have the four remaining values (a, e, c, d). Note that because the order of a Set is undefined, it was fine for us to move the value at the last index to any other index; if the order were important (as in a List), we would have to shift a bunch of values to retain their relative ordering, instead of doing this faster operation. (3) clear: This method SHOULD appear in this concrete class, providing a speed improvement over the method inherited from AbstractCollection. This method just sets objectCount to 0 and increments modCount because the data structure has changed. It would be useful to set every index containing an object to null (that would make garbage collection work better), but it is not necessary to work correctly, since an objectCount of 0 implies there is nothing useful in all the array indices. We will talk about garbage throughout the quarter. (4) contains: This method SHOULD appear in this concrete class, providing a speed improvement over the method inherited from AbstractCollection. This method calls a private helper method, indexOf (discussed above as 2a), to compute the smallest index (of course since sets have no duplicates, the smallest index is the unique index, if the value is present) at which o is stored in the Set: -1 means o it is not in the Set, any other value means it is in the Set. (5) iterator: This method MUST appear in this concrete class. It uses the ArraySetIterator, which will will discuss below. (6) size: This method SHOULD appear in this concrete class, providing a speed improvement over the method inherited from AbstractCollection. The instance variable objectCount directly stores the number or elements in the instance variable set, so that number represents the size. (7) newEmpty: This method MUST appear in this concrete class. It just uses the first constructor to return a new, empty ArraySet. (8) shallowCopy: This method MUST appear in this concrete class. It uses the second constructor to declare a new ArraySet (whose length is the same as the Set that this method was called on). Then it copies the objectCount of the ArraySet that this method was called on to the objectCount of answer. Finally, it copies all the reference from the set Array of the Set that this method was called on into the set Array of answer, and then returns the answer Set. Notice that both Sets store references to the same values, so if we mutate a value from one Set, the other Set will refer to that mutated value. That is why this is called "shallow" copying. "Deep" copying would make a copy of every value in the original Set as well. Note that some classes, like String, are immutable: they contain no mutator methods. So, shallow copies don't cause any potential problems for these classes (9) toString: The method SHOULD appear in this concrete class. Mostly it is here to show information about the data structure implementing the Set, which is an array. The returned String uses a StringBuffer to efficiently catenate the class name (ArraySet) with the objectCount instance variable and length of the set Array instance variable, followed by every index in the set Array that is being used and the value at each). ArraySetIterator: The iterator method returns an object constructed from the ArraySetIterator class. The state of this object remembers the state of the iteration (which elements were seen, which was seen last and which will be seen next). Because this class is declared inside ArraySet, it can refer to all the instance variables in ArraySet: the set[], objectCount, and modCount. Technically, when one class is declared inside another it is called a NESTED class. Nested classes can be static or non-static; this class is non-static so it is called an INNER classes. Each object constructed from an inner class will have a reference to the object of the outer class that constructs it (whichever object the iterator method is called on), which is how it accesses the information in the outer class. In ArraySetIterator, the outer class instance variables are used in hasNext (objectCount), next (set and modCount), and remove (set and modCount). Also, modCount appears in the initialization of expectedModCount in line 262. We will first examine how ArraySetIterator objects are constructed and how their hasNext and next methods work, then we will examine the code in the remove method. The code is short, but a bit complicated. There is no specific constructor for ArraySetIterator, but its three instance variables (lines 261-263) are all explicitly initialized. Most importantly is nextIteratorIndex (intitialized to 0, because 0 is the index of the next element to be iterated over). Also note that the modCount instance variable of the Set is copied into the expectedModCount instance variable of the nested class: if a mutator is called on the Set, its modCount will increment and become unequal to expectedModCount. (1) hasNext: This method returns whether or not nextIteratorIndex is strictly less than objectCount. If it is, then there is still another element in the set Array that the iterator can produce (this index hasn't exceeded the end of the array). (2) next: This method checks for two error conditions: (a) whether the data structure has changed during the iteration and (b) whether there are no more elements that the iterator can produce; in either case, it throws an exception. If it passes both checks, it gets the answer from the nextIteratorIndex (incrementing that index for the next time hasNext/next is called). In addition it sets the instance variable removedAlready to false. This means that a new element was produced and returned by the next method in this iterator, and that element can be removed by the remove method in this iterator. Notice that removedAlready is initialized to true, so trying to call remove before you even call next will throw an exception. Basically, next sets this variable to false and remove sets it to true, forcing next to be called at least once between calls to remove. (3) remove: This method checks for two error conditions: (a) whether the data structure has changed during the iteration and (b) whether the element returned by the most recent call to next has already been removed; in either case, it throws an exception. If it passes both checks, it removes the element at the index one less than nextIteratorIndex (the index of the element just returned by next). Then it decrements nextIteratorIndex because there is a new element for the iterator to produce at that index. For example, suppose the set Array stores a, b, c, and d at indices 0-3. Now suppose nextIteratorIndex is 1: that means next has already returned a (the element in index 0). Now we call remove, to remove a from the Set. The call to removeAt removes the element at position 0 by storing d there. Thus, the set Array now stores d, b, and c at indices 0-2. The next method, if it is called again, should return d, so we need to decrement nextIteratorIndex from 1 back to 0, because a new/different element is now in the set Array at index 0. I could have written the two lines removeAt(nextIteratorIndex-1); nextIteratorIndex--; as the one line removeAt(--nextIteratorIndex); but I thought things were complicated enough already. Note that removeAt will increment modCount, but it is OK for an iterator to mutate the data structure and keep iterating (because the iterator is doing all the work and knows how to continue doing it), so the new modCount is stored back into expectedModCount and removedAlready is set to true, which requires another call to next before remove can be called again. We have now taken the complete tour through all the .java files relating to the Set data type and its array implementation. Feel free to examine any/all of the 5 other data types and their array implementations. Each will have some unique code, but there will also be much similar code as well. In Programming Assignment #2 you will write list implementations of various collection classe: each will implement its data type by using a linked list. Much of the code will mirror what is written here (converting array access to linked list accesses). Especially interesting is the code relating to Iterators (where hints will be given).