Fundamentals of Memory and Memory Management Memory Hierarchies: Speed/Size/Cost Tradeoffs To start, we will discuss hierarchies in memory, including tradeoffs between their speeds, sizes, and costs. First, we will discuss the following prefixes for sizes/speeds, which every computer scientist should know. kilo x 10^3 milli x 10^-3 mega x 10^6 micro x 10^-6 giga x 10^9 nano x 10^-9 tera x 10^12 pico x 10^-12 peta x 10^15 femto x 10^-15 exa x 10^18 atto x 10^-18 So, 1 gigabyte is 10^9 bytes. The following might become relevant for your life zetta x 10^21 zepto x 10^-21 yotta x 10^24 yoxto x 10^-24 There are three standard locations for memory: on the CPU chip (also known as cache memory), in special memory chips (main memory), and external memory (disk drives, DVDs, etc.). Each is slower than the previous, but its size is bigger and its cost/byte is lower. CPU chips act as a cache for main memory; and main memory often acts as a cache for external memory (some external memory devices, like disk drives, also have their own dedicated caches). A book by Van Loan/Fan (Insight Through Computing) made some of these numbers concrete. 1 megabyte: A 500 page novel; 1 minute of MP3 music 1 gigabyte: The human genome; 20 minute of a DVD 1 terabyte: A University library; photos of all US Airline passengers (1 day) 1 petabyte: The amount of text in a library; printing it takes 50x10^6 trees External Memory/Disk Drives: Note that disk drives/DVDs are modeled by a spinning platter that has concentric circles that each store data; there is a "read head" that can move between concetric circles. To read a specific word of data, the read head seeks the correct circle and waits for the data to rotate under it. A typical rotation speed is 7,200 rpm. This is roughly 100 revolutions/second, meaning that it takes about 10 milliseconds to complete one full rotation; this is called the "rotational delay". The read head takes about the same amount of time to move into position (called "seek time"). Note that a processor executing 1 billion operations per second can execute 10 million instructions in 10 milliseconds, while the read head seeks and the platter rotates to the needed position. We often talk about the terms "latency" and "bandwidth" when discussing memory access (and transmitting information over networks as well). Latency is the time it takes from a request for data until the first information arrives. Bandwidth is the throughput (data rate) starting when the first data begins to arrive. The latency for a disk drive/DVD can be large (see the 10 millisecond numbers above), because it involves moving something physical: the read head and the platter; but once the read head/platter are in the right position, with the right data is under it, the quickly spinning platter can transfer all the data on the circle very quickly. Current data transfer rates are 70 megabytes per second. Because external memory typically has high latency (a long time, measured in machine instructions that can execute before the first piece of data arrives) and high bandwidth (after that, lots of data can be delivered quickly), we typically use every memory request to transfer a block of memory, not a single word of memory. The expectation is that the subsequent information will be needed soon (certainly the case when reading a sequential file will a class like TypedBufferReader). So, when reading information from a file stored on a disk drive, instead of just reading one character at a time, many characters (tens of thousands to millions) are read and cached in main memory (or the disk drive's dedicated cache; these caches can transfer data to memory at the rate of 3 gigabytes per second -much faster than information can be read as the disk rotates). So, for many subsequent character reads, the information is retrieved directly from memory and the computer doesn't have to perform any more disk operations. Such a cache has a special name, a "buffer". When the buffer is exhausted (all characters read from it), the next character read initiates another block transfer. As we saw, it might take 10 millisecond to move the read head and wait for the data to rotate on the platter under the read head, but it will take little time to read tens of thousands to millions more characters. Another way to look at this: to get one character takes 10 milliseconds, but to get 100,000 characters requires only 10 + 1 = 11 milliseconds (at 100 megabytes/sec a bit more ,than the 70 quoted above as the transfer rate). So, the amortized cost of reading one character is 11/100,000 milliseconds/character or about 1.1 microsecond/character (an overall rate of almost 1 millions characters/second). This analysis is a bit like what we did for putting N values in an ArrayQueue, which doubles its size. Some adds are very quick, but a few require much more time, when the array size doubles. But if you take a look at lots of adds, the average cost is very cheap. Here are some current sizes and relative speeds for CPU Chip, main, and external memory (these are very approximate). CPU Chip ~1-10 Mb 10-100 times faster than main memory Main ~1-10 Gb 100K - 1M times faster than external memory (latency) External ~1- ? Tb Typically, each step up increases the size by a factor of 1,000 and decreases the speed and cost/byte by a similar factor. When we have analyzed algorithms, we have assumed that all data is in main memory. In fact, often most of the data is on the CPU chip memory, and its performance is often an important practical consideration for determining the time an algorithm will take. With effective caching, often 9 of 10 memory access will occur in the CPU Chip memory, not the main memory. This can speed up the execution by a factor of 10-100. We will briefly discuss the interplay between CPU chip and main memory below, using CPU chip memory to cache information from main memory. Data Access patterns and the CPU Cache: Access patterns for data often exhibits two kinds of locality. Temporal locality: if data is being accessed now, it is likely to be accessed soon in the future; for example, a loop index (or more generally, a cursor) is accessed frequently during the execution of a loop (its value is initialized, checked, and updated in each loop iteration). Spatial locality: if data is being accessed now, data near it is likely to be accessed soon in the future; for example, if we are accessing an array at position i, it is likely we will access it at position i+1 in the near future (when scanning an array, not doing a binary search in an array). This effect is similar, but not quite as pronounced in linked lists/trees whose nodes were allocated at a similar time (and therefore in memory locations that are close by). Scanning all the values in an array, using an index variable, exhibits both temporal (the index variables) and spatial (the array elements) locality. If we can rewrite an algorithm (or write it in machine code) so that it can completely fit in CPU chip memory, it can run much faster than a slightly bigger amount of code that cannot fit in CPU chip memory. The following algoithm is implemented in hardware. It is used whenever the CPU needs to access data. Here we use the term cache for the CPU Chip memory. The cache starts out empty. 1) If the data is already in the cache, use it 2) If the data is not already in the cache a) Retrieve it (and other data near it: some block of memory) As with external memory, there is high latency to get the data, but there is high bandwidth to transfer a block of data from main memory to the cache, so it accesses/transfers blocks of data b) If the cache is not full, add the new block of data to it c) If the cache is full, determine which block of data to remove and add the new block of data to replace it Again, we need a policy about which old data block to remove from a filled cache. Three standard and well-studied policies are random, First in first out (FIFO), and Least Recently Used (LRU). Any policys must be fairly simple, otherwise it could not be directly implemented in hardware (because of the speeds needed, cache replacement algorithms must be implemented in hardware). The idea is to leave in the cache data that is expected to be accessed soon in the future. Random does not bring anything relevant into the decision, but it is simple/cheap to implement (no extra storage). FIFO seems a reasonable strategy: if something was brought in a long time ago, it is less likely to be used compared to something that was brought in more recently (it requires only a simple queue to keep track of which block to replace next). But, LRU gets more to the heart of temporal locality: if something has been used recently, it is more likely to be used in the near future (regardless of when it was initially used, which is what FIFO monitors). While LRU is harder to implement in hardware (it uses something like a priority queue), it can be implemented there and it is a better predictor of what to remove and what to leave in the cache. Because the time needed to locate and transfer a block of data is large (while waiting for the data to arrive, the computer could execute many instructions), choosing a replacement policy like LRU that is less efficient (takes more time to determine what block to remove) but better (determines more accurately which block won't be used in the future) is likely to provide better performance overall. The concept of "prefetching" works for Main->CPU or External->Main memory. If as programmers we know that some data is expected to be used in the future (but not needed yet) we can prefetch it (touch it). Then, while we are doing other things, before the data is actually needed, it will be brought from the slower->faster memory. The maintenance of caches is an important part of chip design. Cache design becomes even more interesting when multiple cores/processors access the same memory. For example, if 4 CPUs each cache some main memory that they share, and one CPU changes the value (in its cache) the other CPUs that are caching that same memory have to be updated (and main memory as well, eventually). When new cache mechanisms are proposed, they are often evaluated by using previously collected "memory traces" showing which memory locations actual "important" programs generate, and determining how well the caching mechanism works in these cases. Such memory traces can consitute hundreds of millions (even billions) of memory references. Likewise, using the concept of "virtual memory" we can consider main memory to act as a cache for external memory. Using virtual memory, when we solve huge problems by just pretending that the computer's memory is as large as its disk-drive's memory (terabytes not gigabytes). Then we use the main memory as a cache for external memory (just as we discussed above using a CPU Chip's memory as a cache for main memory, including replacement policies). Using virtual memory we can "easily" solve problems that do not fit into a computer's main memory, but unless the data structures and algorithms processing the data structures exhibit strong temporal/spatial locality, the run time can be enormously larger (thousands to millions of time longer). In the next two lecture we will discuss data structures and algorithms for fast searching and sorting, when using huge amounts of external memory. Gordon Bell (a famous computer designer) has written a book called "Total Recall: How the E-Memory Revolution will change Everything". In the book he posits that in the future, everyone can have everything that they ever see and hear (e.g., every conversation that they have), every web-page that they look at, etc. stored in memory and indexed for retrieval. Here is a quote from page 9, early in the book. In fact, digital storage capacity is increasing faster than our ability to pull information back out. Once upon a time, you had to be extremely judicious and stingy about which pieces of data you hung on to. You had to be thrifty with your electronic pieces of information, or bits, as we call them. But starting around 2000 it became trivial and cheap to sock away tremendous piles of data. The hard part is no longer deciding what to hold on to, but how to efficiently organize it, sort it, access it, and find patterns and meaning in it. This is a primary challenge for the engeineers developing the software that will fully unleash the power of Total Recall. Basically, Moore's Law (http://www.intel.com/technology/mooreslaw/) postulated by Gordon Moore (Intel) says that the number of transistors in a given area will double every 1.8 months. This typically translated into computer speed doubling as well, but not now. It requires too much power to speed up computers. So instead, we use the extra transistors to create more cores (CPUs) on a single chip. They all run at a "slow" speed, but if programmed correctly, to work together, they can accomplish as much as a faster chip. How to coordinate cores is still a problem (some say the biggest practical problem facing computing today). External memory is still growing at a slightly faster pace than predicted by Moore's law; typically every couple of years you can buy twice the amount of external memory for about the same price (with no speed degradation, but also not a lot of speed improvements) Stacks and Heaps: Most computer programming languages use memory in two special ways: as a stack and as a heap (NOT the same kind of heaps used for efficient priority queues; here the same name is used for something very different). Main memory is really just a giant array of words. A 32 bit word stores an int or a reference; it can also be divided into four, 8-bit bytes, where each byte can store a single ASCII character. Think of all available memory (once the program has been stored in memory) as being divided between stacks and heaps, with the stack on the left growing towards the right, and the heap on the right growing towards the left Memory +---------+----------------------+ | program | Stack -> <- Heap| +---------+----------------------+ We have seen that stacks are used for method calls -including recursive method calls- to store parameters and locals variables; we have also seen them used to evaluate arithmetic expressions. Stacks grow and shrink with no "holes": each method call increases the stack size (adds to it) by N locations (storing N parameters and local variables) and each method return decreases the stack (removes from it) by the same N locations. We use heaps for objects constructed by "new". Heaps can have a holes. For example, if we store a Set in an array, initially we allocate an array of a certain size to store the Set; later we migh double the length of the Set, allocating another array whose size is twice as big as the first (coming from heap space to the left of the original array). Now the original array is garbage (there are no references from the program pointing to it, but Java's memory manager knows about it) creating a hole in the heap space (that can be reused: see the section on garbage collection below). Thus, it is more difficulty for programming languages to manage (allocate and reuse garbage) heap space. Basics of (Heap) Memory Management: Whether programmers do their own memory management (as in the C and C++ languages, where they explicitly must "dispose" of memory they no longer need) or whether an Automatic Garbage Collector (really, a "recycler") does it for us, we can discuss various needs and strategies for recycling memory. First, a free block of memory is a contiguous number of free memory locations. Typically each memory block allocated in the heap has a few words reserved for memory management information. A minimal amount would allows us to store the size of the free block of memory and a reference to the next free block of memory (keeping all the free blocks in a linear linked list). Initially we would have one huge block of free memory. When a block of memory is freed (either explicitly because we dispose it, or because an automatic garbage collector finds it) we can add it to the linked list of free memory blocks. If we need to allocate a block of memory, before going to the remaining memory in the heap (or after going there and not finding enough memory), we can check whether we can reallocate a block of memory from this linked list of free memory blocks that were previously allocated but garbage collected. We will discuss four strategies below, using the concept of "fragmentation". Memory is fragmented if there are many small blocks (as opposed to a few large free blocks). If memory is fragmented, it is likely to take longer to search for a free block of the necessary size. Here are four policies that decide which memory block to use from the linked list of free memory blocks. 1) First-fit: Search the linked list starting at the begining and stop at the first memory block with enough space. 2) Next-fit: Search the linked list starting wherever the last reclaimed block came from, and stop at the first block with enough space (if we run off the end of the list, start at the front: e.g., circular list). 3) Best-fit: Search the entire linked list and find the smallest block with enough space (or keep the the list sorted by size, or use a hash-like stucture with all blocks 1-2 words, 3-4 words, 5-8 words, 9-15 words, etc. linked together). 4) Largest: Use the largest free memory block (sometimes called "worst fit", but not a pejorative). After allocating a memory block of the needed size, the remaining memory in that block goes back on the the linked list of free memory blocks (with a smaller size). So, regardless of the policy, if we need 100 words of memory to allocate for an object, and we decide to use a block that stores 300 words of memory, we allocate the 100 and put the remaining 200 back into free memory. 1) First-fit: initially fast, but can create lots of small memory blocks at at the front of the linked list, slowing down searching. 2) Next-fit: improves on first-fit by spreading fragmentation throughout the linked list (not always at the front). 3) Best-fit wastes little extra space, but tends to create very small memory blocks, possibly unallocatable (because they are too small to be useful) at the front of the linked list. 4) Largest: can be fast (we can use a priority queue where the largest memory block has the highest priority), and puts the largest memory blocks (thus more easily allocatable in the future) put back in the priority queue. Computer Scientists have created lots of models for managing recycled memory and collected lots of data (in the form of memory-use traces) to simulate and evaluate all sorts of memory recycling policies. First-fit can create lots of fragmentation at the front of the list, causing a large search time. Next-fit improves on this by spreading the fragmentation throughout the list. Best-fit wastes little space, but it can create very small (unusable) fragments. Largest works well but takes time O(Log N) where N is the number of memory block. Note that when a block memory is returned, it is a good idea (but it takes time) to discover whether it is adjacent to another free block, and if so, combing the two blocks into one bigger block. Garbage Collection: When programmers manage free memory, their code is often prone to error (even good/experienced programmers), creating memory leaks: memory that the program is not using (and no longer has access to) but is also not on a free list for future use: truly garbage that cannot be recycled. Sometimes such programs must be stopped and restarted because they run out of memory. There are some mission critical programs that do not allow the use of "new" (after setting up initial data structures) because of the possible memory leaks. During the first Gulf War, a memory leak was found in an anti-missle weapon which would not function well after operating for days (ir was designed to be used in a "fast European war" and not expected to have to work continuously for days at a time). Until the software was fixed, the operators were instructed to shut down and restart the software every few days (of course, when to shut it down was problematic, as the system was inoperable during the minutes required for a shutdown and restart). As I said above, it was designed to operate in Europe, where antimissile batteries would go on alert for just hours at a time, therefore testing it under these conditions failed to show any problems with running it 24/7 (as was needed during the first Gulf War). By using an automatic Garbage Collector (GC) we avoid explicit deallocation; our code calls "new" when it needs memory but not "dispose" (typically the code just makes some reference variable refer to a different object; then the original object it referred to -if no other reference variables refer to it- become garbage/reycleable). Such systems can find all the memory currently not used by a program and put it all on the linked list of free memory blocks. Note that languages like Lisp had automatic garbage collection as early as the the 1960s. Note that for a program that doesn't exhaust memory, automatic garbage collection (which doesn't occur) can be faster than manual garbage collection, since manual garbage collection requires doing some work on disposing of some memory, while automatic garbage collection does no work on disposal, but possibly more work on recycling garbage when there is no more free memory in the heap. Simple garbage collection can be accomplished by storing reference counts, although a circular structure can have each of its nodes with a reference count of yet, yet no variable refers to any part of the circular structure. Mark and Sweep garbage collectors are fairly standard and simple to understand, but there are many different algorithms for this universally useful taks. We will discuss them briefly. In the Mark phase, the GC first finds all the references in a program: these are all reference variables stored in the stack, representing references to objects in the heap (from parameter and local variables in executing methods). The GC follows these references to the objects that they refer to and marks these objects as "live" (often there is a bit in an extra word associated with each free block of memory to mark whether or not it is live). From these live objects, the GC follows their reference instance variables to the objects that they refer to in the heap, and marks these objects live as well. The GC continues this process (which is like searching a graph of objects -which refer to other objects- for "reachability") until it has marked every object object live that can be reached from parameter/local variables active in the code. There are some very clever algorithms that use the extra space in these live objects to store the data structures needed to reach all the live objects, so we don't need much extra memory during garbage collection (because at that time we don't have much extra memory!). In the sweep phase, the GC sweeps through the heap memory and puts on the linked list of free memory blocks all those memory blocks that it enoucounters that are NOT marked as live. If possible, it will compact two adjacent free memory blocks into one larger one. Finally, note that when we are storing data in an array (say a Set) and we perform a "clear" operation, typically we set objectCount to 0 AND store null in every previously used location in the array. This ensures that objects with references to them currently stored in the array can be garbage collected (if there are no other references to them). Even though WE know know that no array positions beyond objectCount-1 store useful data, the garbage collector thinks every reference in an array is live. If we left those references stored in the array, the garbage collector would consider all references in the array as live when doing the Mark phase, and not collect those object as garbage. Of course the code works correctly either way, but if we don't set object reference to null, we may eventually run out of space. Garbage collectors have problems as well: they run at unpredictable times and take an unpredictable amount of time to run (although we can determine limits on their run time). So, there are some mission critical real-time programs that do not allow the use of "new" because of this unpredictability. For example, in real-time applications (such as software flying an airplane), we would like to ensure that garbage collection does not take place during a critical phase (like landing). So, some real-time software prohibits the use of heap memory. There are ways around this unpredictability. We can run an "incremental" GC at the same time as the program to minimize the frequency and length of pauses in actual code. Say, every 100 milliseconds the GC runs for a few milliseconds, doing some of it work. The result is that the program executes a few percent slower (typically not a big problem on fast CPUs), but garbage collection runs more predictably. When it is required, much of its work has been accomplished. In fact, with multi-core CPUs, we can always run a GC on one of the cores to minimize the impact of automatical garbage collection. Final Words: One lecture on the material described above and below is not enough to get a truly intuitive feeling for the information. The course ICS 51 (Introductory Computer Organization) covers these topics in much more depth. Read about these terms on the internet as well (e.g., Wikipedia). Use the names provided here.