CPU Caches

From OSDev.wiki
Jump to navigation Jump to search

Introduction

Originally, memory was as fast as CPUs. Over time CPUs got faster and faster. Memory got faster too, but didn't keep up with the increasing speed of CPUs, which means that CPUs needed to spend more and more time waiting for data to come from RAM (which can have a serious effect on performance).

As a work-around for the speed difference between CPUs and RAM, CPU manufacturers started including caches. The basic idea is to have an small area of RAM that the CPU can access really fast, that's used to store a copy of data in RAM (to reduce the time the CPU spends waiting to access slower RAM).

Over time the difference in speed between CPUs and RAM got even worse, so CPU manufacturers increased the size of the caches; but (due to limitations in semiconductor design) the larger a cache is the slower it is (or, the higher the latency of the cache is). To improve performance more, CPU manufacturers started having large caches with smaller/faster caches on top.

For a current/modern CPU there can be up to 3 layers of caches - extremely fast but relatively small "layer 1" (or L1) caches close to the CPU, fairly fast medium sized "layer 2" (or L2) caches, then relatively large "layer 3" (or L3) caches close to the system bus or RAM.

Of course the amount of RAM used in computers has grown too; and even a very large L3 cache can't cache everything. For example, a computer might have 4 GiB of RAM but only 12 MiB of L3 cache. To improve performance CPU manufacturers want to cache the RAM that is most likely to be used in future. However, it's extremely difficult to predict the future, so CPU manufacturers usually use a "least recently used" algorithm for caches, as something that was used recently is (usually) likely to be used again.

Fortunately, except for a few cases (testing for RAM faults, RAM bandwidth benchmarks) a programmer doesn't have to worry about the caches. However, caches play an important role in the overall performance of a system, and it is possible for programmers to get significant performance improvements by paying attention to cache efficiency.

In general the "least recently used" algorithm can work quite well, but there are cases where it causes poor efficiency. One example of this is called "cache pollution", where the CPU accesses a lot of data that won't be needed in the future; which causes data that is more likely to be needed soon to be pushed out of the cache to make room for the (more recently used) data that won't be needed soon. Another example is called "cache thrashing", where you repeatedly access an amount of data that is larger than the size of the cache (e.g. repeatedly accessing each item in a 10 MiB array when the CPUs cache is only 8 MiB). In this case, even though a piece of data will be used again it will be pushed out of the cache by more recent data accesses before it's used again.

There's several techniques for managing these cache efficiency problems. One technique CPU manufacturers use is to have separate caches for separate things. For example, a CPU might have an L1 cache for instructions and another L1 cache for data, so that problems with L1 data cache efficiency don't effect L1 instruction cache efficiency. This also helps with the "larger caches have worse latency" problem.

Modern CPUs also have instructions that programmers can use to help avoid "least recently used" cache efficiency problems; including instructions that can be used to prefetch data into the caches (useful when you know that data will be used soon), instructions to explicitly remove/flush data from the cache (useful when you know data that you've used recently won't be used again soon), and instructions to write data directly to RAM and bypass the cache (which is similar to writing data normally and then flushing that data from the cache).

There are situations where software needs to process a large amount of data, but can process this data in blocks (for example, decompression often falls into this category). In these cases, it can be desirable (to avoid "cache thrashing") to detect how large the caches are, and to make sure that each block fits inside the cache. For example, if the CPU's cache is 4 MiB and you want to process 100 MiB of data, then processing 50 blocks that are 2 MiB each can give a lot better performance than processing 20 blocks that are 5 MiB each. For this reason, it can be a good idea for an OS to detect the size of CPU caches and provide a way for software (e.g. applications) to easily obtain this information.

Caches In Systems With Multiple CPUs

When there's multiple CPUs directly connected to memory (without caches), everything works correctly (but slowly). When there's multiple CPUs with their own caches connected to memory, then data stored in RAM can be stale/obsolete data because the current version of that data may still be in one of the CPU's caches. In most systems caches are meant to be transparent (e.g. it's meant to be safe for software to forget that caching is being done), and having caches in each CPU creates problems because it's no longer safe for programmers to forget that caching is being done. The hardware in most computers use special protocols to ensure that these problems don't happen - for example, when a CPU accesses memory it accesses the current version of the data regardless of whether the current version of the data is in RAM or in another CPU's cache. This is called "cache coherency".

Some computer systems are not cache coherent; and it's the programmers responsibility to avoid accessing stale data. In general these systems can be a nightmare to program for, and it can be easier to treat each CPU like a separate system (e.g. pretend that a computer with 2 GiB of RAM and 2 CPUs is actually 2 separate computers with 1 GiB of RAM and 1 CPU each). Fortunately non-cache coherent systems are rare, especially for desktop/server computers.

For systems with multiple CPUs (especially for multi-core), it's possible for several CPUs to share the same caches. For example, for a computer with a pair of quad core CPUs you might have a total of 8 CPUs, where each CPU has it's own L1 caches, pairs of CPUs share L2 caches, and 4 CPUs share each L3 cache. This can complicate "cache efficiency" optimizations for operating system software (e.g. in schedulers and memory management) and applications.

Cache Organization

Normally a cache is split up into many cache lines, where a cache line is the minimum amount of data that can be stored in the cache. For example, a 4 MiB cache might have 65536 entries to hold up to 65536 cache lines, where each cache line in 64 bytes. When data is transferred between the cache and RAM (or between the cache at one level and another cache at a different level) all 64 bytes are transferred. For a given cache line size, a larger cache has more entries.

The CPU needs to be able to find an entry in a cache that corresponds to a specific address. To do this each entry has a "tag", which says what is in that entry (e.g. if the entry is being used or not, and if it is being used which address the data corresponds to). For a very simple cache, this means that the CPU needs to look at every single tag just to find some data in the cache. This has obvious problems (it's slow).

To make the cache faster, CPU manufacturers tend to use sets of cache lines, where the data for a specific address can only be in one place in each set. The number of sets is called "associativity", and a cache designed like this is called an "N-way set associative cache". For example, a 4 MiB cache that is 4-way set associative would have 1 MiB per set, the data for a specific address can only be in one of 4 places, and the CPU only needs to look at 4 tags to find some data in the cache.

To work out which tags to look at in each set of cache lines, the CPU uses some of the bits of the address. For example, for the address 0x12345678, with 64 bytes per cache line the lowest 6 bits don't matter, and (if each set is 1 MiB) the next 14 bits determine the position in each set that the data could be. This also means that for a 32-bit address the tag only needs to store the highest 12 bits of the address (which helps to make the tag smaller).

A cache where the data for any address can be anywhere in the cache (the simple/slower cache mentioned above) is called "fully associative", which is effectively the same as having infinite associativity. A cache where the data for any address can only be in one place in the cache is called "direct mapped", which is effectively the same as "1-way associative".

For a direct mapped cache and for caches with low associativity, information in the cache can be found faster because there's less places in the cache that it could be; but because there's a limited number of places it also increases the chance of cache misses. For example, with 1 MiB direct mapped cache, data from the addresses 0x00000000, 0x00100000, 0x00200000, 0x00300000, etc must share the same place in the cache; and data from 0x00000000 will cause data from 0x00100000 to be evicted from the cache, even if other entries in the cache are unused, and even if the data from 0x00100000 is used frequently. This can cause excessive cache misses. Therefore the design of a cache is a compromise between low associativity (fast lookup times) and high associativity (reduced cache misses).

For most modern CPUs, caches are either 2-way associative, 4-way associative or 8-way associative. For more associativity, the overhead of finding entries in the cache outweigh the benefits of slightly less chance of cache misses.

Optimizing Cache Efficiency

For modern computers, CPUs are very fast and RAM is only "fast". Caches help performance a lot if software uses the cache efficiently, but if software doesn't use the caches efficiently then caches won't help performance much. There are several techniques that can be used when trying to use caches more efficiently.

Cache Blocking

The first technique is already mentioned in the introduction - if you can, split large amounts of data up into smaller blocks that fit inside the cache to avoid cache thrashing. For this to work you need to know how large the cache/s are and how many CPUs are sharing the cache/s.

Page Colouring

The next technique is called "page colouring" or "cache colouring". Systems that use paging can choose pages to minimize the chance of cache misses caused by low associativity. Consider a 4 MiB 4-way associative cache, where a process is extremely unlucky and every page it's using happens to correspond to the same locations in each set in the cache. In this case the process would only be able to use 16 KiB of the cache and the remaining 99.6% of the cache would be wasted. Of course this is an extremely unlikely worst case used for illustration purposes - in practice it's likely that this problem will only cause a small decrease in cache efficiency. The point is that without page colouring an OS relies on probability and hopes that cache efficiency isn't too badly effected. Also note that this can cause non-deterministic performance (e.g. how fast software runs depends on luck), which may be undesirable (especially in real time systems).

The idea behind page colouring is to make sure that pages are allocated in a way that minimizes the chance of cache efficiency being effected. To do this an OS determines which bits in an physical address effect where the cache lines will go in each set in the cache; and then makes sure that these bits in the virtual address of a page is the same as the corresponding bits in the physical address of the page. However, this can create an extra problem: typically some virtual addresses are used more often than others (for example, an OS might load all processes at the virtual address 0x00001000); which means that an OS might run out of physical pages that match the more frequently used virtual page addresses. To avoid this extra problem an OS can add an offset a process' virtual addresses before deciding which physical pages are suitable, and make this offset different for different processes.

For page colouring to work you need to know the size of each set in the cache. This can be determined by dividing the total size of the cache by the cache's associativity.

Preventing Unnecessary Caching

In some situations software might access a lot of data, and the programmer knows that this data either won't be used again or won't be used again soon. In these cases you can improve cache efficiency by explicitly flushing cache lines (for example, the 80x86 "CLFLUSH" instruction) to free entries in the cache for more important data.

Some CPUs allow caching to be disabled on a per page basis. This feature can be used to avoid the need for software to explicitly flush cache lines in some situations.

Prefetching

In some situations it's easy to predict which cache lines will be needed soon. In these cases, some CPUs allow programmers to explicitly request that the data is loaded into the cache (prefetched) to avoid cache misses.

Also, some CPUs will detect certain access patterns (e.g. accesses to sequentially increasing and sequentially decreasing addresses), and will automatically prefetch cache lines.

Scheduling With Multiple CPUs

For computers with more than one CPU, when a task runs it's code and data will end up in the cache/s of the CPU that it was running on. The next time that the scheduler gives the task CPU time it's possible for some of the task's code and data to still be in a CPU's cache, and some cache misses can be avoided by making the task run on the same CPU that it was running on last time. It isn't a good idea to always do this because an OS's scheduler may also take into account many other things (for performance, power management and other reasons) - it's better to think of this optimization as a scheduling hint.

For more complex systems with several layers of caching, where different caches are shared by different CPUs; if the scheduler can't schedule the task on the CPU that the task used last time then a CPU that shares cache/s with the previously used CPU may be more preferable than other CPUs that don't share cache/s with the previously used CPU.