C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Cache is a small amount of very fast expensive memory that acts as a proxy for main memory. It is divided into a number of equal sizes slots known as lines. Each line contains a handful of sequential main memory locations, generally four to eight integers or real numbers. Whereas the data within a line comes from the same part of memory, neighboring lines contain data that is far separated within your program or perhaps data from somebody else program. When you ask for something from memory, the computer checks to see if the data is available within one of these cache lines. If it is, the data is returned with a minimal delay. If it is not, then your program may be delayed while a new line is fetched from main memory. Of course, if a new line is brought in, then another one has to be thrown out. If you are lucky, it won't be the one containing the data that you are just about to need. When every reference can be found in a cache, you say that you have a 100 % hit-rate. Generally, a hit rate of 95 % or better is considered OK. Below it application performance drops off steeply.

The general idea of a cache is that when a piece of data is needed most, it is moved to a place where it can be accessed quickly. The remaining data continues to reside in main memory where it can be accessed later. Problems occur when the program accesses data that is not already in the cache (a so called cache miss). As we described earlier, the data element must be retrieved by grabbing data (a cache lines worth) from secondary cache or main memory. Further, this data will overwrite any data in the existing cache line. If the entry that was just replaced is subsequently needed, the data needs to be retrieved, resulting in lots of wasted overheads, etc. In cases like these, (the technical term is thrashing) performance is degraded and cache may seem like more of a headache, than a plus.

Another way to optimize the use of cache is to keep in mind that once a piece of data is in the cache, it should be re-used as much as possible. Looping through multidimensional arrays provide excellent opportunities for cache reuse. For each layer, three parameters such as capacity, latency, and bandwidth play a vital role for performance. The faster and smaller devices are closer to the processors. The devices closer to the processor or registers, which are in fact a part of the processor chip. Normally, no extra cycles are needed to fetch a word from a register to a functional unit (e.g. ALU). But the registers have limited capacity.

On most of the modern computers, the latency to fetch a word from a register to functional unit may vary 2-4 ns. The level-1 cache is usually on the processor chip, while the level-2 cache is off the chip. The remote memory refers to all local memories of other nodes. Cache memories are faster static RAM devices that are used as a buffer between the processor and the DRAM. These are much more expensive than DRAM and have a significantly higher power consumption. Because of these reasons, the size of cache memories is typically much smaller than the available DRAM.

There are various approaches in current use that attempt to ensure, based on statistical behavior of program execution or on compiler analysis of the memory reference patterns, that when a variable is actually referenced by the processor it will be close to the processor as possible. This means moving the data, before it is referenced, so that it will be in the fastest part of memory (register or cache) when it is referenced.

Cache memories work on the principles of spatial and temporal locality of data reference by typical programs. In most programs, if a data word is accessed in the memory, it is likely that the words in the neighborhood will also be accessed. This property is referred to as the spatial locality of data reference. In some programs, it is also possible that words are repeatedly accessed in windows of time. We refer to this locality as the temporal locality of data access. This is sometimes also called reuse the locality of the program. Sometimes, if the processor attempts to access data from the cache, which is not available in the cache, then a cache miss occurs. If the data is not available in the cache (a cache miss occurs), the data is fetched from the DRAM into that cache. However, instead of fetching a single word of data corresponding to the requested data item, a whole block of contiguous data is transferred from the DRAM to the cache. This block of data is also referred to as a cache line. Due to spatial locality properties, subsequently accessed data locations are likely to have been fetched into the cache. The effective cache, subsequent accesses to these data items can also be serviced fast and these data items will then be waiting in fast memory ready for use when required. Many other factors influence cache performance. First, the cache size is critical for controlling cache misses. The cache-line size has an impact on the miss-ratios and the time access memory.

Caches are not only good for data, which is being read, they are good for write too. In a simple uni-processor system, data written into cache generally stays in the cache until the cache line replaced, at which point it is written out into main memory.

All modern virtual memory machines have special cache called a translation lookaside buffer or TLB for virtual physical memory address translation. The two inputs to the TLB are an integer that identifies the program making the memory request and the virtual page requested. TLB lookups occur in parallel with instruction execution, so if the address data is in the TLB, memory references will proceed quickly.

Caches come in several flavors: direct mapped, set-associate, and fully associate. The process of pairing memory locations with cache locations is called mapping. Direct mapping is the simplest algorithm for deciding how memory maps onto the cache. At the other extreme from a direct mapped cache is a fully associative cache, where any memory location can be mapped into any cache line, regardless of memory address. Set associative caches are generally come in two and four way designs.

On multi-processors (SMPs) written data must be returned to main memory so the rest of the processors can see it, or all other processors must be made aware of local cache activity. Perhaps they need to be told to invalidate old lines containing the previous value of the written variable so that they don't accidentally use static data. This is known as coherency between the different caches. The problem can become very complex in a multi-processor system.

Three approaches to solving the memory latency problem are latency avoidance, latency reduction, and latency hiding are been suggested in the past. Latency avoidance is a technique that tries to organize user applications at architectural, compiler, or application levels to achieve data/program locality. The purpose is to avoid the long latency in remote data or program access. This is possible only when applications exhibit either temporal locality or spatial locality. These techniques fall into the following categories such as architectural support, user supports, and software supports.