How-Tos

High-speed Cache Memory

PCQ Bureau

03 Apr 2003 03:50 IST

New Update

A key component in any processor is the cache memory–extremely fast memory, generally 3.5 times faster than the main computer memory or RAM in a computer. In fact, there’s a hierarchy of different types of memory inside a PC, and cache tops the list.

Cache memory isn’t a new concept, but has undergone changes in capacity, speed, and location over time and with changes in processors. Since the CPU is the fastest component inside a PC, it can process data much faster than RAM, which is much slower, can feed it. That’s why cache memory was introduced to act as the intermediary between the CPU and the RAM so that the CPU doesn’t waste time waiting for data. The cache memory is placed close to the CPU so it can supply data faster.

RAM is placed after the cache memory, with other storage media such as the hard drive, optical media, and tape drives being placed after it.

Using this memory hierarchy helps ensure better performance. Another concept called the locality of reference also comes into play as it corresponds to the statistical probability of a program using and reusing a particular area of memory. It has been found that in a given interval of time programs tend to access few localized areas of memory. Take the example of a loop.

Although the time spent on a loop can be considerably large, the amount and area of memory accessed is relatively small, that is, a simple loop counter is accessed many times but its memory location remains the same. So if this small amount of data can be shifted to an extremely fast memory, meaning the cache, instead of keeping it in the main RAM, there would be a considerable improvement in the performance. The same argument holds if the program is shifted from the hard drive to RAM.

Cache levels

Typically, cache in processors is divided into two parts called L1 (Level 1) and L2 (Level 2). L1 cache is a part of the processor, which also means that it runs at the same speed. L2 cache is slightly bigger and slower than L1, but faster than the RAM. L1 cache sends new instructions to the CPU core as quickly as the CPU can take them. The L2 cache assists the L1, and also feeds data directly to the CPU. The L2 cache also buffers data and tries to predict the part of the memory that is to be accessed next. This is called pre-fetching, and also helps improve performance. Different processors use different types of pre-fetching, for instance, the PIII uses ATC (Advance Transfer Cache), while the P4 uses branch-prediction algorithms.

The L2 cache has an interesting history. In earlier processors such as x86 architecture and Pentium, the L2 cache was located on the motherboard. While it was much faster than the system memory, its performance was hindered as it was physically separate from the CPU and the L1 cache. After introducing the Pentium Pro and Pentium II processors, Intel moved the L2 cache from the motherboard to the SECC module (processor packaging) and ran it at half the processor speed. This led to a significant performance improvement but became a bottleneck once processor clock speeds rose. Later, Intel moved the L2 cache from the SECC packaging onto the same CPU die. This had two direct results–one, the size of the processor was reduced because of space constraints and the large cost of integration of the memory with the CPU core; and two, the cache speed became the same as the processor’s. With the introduction of AMD K6-III, AMD moved the L2 cache on the die .

See the table on the next page for cache memory movement in the processors.

Changes in P4 cache

With increasing processor speeds and architectural upgradings, cache memory has also undergone several changes. We’ll take Intel P4 as an example here. Pentium 4’s L2 cache is an 8-way unified, i.e. no separate data and instruction L2 cache, 256 KB cache with 128-byte line size. By 8-way, we mean that it has 8 sets of tags, enabling it to function as a fully associative cache. This is a technique that maps data from the cache to the main memory. Other techniques include Associative, Direct, and Set Associative mapping. The fully associative technique gives the cache a miss rate that is only 60% of that of a 1-way, direct-mapped cache of the same size.

While 128-bytes line size improves the hit rate, it requires more time to refill the cache line. To compensate for this, P4 uses a quad-pumped 100 MHz System Bus (Intel calls it 400 MHz Bus, because it is able to transfer data 4 times in a single clock cycle).

P4s use a trace cache and not a traditional L1 cache. P4s decode the x86-style instructions into RISC-style instructions called µops (micro-ops). The trace cache stores these decoded µops rather than storing the instruction, which was the case with the L1 instruction cache of processors like PIII or Athlon. Trace cache can house up to 12,000 µops. However, the size of the trace cache or the µops is not documented by Intel. As the instructions have been already decoded and stored into the cache, a unit called the branch prediction unit plays an important role. Once the instructions have been decoded into ops, the system knows what branches are there in the code. Using this information, this unit tries to fetch the instructions from the L2 cache and builds them into program ordered sequences of µops called traces.

Cache performance

Cache performance is measured in terms of hit ratio. If the processor tries to locate data in RAM and finds it on cache, it is termed as a hit and if it does not the request is propagated to RAM and is termed as a miss. The ratio of the number of hits to the total number of references made to memory is called the hit ratio. This ratio is less than 1 (because the mapping between cache and RAM is not one to one) and greater than 0. The hit ratio is found out by executing some representative program and in some case it’s been found to be as high as 0.9, which can also be
interpreted as statistical support for the locality of reference theory.

Ankit Khare

Cache memory in different processors

Cache memory in different processors
Processor family	L1 cache	L2 cache	Comments
Intel 80286	None	None
Intel 80386	None	None
Intel 80486	8 KB unified	0 to 256 KB	DX4 version of CPU had 16 KB unified L1. L3 cache on motherboard
Intel Pentium	8KB data + 8KB instruction	256 to 512 KB	L3 cache on motherboard
Intel Pentium MMX	16 KB data + 16KB instruction	- do-
Cyrix 6x86	16 KB unified	256 to 512 KB	L3 cache on motherboard
AMD K5	8KB data + 16 KB instruction	-do-
Intel Pentium Pro	8KB data + 8 KB instruction	Integrated 256 KB to 1 MB
Intel Pentium —II	16 KB data + 16 KB data	512 KB	CPU on SECC (Single Edge Contact Cartridge) form factor
AMD K6	32 KB data + 32 KB instruction	On motherboard 256 KB to 1 MB
AMD K6-III	64 KB	On die 256 KB	CPU supported a tri-level cache architecture, sporting L3 cache on motherboard, scalable to 2 MB.
AMD Athlon	128	256	Thoroughbred core
AMD Athlon XP	128	512	Barton core
Intel Celeron	16 KB data + 16 KB instruction	126-256KB	Celeron was initially introduced without L2 cache for cost reduction, later they included cache
AMD Duron	128 KB	64 KB	Features exclusive cache, i.e. data is not duplicated between l1 and l2 cache.
Intel PIII Katamai	64 KB	512 KB	L2 running at half CPU speed
Intel P III Coppermine	64	256 KB	L2 running at CPU speed
Intel PIII Tualatin	64	256	L2 running at CPU speed
Intel Pentium 4	8 KB data + undocumented size of trace cache	256 KB	Uses trace cache instead of the regular L1 instruction cache

Stay connected with us through our social media channels for the latest updates and news!