Advertisment

High-speed Cache Memory

author-image
PCQ Bureau
New Update

A key component in any processor is the cache memory–extremely fast memory, generally 3.5 times faster than the main computer memory or RAM in a computer. In fact, there’s a hierarchy of different types of memory inside a PC, and cache tops the list. 

Advertisment

Cache memory isn’t a new concept, but has undergone changes in capacity, speed, and location over time and with changes in processors. Since the CPU is the fastest component inside a PC, it can process data much faster than RAM, which is much slower, can feed it. That’s why cache memory was introduced to act as the intermediary between the CPU and the RAM so that the CPU doesn’t waste time waiting for data. The cache memory is placed close to the CPU so it can supply data faster.

RAM is placed after the cache memory, with other storage media such as the hard drive, optical media, and tape drives being placed after it. 

Using this memory hierarchy helps ensure better performance. Another concept called the locality of reference also comes into play as it corresponds to the statistical probability of a program using and reusing a particular area of memory. It has been found that in a given interval of time programs tend to access few localized areas of memory. Take the example of a loop.

Advertisment

Although the time spent on a loop can be considerably large, the amount and area of memory accessed is relatively small, that is, a simple loop counter is accessed many times but its memory location remains the same. So if this small amount of data can be shifted to an extremely fast memory, meaning the cache, instead of keeping it in the main RAM, there would be a considerable improvement in the performance. The same argument holds if the program is shifted from the hard drive to RAM.

Cache levels



Typically, cache in processors is divided into two parts called L1 (Level 1) and L2 (Level 2). L1 cache is a part of the processor, which also means that it runs at the same speed. L2 cache is slightly bigger and slower than L1, but faster than the RAM. L1 cache sends new instructions to the CPU core as quickly as the CPU can take them. The L2 cache assists the L1, and also feeds data directly to the CPU. The L2 cache also buffers data and tries to predict the part of the memory that is to be accessed next. This is called pre-fetching, and also helps improve performance. Different processors use different types of pre-fetching, for instance, the PIII uses ATC (Advance Transfer Cache), while the P4 uses branch-prediction algorithms. 

The L2 cache has an interesting history. In earlier processors such as x86 architecture and Pentium, the L2 cache was located on the motherboard. While it was much faster than the system memory, its performance was hindered as it was physically separate from the CPU and the L1 cache. After introducing the Pentium Pro and Pentium II processors, Intel moved the L2 cache from the motherboard to the SECC module (processor packaging) and ran it at half the processor speed. This led to a significant performance improvement but became a bottleneck once processor clock speeds rose. Later, Intel moved the L2 cache from the SECC packaging onto the same CPU die. This had two direct results–one, the size of the processor was reduced because of space constraints and the large cost of integration of the memory with the CPU core; and two, the cache speed became the same as the processor’s. With the introduction of AMD K6-III, AMD moved the L2 cache on the die .

Advertisment

See the table on the next page for cache memory movement in the processors.

Changes in P4 cache



With increasing processor speeds and architectural upgradings, cache memory has also undergone several changes. We’ll take Intel P4 as an example here. Pentium 4’s L2 cache is an 8-way unified, i.e. no separate data and instruction L2 cache, 256 KB cache with 128-byte line size. By 8-way, we mean that it has 8 sets of tags, enabling it to function as a fully associative cache. This is a technique that maps data from the cache to the main memory. Other techniques include Associative, Direct, and Set Associative mapping. The fully associative technique gives the cache a miss rate that is only 60% of that of a 1-way, direct-mapped cache of the same size. 

While 128-bytes line size improves the hit rate, it requires more time to refill the cache line. To compensate for this, P4 uses a quad-pumped 100 MHz System Bus (Intel calls it 400 MHz Bus, because it is able to transfer data 4 times in a single clock cycle). 

Advertisment

P4s use a trace cache and not a traditional L1 cache. P4s decode the x86-style instructions into RISC-style instructions called µops (micro-ops). The trace cache stores these decoded µops rather than storing the instruction, which was the case with the L1 instruction cache of processors like PIII or Athlon. Trace cache can house up to 12,000 µops. However, the size of the trace cache or the µops is not documented by Intel. As the instructions have been already decoded and stored into the cache, a unit called the branch prediction unit plays an important role. Once the instructions have been decoded into ops, the system knows what branches are there in the code. Using this information, this unit tries to fetch the instructions from the L2 cache and builds them into program ordered sequences of µops called traces. 

Cache performance 



Cache performance is measured in terms of hit ratio. If the processor tries to locate data in RAM and finds it on cache, it is termed as a hit and if it does not the request is propagated to RAM and is termed as a miss. The ratio of the number of hits to the total number of references made to memory is called the hit ratio. This ratio is less than 1 (because the mapping between cache and RAM is not one to one) and greater than 0. The hit ratio is found out by executing some representative program and in some case it’s been found to be as high as 0.9, which can also be

interpreted as statistical support for the locality of reference theory.

Ankit Khare

Advertisment

Cache memory in different processors



Cache memory in

different processors

Processor

family
L1 cache L2 cache Comments
Intel

80286
None None
Intel

80386
None None
Intel

80486
8 KB unified 0 to 256 KB DX4

version of CPU had 16 KB unified L1. L3 cache on motherboard
Intel

Pentium
8KB

data + 8KB instruction
256 to

512 KB
L3

cache on motherboard
Intel

Pentium MMX
16 KB data + 16KB instruction  -

do-
Cyrix

6x86
16 KB

unified
256 to 512 KB L3

cache on motherboard
AMD

K5
8KB

data + 16 KB instruction
-do-
Intel

Pentium Pro
8KB data + 8 KB instruction Integrated

256 KB to 1 MB
Intel

Pentium —II
16 KB data + 16 KB data 512 KB CPU

on SECC (Single Edge Contact Cartridge) form factor
AMD K6 32

KB data + 32 KB instruction
On

motherboard 256 KB to 1 MB
AMD

K6-III
64 KB On die

256 KB
CPU

supported a tri-level cache architecture, sporting L3 cache on

motherboard, scalable to 2 MB.
AMD

Athlon
128 256 Thoroughbred

core
AMD

Athlon XP
128 512 Barton

core
Intel

Celeron
16

KB data  + 16 KB instruction
126-256KB Celeron

was initially introduced without L2 cache for cost reduction, later they

included cache
AMD

Duron
128 KB 64 KB Features

exclusive cache, i.e. data is not duplicated between l1 and l2 cache.
Intel

PIII Katamai
64 KB 512 KB L2

running at half CPU speed
Intel

P III Coppermine
64 256 KB L2

running at CPU speed
Intel

PIII Tualatin
64 256 L2

running at CPU speed
Intel

Pentium 4
8 KB data + undocumented size

of trace cache
256 KB Uses

trace cache instead of the regular L1 instruction cache
Advertisment