by April 3, 2003 0 comments

A key component in any processor is the cache memory–extremely fast memory, generally 3.5 times faster than the main computer memory or RAM in a computer. In fact, there’s a hierarchy of different types of memory inside a PC, and cache tops the list. 

Cache memory isn’t a new concept, but has undergone changes in capacity, speed, and location over time and with changes in processors. Since the CPU is the fastest component inside a PC, it can process data much faster than RAM, which is much slower, can feed it. That’s why cache memory was introduced to act as the intermediary between the CPU and the RAM so that the CPU doesn’t waste time waiting for data. The cache memory is placed close to the CPU so it can supply data faster.

RAM is placed after the cache memory, with other storage media such as the hard drive, optical media, and tape drives being placed after it. 

Using this memory hierarchy helps ensure better performance. Another concept called the locality of reference also comes into play as it corresponds to the statistical probability of a program using and reusing a particular area of memory. It has been found that in a given interval of time programs tend to access few localized areas of memory. Take the example of a loop.

Although the time spent on a loop can be considerably large, the amount and area of memory accessed is relatively small, that is, a simple loop counter is accessed many times but its memory location remains the same. So if this small amount of data can be shifted to an extremely fast memory, meaning the cache, instead of keeping it in the main RAM, there would be a considerable improvement in the performance. The same argument holds if the program is shifted from the hard drive to RAM.

Cache levels
Typically, cache in processors is divided into two parts called L1 (Level 1) and L2 (Level 2). L1 cache is a part of the processor, which also means that it runs at the same speed. L2 cache is slightly bigger and slower than L1, but faster than the RAM. L1 cache sends new instructions to the CPU core as quickly as the CPU can take them. The L2 cache assists the L1, and also feeds data directly to the CPU. The L2 cache also buffers data and tries to predict the part of the memory that is to be accessed next. This is called pre-fetching, and also helps improve performance. Different processors use different types of pre-fetching, for instance, the PIII uses ATC (Advance Transfer Cache), while the P4 uses branch-prediction algorithms. 

The L2 cache has an interesting history. In earlier processors such as x86 architecture and Pentium, the L2 cache was located on the motherboard. While it was much faster than the system memory, its performance was hindered as it was physically separate from the CPU and the L1 cache. After introducing the Pentium Pro and Pentium II processors, Intel moved the L2 cache from the motherboard to the SECC module (processor packaging) and ran it at half the processor speed. This led to a significant performance improvement but became a bottleneck once processor clock speeds rose. Later, Intel moved the L2 cache from the SECC packaging onto the same CPU die. This had two direct results–one, the size of the processor was reduced because of space constraints and the large cost of integration of the memory with the CPU core; and two, the cache speed became the same as the processor’s. With the introduction of AMD K6-III, AMD moved the L2 cache on the die .

See the table on the next page for cache memory movement in the processors.

Changes in P4 cache
With increasing processor speeds and architectural upgradings, cache memory has also undergone several changes. We’ll take Intel P4 as an example here. Pentium 4’s L2 cache is an 8-way unified, i.e. no separate data and instruction L2 cache, 256 KB cache with 128-byte line size. By 8-way, we mean that it has 8 sets of tags, enabling it to function as a fully associative cache. This is a technique that maps data from the cache to the main memory. Other techniques include Associative, Direct, and Set Associative mapping. The fully associative technique gives the cache a miss rate that is only 60% of that of a 1-way, direct-mapped cache of the same size. 

While 128-bytes line size improves the hit rate, it requires more time to refill the cache line. To compensate for this, P4 uses a quad-pumped 100 MHz System Bus (Intel calls it 400 MHz Bus, because it is able to transfer data 4 times in a single clock cycle). 

P4s use a trace cache and not a traditional L1 cache. P4s decode the x86-style instructions into RISC-style instructions called µops (micro-ops). The trace cache stores these decoded µops rather than storing the instruction, which was the case with the L1 instruction cache of processors like PIII or Athlon. Trace cache can house up to 12,000 µops. However, the size of the trace cache or the µops is not documented by Intel. As the instructions have been already decoded and stored into the cache, a unit called the branch prediction unit plays an important role. Once the instructions have been decoded into ops, the system knows what branches are there in the code. Using this information, this unit tries to fetch the instructions from the L2 cache and builds them into program ordered sequences of µops called traces. 

Cache performance 
Cache performance is measured in terms of hit ratio. If the processor tries to locate data in RAM and finds it on cache, it is termed as a hit and if it does not the request is propagated to RAM and is termed as a miss. The ratio of the number of hits to the total number of references made to memory is called the hit ratio. This ratio is less than 1 (because the mapping between cache and RAM is not one to one) and greater than 0. The hit ratio is found out by executing some representative program and in some case it’s been found to be as high as 0.9, which can also be
interpreted as statistical support for the locality of reference theory.

Ankit Khare

Cache memory in different processors

Cache memory in
different processors

L1 cache L2 cache Comments
None None
None None
8 KB unified 0 to 256 KB DX4
version of CPU had 16 KB unified L1. L3 cache on motherboard
data + 8KB instruction
256 to
512 KB
cache on motherboard
Pentium MMX
16 KB data + 16KB instruction  
16 KB
256 to 512 KB L3
cache on motherboard
data + 16 KB instruction
Pentium Pro
8KB data + 8 KB instruction Integrated
256 KB to 1 MB
Pentium —II
16 KB data + 16 KB data 512 KB CPU
on SECC (Single Edge Contact Cartridge) form factor
AMD K6 32
KB data + 32 KB instruction
motherboard 256 KB to 1 MB
64 KB On die
256 KB
supported a tri-level cache architecture, sporting L3 cache on
motherboard, scalable to 2 MB.
128 256 Thoroughbred
Athlon XP
128 512 Barton
KB data  + 16 KB instruction
126-256KB Celeron
was initially introduced without L2 cache for cost reduction, later they
included cache
128 KB 64 KB Features
exclusive cache, i.e. data is not duplicated between l1 and l2 cache.
PIII Katamai
64 KB 512 KB L2
running at half CPU speed
P III Coppermine
64 256 KB L2
running at CPU speed
PIII Tualatin
64 256 L2
running at CPU speed
Pentium 4
8 KB data + undocumented size
of trace cache
256 KB Uses
trace cache instead of the regular L1 instruction cache

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.