Tech Explained

Inside the Pentium 4

PCQ Bureau

11 Feb 2001 06:47 IST

New Update

We’ve been seeing the shining stickers of the Pentium III and Celeron

processors for a long time now. Let’s now focus our attention on a new silicon

masterpiece–the Pentium 4 processor from Intel. With this new processor, Intel

has changed to a new architecture, called NetBurst. Let’s get inside the

Pentium 4 and see what makes it tick.

Advertisment

Pentium 4 offers clock speeds ranging from 1.3 to 1.5 GHz (1 GHz = 1,000

MHz). The processor is developed using a 0.18 micron manufacturing process and

carries about 42 million transistors. As you may know, all microprocessors have

what’s called the L1 cache memory. This memory is broken into two sections:

the data cache for storing frequently required data, and the instruction cache

for storing frequently used instructions. The Pentium 4 has just 8 kB of data

cache, as opposed 16 kB in the PIII. The L1 instruction cache has been replaced

by what’s called the Execution Trace cache (see below), and made a part of the

NetBurst architecture. The L2 cache has also been modified and is now called

Advanced Transfer cache under the NetBurst architecture. It’s 256 kB in size

and runs at the processor speed.

The NetBurst

The NetBurst architecture consists of the following:

Advertisment

Advanced Transfer cache
400 MHz (Quad-pumped) System bus
Hyper Pipeline technology
Execution Trace cache
Rapid Execution Engine
Streaming SIMD Extensions 2 (SSE 2)

Let’s see what these mean.

Quad-pumped System bus

Advertisment

All processors support a particular FSB (Front Side Bus) speed. FSB is the

channel, which connects the processor to the main memory (RAM). All kinds of

processing, ranging from loading device drivers at startup, starting Windows,

editing an MS Word file to playing MP3s in Winamp and fighting against bots in

Quake require the processor to communicate with the main memory. For good

performance, the FSB channel must support a high data transfer rate. In the

Pentium 4, the FSB channel is 64-bit wide and has a maximum effective speed of

400 MHz. We say effective because the actual bus speed is just 100 MHz. It has

been elevated to 400 MHz by utilizing the clock cycle more efficiently. So,

instead of transferring one bit of data per clock pulse, four are transferred.

The next question is to complement this high-speed bus with high-speed memory

modules. For this, Intel has coupled the Quad-pumped bus with dual channel RDRAM.

Hyper Pipeline technology

Advertisment

What gives a microprocessor the ability to process multiple tasks in

parallel? It is the Instruction Pipeline. A pipeline consists of the Arithmetic

and Logic unit, CPU registers, decoders, etc. Ideally, more pipelines means a

faster processor. The concept of a pipeline is to have separate and independent

units doing their particular subtasks concurrently. The figure below shows a

4-stage pipeline.

Here four instructions are processed concurrently. Fetch, Load, Decode, and Execute are performing one subtask per clock cycle. While instruction 1 is being loaded into the register, instruction 2 is being fetched from the memory. At the time when instruction 1 is being executed, instruction 2 is decoded, instruction 3 is loaded into the register, and instruction 4 is fetched from a memory location for processing. Assuming that all these instructions are independent of each other, we can see that more work is done per clock cycle. The PIII had a 10-stage pipeline, while the Pentium 4 has revved it to twice as much–a 20-stage pipeline

For a pipeline to do its job ideally, it is important that all instructions

fed to the pipeline are independent of each other. Instructions like:

Advertisment

A=A+2;

If (A>10) B=A;

are not independent of each other as the second instruction depends on the

first. So in such cases, a prediction algorithm is used. Such algorithms monitor

the path of execution of a program and try to learn the pattern of the program.

Subsequently, it continues with the next instruction by making assumptions like

let A be greater than 10. Here’s where the problem arises. Suppose a pipeline

as deep as 20 stages carries on the execution of 20 instructions based on the

prediction algorithm. Now if the prediction made for an instruction goes wrong

and other instructions in the pipeline depend on it, then the entire pipeline

has to be cleared, thus wasting the valuable clock cycles. Imagine this

happening at the 19th stage of the Pentium 4 pipeline. The solution to this has

been incorporated as the Execution Trace cache (the L1 instruction cache) in the

Pentium 4.

Advertisment

Execution Trace cache

The application programs and operating system communicate with the processor

using an instruction set. In the case of Intel, AMD, and Cyrix processors, this

instruction set is called the x86 instruction set. The x86 instructions can be

easily used by programmers, but are rather complicated for a microprocessor.

That’s because they don’t have a fixed length, which makes calculating the

location of the next instruction difficult. Thus what the processor works on are

simpler fixed-length instructions called micro-ops. It’s the job of the

decoder unit in the pipeline to convert x86 instructions into micro-ops; the

decoder unit has to work really hard to convert arrays of x86 instructions into

micro-ops. It consumes some useful clock cycles for the same. In the case of a

wrong prediction, the innocent decoder takes the stress to decode the

instructions and when it is found that the prediction is wrong, a new set of

instructions are fetched, and the poor decoder has to work again. This

drastically slows down the entire pipeline and hence the processing speed. In

the Pentium 4, the Execution Trace cache, stores the decoded micro-ops (at least

those which are most likely to be used). Over some cycles of execution, the

decoder would have those decoded micro-ops, which can be used immediately in

case of a misprediction. Apart from this, the Pentium 4 uses a new branch

prediction algorithm, which is said to be as powerful as to eliminate 33 percent

of mispredictions compared to the PIII.

Rapid Execution Engine

Advertisment

In the Pentium 4, two ALUs (Arithmetic Logic Unit) and two AGUs (Address

Generation Unit) run at twice the processor speed. Arithmetic and Logic unit is

responsible for carrying out all integer calculations (add, subtract,

multiplication, division) and logical comparisons (A>B, A instructions that are stored in the memory (RAM) can be addressed in two ways–direct

and indirect. In case of direct addressing, the given address of a memory

location directly yields the instruction. In case of indirect addressing, the

given address of a memory location in turn contains the memory address of the

instruction. AGUs are primarily used to resolve indirect addresses. As can be

comprehended, these units are quite important for high-speed processing which

includes frequent fetching of instructions and arithmetic calculations. In the

Pentium 4, the doubled speed of these units means twice the number of

instructions being processed per clock cycle.

SSE2

SSE stands for Streaming SIMD (Single Instruction Multiple Data) Extensions.

As in the PIII, where 70 new SSE instructions were added called KNI (Katmai New

instructions), 144 new instructions have been hard-wired into the Pentium 4.

This new instruction set further improves the double-precision (very accurate

decimal values) floating point operations. The performance of applications like

speech recognition, 3D games and animations, audio-video streaming,

compression-decompression can improve by using these instructions. As per Intel,

many such applications have been developed and would be called Pentium 4

optimized applications.

That was about the architecture of the Pentium 4. We also ran some tests on

the processor to see how good it really is. So, flip a few pages to see how it

fared in our tests.

Shekhar Govindarajan

Advertisment