We’ve been seeing the shining stickers of the Pentium III and Celeron
processors for a long time now. Let’s now focus our attention on a new silicon
masterpiece–the Pentium 4 processor from Intel. With this new processor, Intel
has changed to a new architecture, called NetBurst. Let’s get inside the
Pentium 4 and see what makes it tick.
Pentium 4 offers clock speeds ranging from 1.3 to 1.5 GHz (1 GHz = 1,000
MHz). The processor is developed using a 0.18 micron manufacturing process and
carries about 42 million transistors. As you may know, all microprocessors have
what’s called the L1 cache memory. This memory is broken into two sections:
the data cache for storing frequently required data, and the instruction cache
for storing frequently used instructions. The Pentium 4 has just 8 kB of data
cache, as opposed 16 kB in the PIII. The L1 instruction cache has been replaced
by what’s called the Execution Trace cache (see below), and made a part of the
NetBurst architecture. The L2 cache has also been modified and is now called
Advanced Transfer cache under the NetBurst architecture. It’s 256 kB in size
and runs at the processor speed.
The NetBurst
The NetBurst architecture consists of the following:
-
Advanced Transfer cache
-
400 MHz (Quad-pumped) System bus
-
Hyper Pipeline technology
-
Execution Trace cache
-
Rapid Execution Engine
-
Streaming SIMD Extensions 2 (SSE 2)
Let’s see what these mean.
Quad-pumped System bus
All processors support a particular FSB (Front Side Bus) speed. FSB is the
channel, which connects the processor to the main memory (RAM). All kinds of
processing, ranging from loading device drivers at startup, starting Windows,
editing an MS Word file to playing MP3s in Winamp and fighting against bots in
Quake require the processor to communicate with the main memory. For good
performance, the FSB channel must support a high data transfer rate. In the
Pentium 4, the FSB channel is 64-bit wide and has a maximum effective speed of
400 MHz. We say effective because the actual bus speed is just 100 MHz. It has
been elevated to 400 MHz by utilizing the clock cycle more efficiently. So,
instead of transferring one bit of data per clock pulse, four are transferred.
The next question is to complement this high-speed bus with high-speed memory
modules. For this, Intel has coupled the Quad-pumped bus with dual channel RDRAM.
Hyper Pipeline technology
What gives a microprocessor the ability to process multiple tasks in
parallel? It is the Instruction Pipeline. A pipeline consists of the Arithmetic
and Logic unit, CPU registers, decoders, etc. Ideally, more pipelines means a
faster processor. The concept of a pipeline is to have separate and independent
units doing their particular subtasks concurrently. The figure below shows a
4-stage pipeline.
Here four instructions are processed concurrently. Fetch, Load, Decode, and Execute are performing one subtask per clock cycle. While instruction 1 is being loaded into the register, instruction 2 is being fetched from the memory. At the time when instruction 1 is being executed, instruction 2 is decoded, instruction 3 is loaded into the register, and instruction 4 is fetched from a memory location for processing. Assuming that all these instructions are independent of each other, we can see that more work is done per clock cycle. The PIII had a 10-stage pipeline, while the Pentium 4 has revved it to twice as much–a 20-stage pipeline |
For a pipeline to do its job ideally, it is important that all instructions
fed to the pipeline are independent of each other. Instructions like:
A=A+2;
If (A>10) B=A;
are not independent of each other as the second instruction depends on the
first. So in such cases, a prediction algorithm is used. Such algorithms monitor
the path of execution of a program and try to learn the pattern of the program.
Subsequently, it continues with the next instruction by making assumptions like
let A be greater than 10. Here’s where the problem arises. Suppose a pipeline
as deep as 20 stages carries on the execution of 20 instructions based on the
prediction algorithm. Now if the prediction made for an instruction goes wrong
and other instructions in the pipeline depend on it, then the entire pipeline
has to be cleared, thus wasting the valuable clock cycles. Imagine this
happening at the 19th stage of the Pentium 4 pipeline. The solution to this has
been incorporated as the Execution Trace cache (the L1 instruction cache) in the
Pentium 4.
Execution Trace cache
The application programs and operating system communicate with the processor
using an instruction set. In the case of Intel, AMD, and Cyrix processors, this
instruction set is called the x86 instruction set. The x86 instructions can be
easily used by programmers, but are rather complicated for a microprocessor.
That’s because they don’t have a fixed length, which makes calculating the
location of the next instruction difficult. Thus what the processor works on are
simpler fixed-length instructions called micro-ops. It’s the job of the
decoder unit in the pipeline to convert x86 instructions into micro-ops; the
decoder unit has to work really hard to convert arrays of x86 instructions into
micro-ops. It consumes some useful clock cycles for the same. In the case of a
wrong prediction, the innocent decoder takes the stress to decode the
instructions and when it is found that the prediction is wrong, a new set of
instructions are fetched, and the poor decoder has to work again. This
drastically slows down the entire pipeline and hence the processing speed. In
the Pentium 4, the Execution Trace cache, stores the decoded micro-ops (at least
those which are most likely to be used). Over some cycles of execution, the
decoder would have those decoded micro-ops, which can be used immediately in
case of a misprediction. Apart from this, the Pentium 4 uses a new branch
prediction algorithm, which is said to be as powerful as to eliminate 33 percent
of mispredictions compared to the PIII.
Rapid Execution Engine
In the Pentium 4, two ALUs (Arithmetic Logic Unit) and two AGUs (Address
Generation Unit) run at twice the processor speed. Arithmetic and Logic unit is
responsible for carrying out all integer calculations (add, subtract,
multiplication, division) and logical comparisons (A>B, A
instructions that are stored in the memory (RAM) can be addressed in two ways–direct
and indirect. In case of direct addressing, the given address of a memory
location directly yields the instruction. In case of indirect addressing, the
given address of a memory location in turn contains the memory address of the
instruction. AGUs are primarily used to resolve indirect addresses. As can be
comprehended, these units are quite important for high-speed processing which
includes frequent fetching of instructions and arithmetic calculations. In the
Pentium 4, the doubled speed of these units means twice the number of
instructions being processed per clock cycle.
SSE2
SSE stands for Streaming SIMD (Single Instruction Multiple Data) Extensions.
As in the PIII, where 70 new SSE instructions were added called KNI (Katmai New
instructions), 144 new instructions have been hard-wired into the Pentium 4.
This new instruction set further improves the double-precision (very accurate
decimal values) floating point operations. The performance of applications like
speech recognition, 3D games and animations, audio-video streaming,
compression-decompression can improve by using these instructions. As per Intel,
many such applications have been developed and would be called Pentium 4
optimized applications.
That was about the architecture of the Pentium 4. We also ran some tests on
the processor to see how good it really is. So, flip a few pages to see how it
fared in our tests.
Shekhar Govindarajan