Why are Multicore CPUs So Hot?

PCQ Bureau
New Update

These days it is common to have processor codenames and architecture names

used with abandon. The names can confuse anyone, and after a while, it becomes

nearly impossible to tell them apart. This month, we take you through this

jargon jungle and unveil the secrets of how modern x86 processors work. There

are two parts of the story. One is the micro-architecture part, which is a

combination of multiple technologies. For instance, when Intel says 'Core 2'

or AMD says 'AM2', it denotes that a particular combination of technologies

is at work to produce a processor that can be called a 'Core 2' or an 'AM2'



The second part to our story is the architecture itself, which is known by a

specific codename (say 'Conroe' or 'Manchester'). This month, we look at

technologies behind the different micro-architectures and next month, we'll

see what each combination (the architectures themselves) is about.

Direct Hit!
Applies To:

IT managers

USP: Learn how multicore processors work under the hood and
what makes them so special

Primary Link:

Google Keywords: Multicore architecture

Macro Fusion

A term coined by Intel, it refers to the processor's ability to combine
several instructions into one, thus optimizing it and making for a faster

execute. Thus, if values from two different memory locations already in the

processor cache are to be compared but the instruction set first loads them

elsewhere and then compares them, the Macro Fusion technique would enable you to

directly compare them, skipping a step. Spread across an entire application or

thread, this can significantly boost execution times.


Handling L2 cache

When you have more than one core on a processor die, selecting the right place
to have the L2 cache is key. L1 is not touched here since it is a small-sized

cache that contains instructions being immediately fed to the processing core.

One method uses a common L2 cache for multiple cores (Intel) while the other

seeks a dedicated cache location per core (AMD). There are pros and cons to both

sides of the coin.

When you go the AMD way, you have two L2 caches in a dual core die and four

of them on a quad core. While each core gets its own dedicated cache, the

drawback is that when some cores are more active than the others, their caches

will be overflowing and hence those cores suffer a performance hit even though

the core waits for a fetch from the main memory. Intel's shared core (what it

calls an 'Advanced Smart Cache') uses shared L2 caches between two cores,

letting cores be better utilized when one of the cores is relatively less

loaded. But, this method introduces the headache of having a memory controller

that manages the memory between the two cores (who gets to put what where and so

forth). The memory bandwidth Intel promises with its way out peaks at 96 GB per

second at 3 GHz. Obviously it is hard to take a call on which one is better, as

each method compensates for a different scenario.

In FSB-based architecture, data hits a memory controller which routes it to the CPU, memory, etc. AMD's Direct Connect places the memory controller in the CPU, causing all data to pass through the processor, which then routes it appropriately 

Intel Smart Memory Access: So, when you have a shared L2 cache like Intel has

with its Advanced Smart Cache technology, the headache of managing the cache

between two cores falls on a memory controller. This is the bundle they call 'Smart

Memory Access.' Along with memory management, SMA also resolves memory

locations and adjusts internal pointers so that when the core finds an

instruction to jump elsewhere in the code or fetch an instruction or data that's

already been cached, it can be directly fetched from the L2 or main memory

location instead of reloading it. This speeds up out-of-order execution. The

speed arises from the fact that instructions that are independent of each other

can be executed as soon as that instruction has been decoded, without having to

wait for the sequence of instructions before it to finish. Thus, if a block of

code loads two values and stores them elsewhere without any processing with

other logical code in between can be reordered and the LOAD/STORE can be

executed in independent threads and many clock cycles earlier.

Intel's vPro technology lets IT admins create system partitions for troubleshooting, maintenance and inventory management

AMD Direct Connect: AMD solves the problem of managing L2 cache in multi core

or multi processor systems, by doing away with the need for an FSB and talking

directly to the various components through something called the HyperTransport

Link. Taken together this architecture is called 'Direct Connect' (DCA).

Each processor built around DCA has an integrated memory controller and is

HyperTransport enabled. Each processor with DCA is linked to specific portions

of memory. When one CPU needs to access data that is in the memory linked to the

other processor, it will use HyperTransport to link to that processor/memory.

This HyperTransport linkage is called Coherent HyperTransport.


Memory access speeds for processor transactions are boosted when you put the

controller on the processor die itself. But in a multi-processor system, when

you add more processors, you also add more memory controllers. To solve the

problems with access overlaps and violations and race conditions, AMD uses

something called NUMA (Non-Uniform Memory Access) that is similar to Intel's

SMP (Symmetric Multi-Processing). NUMA deviates from the SMP in that NUMA is

asymmetric. Although the original use for NUMA and DCA involve multiple

processors rather than cores, it can be applied to a multi-core system just as


Multimedia instructions

With the Pentium-1 class of processors, we had the SSE (or AMD's equivalent of
3DNow!), SSE-2 and SSE-3 that themselves are descendants of the now defunct MMX

instruction set. SIMD is one more evolution forwards from there. 128-bit SSE

instructions traditionally take two clock cycles each to execute. SIMD allows

for these instructions to complete in one clock, doubling their throughput. In

addition to all that, SIMD adds about 70 new instructions to process packed

floating-point data, control memory without cache pollution and extend the MMX

instruction set. This lets your application do really parallel threads with

independent control flow. This expanded instruction set is also called the 'Streaming

SIMD extensions' because it works best for streaming data (into the processor)

and letting you do video processing like encoding video and other multimedia

data, faster.

Intel vPro

If you've heard the talk about having system level partitions with each
partition being able to run isolated operating systems and software, and your IT

department being able to manage your system remotely using hidden or access

restricted areas of your system while you were busy working with your section of

it... vPro is the technology that is set to make that a reality.


With vPro, your IT department can remotely monitor, diagnose and repair your

desktop even it is switched off. Also, the system can send out its configuration

information like what cards are installed, how the BIOS has been configured and

so on. This also improves asset tracking and inventory since one no longer needs

to go to each computer to audit this and neither do they need to install

specific software or open (desktop) firewall ports to do this. Additionally,

vPro also has the benefit of lower power usage by turning off processor

functions when they are not in use. Yes, vPro can turn off specific portions

inside the processor when they are not being used rather than turning an entire

core on or off.

New in virtualization

This is a big topic outside the discussion on processors as well. The reason
this is hot is well known-the more the number of applications you can put into

a single box (most usually by cramming more virtual machines into that box) the

lower your cost of ownership and running costs and better your cost efficiency.

But the biggest limiting factor so far to that argument has been that while

virtualization has been possible in the non-x86 world for some time now, it has

been primarily software based in the x86 realms. The VMware, Xen and Virtual

Servers/PCs of the world have ruled this area. However, the performance gained

by doing that while the hardware did not inherently support virtualization is

not that great. The traditional way to boost virtual machine performance is by

jacking up the amount of RAM and the processor speed.

Intel's vPro and VT (Virtualization Technology) and AMD's AMD-V are

rather big steps to taking the x86 into a fully virtualizible environment. We

have discussed vPro above, let's take some time to understand VT and AMD-V.


Intel VT: Codename 'Vanderpool', it enables support for

virtualization software layers to control processing actually being done inside

a virtual machine. This allows such software to monitor virtual systems and

marshal their resource (processor and memory) usage. Intel VT is in use not just

in x86 desktop processors, but also affects Itanium 32 and Itanium 64 bit

families. Three new features for the 32-bit VT (called VT-x) are: a more

coordinated way of dealing with blocked NMI (Non Maskable Interrupts) when VMs

(guest OS in a VM) are exiting, setting up virtual processor IDs by VMMs and

then use these IDs to translate buffer addresses in memory and; instead of

having a common memory page tables for host and guest OS VMs now have their own

page tables. The Xen Hypervisor gets extra support from the Intel VT

architecture, letting it expose resources and configuration to its guest

software that return better virtualized performance. This includes presenting

all the real processor information bits (CPUID) to the guest, with the exception

of the VMX and MCA.

AMD-V: This is a step further from AMD's 64-bit architecture

(AMD64). AMD-V adds two modes of operation to the systems it runs on: Host mode

and Guest mode. Also added is a new instruction called VMRUN that lets virtual

machines along with the guest OS and its applications work a little faster

inside a VM. An AMD-V processor will initially boot up with its new VM

capabilities disabled (the 'guest' mode) until compatible VMM (VM Managers)

are detected. Once such a VMM is detected, the processor switches to 'host'

mode and turns on all its capabilities. An AMD-V processor in host mode features

a number of interesting capabilities like setting up at the hardware level

itself the kinds of resources the system software is allowed to access. Using

these instructions one can even assign different VMs exclusive access to

different resources (like network interfaces).

As we continue adding cores to processors and add more of such processors to

our computer systems, it is natural that their technologies will advance to let

multiple processor cores access what so far has been a resource dedicated to one

processor core. Next month, we will be looking to demystify the various

processor code names in the multi-core domain.