Tech Explained

Why are Multicore CPUs So Hot?

PCQ Bureau

01 Jan 2007 06:35 IST

New Update

These days it is common to have processor codenames and architecture names
used with abandon. The names can confuse anyone, and after a while, it becomes
nearly impossible to tell them apart. This month, we take you through this
jargon jungle and unveil the secrets of how modern x86 processors work. There
are two parts of the story. One is the micro-architecture part, which is a
combination of multiple technologies. For instance, when Intel says 'Core 2'
or AMD says 'AM2', it denotes that a particular combination of technologies
is at work to produce a processor that can be called a 'Core 2' or an 'AM2'
product.

The second part to our story is the architecture itself, which is known by a
specific codename (say 'Conroe' or 'Manchester'). This month, we look at
technologies behind the different micro-architectures and next month, we'll
see what each combination (the architectures themselves) is about.

Direct Hit!

Applies To:
IT managers

USP: Learn how multicore processors work under the hood and
what makes them so special

Primary Link: en.wikipedia.org/wiki/Multi_core

Google Keywords: Multicore architecture

Macro Fusion

A term coined by Intel, it refers to the processor's ability to combine
several instructions into one, thus optimizing it and making for a faster
execute. Thus, if values from two different memory locations already in the
processor cache are to be compared but the instruction set first loads them
elsewhere and then compares them, the Macro Fusion technique would enable you to
directly compare them, skipping a step. Spread across an entire application or
thread, this can significantly boost execution times.

Handling L2 cache

When you have more than one core on a processor die, selecting the right place
to have the L2 cache is key. L1 is not touched here since it is a small-sized
cache that contains instructions being immediately fed to the processing core.
One method uses a common L2 cache for multiple cores (Intel) while the other
seeks a dedicated cache location per core (AMD). There are pros and cons to both
sides of the coin.

When you go the AMD way, you have two L2 caches in a dual core die and four
of them on a quad core. While each core gets its own dedicated cache, the
drawback is that when some cores are more active than the others, their caches
will be overflowing and hence those cores suffer a performance hit even though
the core waits for a fetch from the main memory. Intel's shared core (what it
calls an 'Advanced Smart Cache') uses shared L2 caches between two cores,
letting cores be better utilized when one of the cores is relatively less
loaded. But, this method introduces the headache of having a memory controller
that manages the memory between the two cores (who gets to put what where and so
forth). The memory bandwidth Intel promises with its way out peaks at 96 GB per
second at 3 GHz. Obviously it is hard to take a call on which one is better, as
each method compensates for a different scenario.

In FSB-based architecture, data hits a memory controller which routes it to the CPU, memory, etc. AMD's Direct Connect places the memory controller in the CPU, causing all data to pass through the processor, which then routes it appropriately

Intel Smart Memory Access: So, when you have a shared L2 cache like Intel has
with its Advanced Smart Cache technology, the headache of managing the cache
between two cores falls on a memory controller. This is the bundle they call 'Smart
Memory Access.' Along with memory management, SMA also resolves memory
locations and adjusts internal pointers so that when the core finds an
instruction to jump elsewhere in the code or fetch an instruction or data that's
already been cached, it can be directly fetched from the L2 or main memory
location instead of reloading it. This speeds up out-of-order execution. The
speed arises from the fact that instructions that are independent of each other
can be executed as soon as that instruction has been decoded, without having to
wait for the sequence of instructions before it to finish. Thus, if a block of
code loads two values and stores them elsewhere without any processing with
other logical code in between can be reordered and the LOAD/STORE can be
executed in independent threads and many clock cycles earlier.

Intel's vPro technology lets IT admins create system partitions for troubleshooting, maintenance and inventory management

AMD Direct Connect: AMD solves the problem of managing L2 cache in multi core
or multi processor systems, by doing away with the need for an FSB and talking
directly to the various components through something called the HyperTransport
Link. Taken together this architecture is called 'Direct Connect' (DCA).
Each processor built around DCA has an integrated memory controller and is
HyperTransport enabled. Each processor with DCA is linked to specific portions
of memory. When one CPU needs to access data that is in the memory linked to the
other processor, it will use HyperTransport to link to that processor/memory.
This HyperTransport linkage is called Coherent HyperTransport.

Memory access speeds for processor transactions are boosted when you put the
controller on the processor die itself. But in a multi-processor system, when
you add more processors, you also add more memory controllers. To solve the
problems with access overlaps and violations and race conditions, AMD uses
something called NUMA (Non-Uniform Memory Access) that is similar to Intel's
SMP (Symmetric Multi-Processing). NUMA deviates from the SMP in that NUMA is
asymmetric. Although the original use for NUMA and DCA involve multiple
processors rather than cores, it can be applied to a multi-core system just as
easily.

Multimedia instructions

With the Pentium-1 class of processors, we had the SSE (or AMD's equivalent of
3DNow!), SSE-2 and SSE-3 that themselves are descendants of the now defunct MMX
instruction set. SIMD is one more evolution forwards from there. 128-bit SSE
instructions traditionally take two clock cycles each to execute. SIMD allows
for these instructions to complete in one clock, doubling their throughput. In
addition to all that, SIMD adds about 70 new instructions to process packed
floating-point data, control memory without cache pollution and extend the MMX
instruction set. This lets your application do really parallel threads with
independent control flow. This expanded instruction set is also called the 'Streaming
SIMD extensions' because it works best for streaming data (into the processor)
and letting you do video processing like encoding video and other multimedia
data, faster.

Intel vPro

If you've heard the talk about having system level partitions with each
partition being able to run isolated operating systems and software, and your IT
department being able to manage your system remotely using hidden or access
restricted areas of your system while you were busy working with your section of
it... vPro is the technology that is set to make that a reality.

With vPro, your IT department can remotely monitor, diagnose and repair your
desktop even it is switched off. Also, the system can send out its configuration
information like what cards are installed, how the BIOS has been configured and
so on. This also improves asset tracking and inventory since one no longer needs
to go to each computer to audit this and neither do they need to install
specific software or open (desktop) firewall ports to do this. Additionally,
vPro also has the benefit of lower power usage by turning off processor
functions when they are not in use. Yes, vPro can turn off specific portions
inside the processor when they are not being used rather than turning an entire
core on or off.

New in virtualization

This is a big topic outside the discussion on processors as well. The reason
this is hot is well known-the more the number of applications you can put into
a single box (most usually by cramming more virtual machines into that box) the
lower your cost of ownership and running costs and better your cost efficiency.
But the biggest limiting factor so far to that argument has been that while
virtualization has been possible in the non-x86 world for some time now, it has
been primarily software based in the x86 realms. The VMware, Xen and Virtual
Servers/PCs of the world have ruled this area. However, the performance gained
by doing that while the hardware did not inherently support virtualization is
not that great. The traditional way to boost virtual machine performance is by
jacking up the amount of RAM and the processor speed.

Intel's vPro and VT (Virtualization Technology) and AMD's AMD-V are
rather big steps to taking the x86 into a fully virtualizible environment. We
have discussed vPro above, let's take some time to understand VT and AMD-V.

Intel VT: Codename 'Vanderpool', it enables support for
virtualization software layers to control processing actually being done inside
a virtual machine. This allows such software to monitor virtual systems and
marshal their resource (processor and memory) usage. Intel VT is in use not just
in x86 desktop processors, but also affects Itanium 32 and Itanium 64 bit
families. Three new features for the 32-bit VT (called VT-x) are: a more
coordinated way of dealing with blocked NMI (Non Maskable Interrupts) when VMs
(guest OS in a VM) are exiting, setting up virtual processor IDs by VMMs and
then use these IDs to translate buffer addresses in memory and; instead of
having a common memory page tables for host and guest OS VMs now have their own
page tables. The Xen Hypervisor gets extra support from the Intel VT
architecture, letting it expose resources and configuration to its guest
software that return better virtualized performance. This includes presenting
all the real processor information bits (CPUID) to the guest, with the exception
of the VMX and MCA.

AMD-V: This is a step further from AMD's 64-bit architecture
(AMD64). AMD-V adds two modes of operation to the systems it runs on: Host mode
and Guest mode. Also added is a new instruction called VMRUN that lets virtual
machines along with the guest OS and its applications work a little faster
inside a VM. An AMD-V processor will initially boot up with its new VM
capabilities disabled (the 'guest' mode) until compatible VMM (VM Managers)
are detected. Once such a VMM is detected, the processor switches to 'host'
mode and turns on all its capabilities. An AMD-V processor in host mode features
a number of interesting capabilities like setting up at the hardware level
itself the kinds of resources the system software is allowed to access. Using
these instructions one can even assign different VMs exclusive access to
different resources (like network interfaces).

As we continue adding cores to processors and add more of such processors to
our computer systems, it is natural that their technologies will advance to let
multiple processor cores access what so far has been a resource dedicated to one
processor core. Next month, we will be looking to demystify the various
processor code names in the multi-core domain.

Stay connected with us through our social media channels for the latest updates and news!