|
The next generation of processors to power our rich-media devices and applications is already here. In this article, we demystify what lies inside it
On November 29, 2005 three top corporate guns-Sony,
Toshiba and IBM- got together to announce the specifications of the first
implementation of the Cell BE (Broadband Engine) processor.
Codenamed Mambo, the Cell Broadband Engine is an
architecture jointly developed by the three companies and aimed at cutting-edge
processor intensive applications such as multimedia, video and image processing.
The name 'Cell Broadband Engine', or Cell BE as it is
called, is a trademark of Sony Computer Entertainment. However, that is not all
it can do. To start off, let us see what is inside the processor that makes it
so special for all.
What's inside it?
You're about to learn a few new acronyms, but let's take it one step at
a time. First up, this belongs to the Power-processor family. The basic design
(in layman terms) of the Cell BE CPU is similar to that of modern multi-core
processors. Only, in the Cell BE's case, you have two kinds of cores, the PPE
and the SPE. Each cell has eight SPEs but only one PPE. The easiest way to
explain this is the table that is given below.
Feature | SPE | PPE |
Name | Synergistic Processing Element |
POWER Processing Element |
Function | Computation | Running the OS, abritration between SPEs |
Cache | 256 KB local | 512 KB L2 |
Instruction issue | Dual | Dual |
Order | In-order | In-order with limited out-of-order for LOAD |
Elements within the processor, like the various PEs, the
memory and so on are inter-connected by the EIB (Element Interconnect Bus). This
bus gives a maximum bandwidth of 204.8 GB/s and runs at half the speed of the
processor. For memory, the processor has an XDR memory along with a memory
controller built-in. This memory controller can exchange data at 25.6 GB per
second with external XDR memories. For I/O, the processor includes an integrated
I/O controller that provides peak bandwidths of 25 GB per second and 35 GB per
second for inbound and outbound respectively.
The PPE is similar to the PowerPC processors including the
G5 (PowerPC 970) because of its AS architecture. Also, it uses AltiVec (VMX)
instructions and parallelizes arithmetic operations for simultaneous
multithreading. Each SPU can use the same instructions to perform both 32-bit
scalar and 128 bit vector processing. There are no memory controllers or
instruction/data caches in the SPU. But, it can access any 128 bit location in
the memory at L1 speeds.
One good news for developers is that this is the end to
their pointer-hell. Pointer data gets aligned and truncated to the sizes in the
local cache within the SPUs. This means there are no more 'access violation'
errors that can be generated. Quite a lot of otherwise common errors from the
CPU are not even possible on the SPU because of its design.
Let's now go ahead and examine some of its other elements
in some detail.
Data transfer
Like we said earlier, the EIB is what connects everything in the processor.
It is actually a combination of a bus and some data storage arranged in four
rings. Each ring can hold 16 bytes of data at a time and can allow upto three
simultaneous non-overlapping read/write operations, and all the 16 bytes can be
sent or received at a time. By non-overlapping, we mean that if data requested
lies discontinuously in a ring and there's another request that's reading
the data that lies in between, the second request will have to wait till the
first one is done. Because of all this, in real-life conditions, sustained EIB
bandwidth lies between 78 GB per second to 197 GB per second.
The I/O system is a set of RAMBUS RRAC FlexIO links and has
11 byte wide channels. Of these seven are for outbound and five are for inbound
data. Both data and commands are transmitted through the I/O system as packets.
Memory
There are two XIO channels available for use and together they can hold upto
512 MB of XDRAM memory. XDRAM is currently the fastest memory technology and
even faster than both DDR and DDR2. These interfaces are based on RAMBUS
technology. This memory is external to the processor and is connected through a
memory interface controller. The controller can handle read/write queues to the
memory on either channel separately and prioritizes the requests too. There is
also a capability to directly write data to the memory in chunks over 16 bytes
but less than 128 bytes. The raw memory speed works out to 25.6 GB per second at
peak, without factors like refresh cycles bringing it down.
Resource allocation logic ensures that access to critical resources is
controlled and time-critical applications are not affected by race conditions.
Spufs
It's an IBM research project, but is quite interesting in its
applications. What the IBM team has done is to turn the SPEs within the Cell BE
processor into a virtual file system, similar to procfs. This lets applications
access and use the resources within the CPU, similar to a normal file system.
This file system is called 'spufs'.
Applications of Cell BE
Some of the currently envisaged applications for Cell BE include powering
the Sony PlayStation 2 console. But, it can also be put to use in computational
areas such as MPEG-2 video decoding, rendering advanced graphics and
cryptography.
What's available?
The Barcelona Supercomputing Center (www.bsc.es) hosts Cell BE specific
packages collected from a variety of sources, including patches for the Linux
kernel (v2.6.14), the GCC toolchain, SDKs and more. The Cell BE platform has
been included into the Linux kernel tree as a new platform under the ppc64
architecture. Hardware-wise, the Cell BE based blade servers are already
available in action since October 2005. There are also simulator and emulator
kits available for this platform from IBM.
The processor can run both 32 and 64- bit software and can
already run the 64-bit PowerPC Linux kernels and supports the ELF binaries. It
is also expected to run all PowerPC applications out of the box.
Sujay V Sarma