by March 10, 2007 0 comments

Thought graphics cards could do only graphics? Think again! If you look at
the action in the x86 graphics processing arena over the past six months or so,
you would notice they are talking about something called ‘Stream Computing.’ To
be truthful, stream computing by itself is nothing new. It is taking a process
with multiple threads that can be executed independently of each other, and
executing them in one go using multiple processors in parallel. How is this
related to graphics processors? We will get into the details shortly, but
consider this: a modern graphics processor (GPU) consists of up to 128 parallel
processing units called ‘pixel shaders.’ Now these pixel shaders cannot do
fantastic things by themselves individually. But take a bunch of them and run
parallel threads through them and you will get much better computer performance.

This has value in applications that have a large pool of data and
the same instruction needs to be applied to all of it. Some applications that
require this sort of computing are used for protein sequences analysis and
geographical data mapping from satellites. So how does stream computing on
graphics processors affect you? Until now, stream computing involved special
stream computing machines (much like super computers) and were restricted to
high-budget research institutions. But now, as graphics cards have this spare
computing capacity within them, they can be harnessed for the same effect at a
much lower cost. And, vendors of such cards (like AMD/ATi and NVIDIA) have come
up with mechanisms to let software developers produce stream computing
applications that can run on an off-the-shelf workstation with nothing more
powerful than a modern graphics card. So maybe your next payroll processing
engine will need the latest high-end gaming card instead of the latest in server
clusters to do their job faster.

Direct Hit!

Applies To: IT managers
USP: Learn about GPUs and how they are harnessed for other
computing applications
Primary Link:

Google Keywords: gpgpu/ stream computing

Inside a modern GPU
Inside a GPU on any modern graphics card you pickup, there are specialized
processors (let’s call them ‘cores’ instead) called ‘vertex’ and ‘fragment’
shaders. A vertex shader operates on a vertex of any polygon being considered
for rendering. For instance, such a processor would determine the position and
color of the vertex. A fragment shader works on sets of pixels and generates
lighting effects (for example) on these pixels. Different shader models that
have evolved in-step with successive DirectX specifications have increased the
capabilities of what GPUs can do. For instance, higher floating point precision
called double-precision, involving 64-bit values instead of traditional 16 or 32
bit ones has enabled a lighting effects paradigm called HDR (High Dynamic Range)
that allows for more realistic shadows and effects.

Consider a GPU with 8 layers of 8 fragment shaders. This makes up 64 cores.
This group can render 64 fragments simultaneously. This sort of processing is of
course called SIMD (Single Instruction Multiple Data) that we discussed in an
earlier part of this series. To sum up, SIMD is an operation where you apply a
single instruction (or transformation) on multiple data elements. For example
increasing the light on a set of rendered pixels is an example of SIMD. Let’s
take SIMD a step further.

A GPU can be used to
perform non-graphics tasks by sending those threads of a multi-threaded
application that can be executed independently of each other to pixel-shader
cores in the GPU, using a special instruction helper framework

Stream processing
Because these cores operate independently, they cannot share any data and so
there are two types of memory locations associated with them-one’s called a
‘Texture’ memory which can only be read from; while the other is called
‘Framebuffer’ which can only be written to. Of course, data in the Framebuffer
can be routed back to the Texture memory to serve as input for another stage of
processing before it is sent to the screen. If you consider that all data in the
texture memory needs to have the same instructions applied to them, then modern
parlance calls this a ‘Stream’. The application of the shader logic on this is
called ‘Stream Processing’ and the logic itself is called ‘Kernel’. Related
terms are DPP (Data Parallel Processing) and GPGPU (General Purpose computing on

Regular GPU based stream computing platforms utilize identical hardware (as
in graphics cards with identical capabilities). But this need not be the rule
and we can use dissimilar hardware, which AMD calls ‘Asymmetric processing’.
ATi’s cards can be used this way to handle different tasks depending on the
capabilities of each card plugged in. For instance, one can do physics while the
other can be used for rendering. Of course, in such a system, it is not easy to
scale up the performance by simply adding more cards, as this will only add to
the system’s asymmetry. Another problem is workload distribution to optimize
performance. So, asymmetric processing can be used even by the same application
with multiple tasks that are independent, with no direct communication amongst

The GPU contains sets of
Pixel Shader cores which can execute parallel threads very quickly

Developer support
Both ATi and NVIDIA have released SDKs (software development kits) to let
developers make applications that make use of GPUs. The problem was that
traditional compilers cannot directly optimize for graphics related processing
easily. NVIDIA’s solution is called CUDA (Compute Unified Device Architecture)
while that of ATi is called CTM (Close To the Metal). CUDA is actually a C-like
compiler for GPU applications. The engine provides a thread controller
architecture that takes care of thread management for applications that stream
through the GPU. CTM on the other hand, is more a hardware driver that exposes
API that GPU applications can call to achieve similar end results.

Applications that want to harness the power of stream computing through the
GPU must use CTM or CUDA to route their code through the GPU. CTM or CUDA will
perform the arbitration of which code to route to the GPU and which to the
regular CPU.

Sadly there is no vendor-agnostic way to let applications run seamlessly
regardless of which of the two vendors supplied your graphics card. For now,
early adopters will have to be contended with building uni-vendor stream
computing infrastructure which may not be the best-of-breed or the most cost
effective solution around. Clearly, much work needs to be done and it is very
early days yet to put this down as a success or write it off as too complicated.

Applications of Stream Computing

The technology behind stream computing can be
used in a variety of applications where you have a large pool of data, each
of which requires the same set of instructions, independent of each other.
In developer parlance, we can put this as-stream computing is best suited
for massively multi-threaded applications, with parallel threads.
Application areas include: protein folding analysis in biotechnology;
seismic analysis for oil and gas exploration; signal processing for defense
services; various simulations and forecasting model processing in the
financial industry; face recognition and speech recognition. Folding@Home, a
distributed stream computing paradigm from AMD/ATi has been around since the
2000s, using the idle time on traditional CPUs to crunch the way through the
complicated problem of analyzing protein folding patterns and effects. In
September 2006, the project switched to using stream computing cores in GPUs
to become faster. At present the projects are helping the medical world to
understand Alzheimer’s Disease, Cancer, Huntington’s Disease, Osteogenesis
Imperfecta, Parkinson’s Disease and the relationships between various
ribosomes and antibiotics.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.