In this age of supercomputers and parallel computing, lot of software
developers and researchers who do computationally intense work might face some
difficulty in exploring the full potential of their hardware. But with several
advancements in technologies things are changing at a rapid pace. With the
paradigm shifting towards parallel computing, the focus is on getting difficult
scientific & engineering problems solved by dividing them into smaller ones.
Parallel computing has been considered to be the high end of computing. The HPC
market has recently seen a surge in interest with an emphasis on modern GPUs.
Most enterprises are developing & working on new techniques that can offer
tremendous potential for performance & efficiency. So the focus is shifting
towards the concept of parallel computing & one must be able to adapt to the
evolving capabilities exhibited by GPUs. Slowly & steadily the performance of
general purpose processor is changing at a rapid pace offering more of a
flexible architecture. With the advent of advanced GPUs developers have started
to explore more applications, beyond rendering in 3D apps. In this article we
will be looking at how GPUs have evolved and how NVIDIA and ATi are taking the
GPU computing to the next level.
Supercomputing capabilities
Refreshing our memories, Central Processing Unit has been known as the brain
of the PC. It is that part where all the logical work is done. Its job role
includes executing the sequence of instructions in a sequential manner. Earlier,
CPUs were delegated to perform all the basic functions & handle the complex
computations. Also, the CPU did all the traditional graphics processing before
GPUs were included within display adapters. CPUs are designed keeping in mind
the different goals that include high performance & throughput levels of
single/multiple threads. Other factors also include less power consumption,
lower cost for the same performance level. A CPU is composed of only few cores
which are designed to execute various instructions. Originally, GPUs were
hardware blocks optimized for graphics use, but with the advent of technology
they became more flexible, more programmable. The Graphics processing unit also
known as the Visual processing unit (VPU) goes just beyond basic graphic
controller. A GPU is a programmable high end computational device that can be
harnessed more broadly for crunching difficult tasks at a greater pace.
Caching, architectural differences
Cache is nothing but a small, faster memory that acts as a repeated memory
storage for the data that a CPU requires. Cache memories increase the
performance of the CPU owing to reduction in memory access latencies. Whereas in
case of GPUs, cache is used to increase the memory bandwidth. With the help of
large caches, it is possible to reduce the memory access latencies in CPUs.
Talking about the GPUs, the interesting point is that GPUs are able to execute
numerous threads simultaneously thus reducing the memory access latency. With
this ability of GPUs they can accelerate the same application by numerous times
over a CPU. Also, if you take a look inside a GPU there are special processors
(cores) called 'vertex' & 'fragment' shaders. The basic functionality of a
vertex shader is to add special effects in a 3D environment to different
objects. A vertex is location dependent and is defined by x, y & z coordinates
in a 3D environment. A fragment shader generates a lightning effect on the
pixels. A GPU receives a set of polygons that performs all necessary functions &
then outputs pixels. As they can be processed in parallel, unlike sequential
instruction thread for CPU they use a lot of execution units. The GPUs are
optimized for high performance of a single command thread processing integer &
floating point numbers. There's a lot of difference in the way GPUs and CPUs
access memory. Memory operations are also different in GPUs & CPUs. GPUs contain
several memory controllers while some of the CPUs have inbuilt controllers.
Higher bandwidth which is relevant for parallel computations is also easily
available to graphics cards. Taking in consideration the multi-threaded
operations, CPUs use SIMD vector units, while GPUs for scalar thread processing
use SIMT. A fine example in which GPU computing is perfectly adopted is the
molecular modeling that requires high processing power.
NVIDIA's CUDA
CUDA developed by Nvidia is a parallel computing architecture that increases the
computing performance. This computing engine in Nvidia GPU enables software
developers, researchers & scientists to perform their complex computational
tasks. Software developers use the 'C ' programming language to program the CUDA
architecture. The GPUs that have CUDA architecture consist of cores that enable
hundreds of computing threads to run collectively. CUDA has been widely used in
the area of scientific research & programming. Some of the key areas are the
areas of fluid dynamics simulation, computational biology & chemistry, ray
tracing, etc. NVIDIA supports CUDA with the following GPUs: GeForce, Ion, Tesla
& Quadro GPUs.
CUDA supports heterogeneous computation wherein serial portion of
applications run on the CPU & parallel portions on the GPU. Thus in this manner,
both CPU & GPU capabilities are utilized. The configuration is designed keeping
in mind that both have their own memory space. Thus these are treated as
separate devices & allow simultaneous computation on both.
Fermi: GPU computing architecture
Fermi, incorporating billions of transistors and featuring upto 512 CUDA
cores is the latest buzz. Fermi incorporates new features & support for Error
correcting code (ECC) memory. Apart from ECC, it also enhances the floating
point performance with the addition of 512 cores. It also supports the GDDR5
memory with an increased memory reach of upto one terabyte. Fermi is available
with a Visual Studio development environment and supports a number of areas. The
areas defined are in ray tracing, physics, sorting & search algorithms, finite
element analysis & more. The innovations that Fermi features includes the
NVIDIA's Parallel data cache & Gigathread engine, 512 cores & ECC support. The
GPU hardware in Fermi multiprocessor consists of cores (stream processors) with
two groups of sixteen microprocessors. In case of GPU, each core can execute
single threads in a sequential manner, while the cores in Fermi execute in a
Single Instruction Multiple Thread fashion. Shared among the cores, attached to
each multiprocessors there is a small software managed data cache which NVIDIA
has termed as 'Shared memory'. The indexable memory runs at registered speed &
is a low latency & high bandwidth memory. This 64KB shared memory on Fermi,
offers flexibility that can be easily configured in two ways. One way can be
configuring it as a 48 KB software-managed data cache with 16KB hardware cache
or 16KB software managed data cache along with 48 KB hardware cache.
NVIDIA's Gigathread Engine: This is the new technology designed by NVIDIA
that allows multiple threads to execute in parallel. The Gigathread engine also
supports a bi-directional data transfer engine & also includes high kernel
execution engine.
NVIDIA's Parallel data cache: Fermi, supports the cache hierarchy with L1 and
L2 caches. The L1 cache in Nvidia's Parallel data cache improves the bandwidth
, reduces the latency for GPUs. While the L2 cache, improves the coherent data
sharing.
ATI's Stream Technology
ATI Fire Stream (or AMD Stream Processor) technology harnesses the power of
the AMD graphic processor working in tandem with the system's central processor
to speed up many applications beyond graphics. Stream allows many parallel
stream cores inside AMD graphic processors to accelerate general purpose
applications. This allows Stream enabled programs to be optimized or have extra
features. ATI stream uses parallel computing architecture that takes advantage
of the graphic card's stream processors to compute and execute applications or
tasks that can be broken down into parallel as well as identical operations and
run at the same time on a single processor. The main advantage is that Stream
uses SIMD (Single Instruction, Multiple Data) whereas a CPU uses a modified SISD
(Single Instruction, Single Data Stream).