As conventional computing races towards the physical
limits of hardware–the speed of light, the uncertainty principle, unpredictable
quantum effects–researchers all over the world are working on methods that will let
them continue increasing available computing power. Moore’s Law, the famous assertion
that computing power available at a fixed price-point doubles every eighteen months, will
not bear violation.
Though DNA, optical and quantum computers are under various
stages of development, and very exciting breakthroughs are being made almost every day, we
are not likely to find these highly experimental computational techniques powering the
machines on our desktops any time soon. But a more prosaic approach, co
on a space science problem. Recently a 16-node Beowulf system, at a total system cost of
under $50,000 (the price of a high-end workstation) exceeded 1.25 gigaflop sustained
performance on a gravitational N-body simulation of 10 million particles–a
price-performance breakthrough of significant importance for a wide range of industrial
applications, including aerospace.
Beowulf’s foremost design objective was to deliver the
highest parallel processing performance for the lowest price. That explains the designers’
choice of mass market commodity components, dedicated processors (rather than scavenging
cycles from other workstations, like some other systems do), and a high-throughput System
Area Network (SAN)–anything from fast Ethernet or multiple Ethernets combined into a
single virtual network, all the way to a 64-node, 128-processor hypercube with 8-node
clusters connected internally through fast Ethernet, and the meta-nodes connected through
a 1.2 gigabit Myrinet crossbar.
Donald Becker, one of the developers of Beowulf, is
responsible for many of the network interface drivers available for Linux, especially for
high-throughput devices. The project has led to the development of much high-performance
networking code that has now been incorporated into the Linux source tree.
The Beowulf software environment is called
Grendel. It’s
implemented as a set of add-on programs to the RedHat Linux distribution, and includes
several independently developed parallel programming environments, providing message
passing interfaces with the PVM (Parallel Virtual Machine) and MPI (Message Passing
Interface) libraries. Beowulf also integrates the BSP (Bulk Synchronous Parallel)
libraries which reduce the need for explicit message passing code to be added to serial
applications and parallel applications developed to work on shared memory architectures–resulting
in faster porting and development times. System V IPC and p-threads are also supported.
With so many programming environments available, Beowulf is
probably one of the most flexible parallel processing systems available. This greatly aids
in porting applications from traditional parallel supercomputers. In the past, code was
often developed and debugged on an inexpensive Beowulf system and then transferred to a
large distributed-memory machine, like an Intel Paragon, a Cray T3D or a Thinking Machines
CM5. Recently, however, a duo of scientists ported their highly optimized N-body codes the
other way–from a CM5 to a Beowulf. The port was accomplished with a simple recompile.
In a Beowulf cluster, every machine is responsible for
running its own copy of the kernel, and nodes are generally autonomous at the kernel
level. However all the kernels participate in a number of global namespaces, such as the
Global Process ID space. All processes running on a Unix machine have a unique Process ID
or PID to identify them. In a parallel distributed processing system, you’d want to
have unique process IDs across an entire cluster, spanning several kernels. Beowulf
provides at least two mechanisms to support this: one internal, and one inherited from
PVM.
Linux uses the /proc pseudo-filesystem to present
system and process information in real time through a familiar ‘filesystem’
interface. A basic /proc filesystem presents a subdirectory for each process on the
local processor. The Linux implementation extends /proc to present almost all
system information in this format. Integrating /proc filesystems across a Beowulf
cluster would allow standard Unix tools for process management and job control to work
transparently on the virtual machine without the need for any modifications, because most
of these utilities get all of their information from the /proc
filesystem. Work is
underway on a unified /proc filesystem for Beowulf.
into the virtual memory system. That makes it simpler to create page-based shared memory
systems that allow the entire memory of a cluster to be accessed transparently from any
node. Other environments being added to Beowulf to implement distributed shared memory
include the page-based NVM (Network Virtual Memory). Beowulf also supports parallel
filesystem libraries like PVFS (Parallel Virtual File System) and PPFS (Portable Parallel
File System) to allow high-performance disk I/O. NASA Goddard’s 16 Pentium Pro
cluster contains 100 GB in distributed disk with a whopping 1 Gbit per second memory to
disk bandwidth.
Beowulf stacks up comparably with machines that can cost a lot
more. A Beowulf has equaled the performance of IBM SP2s with comparable nodes at one-tenth
the price. Beowulf systems running on the latest generation of Intel or DEC Alpha chips
perform comparably with supercomputers like CDAC’s Param, Intel’s Paragon, and
the Cray T3D (which have similar, parallel internal architectures). For high-performance
distributed computing, you can’t go far wrong with Beowulf.
The American federal government has always been paranoid
about letting high-performance computing technology out of the country. But it’s
already losing its war against strong cryptography exports, and with Beowulf it’s no
longer a very big deal for anyone with the motive, the means and the skill to put together
a supercomputer that the CIA and the NSA will consider a national security threat. Even if
all you really want to do is set another record at the Oscars.
Next month: How you can play around with
multiprocessing and increase the computing throughput in your organization using PC
clusters.