by December 9, 2006 0 comments

A database today is a vastly complex affair. The ‘big’ implementations
can be a cluster of thousands of database servers spread across the geography.
Yet, as enterprises scale up and grow beyond those geographical boundaries, they
continue to demand and expect that the data will always be there, at their
fingertips. The magic that works behind the scenes to make this happen is what
makes a modern relational database tick. If you consider a database deployment
as vast as those of a Google or a Yahoo, the amount of data that is there, with
the primary ambition of being found in response to any number of any manner of
queries runs into Petabytes.

How does this work? What is the secret databases engines hold that stores all
this information efficiently and in a manner that they can be queried and
processed in fractions of minutes? The technologies that apply are the very same
ones that power your enterprise. And we’re going to plumb the depths of a
modern RDBMS system and see why and how it just works. Before we move on, let’s
get an overview of what a modern RDBMS looks like on the inside.

Quick refresher
A modern RDBMS has a few logical sections to it, although you may not see them
as such discreet pieces (as in separate modules) in reality. Exactly how the
different pieces fit together also differs between each implementation. But
broadly speaking, they would have the following:

Storage engine: This would determine where each data element goes and
stores it in the best possible way. Sometimes, compression is also applied to
make efficient usage of storage capacity.

Buffer/Cache engine: This takes care of buffering and caching, both
data as well as instructions between the database backend server and the user
front-end (which may be over a Web connection). This is also important when
cursors several data pages long are being retrieved for reporting or analysis.

Transaction engine: RDBMS operations are atomized into transactions
and sets of transactions can be rolled back in case of failure. This is well
known. The engine that makes this happen is the transaction engine, which has
built-in atomization logic, along with the ability to detect different kinds of
failures (recoverable and irrecoverable) and handle them appropriately.

Recovery engine: In case a transaction fails or data is corrupted, the
recovery engine is called upon to get things to working order again. This means
that you get the data in a previously ‘known’ and ‘sane’ state as
against something that’s completely unknown or damaged.

Query engine: This engine is responsible for fetching the queries
(from the application or user), parsing it, optimizing it and then queuing it
for execution.

Each of these engines, in turn, would use sub-engines that perform further
specialized tasks. For instance, the storage engine would need to call on one of
several discreet file system drivers to perform the storage depending on if the
storage is happening locally, over the network onto a SAN or if it is in a
distributed environment.

Now let’s turn to some of the key players in the RDBMS arena and see what
technologies make them work better for us.

Databases for SOA
There is a new set of databases aimed at getting better performance out of SOA
applications that use databases. IBM’s DB2 9 ‘VIPER’ and MS SQL Server
2005 are two such SOA ready databases.

SOA enables databases to interact better with SOA applications like web
services, and are targeted at large e-commerce deployments. Regular databases
and database applications are not geared to meet the demands of SOA deployments.
To do that, an RDBMS must do the following at a minimum:

1. Be able to talk and understand XML. All SOA transactions happen via XML so
it is important for the database engine to know how these messages look and work
so to know what are the really interesting ones that it must store and what is
part of regular chatter that can be safely thrown away.

Special database
This is an open source database engine based around XML and built on top
of the Berkeley DB database engine. It provides a quick and easy way to
store and retrieve XML based data, using XQuery. The engine runs
in-process with the application that uses it and does not require external
administration. As such, it provides ACID transactions along with XML
validation, document meta-data and document indexing features. It is
supported on multiple platforms and useable with all the major developer

ORACLE Berkeley DB Java Edition
Again built around the Berkeley DB architecture, this database is entirely
built with Java and meant for EJB object persistence. The idea is that
instead of having to use external databases for Java EJB objects and then
using SQL queries to store and retrieve them and incur the penalty of
having to translate Java objects into SQL entities and back each time, one
can use powerful Java statements to do the same. Thus, the Berkeley DB
Java Edition uses Java code via a DPL (Direct Persistence Layer) to
persist and retrieve EJB objects in a relational database. The database
functions as a Managed Transaction Resource within your EJB application as
it exposes a JTA (Java Transaction API) and JCA (Java Connector
Architecture). Data storage is always local and in-process.

2. It must do HTTP and SOAP. Most databases simply communicate on a
predetermined TCP port (like 1433 for MS SQL Server over TDS [Tabular Data
Stream] protocol,1521 for Oracle and port 3306 for MySQL) and this has nothing
to do with regular SOA traffic. A true SOA database must communicate on the
regular channels like HTTP (port 80) atleast.

In order to act as a service provider for SOA applications or data, the RDBMS
must be further aware of SOAP authentication methods and be able to proxy that
authentication between token providers and consumers in the SOA chain. Future
databases will need to be SODA instead of just an RDBMS to serve the purposes of
modern computing.

Oracle 10g
Automatic Storage Mgmt: ASM for short, think of it as your SAN’s
storage manager embedded into the kernel of your RDBMS engine. Hierarchically,
the ASM would fit in between the different database instances (see Jargon Buster
in this article for more on this) and the storage they use. Thus, it virtualizes
the storage available to Oracle databases across a clustered storage environment
and exposes them as a single available set of locations to the database. Where
the exact file (or file segments) are located and how they are laid out,
accessed or managed becomes transparent to the database engine, with the
positive side effect of removing the headache of managing storage from the DBA’s
workload. ASM is used by the Oracle kernel only for the database files and
cannot be used for other storage.

A set of physical disks are grouped into ‘disk groups’ and a set of these
are available to a storage node. An ASM manages storage at the node level and
different nodes are formed into a storage cluster. ASMs among different nodes in
a cluster can communicate among themselves to manage storage between them.

At start up, the kernel would fire up the ASM which would mount the disk
groups after making sure everything is ok (or performing maintenance on them).
It is only after this that the DB instances are started up, which in turn mount
the databases.

MySQL Server allows DBAs
to choose from a set of storage engines on the fly, each engine is
optimized for a specific app area from archival to cluster computing

Cluster File System: When an Oracle RAC (Real Application Cluster) is
spread across several cluster-nodes, a robust and high performance CFS (Cluster
File System) is required to efficiently access and store data across the
cluster. Oracle’s open source OCFS (Oracle Cluster File System) is a
contribution in this direction. Unlike what we said about the ASM above, OCFS
lets the database store data files, log files as well as archives. But, regular
files other than Oracle database files cannot be stored on the OCFS and this
support is planned for OCFS 2.0 (current release is 1.0.14-1 at the time of
going to press). The OCFS can work only on a SAN or a DAS where local servers
have direct control over the storage. The system runs on a RHEL Advanced Server
based server, and standard Linux commands (FSCK, MKFS, etc) are available to
manage and maintain OCFS storage.

MS SQL Server
Automatic Consistency Checking: Since version 7.0, SQL Server has
supported file based databases. This lets a single database be extended among
different storage units and managed together from a single interface. SQL Server
2000 removed the need to run DBCC (consistency checks) on databases prior to
backup, since the engine anyway performs these checks at regular intervals.
Also, as and when the storage engine encounters a problem, it is handled without
needing a separate management operation. With the 2005 edition, running a DBCC
CHECK (although still available) is no longer a necessary step, as it is
completely automated.

Server Instance: Many RDBMS backends
can run multiple instances, and each of these instances service one or
more applications.

This is an efficient way to fine tune the engine
configuration depending on the application(s) being serviced, instead of
being forced to run at the same configuration for all databases
and applications.

Each instance can use the same database, but due to file
locking when an instance mounts a database, only one of it can use it at a
time (the other instances will need to place the database offline or
un-mount it).

Cursor: A cursor is a range (‘page’) of
information being pulled from a database in response to a query. Cursors
are used to reduce bandwidth utilization while processing a large amount
of data. They do this by accessing only a portion of it at a time.

Transaction: Database operations are usually
multi-step. Each step has a possibility of failure. If the operation fails
and data has been written to the database (or deleted), then the database
can be in an inconsistent state since the entire operation has not
completed (for eg, the database may have recorded that the order was
dispatched, but details like how it was paid for and how it was sent away
were not yet saved). Transactions help batch these operations and on
failure within a batch that batch or the entire set can be undone (‘rolled
back’) and retried afresh after corrective measures.

Atomization: The process of breaking up an
operation or information into smaller parts is called atomization. The
smaller a piece, the less likely the entire operation fails. Also errors
can be caught and handled better.

Copy-on-write: Primarily a file system/backup
related technology, it refers to the type of data copy operation that
occurs only when a WRITE occurs on a corresponding file. This means a CoW
agent will ignore all READ operations as well as idle states of the data
and gets triggered only when data content is changed in some way. This is
a more efficient way to snapshot/backup information.

Data partition: When a database contains a large
amount/types of data, it becomes difficult to analyze properly, since the
application will be looking at all of it at a time. The RDBMS if it
supports partitioning, can let you create ‘windows’ into the data,
much like the VIEWS that let you see only a subset of that data. Different
types of partitioning exist (range, hash, key, list and composite).
Sub-partitioning lets you create smaller subsets from existing data

Page-wise Checksum: Instead of checksums being computed at the file
level, SQL Server 2005 checksums pages of data, which can increase data
protection. This can, for instance catch problems that are not reported by the
OS or I/O hardware — an ALTER DATABASE option enables page level checksums for

Read-only Database Compression: Another improvement in this edition is that
when you have compressed NTFS volumes, you can place a database file group that’s
been marked ‘read only’ onto that compressed volume. Since the data in these
files are not subject to change (since they are read only), the file does not
grow and hence compression can help better utilize its space.

Quick Initialization: To improve performance, SQL Server 2005 permits
creating or extending a database’s data files (not log files) without first
zeroing the space allocated to it. But, to be able to do this, the SQL Server
instance needs the SE_MANAGE_VOLUME_NAME privilege, which is not available to
the Network Service account (default for the service) and hence by default this
facility is turned off. This kind of quick-initialization potentially exposes
uninitialized portions of the disk that may contain previous data to the
application or user. But, the storage engine itself restricts access only to
those portions where data has been written to by the database engine, preventing
such leaks. In a broad manner of speaking, this is similar to the ‘quick
format’ option for disks available in major OSs.

Snapshots: Point-in-time backups can be taken using database
snapshots. This uses NTFS Sparse File technology to efficiently manage the space
used. Pages of data are copied from the database to the snapshot on a
copy-on-write basis, letting the snapshot file attain similar sizes as the
original database.

Row-wise versioning: And imagine having different versions of each row
in a database! That’s also now supported with SQL Server 2005’s row-level
versioning. This also works on a copy-on-write basis, but stamps each row with
its transaction ID to
enable a trace back.

Data partitioning: Data in the latest version of this RDBMS can be
partitioned into independent segments and accessed and modified separately as
long as they are assigned to different file groups. Thus, the storage engine in
SQL Server 2005 also handles file groups (actually data partitions), along with
switching partitions between file groups and aligning them to different

Pluggable Engines: This well-known open source enterprise class database uses a
pluggable storage architecture. While about eight different types are shipped
with different editions of MySQL, about five engines are also available from
their partners and the MySQL community. Each storage engine is specifically
written and optimized for a particular application area. Some handle
read-intensive operations more suited to Web based deployments, while others are
raw OLTP, clustering and warehousing optimized. The default storage engine is
called ‘MyISAM’. Other native engines include: Cluster, Federated, Archive,
Merge, Memory, CSV and BlackHole. Oracle/InnoDB’s own InnoDB engine is tasked
with transactional data storage with row-level locking and referential
integrity. PrimeBase XT, Open OLAP and Berkeley DB (Oracle) are among the
community developed contributions.

Till now, this architecture allowed MySQL DBAs to select the best available
storage engine for a MySQL instance. From version 5.1, the architecture has been
changed to allow database developers change the storage engine on the fly using
what MySQL terms as a ‘clap-on
clap-off’ approach with special LOAD and UNLOAD statements.

Other enhancements include the transactional engine, OLAP engine and
multi-dimensional engine. The MySQL community also got its long standing wish to
be able to do data partitioning (range, hash, key, list, composite and
sub-partition) on its databases with the latest version, although the support
for this is rather shaky and under development.

Falcon: This new database engine for MySQL was originally supposed to
enter public beta testing in summer this year, has not yet come into the open
even with broad specifications at the time of this going to press. All that is
known about this engine is that it is based on the Netfrastructure engine by Jim
Starkey that makes better use of modern processor and RAM capabilities. More
cores in a processor means the engine would use them to marshal data from one
place to another more efficiently and go multi-threaded to do so. Larger
memories mean more caching, but Falcon wants to do better than that by using
in-memory databases that fill up from disk-based content. Tuning databases are
also rumored to be a thing of the past with Falcon handling your database.
Multiple versions of a table will exist in memory while there would be just a
single version on the disk,
making versioning faster and more consistent. Similarly, transactions are also
performed in memory with the outputs flowing to the disk. For the rest, we’ll
have to wait till a version of Falcon actually hits the beta servers.

Not just relational DBMS: This is version 9 of the IBM DB2 and was almost
named “DB3” given the wide ranging changes to the DB2 system. IBM says that
as of Viper, DB2 is no longer a pure relational environment but provides an XML
plus relational environment.

pureXML: DB2 Viper is intensely an XML environment. XML documents can
be stored in a database using several methods: you could break up the document
into its constituent parts and store each one separately or, store entire XML as
a single piece. Both have their pros and cons. When you break up the file, you
can store it more efficiently, replacing duplicates with pointers to the already
stored copy. But, this has a performance penalty when a document has to be
reconstructed in response to a query. DB2 Viper stores XML documents as a single
LOB (Large Object) entity, making it compliance friendly, as when you break up
something into pieces and recombine it later, you do not end up with the
original (for eg, digital certificates will fail to verify as the original
document). Storing XML as a LOB overcomes this problem as well.

SOA friendly: Having a database that can directly do XML is great.
But, having it do a part of SOA for you as well would make for a great deal.
This is what DB2 Viper also enables with its pureXML. So, instead of having
separate DB2 instances for each platform or application that needs the database,
you can have just a single DB2 deployment that can serve data via XML to the
various applications, thus complying with the primary requirement of SOA:

Tokenized Compression: To compress data effectively, DB2 v9 replaces
duplicate data even within columns with tokens. This reduces redundancy. DB2
will look for duplication across the database and replace the duplicates with
numeric tokens each time it appears. A central token ‘dictionary’ helps
refresh the data item with the actual information when it is retrieved. Numeric
values, LOB, XML and indexes are not currently tokenized.

Optimized for SAP: It not onl detects its deployed environment on
installation, but can also detect a SAP deployment in that environment. It can
discover SAP’s configuration and use these values to work better with that
application. DB2 can be deployed in a special ‘silent install’ mode as a
part of the SAP installation. The information picked up from SAP is used for
database tuning, indexing and multi-dimensional clustering.

Security partitioning: This is a form of data partitioning (see Jargon
Buster) where DB2 lets you partition data based on security tokens. This means
you can put confidential information in one partition, top secret data in
another and see-this-and-die information in a third one. You can do partitioning
using an ALTER TABLE command. This function is useful when you need to perform
decision-support analysis (what-if, scenario building, etc) where you may need
to temporarily ‘delete’ data for the analysis query period.

Replicated partitions: This is a feature that is a part of data
warehouse appliances and may soon find its way into DB2. Partition replication
is useful in parallel processing environments where you can replicate a
partition onto another instance.

Self-Tuning: In both memory and storage, the database engine can
automatically detect application workloads (especially SAP) and tune itself to
better performance. This is done by redistributing memory among the different
DB2 processes to optimize the usage.

ILM: The DMS (Data Managed Storage) module of Viper lets you have
disks of different speeds in the same system and it will automatically allocate
data to each of them based on their age. That is, older data will go onto slower
disks and current data on faster ones. Thus, DB2 has a built-in ILM module as

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.