Tech Explained

Petabytes of Data @ Your Fingertips: How?

PCQ Bureau

09 Dec 2006 16:56 IST

New Update

A database today is a vastly complex affair. The 'big' implementations

can be a cluster of thousands of database servers spread across the geography.

Yet, as enterprises scale up and grow beyond those geographical boundaries, they

continue to demand and expect that the data will always be there, at their

fingertips. The magic that works behind the scenes to make this happen is what

makes a modern relational database tick. If you consider a database deployment

as vast as those of a Google or a Yahoo, the amount of data that is there, with

the primary ambition of being found in response to any number of any manner of

queries runs into Petabytes.

Advertisment

How does this work? What is the secret databases engines hold that stores all

this information efficiently and in a manner that they can be queried and

processed in fractions of minutes? The technologies that apply are the very same

ones that power your enterprise. And we're going to plumb the depths of a

modern RDBMS system and see why and how it just works. Before we move on, let's

get an overview of what a modern RDBMS looks like on the inside.

Quick refresher

A modern RDBMS has a few logical sections to it, although you may not see them
as such discreet pieces (as in separate modules) in reality. Exactly how the

different pieces fit together also differs between each implementation. But

broadly speaking, they would have the following:

Advertisment

Storage engine: This would determine where each data element goes and

stores it in the best possible way. Sometimes, compression is also applied to

make efficient usage of storage capacity.

Buffer/Cache engine: This takes care of buffering and caching, both

data as well as instructions between the database backend server and the user

front-end (which may be over a Web connection). This is also important when

cursors several data pages long are being retrieved for reporting or analysis.

Advertisment

Transaction engine: RDBMS operations are atomized into transactions

and sets of transactions can be rolled back in case of failure. This is well

known. The engine that makes this happen is the transaction engine, which has

built-in atomization logic, along with the ability to detect different kinds of

failures (recoverable and irrecoverable) and handle them appropriately.

Recovery engine: In case a transaction fails or data is corrupted, the

recovery engine is called upon to get things to working order again. This means

that you get the data in a previously 'known' and 'sane' state as

against something that's completely unknown or damaged.

Query engine: This engine is responsible for fetching the queries

(from the application or user), parsing it, optimizing it and then queuing it

for execution.

Advertisment

Each of these engines, in turn, would use sub-engines that perform further

specialized tasks. For instance, the storage engine would need to call on one of

several discreet file system drivers to perform the storage depending on if the

storage is happening locally, over the network onto a SAN or if it is in a

distributed environment.

Now let's turn to some of the key players in the RDBMS arena and see what

technologies make them work better for us.

Databases for SOA

There is a new set of databases aimed at getting better performance out of SOA
applications that use databases. IBM's DB2 9 'VIPER' and MS SQL Server

2005 are two such SOA ready databases.

Advertisment

SOA enables databases to interact better with SOA applications like web

services, and are targeted at large e-commerce deployments. Regular databases

and database applications are not geared to meet the demands of SOA deployments.

To do that, an RDBMS must do the following at a minimum:

1. Be able to talk and understand XML. All SOA transactions happen via XML so

it is important for the database engine to know how these messages look and work

so to know what are the really interesting ones that it must store and what is

part of regular chatter that can be safely thrown away.

Special database

engines

ORACLE Berkeley DB XML

This is an open source database engine based around XML and built on top
of the Berkeley DB database engine. It provides a quick and easy way to

store and retrieve XML based data, using XQuery. The engine runs

in-process with the application that uses it and does not require external

administration. As such, it provides ACID transactions along with XML

validation, document meta-data and document indexing features. It is

supported on multiple platforms and useable with all the major developer

languages.

ORACLE Berkeley DB Java Edition

Again built around the Berkeley DB architecture, this database is entirely
built with Java and meant for EJB object persistence. The idea is that

instead of having to use external databases for Java EJB objects and then

using SQL queries to store and retrieve them and incur the penalty of

having to translate Java objects into SQL entities and back each time, one

can use powerful Java statements to do the same. Thus, the Berkeley DB

Java Edition uses Java code via a DPL (Direct Persistence Layer) to

persist and retrieve EJB objects in a relational database. The database

functions as a Managed Transaction Resource within your EJB application as

it exposes a JTA (Java Transaction API) and JCA (Java Connector

Architecture). Data storage is always local and in-process.

Advertisment

2. It must do HTTP and SOAP. Most databases simply communicate on a

predetermined TCP port (like 1433 for MS SQL Server over TDS Stream> protocol,1521 for Oracle and port 3306 for MySQL) and this has nothing

to do with regular SOA traffic. A true SOA database must communicate on the

regular channels like HTTP (port 80) atleast.

In order to act as a service provider for SOA applications or data, the RDBMS

must be further aware of SOAP authentication methods and be able to proxy that

authentication between token providers and consumers in the SOA chain. Future

databases will need to be SODA instead of just an RDBMS to serve the purposes of

modern computing.

Oracle 10g

Automatic Storage Mgmt: ASM for short, think of it as your SAN's
storage manager embedded into the kernel of your RDBMS engine. Hierarchically,

the ASM would fit in between the different database instances (see Jargon Buster

in this article for more on this) and the storage they use. Thus, it virtualizes

the storage available to Oracle databases across a clustered storage environment

and exposes them as a single available set of locations to the database. Where

the exact file (or file segments) are located and how they are laid out,

accessed or managed becomes transparent to the database engine, with the

positive side effect of removing the headache of managing storage from the DBA's

workload. ASM is used by the Oracle kernel only for the database files and

cannot be used for other storage.

Advertisment

A set of physical disks are grouped into 'disk groups' and a set of these

are available to a storage node. An ASM manages storage at the node level and

different nodes are formed into a storage cluster. ASMs among different nodes in

a cluster can communicate among themselves to manage storage between them.

At start up, the kernel would fire up the ASM which would mount the disk

groups after making sure everything is ok (or performing maintenance on them).

It is only after this that the DB instances are started up, which in turn mount

the databases.

MySQL Server allows DBAs

to choose from a set of storage engines on the fly, each engine is

optimized for a specific app area from archival to cluster computing

Cluster File System: When an Oracle RAC (Real Application Cluster) is

spread across several cluster-nodes, a robust and high performance CFS (Cluster

File System) is required to efficiently access and store data across the

cluster. Oracle's open source OCFS (Oracle Cluster File System) is a

contribution in this direction. Unlike what we said about the ASM above, OCFS

lets the database store data files, log files as well as archives. But, regular

files other than Oracle database files cannot be stored on the OCFS and this

support is planned for OCFS 2.0 (current release is 1.0.14-1 at the time of

going to press). The OCFS can work only on a SAN or a DAS where local servers

have direct control over the storage. The system runs on a RHEL Advanced Server

based server, and standard Linux commands (FSCK, MKFS, etc) are available to

manage and maintain OCFS storage.

MS SQL Server

Automatic Consistency Checking: Since version 7.0, SQL Server has
supported file based databases. This lets a single database be extended among

different storage units and managed together from a single interface. SQL Server

2000 removed the need to run DBCC (consistency checks) on databases prior to

backup, since the engine anyway performs these checks at regular intervals.

Also, as and when the storage engine encounters a problem, it is handled without

needing a separate management operation. With the 2005 edition, running a DBCC

CHECK (although still available) is no longer a necessary step, as it is

completely automated.

JARGON BUSTER

Server Instance: Many RDBMS backends

can run multiple instances, and each of these instances service one or

more applications.

This is an efficient way to fine tune the engine

configuration depending on the application(s) being serviced, instead of

being forced to run at the same configuration for all databases

and applications.

Each instance can use the same database, but due to file

locking when an instance mounts a database, only one of it can use it at a

time (the other instances will need to place the database offline or

un-mount it).

Cursor: A cursor is a range ('page') of

information being pulled from a database in response to a query. Cursors

are used to reduce bandwidth utilization while processing a large amount

of data. They do this by accessing only a portion of it at a time.

Transaction: Database operations are usually

multi-step. Each step has a possibility of failure. If the operation fails

and data has been written to the database (or deleted), then the database

can be in an inconsistent state since the entire operation has not

completed (for eg, the database may have recorded that the order was

dispatched, but details like how it was paid for and how it was sent away

were not yet saved). Transactions help batch these operations and on

failure within a batch that batch or the entire set can be undone ('rolled

back') and retried afresh after corrective measures.

Atomization: The process of breaking up an

operation or information into smaller parts is called atomization. The

smaller a piece, the less likely the entire operation fails. Also errors

can be caught and handled better.

Copy-on-write: Primarily a file system/backup

related technology, it refers to the type of data copy operation that

occurs only when a WRITE occurs on a corresponding file. This means a CoW

agent will ignore all READ operations as well as idle states of the data

and gets triggered only when data content is changed in some way. This is

a more efficient way to snapshot/backup information.

Data partition: When a database contains a large

amount/types of data, it becomes difficult to analyze properly, since the

application will be looking at all of it at a time. The RDBMS if it

supports partitioning, can let you create 'windows' into the data,

much like the VIEWS that let you see only a subset of that data. Different

types of partitioning exist (range, hash, key, list and composite).

Sub-partitioning lets you create smaller subsets from existing data

partitions.

Page-wise Checksum: Instead of checksums being computed at the file

level, SQL Server 2005 checksums pages of data, which can increase data

protection. This can, for instance catch problems that are not reported by the

OS or I/O hardware --- an ALTER DATABASE option enables page level checksums for

databases.

Read-only Database Compression: Another improvement in this edition is that

when you have compressed NTFS volumes, you can place a database file group that's

been marked 'read only' onto that compressed volume. Since the data in these

files are not subject to change (since they are read only), the file does not

grow and hence compression can help better utilize its space.

Quick Initialization: To improve performance, SQL Server 2005 permits

creating or extending a database's data files (not log files) without first

zeroing the space allocated to it. But, to be able to do this, the SQL Server

instance needs the SE_MANAGE_VOLUME_NAME privilege, which is not available to

the Network Service account (default for the service) and hence by default this

facility is turned off. This kind of quick-initialization potentially exposes

uninitialized portions of the disk that may contain previous data to the

application or user. But, the storage engine itself restricts access only to

those portions where data has been written to by the database engine, preventing

such leaks. In a broad manner of speaking, this is similar to the 'quick

format' option for disks available in major OSs.

Snapshots: Point-in-time backups can be taken using database

snapshots. This uses NTFS Sparse File technology to efficiently manage the space

used. Pages of data are copied from the database to the snapshot on a

copy-on-write basis, letting the snapshot file attain similar sizes as the

original database.

Row-wise versioning: And imagine having different versions of each row

in a database! That's also now supported with SQL Server 2005's row-level

versioning. This also works on a copy-on-write basis, but stamps each row with

its transaction ID to

enable a trace back.

Data partitioning: Data in the latest version of this RDBMS can be

partitioned into independent segments and accessed and modified separately as

long as they are assigned to different file groups. Thus, the storage engine in

SQL Server 2005 also handles file groups (actually data partitions), along with

switching partitions between file groups and aligning them to different

boundaries.

MySQL

Pluggable Engines: This well-known open source enterprise class database uses a
pluggable storage architecture. While about eight different types are shipped

with different editions of MySQL, about five engines are also available from

their partners and the MySQL community. Each storage engine is specifically

written and optimized for a particular application area. Some handle

read-intensive operations more suited to Web based deployments, while others are

raw OLTP, clustering and warehousing optimized. The default storage engine is

called 'MyISAM'. Other native engines include: Cluster, Federated, Archive,

Merge, Memory, CSV and BlackHole. Oracle/InnoDB's own InnoDB engine is tasked

with transactional data storage with row-level locking and referential

integrity. PrimeBase XT, Open OLAP and Berkeley DB (Oracle) are among the

community developed contributions.

Till now, this architecture allowed MySQL DBAs to select the best available

storage engine for a MySQL instance. From version 5.1, the architecture has been

changed to allow database developers change the storage engine on the fly using

what MySQL terms as a 'clap-on

clap-off' approach with special LOAD and UNLOAD statements.

Other enhancements include the transactional engine, OLAP engine and

multi-dimensional engine. The MySQL community also got its long standing wish to

be able to do data partitioning (range, hash, key, list, composite and

sub-partition) on its databases with the latest version, although the support

for this is rather shaky and under development.

Falcon: This new database engine for MySQL was originally supposed to

enter public beta testing in summer this year, has not yet come into the open

even with broad specifications at the time of this going to press. All that is

known about this engine is that it is based on the Netfrastructure engine by Jim

Starkey that makes better use of modern processor and RAM capabilities. More

cores in a processor means the engine would use them to marshal data from one

place to another more efficiently and go multi-threaded to do so. Larger

memories mean more caching, but Falcon wants to do better than that by using

in-memory databases that fill up from disk-based content. Tuning databases are

also rumored to be a thing of the past with Falcon handling your database.

Multiple versions of a table will exist in memory while there would be just a

single version on the disk,

making versioning faster and more consistent. Similarly, transactions are also
performed in memory with the outputs flowing to the disk. For the rest, we'll

have to wait till a version of Falcon actually hits the beta servers.

IBM DB2 VIPER

Not just relational DBMS: This is version 9 of the IBM DB2 and was almost
named “DB3” given the wide ranging changes to the DB2 system. IBM says that

as of Viper, DB2 is no longer a pure relational environment but provides an XML

plus relational environment.

pureXML: DB2 Viper is intensely an XML environment. XML documents can

be stored in a database using several methods: you could break up the document

into its constituent parts and store each one separately or, store entire XML as

a single piece. Both have their pros and cons. When you break up the file, you

can store it more efficiently, replacing duplicates with pointers to the already

stored copy. But, this has a performance penalty when a document has to be

reconstructed in response to a query. DB2 Viper stores XML documents as a single

LOB (Large Object) entity, making it compliance friendly, as when you break up

something into pieces and recombine it later, you do not end up with the

original (for eg, digital certificates will fail to verify as the original

document). Storing XML as a LOB overcomes this problem as well.

SOA friendly: Having a database that can directly do XML is great.

But, having it do a part of SOA for you as well would make for a great deal.

This is what DB2 Viper also enables with its pureXML. So, instead of having

separate DB2 instances for each platform or application that needs the database,

you can have just a single DB2 deployment that can serve data via XML to the

various applications, thus complying with the primary requirement of SOA:

disconnectedness.

Tokenized Compression: To compress data effectively, DB2 v9 replaces

duplicate data even within columns with tokens. This reduces redundancy. DB2

will look for duplication across the database and replace the duplicates with

numeric tokens each time it appears. A central token 'dictionary' helps

refresh the data item with the actual information when it is retrieved. Numeric

values, LOB, XML and indexes are not currently tokenized.

Optimized for SAP: It not onl detects its deployed environment on

installation, but can also detect a SAP deployment in that environment. It can

discover SAP's configuration and use these values to work better with that

application. DB2 can be deployed in a special 'silent install' mode as a

part of the SAP installation. The information picked up from SAP is used for

database tuning, indexing and multi-dimensional clustering.

Security partitioning: This is a form of data partitioning (see Jargon

Buster) where DB2 lets you partition data based on security tokens. This means

you can put confidential information in one partition, top secret data in

another and see-this-and-die information in a third one. You can do partitioning

using an ALTER TABLE command. This function is useful when you need to perform

decision-support analysis (what-if, scenario building, etc) where you may need

to temporarily 'delete' data for the analysis query period.

Replicated partitions: This is a feature that is a part of data

warehouse appliances and may soon find its way into DB2. Partition replication

is useful in parallel processing environments where you can replicate a

partition onto another instance.

Self-Tuning: In both memory and storage, the database engine can

automatically detect application workloads (especially SAP) and tune itself to

better performance. This is done by redistributing memory among the different

DB2 processes to optimize the usage.

ILM: The DMS (Data Managed Storage) module of Viper lets you have

disks of different speeds in the same system and it will automatically allocate

data to each of them based on their age. That is, older data will go onto slower

disks and current data on faster ones. Thus, DB2 has a built-in ILM module as

well.

Advertisment