Tech Explained

Hot Technologies in Storage

PCQ Bureau

02 Jul 2007 20:25 IST

New Update

Storage is one area where any amount of coverage is insufficient. That's

because no matter which application you deploy, you'll always require storage

for it. As a result, the amount of data grows at such an exponential rate that

you quickly run out of what you have and end up buying more. With so much data

spread across so many devices, manageability also becomes a key challenge. How

do you ensure that there's no unwanted data eating up precious storage space on

your network? How do you ensure that there's no duplicate data consuming your

precious storage space? It becomes very difficult. That's why, beyond a certain

point, you have to decide whether to buy more storage, or invest in technologies

and solutions to manage what you already have.

Advertisment

Moreover, as the amount of data grows, security becomes a key concern. How do

you secure the ever growing amount of data? How do you keep it away from prying

eyes? You can keep it secure while the data is being transported, but what about

data that's lying stationary inside storage devices? How safe is it? There are

technologies evolving to safeguard that data as well.

This brings us to the third most critical element. When you've got so much of

data, you not only have to protect it from prying eyes, but also from disasters.

So you need a mechanism by which data can be instantly replicated to a remote

location. There are technologies available for doing that as well.

Advertisment

The storage market is vast. Lots of technologies have already been developed

to cater to existing requirements, and work is in progress to cater to new and

future requirements. In this story, we look at the key developments in the

storage space to cater to most of the things we've said till now.

There's storage virtualization to help you utilize your existing storage

resources more efficiently. There are encryption technologies to keep it away

from prying eyes. There are replication technologies that ensure efficient and

quick synchronization of data with a remote site, and work is even in progress

to store data on the most unheard of things, like bacteria. The technologies

behind all these areas have been discussed in this article.

Storage Virtualization Consolidation and more

Advertisment

As your business grows, your data also grows exponentially. As a result, the

available storage capacity for storing all this data shrinks. So you buy more

storage devices to cater to this growing requirement, and the cycle continues.

The trouble is that over a period of time, both user preferences as well as the

available storage technologies change, and you end up deploying disparate

storage resources in your IT infrastructure. Eventually this makes the

management of so many storage resources difficult, and often leads to

underutilized storage. In fact, it wouldn't be surprising if only 50% of storage

capacity is actually utilized. It's a fairly common sight these days. So what do

you do with the additional 50% capacity? How do you leverage it?

One answer that the storage industry has for this is storage virtualization,

which not only consolidates the storage infrastructure but also makes it more

usable. It allows storage administrators to identify, provision and manage

disparate storage as a single aggregated resource. The end result is that it

eases the management headaches and allows higher levels of storage utilization,

which in turn forestalls the expense of added storage. Let's examine this

exciting technology in more detail and even look at some of the key issues

involved in using it.

Key issues

The technology basically works by adding a layer of abstraction between the

storage systems and applications. So, applications no longer need to know what

disks, partitions or storage subsystems their data is stored on. They look at

the storage as one pool, which results in improved disk utilization. It also

helps automate storage capacity expansion, and allows storage resources to be

altered and updated on the fly without disrupting application performance.

Advertisment

Earlier, an application would be associated to specific storage resources,

and any interruption to those resources will adversely affect the application's

availability. After doing storage virtualization, they're no longer tied to

particular storage units, thereby improving data availability.

Storage virtualization can also aid in disaster recovery (DR) planning.

Traditionally you needed identical hardware at the DR site for replication of

data, but virtualization eases that requirement. Moreover, you can speed up

backups through the use of snapshots, which basically eliminates the backup

window. Data migration can also be handled through storage virtualization

instead of using vendor-specific tools, supporting greater heterogeneity in the

data center.

But, all this also adds up to the complexity significantly. The

virtualization layer is one more element of the storage environment that must be

managed and maintained as and when virtualization products are patched and

updated. It's also important to consider the impact of storage virtualization on

interoperability and compatibility between storage devices. In some cases, the

virtualization layer may potentially interfere with certain special features of

storage systems, such as remote replication.

Advertisment

If you face such issues with storage virtualization, then undoing it can be

very challenging. Therefore it is advisable to go one step at a time. Implement

it in parts.

Technologies behind storage virtualization

Currently, storage virtualization is being done at three architectural

levels: (a) in the host, (b) at the storage sub-system and (c) in the storage

network. Each method provides specific advantages but is limited in its

capabilities.

Virtualization could be seen

as an important element of storage consolidation, easing management

headaches and allowing higher levels of storage utilization

Advertisment

Host based virtualization is the easiest and most straightforward.

Abstraction is implemented in servers, typically in Logical Volume Managers (LVM).

But scalability and maintainability become an issue in this kind of

virtualization, after a while. Reason for the same is that it assumes prior

partitioning of the entire SAN resources (disks or LUNs) to various servers.

You can also have storage virtualization in the storage array itself, e.g.,

Hitachi Data Systems' TagmaStore. This offers convenience, but it's

vendor-centric and generally, not heterogeneous. Pooling all SAN storage

resources and managing virtual volumes across several storage subsystems

generally requires homogeneous SANs using a single type of RAID subsystem.

Today, the most popular point of implementation is in the network fabric

itself. It is winning because of its neutrality for storage and servers. It is

often done through a dedicated virtualization appliance or an intelligent switch

running virtualization software, such as IBM's SVC software. Network-based

virtualization can have following two major architectures:

Advertisment

Symmetric approach — intelligent switches and/or appliances

in the data path of the storage network infrastructure.
Asymmetric approach - separate appliances installed out of the data

path of the storage network infrastructure.The appliance can be a small and

inexpensive unit, because it does not have to handle actual data transfers,

along with controls.

Network-based storage virtualization is the most scalable and int eroperable;

making it particularly well suited to storage consolidation projects. But as a

downside, there may be a slight impact on network performance due to in-band

processing (symmetric approach) in the virtualization layer.

iSCSI vis-Ã -vis FC Where it fits?

Another hot storage trend relates to the use of iSCSI in SAN deployments. But

before you leap on to the bandwagon you need to ensure that it meets your

requirements. Here's a technical insight.

Small Computer Systems Interface (SCSI) commands are not new to the storage

world, but SCSI over IP (iSCSI) is still a nascent phenomenon and is evolving

with every passing day. Until recently, it was not even seen as a viable

alternative to high performing and feature-rich Fibre Channel option for storage

networking. Both vendors and users were skeptical about its performance under

mission-critical mode. There was lack of support and availability of products

was also scarce. But things are fast changing for iSCSI, and to date it's become

the most exploited form of IP Storage.

What's motivating users to adopt iSCSI? Lesser capital and operational costs are
the main motivators. Ease of management is right behind, which includes easier

and faster deployment, leveraging existing IP skills. Let's delve deeper into

the use of iSCSI protocol in storage networking and see where it fits the bill.

SCSI over IP

The iSCSI protocol is the means to enable block storage applications over

TCP/IP networks. It is similar to the client/server model. The client being the

initiator which actively issues commands and the target is the server which

passively serves the requests generated by the initiator. A target has one or

more logical units, with unique numbers (LUNs) to process these commands. After

processing, these commands are stored in a Command Descriptor Block, which is

encapsulated and reliably sent back to the initiator. The status of delivery is

also reported. Basically, the iSCSI layer interfaces to the operating system's

standard SCSI set. In practical applications, an initiator may have multiple

targets resulting in multiple concurrent TCP connections active.

iSCSI adapters

Unlike FC Host Bus Adapters (HBAs), iSCSI network interface cards do not

require FC at the host end and the network, which reduces the ongoing support

and management costs. In addition, it offers several architectural options that

vary based on the price, bandwidth, CPU usage, etc. The most cost effective

option is to use a software-only solution, wherein a server can do block I/O

over IP through an ordinary Gigabit Ethernet card.

Unfortunately, this approach suffers in terms of performance and CPU

overhead, making it best suited for low-performance applications such as tape

backup. But as CPUs are becoming more powerful, this is becoming an acceptable

trade-off for the convenience of iSCSI storage access.

The other approach to having iSCSI involves using TCP off-load engines (TOEs).

These are incorporated in network adapters to remove the burden of TCP overhead

processing from the host CPU, and making it more efficient. The third approach

lies on the other extreme, in which high performance iSCSI adapters off-load

both TCP as well as iSCSI. Though it adds up to the cost, it provides both

high-speed iSCSI transport and

minimal CPU usage.

iSCSI or FC

High performance iSCSI adapters and IP storage switches are proliferating in

the market and dramatically changing the SAN landscape. It's difficult to say

which type of SAN will meet your unique requirements, but consider FC over iSCSI

where availability, high-performance and reliability are paramount, for example,

to store transactions in banks. In other cases, you can choose iSCSI. For

example, an iSCSI SAN will be appropriate for an organization which does not

want to invest in acquiring or training staff to support FC.

IP networks often have higher latency than localized SANs. That means even if

you have an 8 Gbps FC, let's say by 2008 and a 10 Gbps Ethernet, which in all

probability would be a possibility by that time, it is less likely that people

would fall for such small speed advantage. Moreover, for companies that have

already stabilized on FC it would be even less likely.

Some OSs still don't have native iSCSI support and iSCSI management tools (SRMs)

are also comparatively less developed. A lots of trust building is still needed

in context of iSCSI.

Nonetheless, people will find iSCSI enticing for the other major reason-cost.

Although prices of FC components are coming down fast, they're still much higher

than their IP counterparts.Moreover, this price difference is likely to remain

there at least in the foreseeable future. Even if you add in the cost of a TOE,

it does not exceed the FC cost.

Security for Data-in-store Can't take it for Granted

Over the past few years we have seen instances where we have realized an

underlying need for securing data-in-store. It's not always the data-in-transit

that is vulnerable. Let's see how it is done.

Over the past few years, there's been a sizable increase in malicious attacks

on corporate computer systems and electronic thefts of private information. To

provide protection from these attacks, most companies have secured their systems

and network from outsiders, implementing perimeter-based security strategies

with firewalls and virtual private networks (VPNs) to ensure that external users

can't access sensitive data without authorization. But that's not enough

anymore. Today, you also have to secure data from unauthorized employees and

erroneous or unwanted use by an authorized user.

What comprises storage security?

Typically, there are three parts to storage security-Authentication, Access

control and Encryption. Authentication ensures that only those people can access

data who have been authorized. For making authentications on a network we have

several standards and protocols, such as Remote Access Dial-in User Security

(RADIUS) and Challenge Handshake Authentication Protocol (CHAP). In the mean

time, new storage-specific methods and standards, such as Diffie-Hellman CHAP,

are also emerging that enable organizations to add authentication to the storage

infrastructure.

Access control maps and controls a user or a system to a particular set of

data. On a network, users can only view data allowed by router access control

lists (ACLs) and directory services that control access. Within the storage

infrastructure, which servers have access to what data is controlled by

techniques like zoning and (Logical Unit Number) LUN masking (discussed in

PCQuest May2007 pg. 76 and 77).

Encryption also forms a key part of an effective storage strategy. There are

two key components in encryption, viz., the encryption algorithm and the key.

There are several standards for implementing encryption. Most systems use

specific algorithms for specific operations, such as 3DES for encrypting data at

rest and AES for encrypting data in flight.

Key lifecycle for data-in-store

Additional considerations

Besides encryption, it is also necessary to ensure that the encrypted data

remains unaltered till decryption. Message digests or secure hashes consist of a

fixed-length bit string that can be used to verify the validity of data. There

are various mechanisms to calculate secure hash. While most common ones are MD5

and SHA1, but stronger hashes like SHA256, SHA384 and SHA512 are recommended.

Alternatively, the Keyed Hash Message Authentication (HMAC) with a stronger hash

is also an advisable option. Then there should be the involvement of an

Internet-based Certificate Authority for making key exchanges

Key management

Encryption is not the end of it. In fact, it gives rise to another important

aspect of key management. There are basically two kinds of encryptions that one

can have-symmetric keys encryption or asymmetric keys encryption. Symmetric keys

encryption is the preferred for data at rest. There are two standards being

worked upon by IEEE, namely P1619 (for disks) and P1619.1 (for tapes), for

symmetric keys encryption.

No matter what encryption standards you are following, you should always have

a key hierarchy. The hierarchy must consist at least two levels of keys-data

encryption key and a key encryption key (KEK). As the name suggests, KEK is used

to encrypt and store the key itself. The deeper the hierarchy of keys, the more

robust the key management system required for operations.

A key management system is one that combines the devices, people and

operations required to create, maintain and control keys. Some of the common

components of a key management system are:

Key generators: They can be manual or preferably a random number

generator. Ideally, the random number generators should be non-deterministic.

Although till date no certified or accepted Non-deterministic Random Bit

Generator (NRBG) exists due to lack of a standard verification process. Until

one is developed, it is a good practice to use a DRBG (deterministic)

certified by an appropriate agency or by the Government where the storage

device is kept.
Transport of keys: This can again be done manually through smart

cards or automatically.
Encryption devices: This defines the granularity of the key, that

is, at which level (disk, directory or individual file level) to encrypt data.

It depends on the sensitivity of data.
Key archive systems: This provides for easy recoverability of keys.

These are maintained in some tamper-proof hardware to ensure key security.
Key backup files or devices: It's different from key archiving, as

it is a practice to securely back-up keys and restore them in case of any

disaster. Ensuring control of key backup system is extremely important to key

security and integrity.

Conclusion

Data encryption is nothing new, but becomes important in case of high-volume

storage. It poses some formidable challenges as well. For example, encryption

and decryption are processor-intensive activities which may slow down access to

stored data. The situation will become worse when organizations are storing and

accessing massive amounts of information.

Then the management of the encryption keys should also be handled in an

efficient, reliable and secure way. Moreover the application should be

completely abstracted from where and how the encryption is being done and the

whole process of storage security should be automated.

Data Replication over WANs

Not only you need to have a secure first copy, you also need to be having

an available second copy of data to restore the normal course of business, in

case disaster strikes.

You never know the value of a backup until a disaster strikes and wipes out

the principal copy of your data. Many of us have realized the value of

investment in creating backups at remote locations, only after seeing the few

mishaps which took place in recent times and the devastation that they caused to

the unprepared. The traditional method of data protection has been to back it up

on tape, and then physically move it to a safe location, preferably far away

from the main office. However, with proliferation of a wide range of disk-based

automated backup systems the platform is all set for technologies for data

replication over WANs to get mature. On one hand we are seeing advances in

networking technologies. To name a few-optical networks now support storage

networking protocols, inclusion of flow control mechanisms, and efficient

transport capabilities. On the other hand in the storage world, we have data

de-duplication technologies coming up, which significantly reduce the volume of

the data to be backed up.

Networking technologies

To send data to geographically dispersed storage, you need a resilient storage
networking infrastructure. For inter-datacenter data replication, you need a

network with low latency, so there's minimal packet loss. Plus, the bandwidth

should also be scalable for such a network. With these sensitivities in mind,

there are two options to build a network to support for data replication-Coarse

/ DenseWavelength Division Multiplexing (C/DWDM) and Synchronous Optical NETwork

/ Synchronous Digital Hierarchy (SONET/SDH).

C/DWDM is a technology that maps data from different sources and protocols

together on an optical fiber with each signal carried on its own separate and

private light wavelength. It can be used to interconnect data centers via a

variety of storage protocols such as Fibre Channel, FICON, and ESCON. It has

been verified to support data replication over distances up to several hundred

kilometers. C/DWDM provides bandwidth from one to several hundreds of

gigabits/second (Gbps).

SONET/SDH technology is based on Time Division Multiplexing (TDM). With this

technology, enterprise data centers can be interconnected over thousands of

kilometers for data replication and other storage-networking needs. Storage over

SONET/SDH is a reliable and readily available networking option.

Data de-duplication

Though data de-duplication technologies have been around for years, there is a
renewed focus on them recently as they are being utilized by products in the

disk-based backup market. Data reduction enables disks to be a feasible

long-term retention backup media-making it the same or lower cost than

tape-based systems.

Moreover, data de-duplication addresses the issue of data replication for

disaster recovery. With the reduced amount of data after de-duplication, the

network bandwidth required for replication reduces significantly. This makes

replication even possible for smaller companies with lower budgets.

There are two primary methods of data reduction found in disk-based backup
systems: One is, byte-level delta data reduction that compares versions of data

over time and stores only the differences at the byte level and the other is,

block level data de-duplication in which blocks of data are read from the

written data and only the unique blocks are stored.

A byte level delta data reduction outperforms data de-duplication in a

disk-based backup system, as it scales to larger amounts of data. It avoids hash

table and restore fragmentation issues. It also processes the backup data after

it's been written to disk and on top of all, it is content aware and optimized

for your specific backup application. Therefore, it knows how each backup

application operates, and understands file content and boundaries. All in all,

it helps in optimizing the de-duplication process.

Flavors of replication

There are several products available in the market for data replication over
WANs. There are four flavors of replication to choose from. You can do it at the

application level, host level, in the storage arrays, or with a storage

networking appliance. The advantage of having it at application level is that

the application is fully aware of such replication. DBAs have confidence in data

integrity. It supports both synchronous as well as asynchronous replication.

And, it is also not hardware dependent. The disadvantages of such an approach

are that application owners themselves are responsible for recovery. It is

specific to a particular application. Often it does not protect application

files.

If you have it done at host level the failover can be automated, while if you

have it at application level, then the DBAs themselves need to pull up the data

in case of any failure. It supports disparate hardware and many-to-one

replication can be facilitated. As cons, you can count its dependence on OS and

that, it requires additional resources at host and the replication has to be

explicitly integrated with applications. In case you plan to replicate at

storage array level, then you must know that it is unlikely to support

dissimilar hardware. Secondary copies are only usable with point-in-time copies.

It also requires integration with applications. You also need to work on fabric

extension and there is added complexity of keeping everything in sync. But

having it at storage array makes it agnostic to applications and OS. It does not

use any host resources, replicates all kinds of data and is also easier to

manage.

Despite being a little costlier as a separate appliance and additional

hardware is required, having replication in the fabric provides modern

functionalities like CDP. In this approach, no array or host resources are

required. Understandably, it is also agnostic to applications and OS. Besides it

is highly scalable.

Bacteria-based Storage Right from the Labs

If atomic storage was not enough to baffle a common man's wit, these days we

are hearing talks of bacteria-based storage. Let's find out where the research

has reached.

We have heard researchers talk about atomic storage. The atoms can hold 250

Terabits of data on a surface area of one square inch. Then we also know about

the organic thin-film structures that can have more than 20,000

write-read-rewrite cycles. But, now researchers have started talking about the

potential of bacteria to store data. Sounds interesting? It is, and it's not

something out of science fiction.

State of research

Researchers from two prominent universities came out with results that

indicate the possibility of storing digital data in the genome of a living

organism. Not only that, the data thus stored could be retrieved after hundreds

or even thousands of years later. They believe that the potential capacity of

bacteria-based memory is enormous, as more than a billion bacteria are contained

in only a milliliter of liquid. You can assume the enormity.

To this effect a group of Pacific Northwest National Laboratory (PNNL)

researchers described an experiment some three years ago in which they stored

roughly one encoded English sentence in one bacterium. Taking the same ahead,

earlier this year, scientists at Japan's Keio University Institute for Advanced

Biosciences reported similar results in their research. They claim that they

successfully encoded “e= mc2 1905!”-the Einstein's theory of relativity and the

year he enunciated it-on the common soil bacteria Bacillus subtilis.

According to the scientists, one of the challenges faced in carrying out the

research was to provide a safe haven for DNA molecules, which are easily

destroyed in any open environment inhabited by people or other hazards. The

solution for the fragility of DNAs was found in a living host for the DNA. The

host tolerates the addition of artificial gene sequences and survives extreme

environmental conditions. Also, the host with the embedded information had been

able to grow and multiply.

Potential applications

Storing is not the end of the story. If you have been able to store it on a

bacterium and have not been able to retrieve it successfully and also easily

then the purpose remains un-served. By now, the retrieval of the data remains a

wet-laboratory process requiring a certain amount of time and

effort.

Most of the potential applications for DNA-based data storage relate to the

core missions of the US Department of Energy (DoE), which funded the research

carried out at PNNL. Also the technology once developed could be used for other

security-related applications such as information-hiding and data steganography

to be used in commercial products, as well as those related to State security.

Data stenganography is the hiding of data inside other data. As the DNA-based

storage has the capability to survive nuclear threats, one of the possible

applications one could think of this kind of storage is to keep copies of data

that may be destroyed in a nuclear explosion. Anyways, who wants to talk about

nuclear explosion, that's insane!

Advertisment