Storage is one area where any amount of coverage is insufficient. That's
because no matter which application you deploy, you'll always require storage
for it. As a result, the amount of data grows at such an exponential rate that
you quickly run out of what you have and end up buying more. With so much data
spread across so many devices, manageability also becomes a key challenge. How
do you ensure that there's no unwanted data eating up precious storage space on
your network? How do you ensure that there's no duplicate data consuming your
precious storage space? It becomes very difficult. That's why, beyond a certain
point, you have to decide whether to buy more storage, or invest in technologies
and solutions to manage what you already have.
Moreover, as the amount of data grows, security becomes a key concern. How do
you secure the ever growing amount of data? How do you keep it away from prying
eyes? You can keep it secure while the data is being transported, but what about
data that's lying stationary inside storage devices? How safe is it? There are
technologies evolving to safeguard that data as well.
This brings us to the third most critical element. When you've got so much of
data, you not only have to protect it from prying eyes, but also from disasters.
So you need a mechanism by which data can be instantly replicated to a remote
location. There are technologies available for doing that as well.
The storage market is vast. Lots of technologies have already been developed
to cater to existing requirements, and work is in progress to cater to new and
future requirements. In this story, we look at the key developments in the
storage space to cater to most of the things we've said till now.
There's storage virtualization to help you utilize your existing storage
resources more efficiently. There are encryption technologies to keep it away
from prying eyes. There are replication technologies that ensure efficient and
quick synchronization of data with a remote site, and work is even in progress
to store data on the most unheard of things, like bacteria. The technologies
behind all these areas have been discussed in this article.
As your business grows, your data also grows exponentially. As a result, the
available storage capacity for storing all this data shrinks. So you buy more
storage devices to cater to this growing requirement, and the cycle continues.
The trouble is that over a period of time, both user preferences as well as the
available storage technologies change, and you end up deploying disparate
storage resources in your IT infrastructure. Eventually this makes the
management of so many storage resources difficult, and often leads to
underutilized storage. In fact, it wouldn't be surprising if only 50% of storage
capacity is actually utilized. It's a fairly common sight these days. So what do
you do with the additional 50% capacity? How do you leverage it?
One answer that the storage industry has for this is storage virtualization,
which not only consolidates the storage infrastructure but also makes it more
usable. It allows storage administrators to identify, provision and manage
disparate storage as a single aggregated resource. The end result is that it
eases the management headaches and allows higher levels of storage utilization,
which in turn forestalls the expense of added storage. Let's examine this
exciting technology in more detail and even look at some of the key issues
involved in using it.
Key issues
The technology basically works by adding a layer of abstraction between the
storage systems and applications. So, applications no longer need to know what
disks, partitions or storage subsystems their data is stored on. They look at
the storage as one pool, which results in improved disk utilization. It also
helps automate storage capacity expansion, and allows storage resources to be
altered and updated on the fly without disrupting application performance.
Earlier, an application would be associated to specific storage resources,
and any interruption to those resources will adversely affect the application's
availability. After doing storage virtualization, they're no longer tied to
particular storage units, thereby improving data availability.
Storage virtualization can also aid in disaster recovery (DR) planning.
Traditionally you needed identical hardware at the DR site for replication of
data, but virtualization eases that requirement. Moreover, you can speed up
backups through the use of snapshots, which basically eliminates the backup
window. Data migration can also be handled through storage virtualization
instead of using vendor-specific tools, supporting greater heterogeneity in the
data center.
But, all this also adds up to the complexity significantly. The
virtualization layer is one more element of the storage environment that must be
managed and maintained as and when virtualization products are patched and
updated. It's also important to consider the impact of storage virtualization on
interoperability and compatibility between storage devices. In some cases, the
virtualization layer may potentially interfere with certain special features of
storage systems, such as remote replication.
If you face such issues with storage virtualization, then undoing it can be
very challenging. Therefore it is advisable to go one step at a time. Implement
it in parts.
Technologies behind storage virtualization
Currently, storage virtualization is being done at three architectural
levels: (a) in the host, (b) at the storage sub-system and (c) in the storage
network. Each method provides specific advantages but is limited in its
capabilities.
Virtualization could be seen as an important element of storage consolidation, easing management headaches and allowing higher levels of storage utilization |
Host based virtualization is the easiest and most straightforward.
Abstraction is implemented in servers, typically in Logical Volume Managers (LVM).
But scalability and maintainability become an issue in this kind of
virtualization, after a while. Reason for the same is that it assumes prior
partitioning of the entire SAN resources (disks or LUNs) to various servers.
You can also have storage virtualization in the storage array itself, e.g.,
Hitachi Data Systems' TagmaStore. This offers convenience, but it's
vendor-centric and generally, not heterogeneous. Pooling all SAN storage
resources and managing virtual volumes across several storage subsystems
generally requires homogeneous SANs using a single type of RAID subsystem.
Today, the most popular point of implementation is in the network fabric
itself. It is winning because of its neutrality for storage and servers. It is
often done through a dedicated virtualization appliance or an intelligent switch
running virtualization software, such as IBM's SVC software. Network-based
virtualization can have following two major architectures:
- Symmetric approach — intelligent switches and/or appliances
in the data path of the storage network infrastructure. - Asymmetric approach - separate appliances installed out of the data
path of the storage network infrastructure.The appliance can be a small and
inexpensive unit, because it does not have to handle actual data transfers,
along with controls.
Network-based storage virtualization is the most scalable and int eroperable;
making it particularly well suited to storage consolidation projects. But as a
downside, there may be a slight impact on network performance due to in-band
processing (symmetric approach) in the virtualization layer.
Another hot storage trend relates to the use of iSCSI in SAN deployments. But
before you leap on to the bandwagon you need to ensure that it meets your
requirements. Here's a technical insight.
Small Computer Systems Interface (SCSI) commands are not new to the storage
world, but SCSI over IP (iSCSI) is still a nascent phenomenon and is evolving
with every passing day. Until recently, it was not even seen as a viable
alternative to high performing and feature-rich Fibre Channel option for storage
networking. Both vendors and users were skeptical about its performance under
mission-critical mode. There was lack of support and availability of products
was also scarce. But things are fast changing for iSCSI, and to date it's become
the most exploited form of IP Storage.
What's motivating users to adopt iSCSI? Lesser capital and operational costs are
the main motivators. Ease of management is right behind, which includes easier
and faster deployment, leveraging existing IP skills. Let's delve deeper into
the use of iSCSI protocol in storage networking and see where it fits the bill.
SCSI over IP
The iSCSI protocol is the means to enable block storage applications over
TCP/IP networks. It is similar to the client/server model. The client being the
initiator which actively issues commands and the target is the server which
passively serves the requests generated by the initiator. A target has one or
more logical units, with unique numbers (LUNs) to process these commands. After
processing, these commands are stored in a Command Descriptor Block, which is
encapsulated and reliably sent back to the initiator. The status of delivery is
also reported. Basically, the iSCSI layer interfaces to the operating system's
standard SCSI set. In practical applications, an initiator may have multiple
targets resulting in multiple concurrent TCP connections active.
iSCSI adapters
Unlike FC Host Bus Adapters (HBAs), iSCSI network interface cards do not
require FC at the host end and the network, which reduces the ongoing support
and management costs. In addition, it offers several architectural options that
vary based on the price, bandwidth, CPU usage, etc. The most cost effective
option is to use a software-only solution, wherein a server can do block I/O
over IP through an ordinary Gigabit Ethernet card.
Unfortunately, this approach suffers in terms of performance and CPU
overhead, making it best suited for low-performance applications such as tape
backup. But as CPUs are becoming more powerful, this is becoming an acceptable
trade-off for the convenience of iSCSI storage access.
The other approach to having iSCSI involves using TCP off-load engines (TOEs).
These are incorporated in network adapters to remove the burden of TCP overhead
processing from the host CPU, and making it more efficient. The third approach
lies on the other extreme, in which high performance iSCSI adapters off-load
both TCP as well as iSCSI. Though it adds up to the cost, it provides both
high-speed iSCSI transport and
minimal CPU usage.
iSCSI or FC
High performance iSCSI adapters and IP storage switches are proliferating in
the market and dramatically changing the SAN landscape. It's difficult to say
which type of SAN will meet your unique requirements, but consider FC over iSCSI
where availability, high-performance and reliability are paramount, for example,
to store transactions in banks. In other cases, you can choose iSCSI. For
example, an iSCSI SAN will be appropriate for an organization which does not
want to invest in acquiring or training staff to support FC.
IP networks often have higher latency than localized SANs. That means even if
you have an 8 Gbps FC, let's say by 2008 and a 10 Gbps Ethernet, which in all
probability would be a possibility by that time, it is less likely that people
would fall for such small speed advantage. Moreover, for companies that have
already stabilized on FC it would be even less likely.
Some OSs still don't have native iSCSI support and iSCSI management tools (SRMs)
are also comparatively less developed. A lots of trust building is still needed
in context of iSCSI.
Nonetheless, people will find iSCSI enticing for the other major reason-cost.
Although prices of FC components are coming down fast, they're still much higher
than their IP counterparts.Moreover, this price difference is likely to remain
there at least in the foreseeable future. Even if you add in the cost of a TOE,
it does not exceed the FC cost.
Over the past few years we have seen instances where we have realized an
underlying need for securing data-in-store. It's not always the data-in-transit
that is vulnerable. Let's see how it is done.
Over the past few years, there's been a sizable increase in malicious attacks
on corporate computer systems and electronic thefts of private information. To
provide protection from these attacks, most companies have secured their systems
and network from outsiders, implementing perimeter-based security strategies
with firewalls and virtual private networks (VPNs) to ensure that external users
can't access sensitive data without authorization. But that's not enough
anymore. Today, you also have to secure data from unauthorized employees and
erroneous or unwanted use by an authorized user.
What comprises storage security?
Typically, there are three parts to storage security-Authentication, Access
control and Encryption. Authentication ensures that only those people can access
data who have been authorized. For making authentications on a network we have
several standards and protocols, such as Remote Access Dial-in User Security
(RADIUS) and Challenge Handshake Authentication Protocol (CHAP). In the mean
time, new storage-specific methods and standards, such as Diffie-Hellman CHAP,
are also emerging that enable organizations to add authentication to the storage
infrastructure.
Access control maps and controls a user or a system to a particular set of
data. On a network, users can only view data allowed by router access control
lists (ACLs) and directory services that control access. Within the storage
infrastructure, which servers have access to what data is controlled by
techniques like zoning and (Logical Unit Number) LUN masking (discussed in
PCQuest May2007 pg. 76 and 77).
Encryption also forms a key part of an effective storage strategy. There are
two key components in encryption, viz., the encryption algorithm and the key.
There are several standards for implementing encryption. Most systems use
specific algorithms for specific operations, such as 3DES for encrypting data at
rest and AES for encrypting data in flight.
Key lifecycle for data-in-store |
Additional considerations
Besides encryption, it is also necessary to ensure that the encrypted data
remains unaltered till decryption. Message digests or secure hashes consist of a
fixed-length bit string that can be used to verify the validity of data. There
are various mechanisms to calculate secure hash. While most common ones are MD5
and SHA1, but stronger hashes like SHA256, SHA384 and SHA512 are recommended.
Alternatively, the Keyed Hash Message Authentication (HMAC) with a stronger hash
is also an advisable option. Then there should be the involvement of an
Internet-based Certificate Authority for making key exchanges
Key management
Encryption is not the end of it. In fact, it gives rise to another important
aspect of key management. There are basically two kinds of encryptions that one
can have-symmetric keys encryption or asymmetric keys encryption. Symmetric keys
encryption is the preferred for data at rest. There are two standards being
worked upon by IEEE, namely P1619 (for disks) and P1619.1 (for tapes), for
symmetric keys encryption.
No matter what encryption standards you are following, you should always have
a key hierarchy. The hierarchy must consist at least two levels of keys-data
encryption key and a key encryption key (KEK). As the name suggests, KEK is used
to encrypt and store the key itself. The deeper the hierarchy of keys, the more
robust the key management system required for operations.
A key management system is one that combines the devices, people and
operations required to create, maintain and control keys. Some of the common
components of a key management system are:
- Key generators: They can be manual or preferably a random number
generator. Ideally, the random number generators should be non-deterministic.
Although till date no certified or accepted Non-deterministic Random Bit
Generator (NRBG) exists due to lack of a standard verification process. Until
one is developed, it is a good practice to use a DRBG (deterministic)
certified by an appropriate agency or by the Government where the storage
device is kept. - Transport of keys: This can again be done manually through smart
cards or automatically. - Encryption devices: This defines the granularity of the key, that
is, at which level (disk, directory or individual file level) to encrypt data.
It depends on the sensitivity of data. - Key archive systems: This provides for easy recoverability of keys.
These are maintained in some tamper-proof hardware to ensure key security. - Key backup files or devices: It's different from key archiving, as
it is a practice to securely back-up keys and restore them in case of any
disaster. Ensuring control of key backup system is extremely important to key
security and integrity.
Conclusion
Data encryption is nothing new, but becomes important in case of high-volume
storage. It poses some formidable challenges as well. For example, encryption
and decryption are processor-intensive activities which may slow down access to
stored data. The situation will become worse when organizations are storing and
accessing massive amounts of information.
Then the management of the encryption keys should also be handled in an
efficient, reliable and secure way. Moreover the application should be
completely abstracted from where and how the encryption is being done and the
whole process of storage security should be automated.
Not only you need to have a secure first copy, you also need to be having
an available second copy of data to restore the normal course of business, in
case disaster strikes.
You never know the value of a backup until a disaster strikes and wipes out
the principal copy of your data. Many of us have realized the value of
investment in creating backups at remote locations, only after seeing the few
mishaps which took place in recent times and the devastation that they caused to
the unprepared. The traditional method of data protection has been to back it up
on tape, and then physically move it to a safe location, preferably far away
from the main office. However, with proliferation of a wide range of disk-based
automated backup systems the platform is all set for technologies for data
replication over WANs to get mature. On one hand we are seeing advances in
networking technologies. To name a few-optical networks now support storage
networking protocols, inclusion of flow control mechanisms, and efficient
transport capabilities. On the other hand in the storage world, we have data
de-duplication technologies coming up, which significantly reduce the volume of
the data to be backed up.
Networking technologies
To send data to geographically dispersed storage, you need a resilient storage
networking infrastructure. For inter-datacenter data replication, you need a
network with low latency, so there's minimal packet loss. Plus, the bandwidth
should also be scalable for such a network. With these sensitivities in mind,
there are two options to build a network to support for data replication-Coarse
/ DenseWavelength Division Multiplexing (C/DWDM) and Synchronous Optical NETwork
/ Synchronous Digital Hierarchy (SONET/SDH).
C/DWDM is a technology that maps data from different sources and protocols
together on an optical fiber with each signal carried on its own separate and
private light wavelength. It can be used to interconnect data centers via a
variety of storage protocols such as Fibre Channel, FICON, and ESCON. It has
been verified to support data replication over distances up to several hundred
kilometers. C/DWDM provides bandwidth from one to several hundreds of
gigabits/second (Gbps).
SONET/SDH technology is based on Time Division Multiplexing (TDM). With this
technology, enterprise data centers can be interconnected over thousands of
kilometers for data replication and other storage-networking needs. Storage over
SONET/SDH is a reliable and readily available networking option.
Data de-duplication
Though data de-duplication technologies have been around for years, there is a
renewed focus on them recently as they are being utilized by products in the
disk-based backup market. Data reduction enables disks to be a feasible
long-term retention backup media-making it the same or lower cost than
tape-based systems.
Moreover, data de-duplication addresses the issue of data replication for
disaster recovery. With the reduced amount of data after de-duplication, the
network bandwidth required for replication reduces significantly. This makes
replication even possible for smaller companies with lower budgets.
There are two primary methods of data reduction found in disk-based backup
systems: One is, byte-level delta data reduction that compares versions of data
over time and stores only the differences at the byte level and the other is,
block level data de-duplication in which blocks of data are read from the
written data and only the unique blocks are stored.
A byte level delta data reduction outperforms data de-duplication in a
disk-based backup system, as it scales to larger amounts of data. It avoids hash
table and restore fragmentation issues. It also processes the backup data after
it's been written to disk and on top of all, it is content aware and optimized
for your specific backup application. Therefore, it knows how each backup
application operates, and understands file content and boundaries. All in all,
it helps in optimizing the de-duplication process.
Flavors of replication
There are several products available in the market for data replication over
WANs. There are four flavors of replication to choose from. You can do it at the
application level, host level, in the storage arrays, or with a storage
networking appliance. The advantage of having it at application level is that
the application is fully aware of such replication. DBAs have confidence in data
integrity. It supports both synchronous as well as asynchronous replication.
And, it is also not hardware dependent. The disadvantages of such an approach
are that application owners themselves are responsible for recovery. It is
specific to a particular application. Often it does not protect application
files.
If you have it done at host level the failover can be automated, while if you
have it at application level, then the DBAs themselves need to pull up the data
in case of any failure. It supports disparate hardware and many-to-one
replication can be facilitated. As cons, you can count its dependence on OS and
that, it requires additional resources at host and the replication has to be
explicitly integrated with applications. In case you plan to replicate at
storage array level, then you must know that it is unlikely to support
dissimilar hardware. Secondary copies are only usable with point-in-time copies.
It also requires integration with applications. You also need to work on fabric
extension and there is added complexity of keeping everything in sync. But
having it at storage array makes it agnostic to applications and OS. It does not
use any host resources, replicates all kinds of data and is also easier to
manage.
Despite being a little costlier as a separate appliance and additional
hardware is required, having replication in the fabric provides modern
functionalities like CDP. In this approach, no array or host resources are
required. Understandably, it is also agnostic to applications and OS. Besides it
is highly scalable.
If atomic storage was not enough to baffle a common man's wit, these days we
are hearing talks of bacteria-based storage. Let's find out where the research
has reached.
We have heard researchers talk about atomic storage. The atoms can hold 250
Terabits of data on a surface area of one square inch. Then we also know about
the organic thin-film structures that can have more than 20,000
write-read-rewrite cycles. But, now researchers have started talking about the
potential of bacteria to store data. Sounds interesting? It is, and it's not
something out of science fiction.
State of research
Researchers from two prominent universities came out with results that
indicate the possibility of storing digital data in the genome of a living
organism. Not only that, the data thus stored could be retrieved after hundreds
or even thousands of years later. They believe that the potential capacity of
bacteria-based memory is enormous, as more than a billion bacteria are contained
in only a milliliter of liquid. You can assume the enormity.
To this effect a group of Pacific Northwest National Laboratory (PNNL)
researchers described an experiment some three years ago in which they stored
roughly one encoded English sentence in one bacterium. Taking the same ahead,
earlier this year, scientists at Japan's Keio University Institute for Advanced
Biosciences reported similar results in their research. They claim that they
successfully encoded “e= mc2 1905!”-the Einstein's theory of relativity and the
year he enunciated it-on the common soil bacteria Bacillus subtilis.
According to the scientists, one of the challenges faced in carrying out the
research was to provide a safe haven for DNA molecules, which are easily
destroyed in any open environment inhabited by people or other hazards. The
solution for the fragility of DNAs was found in a living host for the DNA. The
host tolerates the addition of artificial gene sequences and survives extreme
environmental conditions. Also, the host with the embedded information had been
able to grow and multiply.
Potential applications
Storing is not the end of the story. If you have been able to store it on a
bacterium and have not been able to retrieve it successfully and also easily
then the purpose remains un-served. By now, the retrieval of the data remains a
wet-laboratory process requiring a certain amount of time and
effort.
Most of the potential applications for DNA-based data storage relate to the
core missions of the US Department of Energy (DoE), which funded the research
carried out at PNNL. Also the technology once developed could be used for other
security-related applications such as information-hiding and data steganography
to be used in commercial products, as well as those related to State security.
Data stenganography is the hiding of data inside other data. As the DNA-based
storage has the capability to survive nuclear threats, one of the possible
applications one could think of this kind of storage is to keep copies of data
that may be destroyed in a nuclear explosion. Anyways, who wants to talk about
nuclear explosion, that's insane!