No matter what kind of business are you in, your data is bound to grow. This
makes you add to your storage space continually. However, at times multiple
copies of the same data seem to occupy your storage pool. For instance, a
presentation which talks about the products of your company might be stored by
various users in various departments in your company, resulting in wastage of
storage space. Adding to this is the fact that whenever you take the backup of
your primary storage device, multiple copies of the same data get duplicated
again on the backup storage, be it a tape or a disk based storage. You can do
away with all this duplication with the help of data de-duplication
technologies.
Data de-duplication refers to removal of redundant data. In the
de-duplication process, a single copy of data is maintained along with the index
of the original data, so that data can be easily retrieved when required. Other
than saving disk storage space and reduction in hardware costs, (storage
hardware, cooling, backup media, etc), another major benefit of data
de-duplication is bandwidth optimization.
Data de-duplication can be deployed in two ways -source based and target
based. The source based de-duplication is done before the backup i.e, at primary
storage such as NAS, while in target method, de-duplication is done after the
backup. However, in a target based method, de-duplication can also be during the
backup, which is known as inline de-duplication. The benefit of in-line
deduplication over post-process deduplication is that it requires less storage
as data is not duplicated unlike post-process de-duplication.
Source based data de-duplication is usually deployed in environments such as,
file-systems, remote branch office environments and virtualization environments.
In remote backup scenario, the source based data de-duplication also means that
there will be less data traveling through the WAN pipe, resulting in effective
bandwidth utilization. Target based de-duplication is a good option where
bandwidth is not an issue, such as SAN or LAN backup environments.
How it works?
Largely, there are three techniques used by data de-duplication vendors,
file level, block level and byte level. File level de-duplication, also known as
single instance stores (SIS), searches for identical files on the disk and
eliminates identical ones. The biggest drawback of this method is that if the
same file is present with two different names, it won't be eliminated. Block
level de-duplication works at more granular level as compared to file level de-duplcation.
Here data is broken down to blocks which can be any logical or fixed length
blocks and the de-duplication solution looks for unique blocks (most solutions
do this by calculating hash). When a unique block is stored, its identifier is
created in the index. Now, whenever a repeated block comes across, instead of
storing the entire block, a pointer to the existing block is placed in the index
thus saving the storage space. Block level de-duplication offers various
advantages over the file level de-duplication. The same file with two different
names not removed by file based de-duplication will be easily removed in block
level de-duplication. Also, if only a part of the file is modified, the modified
part will be stored uniquely as compared to the entire data.
Byte level data de-duplication is mostly used in post-processing scenarios.
Here, new data is compared at byte level with already existing data and only the
changes are stored. Byte level de-duplication can deliver accurate backups. As
byte by byte comparison is time consuming, which is the precise reason
de-duplication is done after backup, but before data is finally written.
However, the catch here is that, this requires extra disk space, to ensure there
is enough space for de-duplication to be done while data is on hold. For block
level de-duplication to work effectively, data needs to be broken into very
small chunks, mostly around 8kb. The drawback here is, the smaller the block
size, the more the entires in hash table, and handling the table in itself can
become a challenge. Compare to this, byte level stores data in large segments,
mostly around 100MB.
Benefits
In an enterprise, most redundant data comes from backing the same data again
and again. Depending upon the environment and type de-duplication technology
used. Vendors claim, enterprises can achieve de-duplication ratio of 50:1 in
source based scenario, and 20:1 in target based scenario. However, before
choosing a particular solution, it's important to find out which technologies
will suit your current environment and what are your priorities. For instance,
are you looking to cut down your WAN costs along with the storage costs? Also a
target based de-duplication can come as an add-on to your existing data backup
solution. However, for source based, you might need to deploy the solution from
the scratch. Last but not the least, data de-duplication is also considered as a
green technology as reduction in storage space also means less power consumption
and reduction in carbon emission.