Tech Explained

Data De-Duplication

PCQ Bureau

06 Oct 2009 08:14 IST

New Update

No matter what kind of business are you in, your data is bound to grow. This

makes you add to your storage space continually. However, at times multiple

copies of the same data seem to occupy your storage pool. For instance, a

presentation which talks about the products of your company might be stored by

various users in various departments in your company, resulting in wastage of

storage space. Adding to this is the fact that whenever you take the backup of

your primary storage device, multiple copies of the same data get duplicated

again on the backup storage, be it a tape or a disk based storage. You can do

away with all this duplication with the help of data de-duplication

technologies.

Advertisment

Data de-duplication refers to removal of redundant data. In the

de-duplication process, a single copy of data is maintained along with the index

of the original data, so that data can be easily retrieved when required. Other

than saving disk storage space and reduction in hardware costs, (storage

hardware, cooling, backup media, etc), another major benefit of data

de-duplication is bandwidth optimization.

Data de-duplication can be deployed in two ways -source based and target

based. The source based de-duplication is done before the backup i.e, at primary

storage such as NAS, while in target method, de-duplication is done after the

backup. However, in a target based method, de-duplication can also be during the

backup, which is known as inline de-duplication. The benefit of in-line

deduplication over post-process deduplication is that it requires less storage

as data is not duplicated unlike post-process de-duplication.

Advertisment

Source based data de-duplication is usually deployed in environments such as,

file-systems, remote branch office environments and virtualization environments.

In remote backup scenario, the source based data de-duplication also means that

there will be less data traveling through the WAN pipe, resulting in effective

bandwidth utilization. Target based de-duplication is a good option where

bandwidth is not an issue, such as SAN or LAN backup environments.

How it works?

Largely, there are three techniques used by data de-duplication vendors,

file level, block level and byte level. File level de-duplication, also known as

single instance stores (SIS), searches for identical files on the disk and

eliminates identical ones. The biggest drawback of this method is that if the

same file is present with two different names, it won't be eliminated. Block

level de-duplication works at more granular level as compared to file level de-duplcation.

Here data is broken down to blocks which can be any logical or fixed length

blocks and the de-duplication solution looks for unique blocks (most solutions

do this by calculating hash). When a unique block is stored, its identifier is

created in the index. Now, whenever a repeated block comes across, instead of

storing the entire block, a pointer to the existing block is placed in the index

thus saving the storage space. Block level de-duplication offers various

advantages over the file level de-duplication. The same file with two different

names not removed by file based de-duplication will be easily removed in block

level de-duplication. Also, if only a part of the file is modified, the modified

part will be stored uniquely as compared to the entire data.

Byte level data de-duplication is mostly used in post-processing scenarios.

Here, new data is compared at byte level with already existing data and only the

changes are stored. Byte level de-duplication can deliver accurate backups. As

byte by byte comparison is time consuming, which is the precise reason

de-duplication is done after backup, but before data is finally written.

However, the catch here is that, this requires extra disk space, to ensure there

is enough space for de-duplication to be done while data is on hold. For block

level de-duplication to work effectively, data needs to be broken into very

small chunks, mostly around 8kb. The drawback here is, the smaller the block

size, the more the entires in hash table, and handling the table in itself can

become a challenge. Compare to this, byte level stores data in large segments,

mostly around 100MB.

Benefits

In an enterprise, most redundant data comes from backing the same data again

and again. Depending upon the environment and type de-duplication technology

used. Vendors claim, enterprises can achieve de-duplication ratio of 50:1 in

source based scenario, and 20:1 in target based scenario. However, before

choosing a particular solution, it's important to find out which technologies

will suit your current environment and what are your priorities. For instance,

are you looking to cut down your WAN costs along with the storage costs? Also a

target based de-duplication can come as an add-on to your existing data backup

solution. However, for source based, you might need to deploy the solution from

the scratch. Last but not the least, data de-duplication is also considered as a

green technology as reduction in storage space also means less power consumption

and reduction in carbon emission.

Advertisment