Advertisment

The Future of Computing

author-image
PCQ Bureau
New Update

Fifty years from now, when you save a document it will no

longer be a single file. You will not even know exactly where that document has

been saved. Automatically, the system will create a read-only copy of it and

save it to yet another location. You will be able to group your documents,

e-mail and other content into categories and say who has what kind of access to

them. You would also be able to instruct data to either self destruct in certain

conditions or remain alive somewhere for all time. Your PC will be a workstation

connected to a vast network of both peers and servers all over the world and it

will seamlessly exchange information over well-defined and architected P2P

protocols. Whenever you need your document, you will be able to see all the

different versions of it and open one of them as you need.

Advertisment

Sounds like fantasy? Think again. Systems (FS) that exist

today already do most if not all of this, although pretty much in discrete

islands. Where we are headed seems to be the above times of interconnected

seamless systems that provide a global computing experience. Now for the first

important question-how will all this be accomplished? This will happen because

of the radical changes happening to the way our computers store and handle data.

The way that is about to happen, is what our story is about. Let's dive in.

What is a file system?



Let's start with an analogy. In your office, you put away paper files into

filing cabinets and maintain separate records of where each file is. A file

system in a computer is the same. If the hard disk is the equivalent to your

filing cabinet, then the FS is your index of what files lie within the filing

cabinet.

FS have different ways and levels of maintaining

information. Some limit themselves to basic descriptions of what files are

stored. Others add capabilities like access control, encryption and compression,

and let you create links to refer to files/folders. The reason for this

differentiation was the platform or need that file system serves. And, the way

the FS works largely depended on what OS it ran under.

As far as FS go, there isn't a perfect salve for all needs yet and the

evolution continues. Every major OS release is accompanied by a minor or major

improvement or overhaul of the FS it depends on. NTFS for instance has been

revised five times in all so far and each of them has added a feature or

enhancement. In comparison, the “ext” FS for Linux has seen three technical

evolutions.

Advertisment

Tied to an OS?



While the earliest of operating systems singled out one FS (more if the FS

was backward-compatible with another one) that they could use. However, that

trend faded as one OS evolved out of another and needed to use the older one's

applications. This brought in the era where you could use multiple FS, sometimes

different from one another, under an OS. The way we compute and deal with data

also changed. We're now in an era where we save documents to a server

somewhere on the Web that does not even belong to us. This actually brings in

new challenges. The average-Joe user, does not care where his files are and

what's happening behind the scenes. All he would want is to be able to store

his files and get them back when he wants them. This is why concepts like

distributed FS came in.

New requirements



The original requirements of needing to keep track of what files are stored

on a disk still hold for today's FS. But there are some new ones too. Let's

take them one at a time below.

Scalability:

A 630 MB hard disk can no longer fulfill your needs. Today, it is common

to find even a 120 or 200 GB hard disk filled up in a couple of months. If this

is a common global pattern, instead of each disk maintaining a separate discrete

FS, a common external FS would be more efficient. This FS would be scalable and

there would be no limit at which more disks will not be needed. Already, this is

an urgent need. Specialized storage systems like SAN and DAS systems already use

such FS to virtualize and manage lots of hot-swappable disks that may be removed

and replaced with a blank one anytime.



Virtualization:
The need is for a virtualizing FS that can virtualize

(atleast to the user and his applications) a standard set of features and

capabilities, regardless of what the system can do behind the scenes. Storage

systems take care of this by running its own OS and abstracting a virtual FS

over the network.



Longevity and Destructibility:
Both are now required by laws like the

Sarb-Ox. You need data to be available for long periods of time and destroyed

permanently at the end of that time. Traditional FS may let you do the first one

easily. But when it comes to retrieving deleted data from a hard disk, there are

tons of recovery software. The best permanent destruction solution today is to

overwrite with complicated hash values several times. But is there something

better? If the previous FS records can still be accessed, then it's the logic

of the FS that needs to improve, not the means to hide content.



Content independence:
Information in the enterprise can be in many

different forms and formats. The usual solution to retrieve and make use of them

is to use an enterprise application. But if the FS itself can help do this, that

would ideal.



Performance:
The performance of the FS is a function of the ease and

speed of locating the file (by the OS) and access the content. Some FS, like

FAT, are optimized for floppies and small hard disks. Others like Red Hat's

GFS and Sun's ZFS are for high capacity storage. So, we have different file

systems for different sized storage and performance would depend on what you

selected.

Advertisment

How are these factors are driving changes in FS? Let's

take up different areas of FS usage and examine them in detail.

Despite their closed nature, FAT and NTFS together have the biggest base of third party tools for management (partitioning, recovery, etc) HPFS386 (HPFS for servers) was originally designed to be what NTFS is today, with reduced fragmentation, mixed code-page, and more The file system supporting the maximum single-volume size ever today is the UFS2 (UNIX File System), at 1 YottaByte (1024 bytes)

Keep the users happy

Advertisment
ZFS will reach its files limit in 9,000

years
if you create 1,000 files a second

One of the challenges for the administrator at the desktop

and server level is to strike a balance between security and recoverability. In

this segment, when we talk of “server”, we refer to the basic file server,

which may be performing additional tasks like authentication. The enterprise

user needs to be given the right amount of security for his files. At the same

time, when data is lost due to some reason, it should be easily recoverable. A

third key element in deciding the right desktop file system is efficient usage

of limited capacity. Unlike at the server, desktops have limited storage (say 40

GB). This space has to be utilized as efficiently as possible.

Advertisment

Traditional file

systems that we know and use today are 64-bit at best. This lets them

store a few terabytes of data independently before you need to start

thinking about clustering many of them to address more content. A 128-bit

system can store about 1021 bytes of data, which is a thousand times more

than all of Human knowledge till the end of the previous millennium. This

is the first notable point about Sun's ZFS file system. The theoretical

capacity of ZFS is 16 Exabytes per system. Other than this, it also has

other features that let it be used in mission-critical environs.

Loss prevention



    Data is susceptible to corruption. File systems follow various mechanisms
to prevent or minimize this loss. A reason for data loss is that blocks

get over-written on the disk because file sizes keep changing. In ZFS,

this is prevented by writing new data to new locations on the disk and

then deleting the old information. This way, file expansion is less likely

to overwrite adjacent blocks belonging to (other) files. This principle is

actually similar to the WAFL file system (see later in this story for a

discussion on this) which also works the same way. To verify integrity of

files, most modern file systems use checksums. Ordinarily, these are

32-bit checksums. But ZFS uses 64-bit checksums letting it protect data a

little more aggressively. ZFS minimizes performance problems faced by

journaling file systems due to excessive writes by grouping write

operations into 'transaction blocks' and then treating these groups as

one.



    Like NTFS, ZFS also takes snapshots in a content-sensitive way. Normally,
snapshots are copies of the entire file which ends up occupying large

sizes as data grows. Sun's implementation will snapshot only that part

of the data that has changed, letting the file system simply use pointers

instead of copies of unchanged information. This way, the disk is also

utilized more efficiently.



    Solaris 10, the parent OS for ZFS also bundles virtualization
technologies. ZFS makes use of this by adding storage virtualization at a

very low level. It also removes the necessity for separate volume

management for each storage device. This also makes ZFS highly scalable,

since you can add more capacity without needing to make changes anywhere

else. This ease of management is also enhanced by the new ZFS paradigm of

creating policies instead of actions. Here, you can instruct the system on

a policy to apply (like quotas) instead of actually doing a step-by-step

implementation.



    ZFS natively supports mirroring data to other disks in the storage pool.
This means like EMC's CAS storage system, ZFS can correct corrupted data

blocks using checksums and retrieving data from locations with the correct

data. Since it can do this auto-magically, there is no need to perform a

'fsck' on a ZFS volume. In spite of all this functionality, the file

system appears to the user as a normal POSIX system. And if you're not

satisfied with the way ZFS works, the file system implementation is

available under CDDL with source code. It is a part of OpenSolaris as

well.



Shortcomings



      Its not as if ZFS is free of shortcomings. Two major ones at the moment
(although unconfirmed) are: one, you cannot mount ZFS volumes under any

other OS should you so require it. But we can probably expect atleast

third-party solutions for this in the near future. The second is you

cannot convert between existing UNIX File System (UFS) and ZFS formats,

meaning you need to install from scratch if you want ZFS on the system.

Now in a typical enterprise layout, users would be storing

their files mostly in a central file server where the above aspects have been

taken care of. Then the only files that would need recovery at the desktop would

be downloaded e-mail, instant message logs and any files that have not yet been

stored on the file server. If your organization uses centralized messaging

storage (as is likely if you're using something like Lotus Domino/Notes or MS

Exchange), this aspect is transferred to the messaging server. Also, because of

standardization of software in the enterprise, the desktops would be all on the

same OS (some flavor of Linux or Windows) and hence on ext3 or NTFS.

Advertisment

But all this set to change. Remember we said that capacity

is growing. What this also means is that the amount of data being stored is

increasing. A challenge therefore is to not only store TBs of data safely, but

also find it quickly when your user wants it. Traditional FS like FAT, NTFS or

ext3 are not geared for such activity. FAT is meant for low-volume storage and

has no security features. Ext3 and NTFS can address a lot of storage and provide

strong security but they are not inherently search-friendly. Some of the FS that

take care of these problems are: Reiser4, WinFS and ZFS. Strangely, a look at

the specialties of each of the three reveals no discernable pattern.

Limitless expansion:

While there are no known limits to the capacity WinFS or Reiser4 can address,

Reiser4 limits each file to a maximum of 8 TB. ZFS has a limit of 16 Exabytes.

For the uninitiated in the world of numbers, the entire human knowledge in any

known format at the end of the last millennium was only 12 Exabytes. ZFS will be

able to address data for a long time.



Journaling:
Logging of changes before it is made is called

'Journaling'. If there is a power failure, or some error condition before

all required operations have been done then changes can be rolled back. A

jouurnaling file system (JFS) can solve frequent CHKDSK/FSCK screens at boot up.

A performance limitation is that there are two write operations per actual

write. So, if you simply saved this file in a JFS, the engine would write down

that it saved the file and then also save the file itself.



The Reiser4 FS faetures “wandering logs” where B* Trees not populated to a
certain degree are not saved to disk unless the underlying transaction is

complete.Reiser4 is supposed to be much faster than ext3. NTFS also has

journaling, using B+ Trees. ZFS uses out-of-place copies.



Finding data:
WinFS is known for its search-optimization. As we said

in our earlier article (“WinFS Beta 1”, Oct 2005), it would let you run

almost SQL-like queries on your file system to locate what you want. In a

retrogressive move, WinFS is slated to flatten the FS, meaning no more folders.

Instead, it would categorize data by adding attributes to your files and search

based on those values.

So, we're going to see the ability to address more

capacity, locate our data faster and store it reliably.

Advertisment

Time Line

A smaller world

Data is not just stored on individual systems but also on

distributed systems-like network storage and cluster computers. When

information is stored on such systems, it becomes irrelevant what platform or OS

the specific storage device is using, but the broad requirements of the

environment are met. It is then upto the particular device to virtualize an FS

and provide uniform services across each client. Therefore, you will find three

layers of FS when you talk of network and Web based storage mechanisms.

The lowest layer is what lies actually at the physical

(disk) level. The second layer is what administrators get to see. The end-user

sees the third layer that usually takes on the form of an NFS (Network File

System) like SMB. The area that each of these layers fulfill is different. While

the lowest layer takes care of the performance, reliability and efficiency of

the storage it addresses, the middle layer adds manageability and scalability.

The final layer is the user interface of the file system.

Freenet file system does not guarantee it will find your data,

but it will guarantee anonymity and free speech

Below, we take a tour of the various networking and Web

related FS and explore how they are changing the landscape. For it is here that

the picture gets murkier, and users along with applications and various OS start

demanding files from anywhere, anytime. And whichever of the below is in use, it

will have to ensure proper access to that data. When looking at the choices in

this area, we were amazed to note the sheer number of interesting FS available.

Therefore, we are deviating in style from how we looked at the previous segment

in what comes below.

See data not stars



Cluster computing and grid computing are the hot topics today. We have a

series running elsewhere in this magazine on what cluster computing is and what

it can do. In last month's part of that story, we have talked about setting up

a cluster using OpenMosix. This environment has its own file system called the

Mosix File System (MFS) and its OpenMosix equivalent is 'oMFS'. Broadly

speaking, the MFS is similar to NFS, but provides consistency of cache,

timestamps and links.

And when you talk of cluster/grid computing, Oracle is in

the counting too with its OCFS. The OCFS creates an overlaid FS across all the

systems in a cluster. This is then used by their 10g application servers to

store their files (including configuration and logs) and data (including

databases) more efficiently. Similarly, for high-capacity storage environments

there is Red Hat's GFS (Global FS). Unlike other network file systems, the GFS

has no concept of servers and clients and works directly with the storage media.

As a product, it is a part of the FC 4 distribution and needs to be purchased as

an add-on to RHEL. The GFS works as a common FS for all storage devices in a

clustered environment. Among other features, it guarantees symmetry, modularity,

manage- ability and reliability.

IBM's AIX uses a different FS called the GPFS (General

Parallel FS). This file system can abstract disk-based FS from several systems

running AIX or Linux in a cluster and present them in one homogenous interface.

GPFS tries to overcome traditional bottlenecks in system-specific FS by letting

users access files across servers in parallel (hence its name). Redundancy is

provided by mirroring content across multiple disks on multiple servers or

devices. GPFS will read and write these blocks of data in parallel. Reliability

is achieved using multiplexed routing between the different systems.

Linked file systems



For long, work has been going on to virtualize the FS on individual servers

and clients onto a single network based layer. A common name for this type of FS

is the 'Distributed File System' or DFS. A variant is the 'Andrew File

System' or AFS. Both models are similar in the underlying effort to hide the

fact that files and folders shared on it are actually pooled from many different

systems on that network.  What

happens is that shares are created on a central file server using folders that

are actually on other systems on the network. So, if you have separate

authentication, file and Web servers and for a user you need to abstract all his

relevant shares into one folder, DFS/AFS is the way to do it.

The DCE/DFS was used by IBM in 1996 for the

official Olympics website

We have carried two articles in the last year on OpenAFS-“Integrate

Storage with OpenAFS” (in May 2005) where we talked of implementing an OpenAFS

environment on your network and in June 2005 we had “Managing OpenAFS” where

we detailed how to manage the same deployment. There are different

implementations and improvements of the AFS model like Coda and Arla.

The DCE/DFS (Distributed Computing Environment /

Distributed File System) is a 16 year old network FS for distributed

environments. This framework uses RPC mechanisms with Kerberos-like

authentication and authori- zation mechanisms and is linked to a DCE Directory

Server for repository services. This is supposed to be the only fully POSIX-compliant

file system out there. While a variant (FreeDCE) has been available since 2000

under the BSD license.



There is an open source clustering file system for Linux and it is called

Lustre — which is an amalgamated form of 'Linux' and 'Cluster'.

This is a three-layer system consisting of a 'metadata server' (MDS),

systems hosting the Object Storage Target (OST) information and the

storage system called the Object-Based Disks (OBD). All three components

can be clustered themselves. This makes way for a very highly available

and high-performance network FS. Let's first see how it works.

Lustre is a heterogeneous environment witha failover system. Each parent calculates the best available child



The MDS is a server for only the metadata of the files the Lustre is

storing. Metadata is information about the information, and this makes

classifying, indexing and searching data easier. Systems would contact the

MDS only when the operation requires using or creating the metadata. For

example, to update the tags, create/move or delete files and folders. For

the rest of the operations, the client is directed to contact the OST or

the OBD systems. The OST stores information about the OBDs and acts

somewhere along the lines of a traditional distributed file system (DFS or

AFS). The systems at this level would hold information about what OBDs are

there and what they contain, much like a SAN's controller. Finally for

the information itself, the client needs to contact the particular OBD,

which is like a normal file server or NAS device.



            By
design, OBDs are clustered and that information is tracked by the OSTs.

The OSTs are clustered and tracked by the MDS. The Lustre structure allows

MDS to be clustered too! And when this is done, you can publish that

information onto an LDAP directory. Clients must then query the LDAP store

for information on the various MDS, select one and then proceed from

there. All this is possible because Lustre uses a driver-model to

communicate with each component (network and hardware) so that it can work

oblivious of the complexities and specifics of the underlying system.



            And the
extensibility of the Lustre model doesn't end there either. The

configuration is stored in XML files, meaning it is manageable easily

using third party tools. Also, you don't have to depend on improvements

flowing into Lustre source code to get better performance. Any improvement

in the protocol, network, and hardware level at the OST/OBD can also make

this happen. The FS can be used to integrate disparate network file

systems into one coherent cluster system, since it can mount a variety of

FS within itself like ext3, JFS, ReiserFS and XFS. It is also independent

of the network layer or topology and can work just as easily with TCP as

with Fibre Channel (support for FC is being worked on) in the same

deployment. Within the Lustre system itself, it uses a protocol called

'Portals' from Sandia.



           For
authentication purposes, Lustre does not have a built-in mechanism but can

use existing systems like Kerberos 5 and PKI. This can be linked to Lustre

using its API called 'Generic Security Service API'. Authorization is

performed using POSIX style ACLs. All data is automatically encrypted when

stored and decrypted on authorized access.



            It is
planned that in future versions of Lustre, support for mount points

(directories that can be used to mount remote storage) will be added,

along with write-back caching and collaborative read-caching to improve

I/O performance. Snapshots are another feature on most network and cluster

FS, but absent currently in Lustre.



An Apple?



One platform often missed out when we talk of SAN and NAS devices is the

Apple Xserve. It utilizes an Apple-proprietary file system called Xsan that

features full 64-bit addressing letting it use upto 2 petabytes of storage. The

Xserve works over Fibre Channel and serves high-end needs like video-editing

farms and computation clusters. The Xsan can be used on both SAN and NAS

devices. On either system, Xsan will use one machine to serve as a 'metadata

server', a role that it will shift to another system should the original fail.

Similarly Xsan can also function in high-availability environments using

multiplexed routes to different machines. This FS can also serve data when only

one of several systems in a SAN is online.

Split and store files



Although not yet officially released, work is on-going on a P2P architecture

file system. Its name is 'Freenet' and is being designed to be a server-less

de-centralized and 'censorship-resistant' system. It aims to use key-based

routing to share files anonymously which makes it suitable only for static file

content. This system aims to create a scalable version of what is called a

'darknet' where only trusted peers can establish mutual connections.

Previous attempts at darknets have included Nullsoft's WASTE and

Apple's iTunes.

Like most P2P environments, it is designed to be

self-organizing. Anonymity is guaranteed because the FS will replicate content

across multiple participating nodes, with no attached tracking information.

Files can also be broken up into multiple pieces and encrypted and each of these

pieces stored on different nodes.  A

similar (but networking oriented project) is the GNUnet from GNU.

Protocol-oriented systems



Although NFS, CIFS and SMB systems are associated with specific protocols,

one 'not so famous' but widely used network FS is the 9P. This DFS is the

native FS of the Plan 9 OS. Plan 9 is a UNIX-descendant from Bell Labs. Under

9P, almost everything is implemented as a file, including screen elements,

regular files and folders, networks, processes, etc.

A variant of the 9P is the Styx or Inferno file protocol. A server

implementation of 9P also exists for the Linux environment (v9fs). The v9fs has

been included in the Linux kernel 2.6.14 and is a SourceForge project.

File systems, really?



Modern usage of the Web for purposes other than to put up homepages has led

to improvements in the way Web hosted data is stored. This has been given a shot

in the arm because of the FUSE project. FUSE stands for 'Filesystem in

Userspace' and lets users mount custom FS abstractors/ providers/drivers into

the OS at the user level itself. The FUSE Wiki catalogues about 55 such FS. Two

notable examples in the area of FUSE/Web are: Wiki-influenced FS and the

mythical 'Google File System'. What do you do if you want to mount, edit and

use Wikipedia articles or your blog entries as if they were on your hard disk?

Simple, you'd mount them using the virtual WikipediaFS or BlogFS respectively.

WikipediaFS supports standard Linux FS calls and lets users copy, move, rename,

change directories and use pipes and redirections too. And Blogs are not behind

in the race either. BlogFS mounts Wordpress and metadata-rich blogs as FS with

read/write support.

The only FS to support Execute in Place (XIP) is UDF used in optical storage media. XIP programs run without using the RAM Bootable CDs are written in an ISO9660 extension called the 'El Torito' format. El Torito lets the CD emulate a hard disk or floppy disk Win FS is will soon be the most meta data rich FS. Previous such file systems have been ZFS, NTFS, JFS, NSS and NWFS

The idea therefore seems to be to specialize and abstract.

There are so many choices of platforms today, with hardware, software and

implementation. The only way users and administrators can make sense out of data

we store is to abstract it in an entirely new (and platform-independent)

fashion. As applications themselves become demanding, vendors are going back to

the drawing board and create their own abstraction.

Technology in motion

Mobile devices are everywhere and are used more and more in

enterprises. High-availability systems feature extensions that allow reports and

alerts to be sent to mobile devices so that their administrators are notified

instantly when there's something needing their attention. Also, end-users who

cannot be tied down to a desk (like agents in the field) are harnessing the

power of the mobile computer-whether a smart phone or a PDA-to stay in touch

with their offices as well as interface with regular intranet applications. All

these devices let users store files and other content (e-mail) locally. Unlike

desktops, a mobile device has far less CPU power and within that limit, it needs

to be instantaneous. Behind all that needs to be a powerful and yet compact file

system. There is actually one more challenge that mobile file systems face:

mobile memory is solid state, in the form of flash memory. And addressing this

is quite a different task from disk based systems.

EEPROM chips use the compact and efficient romfs to map their content

This is where special purpose FS like JFFS, YAFFS, romfs

and  Coda come into play. Both JFFS

and YAFFS are journaling FS meaning they are reliable. With the rise of

Bluetooth, specialized FS in that space (albeit at the user-mode level) have

come into the picture, notably the BtFS.

JFFS1

(Journaled Flash FS) was only meant for 'NOR'

chips. This FS was also slow because unlike on a disk where data can be

overwritten on locations already containing other data, flash memory must be

erased prior to writing on it. This meant the entire FS structure needs to be in

memory at boot time. This also means that every update required erasing the

entire chip and rewriting all the information. In JFFS2, support was added for

NAND chips as well. Here, there is another challenge since NAND chips are

sequential I/O. However the advantages of JFFS2 were that it offered both

compression and introduces garbage collection to give it better performance over

the JFFS1 system. However, in-memory maps of the chip are still necessary and as

flash memory rises well beyond the GB mark, this increases the demands on the

limited runtime memory (RAM) of the device itself.

To simplify matters and remove much of the complexity

involved with flash memory, the YAFFS (Yet Another Flash FS) was born. This

assumes that an erased flash memory is formatted. It then maintains a tree-list

of the file structure in the runtime memory. Garbage collection is enforced to

free up unused or erased blocks in the FS. Like JFFS, YAFFS has also undergone a

revision and the YAFFS2 is the latest. This revision adds a scheme to

sequentially number data blocks so that the newest block is most easily

available. Also, it allows the system to shrink files more efficiently using a

concept called 'shrink headers'.

Ok, we said enterprise usage and mobiles. So, how about

bringing in features like distributed computing, security, caching, replication

and trying to put it into a mobile device. There is an FS for just this, by name

of Coda. This unique FS is based on early editions of AFS and includes a

paraphernalia of features that enterprise mobile computing. For one thing, you

can sync up your device to a network file server and continue working in a

disconnected mode. Coda lets the user replicate files across multiple servers.

However, there is one serious disadvantage at the moment in Coda and that is an

inherent limit of 106 files per server. Coda has in fact spawned an FS on the

same principles called InterMezzo (Linux v2.4.15, removed in 2.6) that can

overlie an existing journaling file system.

Embedded Linux devices use JFFS2 for their file system. QNX,

a real-time OS uses the Embedded Transactional FS (ETFS) or the QNX Flash FS for

its devices. The ETFS is treats file operations as atomic transactions and is

used on NAND devices and QNX FFS is a wear-optimized system for NOR devices. The

ETFS is optimized to provide both high-performance service as well as

recoverability from failure. An SDK is available from qnx.com for developers to

optimize their products to these file systems.

Windows CE on the other hand uses a flexible 32-bit file

system model, where it offers separate ROM, RAM and device file system types. It

also lets developers create their own file systems. The CE model is slightly

different in that it treats the storage media as an 'object store' and by

default compresses all the files. Additionally, Win CE devices can use .BIN

files — this is a special file system called the 'BinFS'. A program within

the CE environment loads this image and emulates it a FAT file system for the

user.

Special and fast from

Google




With Inputs from Arvind Jain, Head, R&D Center, Google India

Applications like Google Maps above derive data from voluminous data and served up fresh by the Google File System

A globally accessible almost instantaneously responsive

application that works over the Web, and is powered by hundreds of servers

spanned across the globe, needs your FS to be perfect.One way to set a good

example for similar application developers is to build your own FS. And that's

what the Google did. All their applications use the same basic pool of data the

Google servers harvest and store. Somehow when you enter your query, they is

instantly able to locate exactly what you want with a high degree of relevance

and sometimes even show you on a map where it is. Unlike a run-of-the-mill

enterprise application, the storage is both geographically spanned as well as

large in capacity. On top of that, new nodes and data are added regularly.



File servers, storage boxes, backup devices, anti-virus, spam and firewall appliances and even search engines come in appliance form. All of these store files in one form or another, not all of them expose their FS to the user. These boxes usually run a customized version of Linux and hence using one of its FS. File/storage and backup servers become the prime candidates for using custom FS on. One of the better known FS for a file server appliance is the WAFL (Write Anywhere File Layout) from NetApp. This FS writes data blocks to any location on the storage media. It inherently supports extensible metadata for the file, since even metadata can be broken up and written at different places. The ONTAP server from NetApp uses the

WAFL.



      In this space, each vendor has their own custom FS on their device. A simple Google for embedded file systems throws up a huge 15 lakh results (although a lot of them are not unique). This makes interoperability an issue. While devices of a certain type (like cellphones or PDAs or specific appliances) maybe standardized on a particular platform/OS and its file system is also standardized.

The Google File System (GFS) is built from scratch with no

compatibility with any FS. With its own API and methods of storing, indexing and

retrieving information, its a proprietary system never to be used by anyone

except Google. GFS does not support common UNIX or Linux FS commands. And to

read/write data on GFS, applications talk to the GFS Client. And yet, this FS is

only a layer on top of a Linux file system! This way, the complexities of

actually managing the on-media data is done away with. Some specializations in

the GFS are: hardware fault-tolerance and recoverability, optimization for files

larger than 100 MB, queued I/O to handle simultaneous read and write operations

on a file from different sources. GFS is tuned to Google applications and those

applications are in turn tuned to GFS.

Architecture



GFS has a 3-tier model. You have the master server that talks to several 'chunkservers'.

Finally you have the systems running various Google applications. GFS's master

server stores metadata information for the entire FS. This entire data is

maintained in the master's RAM. If the master is reset, the data is rebuilt by

polling the chunkservers.  Files are

broken up into fixed-size blocks 64 KB each and these are stored on the

chunkservers to streamline I/O performances. Data is replicated across many

chunkservers, along with 32-bit checksum values for each chunk. GFS can recover

corrupted data from good copies from elsewhere in a short time frame. Versioning

is managed by electing one chunkserver as the primary and pipelining writes

through it, with globally unique chunk-level sequential numbering to track their

versions.Loss of data is common, especially because of hardware failure. Google

defines three levels of loss-undefined, inconsistent and corrupt. Undefined is

when the state is not known. Chunks that do not match-up on chunkservers are

inconsistent. When it is permanently lost, it is classified corrupt. When a

block of data is designated to be inconsistent or corrupt, attempts are made at

the earliest to refresh it from known good copies-as in the checksums for the

blocks are correct-from other chunkservers. If all chunks for that data are

lost, then the system simply returns a 'data not found' and never returns

corrupt data.

Feedback and effects of GFS



One of the feedback effects from the development of GFS by Google was

improvements in IDE disk drivers at the Linux end. Since GFS runs on top of

Linux systems, when Google experienced problems with various Linux components,

they usually tried to fix it themselves. These fixes then found their way back

into Linux itself.

Rinku Tyagi and Sujay V Sarma

Advertisment