Fifty years from now, when you save a document it will no
longer be a single file. You will not even know exactly where that document has
been saved. Automatically, the system will create a read-only copy of it and
save it to yet another location. You will be able to group your documents,
e-mail and other content into categories and say who has what kind of access to
them. You would also be able to instruct data to either self destruct in certain
conditions or remain alive somewhere for all time. Your PC will be a workstation
connected to a vast network of both peers and servers all over the world and it
will seamlessly exchange information over well-defined and architected P2P
protocols. Whenever you need your document, you will be able to see all the
different versions of it and open one of them as you need.
Sounds like fantasy? Think again. Systems (FS) that exist
today already do most if not all of this, although pretty much in discrete
islands. Where we are headed seems to be the above times of interconnected
seamless systems that provide a global computing experience. Now for the first
important question-how will all this be accomplished? This will happen because
of the radical changes happening to the way our computers store and handle data.
The way that is about to happen, is what our story is about. Let's dive in.
What is a file system?
Let's start with an analogy. In your office, you put away paper files into
filing cabinets and maintain separate records of where each file is. A file
system in a computer is the same. If the hard disk is the equivalent to your
filing cabinet, then the FS is your index of what files lie within the filing
cabinet.
FS have different ways and levels of maintaining
information. Some limit themselves to basic descriptions of what files are
stored. Others add capabilities like access control, encryption and compression,
and let you create links to refer to files/folders. The reason for this
differentiation was the platform or need that file system serves. And, the way
the FS works largely depended on what OS it ran under.
As far as FS go, there isn't a perfect salve for all needs yet and the
evolution continues. Every major OS release is accompanied by a minor or major
improvement or overhaul of the FS it depends on. NTFS for instance has been
revised five times in all so far and each of them has added a feature or
enhancement. In comparison, the “ext” FS for Linux has seen three technical
evolutions.
Tied to an OS?
While the earliest of operating systems singled out one FS (more if the FS
was backward-compatible with another one) that they could use. However, that
trend faded as one OS evolved out of another and needed to use the older one's
applications. This brought in the era where you could use multiple FS, sometimes
different from one another, under an OS. The way we compute and deal with data
also changed. We're now in an era where we save documents to a server
somewhere on the Web that does not even belong to us. This actually brings in
new challenges. The average-Joe user, does not care where his files are and
what's happening behind the scenes. All he would want is to be able to store
his files and get them back when he wants them. This is why concepts like
distributed FS came in.
New requirements
The original requirements of needing to keep track of what files are stored
on a disk still hold for today's FS. But there are some new ones too. Let's
take them one at a time below.
Scalability:
A 630 MB hard disk can no longer fulfill your needs. Today, it is common
to find even a 120 or 200 GB hard disk filled up in a couple of months. If this
is a common global pattern, instead of each disk maintaining a separate discrete
FS, a common external FS would be more efficient. This FS would be scalable and
there would be no limit at which more disks will not be needed. Already, this is
an urgent need. Specialized storage systems like SAN and DAS systems already use
such FS to virtualize and manage lots of hot-swappable disks that may be removed
and replaced with a blank one anytime.
Virtualization: The need is for a virtualizing FS that can virtualize
(atleast to the user and his applications) a standard set of features and
capabilities, regardless of what the system can do behind the scenes. Storage
systems take care of this by running its own OS and abstracting a virtual FS
over the network.
Longevity and Destructibility: Both are now required by laws like the
Sarb-Ox. You need data to be available for long periods of time and destroyed
permanently at the end of that time. Traditional FS may let you do the first one
easily. But when it comes to retrieving deleted data from a hard disk, there are
tons of recovery software. The best permanent destruction solution today is to
overwrite with complicated hash values several times. But is there something
better? If the previous FS records can still be accessed, then it's the logic
of the FS that needs to improve, not the means to hide content.
Content independence: Information in the enterprise can be in many
different forms and formats. The usual solution to retrieve and make use of them
is to use an enterprise application. But if the FS itself can help do this, that
would ideal.
Performance: The performance of the FS is a function of the ease and
speed of locating the file (by the OS) and access the content. Some FS, like
FAT, are optimized for floppies and small hard disks. Others like Red Hat's
GFS and Sun's ZFS are for high capacity storage. So, we have different file
systems for different sized storage and performance would depend on what you
selected.
How are these factors are driving changes in FS? Let's
take up different areas of FS usage and examine them in detail.
Despite their closed nature, FAT and NTFS together have the biggest base of third party tools for management (partitioning, recovery, etc) | HPFS386 (HPFS for servers) was originally designed to be what NTFS is today, with reduced fragmentation, mixed code-page, and more | The file system supporting the maximum single-volume size ever today is the UFS2 (UNIX File System), at 1 YottaByte (1024 bytes) |
Keep the users happy
ZFS will reach its files limit in 9,000 years if you create 1,000 files a second |
One of the challenges for the administrator at the desktop
and server level is to strike a balance between security and recoverability. In
this segment, when we talk of “server”, we refer to the basic file server,
which may be performing additional tasks like authentication. The enterprise
user needs to be given the right amount of security for his files. At the same
time, when data is lost due to some reason, it should be easily recoverable. A
third key element in deciding the right desktop file system is efficient usage
of limited capacity. Unlike at the server, desktops have limited storage (say 40
GB). This space has to be utilized as efficiently as possible.
Traditional file Loss prevention Shortcomings |
Now in a typical enterprise layout, users would be storing
their files mostly in a central file server where the above aspects have been
taken care of. Then the only files that would need recovery at the desktop would
be downloaded e-mail, instant message logs and any files that have not yet been
stored on the file server. If your organization uses centralized messaging
storage (as is likely if you're using something like Lotus Domino/Notes or MS
Exchange), this aspect is transferred to the messaging server. Also, because of
standardization of software in the enterprise, the desktops would be all on the
same OS (some flavor of Linux or Windows) and hence on ext3 or NTFS.
But all this set to change. Remember we said that capacity
is growing. What this also means is that the amount of data being stored is
increasing. A challenge therefore is to not only store TBs of data safely, but
also find it quickly when your user wants it. Traditional FS like FAT, NTFS or
ext3 are not geared for such activity. FAT is meant for low-volume storage and
has no security features. Ext3 and NTFS can address a lot of storage and provide
strong security but they are not inherently search-friendly. Some of the FS that
take care of these problems are: Reiser4, WinFS and ZFS. Strangely, a look at
the specialties of each of the three reveals no discernable pattern.
Limitless expansion:
While there are no known limits to the capacity WinFS or Reiser4 can address,
Reiser4 limits each file to a maximum of 8 TB. ZFS has a limit of 16 Exabytes.
For the uninitiated in the world of numbers, the entire human knowledge in any
known format at the end of the last millennium was only 12 Exabytes. ZFS will be
able to address data for a long time.
Journaling: Logging of changes before it is made is called
'Journaling'. If there is a power failure, or some error condition before
all required operations have been done then changes can be rolled back. A
jouurnaling file system (JFS) can solve frequent CHKDSK/FSCK screens at boot up.
A performance limitation is that there are two write operations per actual
write. So, if you simply saved this file in a JFS, the engine would write down
that it saved the file and then also save the file itself.
The Reiser4 FS faetures “wandering logs” where B* Trees not populated to a
certain degree are not saved to disk unless the underlying transaction is
complete.Reiser4 is supposed to be much faster than ext3. NTFS also has
journaling, using B+ Trees. ZFS uses out-of-place copies.
Finding data: WinFS is known for its search-optimization. As we said
in our earlier article (“WinFS Beta 1”, Oct 2005), it would let you run
almost SQL-like queries on your file system to locate what you want. In a
retrogressive move, WinFS is slated to flatten the FS, meaning no more folders.
Instead, it would categorize data by adding attributes to your files and search
based on those values.
So, we're going to see the ability to address more
capacity, locate our data faster and store it reliably.
Time Line
A smaller world
Data is not just stored on individual systems but also on
distributed systems-like network storage and cluster computers. When
information is stored on such systems, it becomes irrelevant what platform or OS
the specific storage device is using, but the broad requirements of the
environment are met. It is then upto the particular device to virtualize an FS
and provide uniform services across each client. Therefore, you will find three
layers of FS when you talk of network and Web based storage mechanisms.
The lowest layer is what lies actually at the physical
(disk) level. The second layer is what administrators get to see. The end-user
sees the third layer that usually takes on the form of an NFS (Network File
System) like SMB. The area that each of these layers fulfill is different. While
the lowest layer takes care of the performance, reliability and efficiency of
the storage it addresses, the middle layer adds manageability and scalability.
The final layer is the user interface of the file system.
Freenet file system does not guarantee it will find your data, but it will guarantee anonymity and free speech |
Below, we take a tour of the various networking and Web
related FS and explore how they are changing the landscape. For it is here that
the picture gets murkier, and users along with applications and various OS start
demanding files from anywhere, anytime. And whichever of the below is in use, it
will have to ensure proper access to that data. When looking at the choices in
this area, we were amazed to note the sheer number of interesting FS available.
Therefore, we are deviating in style from how we looked at the previous segment
in what comes below.
See data not stars
Cluster computing and grid computing are the hot topics today. We have a
series running elsewhere in this magazine on what cluster computing is and what
it can do. In last month's part of that story, we have talked about setting up
a cluster using OpenMosix. This environment has its own file system called the
Mosix File System (MFS) and its OpenMosix equivalent is 'oMFS'. Broadly
speaking, the MFS is similar to NFS, but provides consistency of cache,
timestamps and links.
And when you talk of cluster/grid computing, Oracle is in
the counting too with its OCFS. The OCFS creates an overlaid FS across all the
systems in a cluster. This is then used by their 10g application servers to
store their files (including configuration and logs) and data (including
databases) more efficiently. Similarly, for high-capacity storage environments
there is Red Hat's GFS (Global FS). Unlike other network file systems, the GFS
has no concept of servers and clients and works directly with the storage media.
As a product, it is a part of the FC 4 distribution and needs to be purchased as
an add-on to RHEL. The GFS works as a common FS for all storage devices in a
clustered environment. Among other features, it guarantees symmetry, modularity,
manage- ability and reliability.
IBM's AIX uses a different FS called the GPFS (General
Parallel FS). This file system can abstract disk-based FS from several systems
running AIX or Linux in a cluster and present them in one homogenous interface.
GPFS tries to overcome traditional bottlenecks in system-specific FS by letting
users access files across servers in parallel (hence its name). Redundancy is
provided by mirroring content across multiple disks on multiple servers or
devices. GPFS will read and write these blocks of data in parallel. Reliability
is achieved using multiplexed routing between the different systems.
Linked file systems
For long, work has been going on to virtualize the FS on individual servers
and clients onto a single network based layer. A common name for this type of FS
is the 'Distributed File System' or DFS. A variant is the 'Andrew File
System' or AFS. Both models are similar in the underlying effort to hide the
fact that files and folders shared on it are actually pooled from many different
systems on that network. What
happens is that shares are created on a central file server using folders that
are actually on other systems on the network. So, if you have separate
authentication, file and Web servers and for a user you need to abstract all his
relevant shares into one folder, DFS/AFS is the way to do it.
The DCE/DFS was used by IBM in 1996 for the official Olympics website |
We have carried two articles in the last year on OpenAFS-“Integrate
Storage with OpenAFS” (in May 2005) where we talked of implementing an OpenAFS
environment on your network and in June 2005 we had “Managing OpenAFS” where
we detailed how to manage the same deployment. There are different
implementations and improvements of the AFS model like Coda and Arla.
The DCE/DFS (Distributed Computing Environment /
Distributed File System) is a 16 year old network FS for distributed
environments. This framework uses RPC mechanisms with Kerberos-like
authentication and authori- zation mechanisms and is linked to a DCE Directory
Server for repository services. This is supposed to be the only fully POSIX-compliant
file system out there. While a variant (FreeDCE) has been available since 2000
under the BSD license.
|
An Apple?
One platform often missed out when we talk of SAN and NAS devices is the
Apple Xserve. It utilizes an Apple-proprietary file system called Xsan that
features full 64-bit addressing letting it use upto 2 petabytes of storage. The
Xserve works over Fibre Channel and serves high-end needs like video-editing
farms and computation clusters. The Xsan can be used on both SAN and NAS
devices. On either system, Xsan will use one machine to serve as a 'metadata
server', a role that it will shift to another system should the original fail.
Similarly Xsan can also function in high-availability environments using
multiplexed routes to different machines. This FS can also serve data when only
one of several systems in a SAN is online.
Split and store files
Although not yet officially released, work is on-going on a P2P architecture
file system. Its name is 'Freenet' and is being designed to be a server-less
de-centralized and 'censorship-resistant' system. It aims to use key-based
routing to share files anonymously which makes it suitable only for static file
content. This system aims to create a scalable version of what is called a
'darknet' where only trusted peers can establish mutual connections.
Previous attempts at darknets have included Nullsoft's WASTE and
Apple's iTunes.
Like most P2P environments, it is designed to be
self-organizing. Anonymity is guaranteed because the FS will replicate content
across multiple participating nodes, with no attached tracking information.
Files can also be broken up into multiple pieces and encrypted and each of these
pieces stored on different nodes. A
similar (but networking oriented project) is the GNUnet from GNU.
Protocol-oriented systems
Although NFS, CIFS and SMB systems are associated with specific protocols,
one 'not so famous' but widely used network FS is the 9P. This DFS is the
native FS of the Plan 9 OS. Plan 9 is a UNIX-descendant from Bell Labs. Under
9P, almost everything is implemented as a file, including screen elements,
regular files and folders, networks, processes, etc.
A variant of the 9P is the Styx or Inferno file protocol. A server
implementation of 9P also exists for the Linux environment (v9fs). The v9fs has
been included in the Linux kernel 2.6.14 and is a SourceForge project.
File systems, really?
Modern usage of the Web for purposes other than to put up homepages has led
to improvements in the way Web hosted data is stored. This has been given a shot
in the arm because of the FUSE project. FUSE stands for 'Filesystem in
Userspace' and lets users mount custom FS abstractors/ providers/drivers into
the OS at the user level itself. The FUSE Wiki catalogues about 55 such FS. Two
notable examples in the area of FUSE/Web are: Wiki-influenced FS and the
mythical 'Google File System'. What do you do if you want to mount, edit and
use Wikipedia articles or your blog entries as if they were on your hard disk?
Simple, you'd mount them using the virtual WikipediaFS or BlogFS respectively.
WikipediaFS supports standard Linux FS calls and lets users copy, move, rename,
change directories and use pipes and redirections too. And Blogs are not behind
in the race either. BlogFS mounts Wordpress and metadata-rich blogs as FS with
read/write support.
The only FS to support Execute in Place (XIP) is UDF used in optical storage media. XIP programs run without using the RAM | Bootable CDs are written in an ISO9660 extension called the 'El Torito' format. El Torito lets the CD emulate a hard disk or floppy disk | Win FS is will soon be the most meta data rich FS. Previous such file systems have been ZFS, NTFS, JFS, NSS and NWFS |
The idea therefore seems to be to specialize and abstract.
There are so many choices of platforms today, with hardware, software and
implementation. The only way users and administrators can make sense out of data
we store is to abstract it in an entirely new (and platform-independent)
fashion. As applications themselves become demanding, vendors are going back to
the drawing board and create their own abstraction.
Technology in motion
Mobile devices are everywhere and are used more and more in
enterprises. High-availability systems feature extensions that allow reports and
alerts to be sent to mobile devices so that their administrators are notified
instantly when there's something needing their attention. Also, end-users who
cannot be tied down to a desk (like agents in the field) are harnessing the
power of the mobile computer-whether a smart phone or a PDA-to stay in touch
with their offices as well as interface with regular intranet applications. All
these devices let users store files and other content (e-mail) locally. Unlike
desktops, a mobile device has far less CPU power and within that limit, it needs
to be instantaneous. Behind all that needs to be a powerful and yet compact file
system. There is actually one more challenge that mobile file systems face:
mobile memory is solid state, in the form of flash memory. And addressing this
is quite a different task from disk based systems.
EEPROM chips use the compact and efficient romfs to map their content |
This is where special purpose FS like JFFS, YAFFS, romfs
and Coda come into play. Both JFFS
and YAFFS are journaling FS meaning they are reliable. With the rise of
Bluetooth, specialized FS in that space (albeit at the user-mode level) have
come into the picture, notably the BtFS.
JFFS1
(Journaled Flash FS) was only meant for 'NOR'
chips. This FS was also slow because unlike on a disk where data can be
overwritten on locations already containing other data, flash memory must be
erased prior to writing on it. This meant the entire FS structure needs to be in
memory at boot time. This also means that every update required erasing the
entire chip and rewriting all the information. In JFFS2, support was added for
NAND chips as well. Here, there is another challenge since NAND chips are
sequential I/O. However the advantages of JFFS2 were that it offered both
compression and introduces garbage collection to give it better performance over
the JFFS1 system. However, in-memory maps of the chip are still necessary and as
flash memory rises well beyond the GB mark, this increases the demands on the
limited runtime memory (RAM) of the device itself.
To simplify matters and remove much of the complexity
involved with flash memory, the YAFFS (Yet Another Flash FS) was born. This
assumes that an erased flash memory is formatted. It then maintains a tree-list
of the file structure in the runtime memory. Garbage collection is enforced to
free up unused or erased blocks in the FS. Like JFFS, YAFFS has also undergone a
revision and the YAFFS2 is the latest. This revision adds a scheme to
sequentially number data blocks so that the newest block is most easily
available. Also, it allows the system to shrink files more efficiently using a
concept called 'shrink headers'.
Ok, we said enterprise usage and mobiles. So, how about
bringing in features like distributed computing, security, caching, replication
and trying to put it into a mobile device. There is an FS for just this, by name
of Coda. This unique FS is based on early editions of AFS and includes a
paraphernalia of features that enterprise mobile computing. For one thing, you
can sync up your device to a network file server and continue working in a
disconnected mode. Coda lets the user replicate files across multiple servers.
However, there is one serious disadvantage at the moment in Coda and that is an
inherent limit of 106 files per server. Coda has in fact spawned an FS on the
same principles called InterMezzo (Linux v2.4.15, removed in 2.6) that can
overlie an existing journaling file system.
Embedded Linux devices use JFFS2 for their file system. QNX,
a real-time OS uses the Embedded Transactional FS (ETFS) or the QNX Flash FS for
its devices. The ETFS is treats file operations as atomic transactions and is
used on NAND devices and QNX FFS is a wear-optimized system for NOR devices. The
ETFS is optimized to provide both high-performance service as well as
recoverability from failure. An SDK is available from qnx.com for developers to
optimize their products to these file systems.
Windows CE on the other hand uses a flexible 32-bit file
system model, where it offers separate ROM, RAM and device file system types. It
also lets developers create their own file systems. The CE model is slightly
different in that it treats the storage media as an 'object store' and by
default compresses all the files. Additionally, Win CE devices can use .BIN
files — this is a special file system called the 'BinFS'. A program within
the CE environment loads this image and emulates it a FAT file system for the
user.
Special and fast from
Google
With Inputs from Arvind Jain, Head, R&D Center, Google India
Applications like Google Maps above derive data from voluminous data and served up fresh by the Google File System |
A globally accessible almost instantaneously responsive
application that works over the Web, and is powered by hundreds of servers
spanned across the globe, needs your FS to be perfect.One way to set a good
example for similar application developers is to build your own FS. And that's
what the Google did. All their applications use the same basic pool of data the
Google servers harvest and store. Somehow when you enter your query, they is
instantly able to locate exactly what you want with a high degree of relevance
and sometimes even show you on a map where it is. Unlike a run-of-the-mill
enterprise application, the storage is both geographically spanned as well as
large in capacity. On top of that, new nodes and data are added regularly.
File servers, storage boxes, backup devices, anti-virus, spam and firewall appliances and even search engines come in appliance form. All of these store files in one form or another, not all of them expose their FS to the user. These boxes usually run a customized version of Linux and hence using one of its FS. File/storage and backup servers become the prime candidates for using custom FS on. One of the better known FS for a file server appliance is the WAFL (Write Anywhere File Layout) from NetApp. This FS writes data blocks to any location on the storage media. It inherently supports extensible metadata for the file, since even metadata can be broken up and written at different places. The ONTAP server from NetApp uses the WAFL. In this space, each vendor has their own custom FS on their device. A simple Google for embedded file systems throws up a huge 15 lakh results (although a lot of them are not unique). This makes interoperability an issue. While devices of a certain type (like cellphones or PDAs or specific appliances) maybe standardized on a particular platform/OS and its file system is also standardized. |
The Google File System (GFS) is built from scratch with no
compatibility with any FS. With its own API and methods of storing, indexing and
retrieving information, its a proprietary system never to be used by anyone
except Google. GFS does not support common UNIX or Linux FS commands. And to
read/write data on GFS, applications talk to the GFS Client. And yet, this FS is
only a layer on top of a Linux file system! This way, the complexities of
actually managing the on-media data is done away with. Some specializations in
the GFS are: hardware fault-tolerance and recoverability, optimization for files
larger than 100 MB, queued I/O to handle simultaneous read and write operations
on a file from different sources. GFS is tuned to Google applications and those
applications are in turn tuned to GFS.
Architecture
GFS has a 3-tier model. You have the master server that talks to several 'chunkservers'.
Finally you have the systems running various Google applications. GFS's master
server stores metadata information for the entire FS. This entire data is
maintained in the master's RAM. If the master is reset, the data is rebuilt by
polling the chunkservers. Files are
broken up into fixed-size blocks 64 KB each and these are stored on the
chunkservers to streamline I/O performances. Data is replicated across many
chunkservers, along with 32-bit checksum values for each chunk. GFS can recover
corrupted data from good copies from elsewhere in a short time frame. Versioning
is managed by electing one chunkserver as the primary and pipelining writes
through it, with globally unique chunk-level sequential numbering to track their
versions.Loss of data is common, especially because of hardware failure. Google
defines three levels of loss-undefined, inconsistent and corrupt. Undefined is
when the state is not known. Chunks that do not match-up on chunkservers are
inconsistent. When it is permanently lost, it is classified corrupt. When a
block of data is designated to be inconsistent or corrupt, attempts are made at
the earliest to refresh it from known good copies-as in the checksums for the
blocks are correct-from other chunkservers. If all chunks for that data are
lost, then the system simply returns a 'data not found' and never returns
corrupt data.
Feedback and effects of GFS
One of the feedback effects from the development of GFS by Google was
improvements in IDE disk drivers at the Linux end. Since GFS runs on top of
Linux systems, when Google experienced problems with various Linux components,
they usually tried to fix it themselves. These fixes then found their way back
into Linux itself.
Rinku Tyagi and Sujay V Sarma