Trends Watch

Next Gen File Systems for Open Source

PCQ Bureau

02 Feb 2012 05:37 IST

New Update

Open Source treats everything as either a file or a directory. Even hardware is considered a file and is kept in a directory. Therefore, a file system is an organization of data and metadata on a storage device and is expected to provide quick transfer and storage of data without corrupting it. Linux file system interface is implemented as a layered architecture, separating the user interface layer from the file system implementation from the drivers that control the storage devices. To begin with Linux file systems are expected to perform your day to day tasks with some of the latest file systems. Some of the key next gen file systems for open source are discussed below.

BTRFS

BTRFS is also known as B-tree file system and is a popular next gen file system for Linux, available with a GPL license. It is developed by Oracle in association with contributors from the Linux community. BTRFS provides a number of features that make it a very attractive file system solution for local storage. It is designed for large files and file system and helps in easy administration, integrated raid and volume engagement. It also detects and fixes data and files system corruption, improves backup operations, easy searching for files and allows quick rollback of software and OS upgrades, improves storage capacity.

BTRFS is intended to address the lack of pooling, snapshots, checksums and integral multi-device spanning in Linux file systems as the use of Linux scales upward into larger storage configurations common in the enterprise. It is structured as several layers of trees, all using the same B-tree implementation to store their various data types as generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object ID. The middle 8 bits is an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways.

ZFS

ZFS is the feature rich file system developed by Sun for its UNIX version, Solaris. ZFS allows quick and easy snapshots of data, data check summing, and integration of several tools to manage disks and file systems. It is based upon a copy-on-write design that writes a new copy of the data every time it changes. Once the new version of the data is written the old version is marked as deleted and the space can be reclaimed. To implement a snapshot system you need to instruct the OS to not mark the old data as deleted and changes are preserved.

All data that is written to a ZFS file system is check summed to ensure its validity. Hard drive corrupting data has always been an issue but due exponential growth in storage requirements data corruption has become a common phenomenon. To help mitigate the risk of silent data corruption ZFS stores a checksum of all the data it stores and validates the data again before relaying it onto the operating system. If one copy of the data has been corrupted it is identified on read and seamlessly copied from another source.

NILFS-2

NILFS-2 is a reprisal of a log-structured file system developed by Nippon Telegraph and Telephone. The first version of NILFS appeared in 2005 but lacked any form of trash collection. In mid-2007, version 2 was first released, which included a trash collector and the ability to create and maintain multiple snapshots. The NILFS-2 file system entered the mainline kernel and can be enabled simply by installing its loadable module.

An interesting aspect of NILFS-2 is its technique of continuous snap-shooting. As NILFS is log structured, new data is written to the head of the log while old data still exists. Because the old data is there, you can step back in time to inspect epochs of the file system. These epochs are called checkpoints in NILFS-2 and are an integral part of the file system. NILFS-2 creates these checkpoints as changes are made. It is one of the many file systems that incorporate snapshot behaviour. Other file systems that include snapshots are ZFS, LFS etc.

CEPH

CEPH is a distributed network storage and file system created to provide excellent performance, reliability, and scalability. CEPH is based on a reliable and scalable distributed object store, with a distributed metadata management cluster layered on top to provide a distributed file system with POSIX semantics. CEPH is released under the terms of the LGPL, which means it is free. CEPH will provide a variety of key features that are generally lacking from existing open-source file systems, including the ability to simply add disks to expand volumes, intelligent load balancing, and efficient, easy to use snapshot functionality.

CEPH is designed to seamlessly and gracefully scale from gigabytes to petabytes and beyond. Scalability is considered in terms of workload as well as total storage. CEPH is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file, or write to the same directory.

EXOFS

EXOFS (Extended Object File System) is a traditional Linux file system built over an object storage system. EXOFS was initially developed by IBM and at that time was called the OSD file system, or OSDFS. Panasas, an object storage systems company, has since taken over the project and renamed it EXOFS based on its ext2 file system ancestry.

EXOFS is a file system that uses an OSD and exports the API of a normal Linux file system. Users can access EXOFS like any other local file system, and EXOFS will in turn issue commands to the local OSD initiator. OSD is a new T10 command set that views storage devices not as a flat array of sectors but as a container of objects, each having a length, quota and time attributes. Each object is addressed by a 64bit ID, and is contained in a 64bit ID partition.

Next3

Next3 was developed by CTERA Networks, which has started shipping it on its C200 network storage device. It is not just an addition to ext3, but works by creating a special, magic file to represent a snapshot of the file system. The files have the same apparent size as the storage volume as a whole, but they are thin files, so they take almost no space at the beginning.

When a change is made to a block on disk, the file system must first check to see whether that block has been saved in the most recent snapshot already. If not, the affected block is moved over to the snapshot file, and a new block is allocated to replace it. Thus, over time, disk blocks migrate to the snapshot file as they are rewritten with new contents. Deleting a snapshot requires moving changed blocks into the previous snapshot, if it exists, because the deleted snapshot holds blocks which are logically part of the earlier snapshots.

REISER 4

REISER 4 uses B-trees in conjunction with the dancing tree balancing approach, in which under populated nodes will not be merged until a flush to disk except under memory pressure or when a transaction completes. Such a system also allows REISER 4 to create files and directories without having to waste time and space through fixed blocks. As of 2004, synthetic benchmarks performed by Namesys show that REISER 4 is 10 to 15 times faster than its most serious competitor ext3 working on files smaller than 1 KB. The benchmarks suggest that it is twice the performance of ext3 for general-purpose file system usage patterns.

As of 2012, REISER 4 hasnÃ¢??t been merged into the core Linux kernel and is still not supported on many Linux distributions; however, its predecessor REISER FS v3 has been widely adopted. REISER 4 is also available from Andrew Morton's mm kernel sources, and from Zen patch set.

Stay connected with us through our social media channels for the latest updates and news!