Steve Best

Subscribe to Steve Best: eMailAlertsEmail Alerts
Get Steve Best: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Article

An Advanced File System for Linux

Demanded by enterprises and beneficial to everyone

As Linux made its way further into the enterprise, a key feature that it was lacking at one point in time was a journaling file system. This was true in 1999, but today there are four journaling file systems that can solve enterprise server requirements. This article focuses on one of them: JFS.

The file system is one of the most important parts of an operating system. It stores and manages user data on disk drives and ensures that what's read from storage is identical to what was originally written. In addition to storing user data in files, the file system also creates and manages information about files and about itself. Besides guaranteeing the integrity of all that data, file systems are also expected to be extremely reliable and have excellent performance.

Before the year 2000, Ext2 was the de facto file system for most Linux machines; it was robust, reliable, and suitable for most deployments. However, as Linux displaced Unix and other operating systems in more and more large server and computing environments, Ext2 was pushed to its limits. In fact, many now-common requirements - large hard-disk volumes, quick recovery from crashes, high-performance I/O, and the need to store millions of files representing terabytes of data - exceed the capabilities of Ext2.

Fortunately, a number of other Linux file systems pick up where Ext2 leaves off. Indeed, Linux now offers four alternatives to Ext2: Ext3, JFS, ReiserFS, and XFS. In addition to meeting some or all of the previously mentioned requirements, each of these alternative file systems also supports journaling, a feature certainly demanded by enterprises but beneficial to anyone running Linux. A journaling file system can simplify restarts, reduce fragmentation, and accelerate I/O. Better yet, journaling file systems make fscks a thing of the past.

To better appreciate the benefits of file systems, it's helpful to speak the vernacular of file systems.

  • Logical block (or a file system's block size): The smallest unit of storage that can be allocated by the file system. A logical block is measured in bytes, and it may take several blocks to store a single file.
  • Logical volume: One or more physical disks or some subset of the physical disk space.
  • Block allocation: A method of allocating blocks in which the file system allocates one block at a time. With this method, a pointer to every block in a file is maintained and recorded. Ext2 uses block allocation.
  • Extent: A large number of contiguous blocks. Each extent is described by a triple, consisting of file offset, starting block number, and length. File offset is the offset of the extent's first block from the beginning of the file; starting block number is the first block in the extent; and length is the number of blocks in the extent. Extents are allocated and tracked as a single unit, meaning that a single pointer tracks a group of blocks. For large files, extent allocation is a much more efficient technique than block allocation. Figure 1 shows how extents are used.
  • File system metadata: The file system's internal data structures - everything concerning a file except the actual data inside the file. Metadata includes date and time stamps, ownership information, file access permissions, other security information such as access control lists (if they exist), the file's size, and the storage location or locations on disk.
  • Inode: Stores all the information about a file except the data itself. You can think of an inode as a "bookkeeping" file for a file (indeed, an inode is a structure that consumes blocks, too). An inode contains file permissions, file types, and the number of links to the file. Every inode has a unique inode number that distinguishes it from every other inode.
An extent is described by its block offset in the file, the location of the first block in the extent, and the length of the extent. If file sample.txt requires 18 blocks, and the file system is able to allocate one extent of length 8, a second extent of length 5, and a third extent of length 5, the file system would look something like Figure 1. The first extent has offset 0 (block A in the file), location 10, and length 8. The second extent has offset 8 (block I), location 20, and length 5. The last extent has offset 13 (block N), location 35, and length 5.

How File Systems Go Bad

With these concepts in mind, here's what happens when a three-block file is modified and grows to be a five-block file:
  1. Two new blocks are allocated to hold the new data.
  2. The file's inode is updated to record the new size of the file.
  3. The actual data is written into the blocks.
As you can see, while writing data to a file appears to be a single atomic operation, the actual process involves a number of steps (even more steps than shown here if you consider all of the accounting required to remove the two blocks from the free list of blocks and other metadata changes).

If all the steps to write a file are completed correctly (and this happens most of the time), the file is saved successfully. However, if the process is interrupted at any time (perhaps due to power failure or other system failure), a non-journal file system can end up in an inconsistent state. Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation may not be totally reflected on the media at any given point in time. A journaling file system uses transactions to keep track of metadata changes. Transactions are recorded in the log and during log replay a rollback to the last commit point is used to place the file system into a consistent state.

Features of JFS

JFS for Linux is a file system based on IBM's JFS file system for OS/2 Warp Server for e-business. Released as open source in early 2000 with a GPL license and ported to Linux soon after, JFS is well suited for enterprise environments. JFS uses many advanced techniques to boost performance, provide for very large file systems, and, of course, journal changes to the file system. Some of the features of JFS include:
  • Extent-based addressing structures: JFS uses extent-based addressing structures, along with aggressive block allocation policies to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. This feature yields excellent performance.
  • Dynamic inode allocation: JFS dynamically allocates space for disk inodes as required, freeing the space when it is no longer required. This is a radical improvement over Ext2, which reserves a fixed amount of space for disk inodes at file system creation time. With dynamic inode allocation, users do not have to estimate the maximum number of files and directories that a file system will contain. Additionally, this feature decouples disk inodes from fixed disk locations.
  • Directory organization: Two different directory organizations are provided: one is used for small directories and the other for large directories. The contents of a small directory (up to eight entries) are stored within the directory's inode. This eliminates the need for separate directory block I/O and the need to allocate separate storage. The contents of larger directories are organized in a B+ tree keyed on name. B+ trees provide faster directory lookup, insertion, and deletion capabilities when compared to traditional unsorted directory organizations.
  • Online resizing: Allows the file system to grow while it is mounted. This feature is used with a volume manager.
  • Online snapshot: Enables backing up an active file system. It provides an online backup mechanism by creating a point-in-time image of the file system. It helps eliminate the system being offline to require a consistent backup. This feature is used with a volume manager.
  • No integrity mount option: Allows the file system to not journal file system metadata changes. This feature can be used by a restore program to decrease the restore time.
  • 64-bits: JFS is a full 64-bit file system. All of the appropriate file system structure fields are 64-bits in size. This allows JFS to support large files and volumes.
There are other advanced features in JFS such as allocation groups (which speeds file access times by maximizing locality). Two additional features are extended attributes and Access Control Lists. To help understand the Access Control List feature a discussion of Linux's file permissions is a must, since Access Control Lists give a user a finer control of file permissions.

If you've spent even a little time with a Linux system, you're probably quite familiar with Linux's file permission scheme. In a nutshell, you may read, write, or execute a file (or in the case of a directory, search the directory) only if you have the proper permission. Furthermore, the traditional Linux read, write, and execute permissions are distinct, and each of those rights can be granted separately to the owner (a user) of the file, to the group that owns the file, and to other, which represents users other than the owner and users in the named group. Linux commands like chmod, chown, and chgrp affect the permissions and change the owners of files.

In general, Linux's simple permission scheme works well and is especially effective when access rights align with the users and groups on the system. But if you want to grant access rights to lists of users that do not belong to an existing group, the system fails miserably. For example, if you want to share one of your personal files, phones.txt, with every member of your group, say, staff, you can grant that access with two commands: chown staff phones.txt, and chmod g+r phones.txt. However, if you want to give read access to friends.txt to Debbie and Bo, and read access to colleagues.txt to Bo and Abby, you'd have to create two different groups with Bo in each one. (Or, perhaps it's more accurate to say that your system administrator would have to create the groups.)

More Flexibility with Fine-Grained Control

As you can see, managing permissions through "special interest groups" is terribly inconvenient, and worse, it doesn't scale. A more flexible scheme is Access Control Lists, or ACLs. Instead of capturing permissions in just a few flags, ACLs record permissions in an individual and extensible list of access rights that are attached to each file or directory. Access control rights can be assigned to a specific user, a specific group, or to multiple users or groups in any combination. In a sense, ACLs are like the "Will Call" list at the hottest restaurant in town: if you're not on the access control list, you don't get in.

Reusing the example above, if you want to give access to friends.txt to Debbie and Bo, you simply grant read access to both users. No (administrative) group is needed. Need to grant access to a third user? Simply give that user the appropriate access rights. In a sense, ACLs enhance security because ACLs can implement an access policy directly, even if the policy is different for every file on the system.

ACLs can be used to build advanced system applications like Samba, which, like its progenitor, Windows, requires ACLs. (For more information on how Samba uses ACLs, see sidebar "ACL Support in Samba.") Let's see how Extended Attributes work and how they can be used.

File Access Control Lists and Extended Attributes (EAs) are currently supported by the Ext2, Ext3, JFS, ReiserFS, and XFS file systems. You've already seen what an ACL is for; EAs are simply the underlying mechanism used to record ACLs.

An EA consists of a name/value pair, and associates arbitrary pieces of file metadata, or data about data, with a file or directory. EAs are not a part of the file's data. Instead, EAs are maintained separately and automatically managed by the file system.

More than one EA can be attached to a specific file or directory, and an EA can store system objects (such as access control lists or the capabilities of an executable) and user objects (such as the MIME type or character set of a file). Applications can define and associate extended attributes with a file object (remember, a directory is just a special file) through file system function calls.

Extended attributes can be used to store almost anything. You can maintain a file's history; categorize the contents of the file (such as text, icons, bitmaps); record the version of the file; append additional data; or do all of the above. For example, Figure 2 shows five extended attributes (Version, File Type, Additional data, Install, and History) of fileA.

With EAs in place, ACLs are relatively easy to implement. An Access Control Entry, or ACE, is an individual entry in an ACL. Each ACE is a triple defined by an entry type, either group or user; a group name, username, numeric UID, or numeric GID, depending on the value of the first field; and the access permission or right (read, write, execute) associated with the ACE. So, in the abstract, giving Debbie permission to read friends.txt means that the ACL attached to friends.txt contains an ACE (user, Debbie, read).

Currently, ACLs are the only Linux feature dependent on EAs. Other operating systems have had EAs for several years, and uses of EAs on those operating systems are broader.

ACL Support in Samba

To make Samba as portable as possible, the designers of Samba decided against a custom implementation of ACLs. Instead, each Samba server converts NT ACL specifications (sent via MS-RPC) into a POSIX ACL, and then converts that neutral ACL into an ACL that's platform-specific. A conceptual illustration of Samba's ACL subsystem is shown below.

If the Samba server's underlying file system supports ACLs, and the POSIX ACL can be converted to a native ACL, Windows users can manipulate server-side ACLs on the Samba server using the common Windows NT commands.

Samba 2.2 included support for ACLs, but up until now, Samba has had no way to store ACLs directly on the file system since there was no ACL support available for Linux. That's no longer an issue, and Samba will preserve NTFS ACLs rather than mapping ACL permissions to the less-flexible, standard Unix permissions. (Windows NT and Windows 2000 use ACLs to set permissions on files and directories. That scheme offers a much finer-grained control over permissions than the traditional "one user, one group" solution that most Unix systems use.)

Native ACL support, in combination with winbind, allows a Linux-based system to "assimilate" Windows NT users, groups, and ACL permissions. Quite an impressive solution!

Resources

  • Extended Attributes and Access Control Lists: http://acl.bestbits.at
  • JFS for Linux: http://oss.software.ibm.com/jfs
  • ReiserFS: www.namesys.com
  • XFS: http://oss.sgi.com/projects/xfs
  • Samba: http://us1.samba.org/samba/samba.html
  • More Stories By Steve Best

    Steve Best is a Senior Software Engineer in the Linux Technology Center of IBM in Austin,
    Texas. He is currently working on the Journaled File System (JFS) for
    Linux project. Steve has done extensive work in operating system
    development, with a focus in the areas of file systems,
    internationalization, and security. He can be reached at
    sbest@us.ibm.com.

    Comments (2) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    Randy Kramer 03/21/04 08:05:42 AM EST

    Well, at least whitespace showed up properly, sorry about some of the typos (due ==> do).

    Randy Kramer 03/21/04 08:02:54 AM EST

    Nice article!

    One question it didn't answer (unless I missed something) is the difference between filesystems that journal metadata only vs. those that journal data as well, and the related question, what do you lose with a filesystem that journals metadata only?

    I don't know the answer, but I do know that some journalling filesystems do metadata only and some do data as well. I'm assuming that it's not as bad as it sounds, surely a filesystem that is described as journalling but only journals metadata has safeguards to avoid losing data -- maybe it has to due with the "atomicitity" (sp?) of the steps?

    This WikiLearn page describes what I think I know (knew once?) about which journalling filesystems do which (meta or meta plus data):

    http://twiki.org/cgi-bin/view/Wikilearn/LinuxFilesystems

    At the time, only ext3 did both, IIUC, and here's the WikiLearn page with my understanding of ext3:

    http://twiki.org/cgi-bin/view/Wikilearn/Ext3

    I understand I'll get an email notifying me of any response to this comment? That would be appreciated!

    I also hope whitespace (blank lines) show up properly, too bad there's not a preview button.

    BTW: feel free to improve the WikiLearn pages in any way.

    regards,
    Randy Kramer