The BeOS file system, an OS geek retrospective

By Andrew Hudson

Like most UNIX-derived file systems, BFS uses a node structure to keep track of disk allocations. The bfs_inode structure is defined below. The i-node handles basic file metadata including file creation time, owner, size of file, group ownership, where the file data is stored on the disk, etc. BFS does not update file size until a file is closed. In testing it was found that this gave a substantial performance gain.

 typedef struct bfs_inode { int32 magic1; inode_addr inode_num; int32 uid; int32 gid; int32 mode; int32 flags; bigtime_t create_time; bigtime_t last_modified_time; inode_addr parent; inode_addr attributes; uint32 type; int32 inode_size; binode_etc *etc; data_stream data; int32 pad[4]; int32 small_data[1]; } bfs_inode; 

A magic number is used again in bfs_inode for sanity checking and error recovery. The magic numbers for BeOS and Haiku are the same for compatibility, but different for the SkyOS implementation. BFS uses the file’s disk sectors as the i-node value, making sector mappings a straight lookup. UID, GID, and mode are used to maintain POSIX compliance.

Data_streams

The i-node structure holds the basic attributes of a file but not the actual file data itself. This is done through the data member structure. The data member is defined by the data_stream struct:

 typedef struct data_stream { block_run direct[NUM_DIRECT_BLOCKS]; off_t max_direct_range; block_run indirect; off_t max_indirect_range; block_run double_indirect; off_t max_double_indirect_range; off_t size; } data_stream; 

The data_stream struct maps data from the physical disk to the file stream API. Access using data_streams is optimized for throughput, bypassing the system cache, and using DMA into and out of user memory. In benchmark tests, BFS is able to achieve high throughput that approaches peak disk transfer rates.

Database functions using extended attributes

As mentioned earlier, handling attributes is an important file system function. The Mac HFS was the first file system to extensively use file attributes in support of GUIs. Consider that a windowed OS needs to persist and manage many GUI attributes such as frame size, location, coloring, text, etc., and needs to optimize access for a quick response time.

BFS supports extended attributes in the form of key/value associations with files. Keys have a fixed type and can be added at any time. Valid types are string, time, double, float, int, boolean, raw, and image. If a key is indexed, basic searching on a key is greatly optimized. The following are command line tools for managing extended file attributes.

  • addattr key value filename: adds the key/value pair to the specified file
  • catattr key filename: displays the specified fieldname value.
  • listattr filename: lists a file’s associated attributes, their types, and their sizes, in bytes.
  • rmattr key filename: removes an attribute from the specified file.

New fields are created globally with mkindex. For instance,

  • mkindex --type=indextype index: creates a new index, of type long, on the current volume.
  • reindex sourcefile filename: adds a file’s key to an index if the index is created after the file’s attributes.
  • rmindex index: removes an attribute from the current volume, making it unavailable for use.
  • lsindex: lists all attributes

In Haiku, access to file attributes is supported graphically through the Tracker, as well as with keyboard shortcuts. Any object in Haiku that has a graphical representation has the _trk/pinfo_le attribute set with its file icon position. The BEOS:TYPE attribute holds the application association for a file. More attribute usage detail can be found here.

Additional information and examples of using file attributes are provided here and here.

Searching with Query

Query is the command line tool used to search for matching attributes. It is easier to use than the "find" UNIX command line utility.

Here are some examples of the query syntax:

query "((name="**")&&(BEOS:TYPE=="audio/x-wav"))" - finds all files with MIME type of "wav". Useful if you have wav files that are missing the .wav extension.

query "(last_modified >= %yesterday%)" - finds files older than yesterday

The output from query can be used with scripting tools from the command line as follows:

touch 'query ((last_modified< %yesterday%)&&(BEOS:TYPE=="audio/mpeg"))'

This will update the last modified time on all MP3 files with a last modified time older than yesterday.

Additional information on scripting with query is provided here.

Its Haiku GUI counterpart is "find," found in the Deskbar and documented here. All find queries are cached for 7 days and appear in the drop-down list.

The BeBox was a dual-processor PC made to run BeOS (this one is running with an aftermarket monitor).
Enlarge / The BeBox was a dual-processor PC made to run BeOS (this one is running with an aftermarket monitor).

Live queries

An especially nice feature of BFS is the Live Query feature. When using Find, drag a query name from the select/pick list and drag it to the desktop or to Tracker. This hooks the query into the file system and saves it. Any time a file matching the query criteria is created or deleted, the query list is updated. Live queries are supported natively in BFS and use surprisingly little resources.

Applications use attributes

The Haiku Mail Kit is an example of an application that makes extensive use of attributes. Haiku mail does not have its own database for storing and managing email files. Instead, it stores each email directly into the BFS file system and uses more than 15 email-specific attributes for searching and sorting. Can you imagine having 8 exabytes worth of email? Haiku makes this theoretically possible.

This is a great example of abstracting functionality out of individual applications and locating it in the operating system, or in this case the file system. Because BFS supports extended attributes, any application can use the powerful database capability of the file system without having to reinvent the wheel.

Optimizing attribute lookups

Since every file on BFS could potentially have many extended attributes, and there could potentially be hundreds of thousands of files, there is a great need to optimize the access of attributes. In BFS, each index is implemented as a B+tree data structure. The B+tree is a balanced, sorted tree data structure that provides very fast lookup and scales up very well. Not surprisingly it's also used to manage directory structures and it is widely used in other file systems, such as NTFS. BFS indices are very optimized for lookups of the form:

query “(name==”temp”)”  or query “(name >”111”)”

BFS indices are not optimized for regular expression searches of the form:

query “(name==[aA][bB][cC])”

These searches degrade from a binary lookup to a sequential lookup and could potentially take much longer in a large file system.

Queries that use a regular expression but start with a fixed expression are optimized in Haiku, e.g., query “(name==temp[1234])” will run faster than query “(name==[tT]emp[1234])”.


Page 2

Computers with disk storage can experience many kinds of failure modes. The magnetic material of a disk sector can go bad, the servo mechanism that moves the disk head can fail, the electronics that interface with the computer can die, or the user can reboot the machine in the middle of a disk write operation. If you run a disk long enough, eventually the electronics or mechanics of a disk will fail. You generally don’t have to wait very long for an impatient user to reboot during a file write operation. This is unfortunately a fairly common occurrence and it can have devastating results on a file system.

Consider the following situation. A user has created a file and it is the process of saving it to disk. Maybe it’s a developer working on an overdue project. The file system must make several updates before this file is completely saved. It must first save the contents of the file. It must save the metadata for the file, creation date, owner, file size, etc. It must also update the superblock. Consider what happens when a system powers down before these operations complete. In addition to losing your valuable work, the data structures can become corrupted and point to i-nodes and blocks that don’t exist. In addition to being even more late, your file system may subsequently fail to boot, losing all of your other files.

Journaling, or file system logging, alleviates some of these problems. While journaling doesn’t guarantee that a premature reboot won’t lose your file, it does guarantee that it won’t corrupt your file system.

Consider the following example of how journaling works. Let’s say a user creates a new file and saves it. The data must be saved to disk, and a new directory entry with correct metadata must be saved as well. Before these disk operations occur, BFS locks the unwritten blocks into RAM, and makes log entries in the file system journal for each block about to be written. After each block is successfully flushed to disk, the journal entries are marked as completed. If the system shuts down before the blocks are successfully flushed, the journal entries will list the incomplete operations. When the system restarts, it inspects the journal for entries that are not completed. The remaining entries can be ‘replayed’ when the system reboots to successfully complete the write operation, or the whole thing can be "rolled back," and the operation aborted. Either way, the file system is left in a consistent state where the directory structure and metadata accurately match the file data.

In BFS, the journal logs any changes made to the directory, the bitmap block, i-nodes, and extended attributes. It does not journal the actual user data. In this way, journaling protects file system consistency but does not provide data recovery features like a redundant disk array, or RAID.

While not immune to all disk errors, BFS is resistant to the common failure mode of a premature system shutdown. File systems without journaling, such as Linux EXT2, are vulnerable to file system inconsistencies and rely on lengthy scanning processes for recovery. BFS does not need disk scanning, and Haiku can start up quickly after a premature shutdown.

Improvements of Haiku BFS over BeOS

Haiku’s version of BFS has a number of improvements over the original BeOS BFS implementation. The B+tree is more robust. Haiku BFS uses a file cache for file data in addition to a block cache. This resulted in a factor of 10 speed improvement. Haiku’s BFS implements status changed time for files, and also has more fine-grained file status capability. The POSIX atime file was omitted from BeOS BFS for performance's sake. Haiku BFS includes a query optimized for hybrid regex that allows mixing of a static string with a regular expression. New inspection tools bfsinfo, bfswhich, chkindex, and recover were added for Haiku BFS. The reindex command was added to improve indexing of extended attributes.

A quick interview with a Senior Developer at BeOS, who worked on BFS

(He asked that his name not be used, to comply with the wishes of his current employer)

The book Practical File System Design describes the BeOS development environment as being short on time and scarce on developers. What were some of the positive aspects of that environment?

It was just a fun place to work at the time.  We all got along really well, loved what we were doing and all fed off of each other's energy. Everyone was top-notch and it was just non-stop.  Usually you get one or two of those aspects in a company, but when you have them all it's a very intoxicating environment: change is rapid, progress is good, you feel like you're really doing something important and it's just a lot of fun.

From conception to release, how long did it take to code and debug BFS?

Ten or eleven months. I had help from one other engineer to write the code that handled writing to the indirect blocks in a file.

What tools did you use to develop and debug BFS?

Emacs, make, and gcc.  Initially I did some prototyping in user space and thus was able to use gdb but once it went into the kernel it was all "Welcome to Kernel Debugging Land" from there on out.

What was the biggest challenge in developing BFS?

Trying to get it done in such a short timeframe while supporting all the features we wanted.

Did BFS influence the design of any subsequent file systems?

Unlikely.

What are some hot topics of file systems today?

Data integrity and data de-duplication are probably the biggest areas of interest right now. People are also spending a lot of time trying to figure out how to provide reliable storage in the face of unreliable components. RAID-5/6 are OK, but as the size of drives go up, a lot of people are worried that when a failure happens, they're vulnerable to a total failure if another drive fails before they're done with reconstruction.

What new features or activities will the next generation of file systems support?

I'm not sure. I think it will be a little while yet before we sort out which layer is the right place for all the different parts of the problem of "storage." ZFS has gone down the path of putting everything in the FS. Other folks are putting in a different set of things. It's pretty clear that some of the functionality that existed in BFS does *not* belong in the FS.

In your opinion, what science fiction movie has the best use of a computer?

Hmmm, not sure. Star Trek? Their computers do everything and have a multitouch interface and they don't spend a lot of time futzing around with them.

A Quick Interview With Axel Dörfler, Developer of Open Source BFS

Before BFS, did you have previous experience working with file systems?

I had to write an application to recover some data from a BFS partitioned hard disk. That gave me all the knowledge I needed to write BFS.

What was the hardest part of rewriting BFS?

Making sure the B+tree implementation behaved correctly, as the one used in the original BFS (as mentioned in Dominic's book) was quite unstable, and inefficient.

What was the easiest part?

The actual BFS read-only implementation, as I could reuse most of the code I had written for recovery.

Did you find any surprises when rewriting BFS?

Yes, there are quite some illogical things like the log using block_runs, but requires them to have a length of one, or the almost superfluous double indirect block implementation.

Did you ever have any discussions with Dominic about BFS?

Yes, but they didn't really detail its flaws, rather my implementation.

What would you change in BFS 2.0?

A lot.

How long did the coding take?

I honestly don't remember. I think the read-only part took only two weeks or so, while the write part took a lot longer, and it took still longer to make it fully compliant to Be's BFS.

What was the biggest bug and how did you solve it?

The most annoying bug was the "vnode ID already exists" bug - there were like a dozen or so reasons why this one could happen, and that kind of minimized the joy of having found another issue with it (only a part of them were in BFS itself, though).

It's still there in a new incarnation as bug #5262, BTW, although this could have completely unrelated reasons (like memory corruption).

What lessons did you learn from rewriting BFS?

That it's much easier to have a working VFS when writing a file system.

Lastly, what is your current development PC?

Depending on where I am it's either a ThinkPad T60, or two year old Core 2 Quad with onboard Intel graphics.

Acknowledgements

This article relied heavily on the book Practical File System Design with the Be File System. It’s seldom that a file system is documented in such detail and in such a readable fashion. Also thanks to Axel Dörfler for fact checking.

References

About the Author

Andrew Hudson is a freelance technical project manager living in Florida, USA. In his abundant spare time he enjoys exploring caves and restoring antique cars.