Swinging from filesystem to filesystem

Paul.Dekkers -at-sign-here- surfnet.nl

In an attempt to find the perfect filesystem for our mail-data (I had the impression that it could run a bit faster then the current one) I found some surprising results.

I started with the impression that reiserFS would be more efficient then EXT3 or the UFS2 filesystem our data was currently stored on. At least for storing the many and small mail-files in the format Cyrus uses (a bit like Maildir). There was a large thread once on the Cyrus mailinglist, and reiserFS seemed to be most popular.

But I knew ext3 changed in the meantime: there was directory indexing now (like with UFS2 under FreeBSD 5?), and maybe if I tweaked things (like with noatime) I could still keep a secure filesystem but gain equal performance as with reiserFS.

At least this set the (limited number of) types of filesystems I wanted to test.

I had an extra reason for comparing the filesystems, since the Linux flavour we tend to prefer at the moment is RedHat. (That's probably because we had some problems with Debian before in combination with Dell hardware (the Perc3 controller to be more precise), and because we can do a little more with hardware monitoring with RedHat (Dell again).) But RedHat has no reiserFS by default: so then I would at least have to create a custom kernel and build the reiserfs-utils by my self...

I must warn that my experiments are not all really scientific, but at least I tried to keep the enviroment for all tests comparable: I rebooted the machine to clear caches after (almost) every action, repeated the tests, I used the same kind of disks (and compared both disks) to write the data on, used the same machine... well, more about this in the General section, but read on :-)

Results

I tend to say that read-performance is a little more important then writes (for our mail-data, and making backups of that), but I think that is biased by the experience that in a copy action under UFS2 reading data was a lot slower then writing it.
And by looking at the results I must say that write-actions are all within reasonable times, and that mostly reading things are influenced by filesystem choices ;-)

Write tests

I started out by writing data to the filesystem. To make things as realistic as possible, I used some real data from our mail-system: the archives of some internal mailinglists, that make a total of 20 gigabytes (20292014080), and about 1 milion files (1075962). A few directories contain a lot of (small) files, one a little more then 100000 files, so that would be sufficient to test a few different aspects of the filesystems.

I copied the data to the disk from a tarball, since reading from a single tar-file seemed to me the fastest read-action to get the data from. Proof of this and for the maximum IO-throughput for the disks is in the general section below.

Writing things to reiserFS seemed reasonably faster then doing the same writes (after a reboot ;-)) to a normal ext3 filesystem. But when enabling the dir_index option on a filesystem (mkfs -O dir_index, or with tune2fs) it appears that ext3 is about as fast as reiserFS in writes.

I would be happy with all of the tried options here, but based on this test I'd say that reiserFS or ext3 with dir_index-enabled are both best for Linux.

As far as FreeBSD is concerned: writes with UFS2 are somewhat between the times of reiserFS and EXT3. Only, UFS2 is not a journalling filesystem, so if you really want to trust on it, you might want to use the sync-option at mount time. (Based on experience I can say that it saves some trouble with fsck!)

Optimization for the number of files with UFS2 only seemed to improve only slightly.

Note that I couldn't seperate reads and writes to or from the disk under FreeBSD: both iostat and vmstat didn't offer me this. So the FreeBSD graphs show the IO troughput. Otherwise the "ufs2 with avgfiles 512 and sync" would have transfered much more data to the disk then all other tests: but apparantly writing with these options includes more reads as well.

Combining the results of reiserFS, (standard) EXT3 and UFS2 shows that there are quite some differences: reiserFS is the fastest here.
The reason for not including the dir_index option for EXT3 in this graph is explained in the read section...

Read performance

Next was a read test for the data from the reiserFS partition. Reading the data seemed reasonably fast.

But then I tried ext3, with dir_index first, since - I thought - this would be the fastest option for ext3 with my kind of files.

I couldn't be more wrong: this took about 2 hours extra. At first I thought this had to do with metadata writes, since I saw more writes on the disk I read from then with reiserFS, but it seemed that if I disabled atime's it didn't really change much.

I was about to jump into conclusions about ext3 and how badly it performed in comparison with reiserFS, but then I thought "let's see how worse things are without dir_index enabled"...

All of a sudden things where comparable, no - even faster, then reiserFS...

UFS2 is quite tunable, and it seemed possible to gain at least 10% if using the sync option (strange?) and even 30% if not using softupdates. (Unfortunally it appeared that writes are much slower without them...)
The default UFS2 settings weren't too bad though. (The tests where run with DIR_INDEX enabled in the kernel.)

Although reading from UFS2 wasn't too bad, it was slower then the rest of the two tested filesystems (with 30% steps in between).

While I was at it...

(The ext3 tests where done with ext3 + dir_index only.)
I graphed the disk-IO during a filesystem check (interesting after crashes or for the periodic consistency check (that reiserFS doesn't have, I believe)...):

The default-fsck of a (clean) FreeBSD UFS2 filesystem was considerably faster: it took only 3 minutes to do this check. I know it takes longer if something is wrong.

I also graphed what IO was caused by removing all of the files. This was clearly faster with reiserFS, there where more writes with ext3. I didn't do this with UFS2.

Conclusions

It's strange to say, but somehow dir_index didn't work for me. This does sound like a bug however, or I don't understand what this option should do. But that leaves two options: I can have either a filesystem with a little faster writes, and a little slower reads, but at least consistent performance (reiserFS) OR I could choose for a filesystem that writes a little slower, and reads a little faster.

Since differences are marginal, and I tend to say that reiserFS is a bit more reliable then ext3 (that's just a feeling, but after the different ext3 results I would say there is a motivation), I would go for reiserFS.

Difficulty is that reiserFS isn't available in the stock RH kernels, but I could of course just compile one myself.

UFS2 isn't too bad, but if you want it to be secure at a crash, you would go for the sync options - and that makes it slower again. I know crashes are very well recoverable this way though.
There are plans fortunally for journalling with the successor of UFS2, UFS3. It will probably take some time before that is stable for production use though.

General

Performing the read-tests

Testing the read-performance was done by piping the tar-output to "wc -c". At first I copied the data to a spare disk, but as the graph shows this causes a small delay: it's a difference of one minute, but well...

For some reason writing the data with tar directly to /dev/null didn't work. The process exited far too early, and looking at the transfer rates it didn't seem feasible that all data was really read.

Write tests

One of the important factors in doing a write test is how fast you can read what you want to write. Of course this is where sources like /dev/zero come in, but since I wanted to test out real data, I tested if reading from a tarball on a different (SATA) disk would be good enough.
I decided it was :-)

Also, the maximum IO-throughput on the test-disks was important. I ran tests on both disks, also to check if they where similar. They are.

Software, Hardware

The tests where done on a recent 3 Ghz Pentium IV with 2G of memory and a SATA disk for the OS. The tests where done on two 200G Maxtor PATA disks that happened to be available.

Allthough we intent to run RedHat (4), the tests where done with Fedora Core 3 (which is presumably not so bad, since RH 4 is based on FC 3 ;-) with the precompiled filesystem-modules under UP kernel 2.6.12-1.1372_FC3).

The system was up2date at the moment of testing, during week 29 of 2005.

The FreeBSD version used (on the same hardware) was 5.3-RELEASE.