[Flaim-users] Improved FLAIM I/O performance (up to 9 times better!!!!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

All,

I was recently able to acquire some used Fibre Channel equipment,
including a cool Brocade switch with a front-panel display that can be
configured to report I/O throughput.  Being excited to test the switch,
Fibre card and disk array, I decided to run a FLAIM bulk-load via the
gigatest utility.  I expected to see the I/O channel maxed out at 100
MBytes/s when forcing a checkpoint and flushing hundreds of megabytes of
dirty cache.  Unfortunately, the I/O channel was hardly being used ...
in fact the highest throughput number reported by the switch during a
checkpoint was 10 MBytes/s!  Given that this was my first experience
setting up a Fibre SAN, I figured that I had done something wrong.  So,
to validate the SAN configuration, I conducted a simple experiment by
copying an ISO disk image from the server's internal U320 SCSI drive to
the SAN (configured with eight 18GB U160 drives).  Wow, what a
difference!  The switch reported a sustained throughput rate of about 91
MBytes/s for the duration of the copy.  This, in turn, told me two
things: 1) The SAN was configured correctly and 2) FLAIM wasn't keeping
the I/O channel busy.

This was perplexing because FLAIM tries to be ultra efficient in the
way it interacts with the disk.  This includes using async and direct
I/O when available, ordering writes to minimize seeking, using
sector-aligned buffers, and various other techniques.  As it turns out,
although FLAIM (and XFLAIM) use async I/O and multiple write buffers,
the code wasn't taking advantage of the fact that out-of-order I/O
completion can result in later writes completing before earlier writes
(seems somewhat obvious, right?).  Basically, there are a limited number
of I/O buffers managed by FLAIM.  As FLAIM flushes dirty cache to disk,
it acquires a buffer from the buffer manager, holds onto it until the
write completes, and then releases the buffer back to the manager.  When
all of the write buffers are in use, FLAIM must wait for a pending I/O
to complete before queuing an additional I/O operation.  The problem is
that at any one time during a checkpoint, we may have thousands of
pending writes and it is unknown which one will complete first.  FLAIM
was taking the simplistic approach of waiting for the earliest queued
write to complete; this, however, was rarely the first I/O to complete. 
The result was that there were many usable I/O buffers available, but
FLAIM was unaware of this fact because it was blocked waiting for the
earliest I/O to complete and had not acknowledged the completion of the
other writes.  

Platforms that support async I/O all provide a simple way of
determining if an async I/O has completed via a call to a routine like
GetOverlappedResult (on Windows).  Windows and NetWare (and possibly
Linux, Solaris, AIX, OS X, etc.) provide an additional callback-based
mechanism to notify an application that I/O has completed.  Its possible
to use this callback-based notification to leverage out-of-order I/O
completion to FLAIM's advantage.  Instead of waiting for a specific I/O
operation to complete so that its buffer can be re-used, the callback
can do all of the work needed to release the buffer back to the buffer
manager and also alert (via a semaphore) any thread waiting for a buffer
to become available.  This results in efficient and timely buffer re-use
and has a huge impact on I/O throughput (more details below).

I made some code changes in the Windows-specific FTK code to use an I/O
completion callback and also added a "buffer waiter" queue, fired up a
new bulk load, and waited for the first checkpoint.  Being used to
checkpoints that run for a long time (anywhere from 30 - 90 seconds)
when flushing a large amount of dirty cache, I was somewhat disappointed
to swing around to look at the Fibre switch and find that it was
reporting throughput of only 10 MBytes/s.  Argggh!  When I looked at the
stats on the bulk load screen, I soon realized that the I/O stats I was
seeing on the switch were attributed to RFL writes, NOT the checkpoint
... it had already completed and the foreground bulk load was
continuing!  At the start of the next checkpoint, I made sure that I was
watching the switch from the start and saw that the throughput was 93
MBytes/s.  This is an improvement of 9.3 times.

To further validate the changes, I started an unreasonably large bulk
load and let it run overnight.  This morning, to my surprise, the bulk
load (via gigatest) had completed, resulting in a database with
2,000,000,000 (yes, TWO BILLION) objects and a total load time of just
under 6 hours (that's 5.5 million objects a minute).  In terms of the
number of objects, this is the largest FLAIM database ever created
(although I would be thrilled if someone contradicted this claim).  Of
course, I proudly announced these results to my wife at breakfast in a
feeble attempt to justify my recent eBay expenditures made in acquiring
all of this equipment.  I'm not sure if she was impressed.

In any case, these code changes have been checked in to the open-source
SVN repository and will be included in the soon-to-be-released 4.9
version of FLAIM and the 5.1 version of XFLAIM.  Currently, Windows will
be the primary beneficiary of these improvements.  I am still
investigating similar improvements on the other supported platforms and
will send out any new information as it becomes available.

Thanks.