Andy,
All I can say is: Woo Hoo!
That is so cool. Not only the test, and the solution, and the
performance and scalability improvement, but that you are a true
craftsman that cares about his work. Woo hoo!
--Dale
On Fri, 2006-09-08 at 11:27 -0600, Andrew Hodgkinson wrote:
> All,
>
> I was recently able to acquire some used Fibre Channel equipment,
> including a cool Brocade switch with a front-panel display that can be
> configured to report I/O throughput. Being excited to test the
> switch, Fibre card and disk array, I decided to run a FLAIM bulk-load
> via the gigatest utility. I expected to see the I/O channel maxed out
> at 100 MBytes/s when forcing a checkpoint and flushing hundreds of
> megabytes of dirty cache. Unfortunately, the I/O channel was hardly
> being used ... in fact the highest throughput number reported by the
> switch during a checkpoint was 10 MBytes/s! Given that this was my
> first experience setting up a Fibre SAN, I figured that I had done
> something wrong. So, to validate the SAN configuration, I conducted a
> simple experiment by copying an ISO disk image from the server's
> internal U320 SCSI drive to the SAN (configured with eight 18GB U160
> drives). Wow, what a difference! The switch reported a s ustained
> throughput rate of about 91 MBytes/s for the duration of the copy.
> This, in turn, told me two things: 1) The SAN was configured correctly
> and 2) FLAIM wasn't keeping the I/O channel busy.
>
> This was perplexing because FLAIM tries to be ultra efficient in the
> way it interacts with the disk. This includes using async and direct
> I/O when available, ordering writes to minimize seeking, using
> sector-aligned buffers, and various other techniques. As it turns
> out, although FLAIM (and XFLAIM) use async I/O and multiple write
> buffers, the code wasn't taking advantage of the fact that
> out-of-order I/O completion can result in later writes completing
> before earlier writes (seems somewhat obvious, right?).
> Basically, there are a limited number of I/O buffers managed by FLAIM.
> As FLAIM flushes dirty cache to disk, it acquires a buffer from the
> buffer manager, holds onto it until the write completes, and then
> releases the buffer back to the manager. When all of the write
> buffers are in use, FLAIM must wait for a pending I/O to complete
> before queuing an additional I/O operation. The problem is that at
> any one time during a checkpoint, we may have thousands of pending
> writes and it is unknown which one will complete first. FLAIM was
> taking the simplistic approach of waiting for the earliest queued
> write to complete; this, however, was rarely the first I/O to
> complete. The result was that there were many usable I/O buffers
> available, but FLAIM was unaware of this fact because it was blocked
> waiting for the earliest I/O to complete and had not acknowledged the
> completion of the other writes.
>
> Platforms that support async I/O all provide a simple way of
> determining if an async I/O has completed via a call to a routine like
> GetOverlappedResult (on Windows). Windows and NetWare (and possibly
> Linux, Solaris, AIX, OS X, etc.) provide an additional callback-based
> mechanism to notify an application that I/O has completed. Its
> possible to use this callback-based notification to leverage
> out-of-order I/O completion to FLAIM's advantage. Instead of waiting
> for a specific I/O operation to complete so that its buffer can be
> re-used, the callback can do all of the work needed to release the
> buffer back to the buffer manager and also alert (via a semaphore) any
> thread waiting for a buffer to become available. This results in
> efficient and timely buffer re-use and has a huge impact on I/O
> throughput (more details below).
>
> I made some code changes in the Windows-specific FTK code to use an
> I/O completion callback and also added a "buffer waiter" queue, fired
> up a new bulk load, and waited for the first checkpoint. Being used
> to checkpoints that run for a long time (anywhere from 30 - 90
> seconds) when flushing a large amount of dirty cache, I was somewhat
> disappointed to swing around to look at the Fibre switch and find that
> it was reporting throughput of only 10 MBytes/s. Argggh! When I
> looked at the stats on the bulk load screen, I soon realized that the
> I/O stats I was seeing on the switch were attributed to RFL writes,
> NOT the checkpoint ... it had already completed and the foreground
> bulk load was continuing! At the start of the next checkpoint, I made
> sure that I was watching the switch from the start and saw that the
> throughput was 93 MBytes/s. This is an improvement of 9.3 times.
>
> To further validate the changes, I started an unreasonably large bulk
> load and let it run overnight. This morning, to my surprise, the bulk
> load (via gigatest) had completed, resulting in a database with
> 2,000,000,000 (yes, TWO BILLION) objects and a total load time of just
> under 6 hours (that's 5.5 million objects a minute). In terms of the
> number of objects, this is the largest FLAIM database ever created
> (although I would be thrilled if someone contradicted this claim). Of
> course, I proudly announced these results to my wife at breakfast in a
> feeble attempt to justify my recent eBay expenditures made in
> acquiring all of this equipment. I'm not sure if she was impressed.
>
> In any case, these code changes have been checked in to the
> open-source SVN repository and will be included in the
> soon-to-be-released 4.9 version of FLAIM and the 5.1 version of
> XFLAIM. Currently, Windows will be the primary beneficiary of these
> improvements. I am still investigating similar improvements on the
> other supported platforms and will send out any new information as it
> becomes available.
>
> Thanks.
>
>
>
> _______________________________________________
> Flaim-devel mailing list
> Fla...@fo...
> http://forge.novell.com/mailman/listinfo/flaim-devel
|