[Flaim-users] Improved FLAIM I/O performance (up to 9 times better!!!!)
Brought to you by:
dsandersorem,
jcalcote
|
From: Andrew H. <AHo...@no...> - 2006-09-08 11:27:26
|
All, I was recently able to acquire some used Fibre Channel equipment, including a cool Brocade switch with a front-panel display that can be configured to report I/O throughput. Being excited to test the switch, Fibre card and disk array, I decided to run a FLAIM bulk-load via the gigatest utility. I expected to see the I/O channel maxed out at 100 MBytes/s when forcing a checkpoint and flushing hundreds of megabytes of dirty cache. Unfortunately, the I/O channel was hardly being used ... in fact the highest throughput number reported by the switch during a checkpoint was 10 MBytes/s! Given that this was my first experience setting up a Fibre SAN, I figured that I had done something wrong. So, to validate the SAN configuration, I conducted a simple experiment by copying an ISO disk image from the server's internal U320 SCSI drive to the SAN (configured with eight 18GB U160 drives). Wow, what a difference! The switch reported a sustained throughput rate of about 91 MBytes/s for the duration of the copy. This, in turn, told me two things: 1) The SAN was configured correctly and 2) FLAIM wasn't keeping the I/O channel busy. This was perplexing because FLAIM tries to be ultra efficient in the way it interacts with the disk. This includes using async and direct I/O when available, ordering writes to minimize seeking, using sector-aligned buffers, and various other techniques. As it turns out, although FLAIM (and XFLAIM) use async I/O and multiple write buffers, the code wasn't taking advantage of the fact that out-of-order I/O completion can result in later writes completing before earlier writes (seems somewhat obvious, right?). Basically, there are a limited number of I/O buffers managed by FLAIM. As FLAIM flushes dirty cache to disk, it acquires a buffer from the buffer manager, holds onto it until the write completes, and then releases the buffer back to the manager. When all of the write buffers are in use, FLAIM must wait for a pending I/O to complete before queuing an additional I/O operation. The problem is that at any one time during a checkpoint, we may have thousands of pending writes and it is unknown which one will complete first. FLAIM was taking the simplistic approach of waiting for the earliest queued write to complete; this, however, was rarely the first I/O to complete. The result was that there were many usable I/O buffers available, but FLAIM was unaware of this fact because it was blocked waiting for the earliest I/O to complete and had not acknowledged the completion of the other writes. Platforms that support async I/O all provide a simple way of determining if an async I/O has completed via a call to a routine like GetOverlappedResult (on Windows). Windows and NetWare (and possibly Linux, Solaris, AIX, OS X, etc.) provide an additional callback-based mechanism to notify an application that I/O has completed. Its possible to use this callback-based notification to leverage out-of-order I/O completion to FLAIM's advantage. Instead of waiting for a specific I/O operation to complete so that its buffer can be re-used, the callback can do all of the work needed to release the buffer back to the buffer manager and also alert (via a semaphore) any thread waiting for a buffer to become available. This results in efficient and timely buffer re-use and has a huge impact on I/O throughput (more details below). I made some code changes in the Windows-specific FTK code to use an I/O completion callback and also added a "buffer waiter" queue, fired up a new bulk load, and waited for the first checkpoint. Being used to checkpoints that run for a long time (anywhere from 30 - 90 seconds) when flushing a large amount of dirty cache, I was somewhat disappointed to swing around to look at the Fibre switch and find that it was reporting throughput of only 10 MBytes/s. Argggh! When I looked at the stats on the bulk load screen, I soon realized that the I/O stats I was seeing on the switch were attributed to RFL writes, NOT the checkpoint ... it had already completed and the foreground bulk load was continuing! At the start of the next checkpoint, I made sure that I was watching the switch from the start and saw that the throughput was 93 MBytes/s. This is an improvement of 9.3 times. To further validate the changes, I started an unreasonably large bulk load and let it run overnight. This morning, to my surprise, the bulk load (via gigatest) had completed, resulting in a database with 2,000,000,000 (yes, TWO BILLION) objects and a total load time of just under 6 hours (that's 5.5 million objects a minute). In terms of the number of objects, this is the largest FLAIM database ever created (although I would be thrilled if someone contradicted this claim). Of course, I proudly announced these results to my wife at breakfast in a feeble attempt to justify my recent eBay expenditures made in acquiring all of this equipment. I'm not sure if she was impressed. In any case, these code changes have been checked in to the open-source SVN repository and will be included in the soon-to-be-released 4.9 version of FLAIM and the 5.1 version of XFLAIM. Currently, Windows will be the primary beneficiary of these improvements. I am still investigating similar improvements on the other supported platforms and will send out any new information as it becomes available. Thanks. |