Hi Sam,

Thanks, that clarifies things considerably: SACKs for within a bio barrier/boundary, stop-and-wait between bio boundaries.  I thought it might be some kernel-fu.

Do you know if there are any Linux filesystem implementations that do not properly use barriers or otherwise to ensure their bio order? i.e. ones to think twice about using for DBs etc.?

Cheers,
Mike Sample


On 7/9/07, Sam Hopkins <sah@coraid.com > wrote:
Hello Mike,

In considering write ordering you've stumbled into a rather involved
systems issue.  At the AoE level, it does not matter what order the
writes get to the device.  The AoE driver is given I/O in units called
bios and only acknowledges the completion of a bio when the entire bio
is successfully transferred.

The AoE driver processes more than one bio simultaneously.  As a
result, it is possible for bios to complete out of order.
Traditionally filesystems requiring specific order to their writes
will communicate with the block layer to enfore boundaries in the I/O
stream.  Recent Linux kernels support this down to the device driver
in the form of barrier requests in an attempt to optimize the I/O
stream.  The AoE driver does not currently utilize this feature and
instead relies on the block layer to drain the queue and wait for the
completion of bios prior to any boundary point.

The AoE protocol spec is maintained as a small document by
specifically documenting only the protocol.  Implementors are free to
choose whatever retransmission mechanism works best for their
environment.

Cheers,

Sam

>  From reading the AoE protocol document and other articles it
> sounds like the AoE protocol has selective acknowledgments and
> is not stop-and-wait.
>
> Given that block reads and writes may have an important ordering,
> does AoE have any mechanism to ensure that a re-transmit does not
> violate the ordering?  For example if three writes to the same
> block with different data are issued close together:
>
> 1. w(blk n, '000...10')
> 2. w(blk n, '000...11')
> 3. w(blk n, '000...12')
>
> If 1 and 3 are ACKed and 2 is lost, then successfully
> retransmitted afterward, does this not silently result in
> incorrect data? The system file cache flushing alg. probably
> avoids this situation normally unless there were syncs between
> 1-2 and 2-3.  But, is this possible?
>
> A section on timers/retransmits to the AoE protocol spec with
> some profiles of standard values would be handy...
>
> Thanks in advance
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Aoetools-discuss mailing list
> Aoetools-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/aoetools-disc uss