On Thu, Feb 4, 2010 at 5:14 AM, John Hughes <john@calva.com> wrote:
In order to do filesystem failover we need to mount filesystems with the
"chard" flag, which ensures that on failover the backup node sees the
same filesystem state as the last one the primary node saw.  This is
necessary to avoid programs running on nodes other than the one that
crashed seeing unexpected filesystem changes during the failover.

As I understand it on Linux the chard flag effectively changes all
writes into synchronous writes.  Is this correct?

Incorrect.  After reviewing OpenSSI code the chard flag does not change how writes are done in Linux.  It is up to the user space programs to include the O_SYNC flag with the file descriptor to specify this behavior.

We probably made an incorrect assumption about CFS based on our experience with the older releases.

While looking over the code I believe I found a bug that explains why chard mounts sometimes seem slower.  For csoft mounts there is a bug where dirty pages in the backing filesystem is not flushed during fsync syscall.  So depending on various factors it could be a while until these dirty pages get flushed to disk.

What effect would the "data=journal" ext3 mount option have?  In fact,
isn't it needed?

With the exception of cached data the surviving CFS nodes will see the state of the backing filesystem after journal recovery.  So the answer depends on what you want.
As I remember the UnixWare cfs implementation would attempt to reduce
the performance loss of "char" mounts by checkpointing writes to the
filesystem backup node rather than forcing them to disk.

Maybe we could obtain the same effect by writing our own version of the
Linux "jbd" (Journaling block device, http://kerneltrap.org/node/6741)
which would checkpoint stuff to another node, instead of to disk.  (We'd
still need to force real user requested sync data to disk though.)

Or maybe this is all nonsense.

Maybe similar results can be achieved with battery backed cache and DRBD.