I found a bug in my patch.  You'd have to add spin_unlock() in cfs_commit_inode() in cluster/ssi/cfs/write.c at approx. line 1325:

        spin_lock(&cfs_wreq_lock);
//      res = cfs_scan_commit(inode, &head, idx_start, npages);
        res = cfs_scan_commit(inode, &head, 0, 0);
        spin_unlock(&cfs_wreq_lock);
        if (res) {

Roger


On 9/22/05, Roger Tsang <roger.tsang@gmail.com> wrote:
The patch is kinda messy as it's a straight diff of my latest work, but it will do.  Patch it against the kernel.  I can't check-in any of these until CVS is fixed...

Roger



On 9/22/05, Roger Tsang <roger.tsang@gmail.com > wrote:
I dunno about sync on same machine, I don't remember now.  It's not a crash, so it's hard to tell.  I'll do a few more tests later after work hours.  I'll put out a CFS patch for you to try.

Roger



On 9/22/05, Andy Phillips <Andrew.Phillips@betfair.com > wrote:
Hi,

If you type "sync" on the same machine in another window, does it
recover?

Any ideas as to the underlying cause?

Andy

On Sat, 2005-09-17 at 14:38 -0400, Roger Tsang wrote:
> Alright I can reproduce this by doing the large file copy.  It gets
> stuck here...
>
> Stack traceback for pid 136777
> 0xc57f3a80   136777   136733  0    0   D  0xc57f3c40  mc
> EBP        EIP        Function (args)
> 0xd8e01cc0 0xc03b4103 schedule+0x2b3
> 0xd8e01cc8 0xc03b462e io_schedule+0xe (0xc15001b0)
> 0xd8e01cd4 0xc0136745 sync_page+0x35 (0xc1251ea8, 0x0, 0xc0136710,
> 0xc57f3a80, 0xd8e01d24)
> 0xd8e01cf4 0xc03b49a9 __wait_on_bit_lock+0x49 (0x2, 0xc1251ea8,
> 0xc1251ea8, 0x0, 0x0)
> 0xd8e01d50 0xc0136f5a __lock_page+0x8a (0xd666c8a0, 0x1d838,
> 0xda0e96a0, 0x1d838, 0x2)
> 0xd8e01de8 0xc013764b do_generic_mapping_read+0x3db (0xd666c8a0,
> 0xda0e96e8, 0xda0e96a0, 0xd8e01f14, 0xd8e01e1c)
> 0xd8e01e38 0xc0137b24 __generic_file_aio_read+0x194 (0xd8e01ed8,
> 0xd8e01e50, 0x1, 0xd8e01f14, 0x8135998)
> 0xd8e01e64 0xc0137be2 generic_file_aio_read+0x52 (0xd8e01ed8,
> 0x8135998, 0x2000, 0x1d838000, 0x0)
> 0xd8e01ea0 0xc0268f40 __cfs_file_read+0xc0 (0xd8e01ed8, 0x0,
> 0x8135998, 0x2000, 0xd8e01ed0)
> 0xd8e01ebc 0xc0268ffe cfs_file_aio_read+0x2e (0xd8e01ed8, 0x8135998,
> 0x2000, 0x1d838000, 0x0)
> 0xd8e01f64 0xc015561b do_sync_read+0xab (0xda0e96a0, 0x8135998,
> 0x2000, 0xd8e01fa8, 0x0)
> 0xd8e01f90 0xc0155758 vfs_read+0xe8 (0xda0e96a0, 0x8135998, 0x2000,
> 0xd8e01fa8, 0x1d838000)
> 0xd8e01fbc 0xc0155a1b sys_read+0x4b
>            0xc0103c55 sysenter_past_esp+0x52
>
> On 9/17/05, Roger Tsang < roger.tsang@gmail.com > wrote:
>         Okay I ran into this hang just a moment ago while copying a
>         very large file from node 2 to the init node.  It hangs at the
>         very end of the file.  Then if I do "sync" as you have
>         suggested, the copy completes.  I guess next time I see this
>         I'll do a backtrace on the copy process.  My guess is it's
>         probably waiting in CFS wait_for_congestion().
>
>         Have you tried a different IO scheduler?  Try deadline if you
>         were using cfq.
>
>         Roger
>
>
>
>         On 8/25/05, John Byrne < john.l.byrne@hp.com> wrote:
>                 Andy Phillips wrote:
>                 > Following on;
>                 >
>                 >   It appears that if I remount the file system with
>                 the
>                 > "sync" option then this problem goes away. But
>                 performance
>                 > is bad. Shutting down the other node in the cluster
>                 does
>                 > not seem to affect this at all.
>                 >
>                 >   Would SSI or the CFS cause issues with async i/o?
>                 Would
>                 > that follow a different path to a normal kernel?
>                 >
>                 >   Andy
>                 >
>
>                 There can certainly be bugs and I do note that your
>                 hanging is rather
>                 large. Maybe that is the cause of the problem. Maybe
>                 you could make a
>                 simple test case with 256k writes and see if that
>                 hangs.
>
>                 John
>
>
>                 -------------------------------------------------------
>                 SF.Net email is Sponsored by the Better Software
>                 Conference & EXPO
>                 September 19-22, 2005 * San Francisco, CA *
>                 Development Lifecycle Practices
>                 Agile & Plan-Driven Development * Managing Projects &
>                 Teams * Testing & QA
>                 Security * Process Improvement & Measurement *
>                 http://www.sqe.com/bsce5sf
>                 _______________________________________________
>                 Ssic-linux-users mailing list
>                 Ssic-linux-users@lists.sourceforge.net
>                 https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
>
>
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________
--
Andy Phillips, FRAS
Systems Architect, Performance and Test Manager.
Infrastructure.

Direct Line: 0208 834 8436

Waterfront | Hammersmith Embankment | Chancellors Road London | W6 9HP

The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.Any view or opinions presented are solely
those of the author and do not necessarily represent those of
Betfair.Betfair is the trading name of The Sporting Exchange Limited
whose registered office is: Waterfront, Hammersmith Embankment,
Chancellors Road, London W6 9HP. Registered in England with No. 3770548.