#104 Colinux thrashes on boot

v0.6.x (release)
closed-out-of-date
Henry N.
None
5
2008-05-01
2006-09-10
No

I'm having a problem with colinux thrashing the disk on
launch. I'm running colinux 0.6.4-linux-2.6.11, and
I've got it set up with debian install on a reiserfs
image on cobd0 (made by "cp -ax"ing the colinux stock
debian image after installing the reiserfs utils to
it). When I start my colinux setup it usually gets as
far as:

[... snip ...]

NET: Registered protocol family 1
NET: Registered protocol family 17
ReiserFS: cobd0: found reiserfs format "3.6" with
standard journal
ReiserFS: cobd0: using ordered data mode
ReiserFS: cobd0: journal params: device cobd0, size
8192, journal first block 18
, max trans len 1024, max batch 900, ReiserFS: cobd0:
checking transaction log (
cobd0)

and sits there hitting the disk for several minutes
before continuing. If I force kill the colinux-daemon
process while it's doing this (taskkill /im
colinux-daemon.exe /f), it doesn't die for several
minutes (i.e. the amount of time usually spent
thrashing) presumably because it's blocked on a huge IO
operation.

But it doesn't always do this... Sometimes it boots
without unusual disk activity, especially on subsequent
colinux launches before I restart windows again. (That
could just be the effects of disk caching in windows,
but I'm not sure.)

This behaviour happens on both the systems I've tried
colinux on: my dual core athlon 64 X2 nforce 4 box at
work, and my athlon XP 2500 nforce 2 box at home. On
my work box, the cobd0 image is 20GB (21474836480
bytes); the one on my home box is substantially
smaller, (~8GB IIRC -- I don't have it handy right now.)

Discussion

  • Andrew Tonner

    Andrew Tonner - 2006-09-10

    Logged In: YES
    user_id=39760

    I should mention that I've removed the initrd section from
    my configuration file in case this bug is somehow related to
    the known problem with that, but this problem didn't go away.

     
  • George P Boutwell

    Logged In: YES
    user_id=30412

    Sounds like there is some big disk operation going on in
    coLinux, I don't know what that operation is (perhaps
    coLinux didn't get shutdown correctly & reiserfs is trying
    to replay a long journal?), but you should probably leave it
    to complete, instead of trying to kill it.

    Make sure that you are shutting down coLinux, by logging in
    and running a proper linux shutdown command (halt, poweroff,
    shutdown -h now, etc) and not just 'killing' coLinux processes.

     
  • Andrew Tonner

    Andrew Tonner - 2006-09-12

    Logged In: YES
    user_id=39760

    I had sort of assumed that even kernel space IO happening on
    the linux side wouldn't cause the colinux-daemon process to
    block for IO like this. But I don't know the internals so I
    guess I should stop making assumptions like that. =)

    Still, other things suggest to me that it's not a resierfs
    journal replay:

    - According to the messages by time the read is happening,
    the system hasn't got the part where the journal replay
    should happen yet AFAIK

    - I fired up Sysinternals' FileMon, and the disk activity is
    colinux-daemon doing a series of consecutive (in terms of
    offsets) 64k IRP_MJ_READs. FileMon doesn't show the target
    of the reads (it just gives C:) but it must be the volume
    file, judging by the eventually huge offsets (I don't have
    any other files that big) and the fact that the last read
    before colinux continues is right at 20GB (the last read
    offset & size lines up with the volume file end position)...
    unless its reading something other than a file.

    - Also this behaviour happens even when the the last run of
    colinux was one that worked fine and was shutdown normally
    with halt or shutdown.

     
  • Henry N.

    Henry N. - 2006-09-13

    Logged In: YES
    user_id=579204

    It can be a limit in one of the block operations from
    colinux.

    Please can you boot from an other image. For sample from
    the small Debian, ArchLinux or Fedora.

    Than check the image without mount, with the reiser tools.
    I'm not know the tool, it is like "fsck.ext3 -f /dev/cobd1"
    for an ext3 system.

    Than mount this device, unmount it, check again.

    Than mount it, write down some, umount it, check again.

    An totaly other idea:
    I'm afraid, that your shutting down don't complete your
    reiser umount. Please try to go into runlevel S (single
    user mode without network). Check, that no other task are
    running and not task shoult need write access to your roor
    filesystem. Than do this command sequence
    "sync; sleep 1; sync; sleep 3; mount -o remount ro /"
    The umount should no give an error.
    Now check your root file system device with reiser tools.
    If it was clean, shutdown your system and run it again.
    This helps?

     
  • Andrew Tonner

    Andrew Tonner - 2007-01-04

    Logged In: YES
    user_id=39760
    Originator: YES

    I've gone through this sequence of checks, and fsck never encounters any file system errors, and except for the occasional thrashing for several minutes when I mount a reiserfs volume nothing unusual happens.

     
  • Ben Voigt

    Ben Voigt - 2007-01-04

    Logged In: YES
    user_id=782364
    Originator: NO

    reiserfs, being a journalled filesystem, usually checks itself very quickly. However, by default every 20th boot it forces a full check. The frequency of checks can be changed in the reiser metadata... but looking at reiserfstune I can't find the command for it right now.

     
  • Andrew Tonner

    Andrew Tonner - 2007-01-05

    Logged In: YES
    user_id=39760
    Originator: YES

    This doesn't seem to be a filesystem-specific problem. I mkfsed an identically-sized (21474836480 bytes) ext3 volume, cp -ax'd the contents of my reiserfs volume across to it, modified the fstab, and then put my colinux config back so that only the new ext3 volume is being used. After a windows restart, when I start colinux, it sits and thrashes for several minutes at roughly the same place.

    dmesg:

    Linux version 2.6.11-co-0.6.4 (george@CoDebianDevel) (gcc version 3.4.4 20050314
    (prerelease) (Debian 3.4.3-13)) #1 Mon Jun 19 05:36:13 UTC 2006
    520MB LOWMEM available.
    On node 0 totalpages: 133120
    DMA zone: 0 pages, LIFO batch:1
    Normal zone: 133120 pages, LIFO batch:16
    HighMem zone: 0 pages, LIFO batch:1
    Built 1 zonelists
    Kernel command line: root=/dev/cobd0
    Initializing CPU#0
    Setting proxy interrupt vectors
    PID hash table entries: 4096 (order: 12, 65536 bytes)
    Using cooperative for high-res timesource
    Console: colour CoCON 80x25
    Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
    Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    Memory: 523648k/532480k available (1537k kernel code, 0k reserved, 521k data, 10
    8k init, 0k highmem)
    Calibrating delay loop... 734.00 BogoMIPS (lpj=3670016)
    Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
    CPU: After generic identify, caps: 178bfbff e3d3fbff 00000000 00000000 00000001
    00000000 00000003
    CPU: After vendor identify, caps: 178bfbff e3d3fbff 00000000 00000000 00000001 0
    0000000 00000003
    CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
    CPU: L2 Cache: 512K (64 bytes/line)
    CPU: After all inits, caps: 178bfbff e3d3fbff 00000000 00000010 00000001 0000000
    0 00000003
    CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ stepping 01
    Enabling fast FPU save and restore... done.
    Enabling unmasked SIMD FPU exception support... done.
    Checking 'hlt' instruction... OK.
    NET: Registered protocol family 16
    devfs: 2004-01-31 Richard Gooch (rgooch@atnf.csiro.au)
    devfs: boot_options: 0x0
    cofuse init 0.1 (API version 2.2)
    Initializing Cryptographic API
    serio: cokbd at irq 1
    io scheduler noop registered
    io scheduler anticipatory registered
    io scheduler deadline registered
    io scheduler cfq registered
    RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
    cobd: loaded (max 32 devices)
    loop: loaded (max 8 devices)
    conet: loaded (max 16 devices)
    conet0: initialized
    conet1: initialized
    mice: PS/2 mouse device common for all mice
    input: AT Translated Set 2 keyboard on cokbd
    NET: Registered protocol family 2
    IP: routing cache hash table of 4096 buckets, 32Kbytes
    TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
    TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
    TCP: Hash tables configured (established 131072 bind 65536)
    NET: Registered protocol family 1
    NET: Registered protocol family 17
    [[Here is where it sits and thrashes for several minutes, then]]
    EXT3 FS on cobd0, internal journal
    EXT3-fs: mounted filesystem with ordered data mode.
    VFS: Mounted root (ext3 filesystem).
    Freeing unused kernel memory: 108k freed
    kjournald starting. Commit interval 5 seconds
    Adding 524280k swap on /dev/cobd1. Priority:-1 extents:1
    EXT3 FS on cobd0, internal journal
    [... snip ...]

     
  • Andrew Tonner

    Andrew Tonner - 2007-01-12
    • milestone: --> 647239
     
  • Andrew Tonner

    Andrew Tonner - 2007-01-12

    Logged In: YES
    user_id=39760
    Originator: YES

    I switched to an 0.8.0 snapshot (20061212), still using my 21474836480 byte ext3 volume, and I get exactly the same behaviour; it stalls at mount time and hits the disk for several minutes before continuing.

     
  • Henry N.

    Henry N. - 2007-03-09

    Logged In: YES
    user_id=579204
    Originator: NO

    Hello,

    can be the problem the size? 21474836480 bytes = 20GB

    Please, before you starts coLinux, run the Debugger
    colinux-debug-daemon.exe -d -p -s prints=31,misc=31 -f debug.xml
    or
    colinux-debug-daemon.exe -d -p -s prints=31,misc=31,messages=31 -f debug2.xml

    You can stop the debugger with CTRL-C after beginning the "several minutes"-Problem.
    Than view into the debug. I'm interesting for the drive geometry detection. Please also locate for misterious messages about your drive there.

    The debug2.xml can be very big. You can remove all the duplicated block operations after beginning the problems to the end. But, locate for problems or some others non normal things in the output.

    The format is XML, text is human readable between the "<strings>", open it with IE.

     
  • Henry N.

    Henry N. - 2007-03-09

    Logged In: YES
    user_id=579204
    Originator: NO

    I'm sorry. The second line should be:

    colinux-debug-daemon.exe -d -p -s prints=31,misc=31,blockdev=31 -f debug2.xml

     
  • Henry N.

    Henry N. - 2008-04-16
    • milestone: 647239 --> v0.6.x (release)
    • assigned_to: nobody --> henryn
    • status: open --> pending-out-of-date
     
  • Henry N.

    Henry N. - 2008-04-16

    Logged In: YES
    user_id=579204
    Originator: NO

    A bug in page fault handlner for sys_mount (mount the root filesystem) can be here the problem. Such similar bugs are fixed in 0.7.3 RC3 and snapshot devel 0.8.0-20080415, see http://www.colinux.org/snapshots/

     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-out-of-date --> closed-out-of-date