From: SourceForge.net <no...@so...> - 2009-01-21 09:20:00
|
Bugs item #2524658, was opened at 2009-01-20 20:54 Message generated for change (Comment added) made by aryairani You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=2524658&group_id=98788 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Crash / BSOD Group: v0.7.x (release) Status: Open Resolution: Accepted Priority: 5 Private: No Submitted By: Arya (aryairani) Assigned to: Henry N. (henryn) Summary: GPF trying to activate raid5 md array in 0.7.3 Initial Comment: I'm using the 0.7.3 release with the Ubuntu 7.10 disk image on XP Pro SP2. I'm very excited about being able to use coLinux to access md arrays under Windows. But I'm having some trouble. I've tried creating arrays using /dev/loopX, and also /dev/cobdX, with varying degrees of failure. sudo apt-get install mdadm mkdir ~/raid for i in ~/raid/{a,b,c,d}; do dd if=/dev/zero of=$i bs=10M count=1; sudo losetup -f $i; done arya@co-calculon:~/raid$ sudo losetup -a /dev/loop0: [7500]:16841 (/home/arya/raid/a) /dev/loop1: [7500]:16843 (/home/arya/raid/b) /dev/loop2: [7500]:16844 (/home/arya/raid/c) /dev/loop3: [7500]:16845 (/home/arya/raid/d) $ sudo modprobe md-mod $ sudo mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/loop{0,1,2,3} mdadm: array /dev/md0 started. arya@co-calculon:~/raid$ Looks good so far, right? But dmesg shows a GPF: md: bind<loop0> md: bind<loop1> md: bind<loop2> md: bind<loop3> raid5: automatically using best checksumming function: pIII_sse pIII_sse : 2469.600 MB/sec raid5: using function: pIII_sse (2469.600 MB/sec) raid6: int32x1 407 MB/s raid6: int32x2 678 MB/s raid6: int32x4 523 MB/s raid6: int32x8 480 MB/s raid6: mmxx1 1571 MB/s raid6: mmxx2 1866 MB/s raid6: sse1x1 933 MB/s raid6: sse1x2 1696 MB/s raid6: sse2x1 1742 MB/s raid6: sse2x2 2623 MB/s raid6: using algorithm sse2x2 (2623 MB/s) md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 raid5: device loop2 operational as raid disk 2 raid5: device loop1 operational as raid disk 1 raid5: device loop0 operational as raid disk 0 raid5: allocated 4196kB for md0 raid5: raid level 5 set md0 active with 3 out of 4 devices, algorithm 2 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:loop0 disk 1, o:1, dev:loop1 disk 2, o:1, dev:loop2 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:loop0 disk 1, o:1, dev:loop1 disk 2, o:1, dev:loop2 disk 3, o:1, dev:loop3 md: recovery of RAID array md0 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 10176 blocks. general protection fault: 0000 [#1] PREEMPT Modules linked in: raid456 xor md_mod ipv6 fuse CPU: 0 EIP: 0060:[<c0103d46>] Not tainted VLI EFLAGS: 00010002 (2.6.22.18-co-0.7.3 #1) EIP is at math_state_restore+0x26/0x50 eax: 8005003b ebx: d8814ab0 ecx: db4aed00 edx: 00000000 esi: db0a8000 edi: db4add00 ebp: db0a9ce0 esp: db0a9cd8 ds: 007b es: 007b fs: 0000 gs: 0000 ss: 0068 Process md0_raid5 (pid: 2803, ti=db0a8000 task=d8814ab0 task.ti=db0a8000) Stack: db4abd00 db4acd00 db0a9d78 c01038fe db4abd00 db4aed00 00000003 db4acd00 db4add00 db0a9d78 db0a9d2c c146007b d881007b 00000000 ffffffff e0814bf3 00000060 00010206 8005003b db4ae000 00000010 00000000 00000000 00000000 Call Trace: [<c0103bba>] show_trace_log_lvl+0x1a/0x30 [<c0103c79>] show_stack_log_lvl+0xa9/0xd0 [<c01040eb>] show_registers+0x21b/0x3a0 [<c0104365>] die+0xf5/0x210 [<c010534d>] do_general_protection+0x1ad/0x1f0 [<c02a999a>] error_code+0x6a/0x70 [<c01038fe>] device_not_available+0x2e/0x33 [<e081282e>] xor_block+0x6e/0xa0 [xor] [<e095eaf8>] compute_block+0xd8/0x130 [raid456] [<e095fce7>] handle_stripe5+0x1197/0x13c0 [raid456] [<e0961522>] handle_stripe+0x32/0x16f0 [raid456] [<e0964007>] raid5d+0x2f7/0x450 [raid456] [<e094db00>] md_thread+0x30/0x100 [md_mod] [<c0123d32>] kthread+0x42/0x70 [<c01039c7>] kernel_thread_helper+0x7/0x10 ======================= Code: c3 8d 74 26 00 55 89 e5 83 ec 08 89 74 24 04 89 e6 89 1c 24 81 e6 00 e0 ff ff 8b 1e 0f 06 f6 43 0d 20 75 07 89 d8 e8 9a 37 00 00 <0f> ae 8b 10 02 00 00 83 4e 0c 01 fe 83 8d 01 00 00 8b 1c 24 8b EIP: [<c0103d46>] math_state_restore+0x26/0x50 SS:ESP 0068:db0a9cd8 note: md0_raid5[2803] exited with preempt_count 2 For reference: arya@co-calculon:~/raid$ sudo mdadm /dev/md0 /dev/md0: 29.81MiB raid5 4 devices, 1 spare. Use mdadm --detail for more detail. arya@co-calculon:~/raid$ sudo mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Tue Jan 20 15:36:06 2009 Raid Level : raid5 Array Size : 30528 (29.82 MiB 31.26 MB) Used Dev Size : 10176 (9.94 MiB 10.42 MB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Jan 20 15:36:06 2009 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 0% complete UUID : 106a26c3:fae68f05:bc6a5c1d:79f7e806 (local to host co-calculon) Events : 0.1 Number Major Minor RaidDevice State 0 7 0 0 active sync /dev/loop0 1 7 1 1 active sync /dev/loop1 2 7 2 2 active sync /dev/loop2 4 7 3 3 spare rebuilding /dev/loop3 So due to the GPF the array is stuck in a degraded state. I'm pretty sure it's not actually recovering at this point. arya@co-calculon:~/raid$ sudo mdadm --stop /dev/md0 mdadm: fail to stop array /dev/md0: Device or resource busy arya@co-calculon:~/raid$ dmesg | tail -n 1 md: md0 still in use. arya@co-calculon:~/raid$ ----------------- Anyway, I tried again using block devices mapped to windows files: arya@co-calculon:~/raid$ cat /proc/partitions major minor #blocks name 117 0 2097152 cobd0 117 1 128 cobd1 117 3 2144646 cobd3 117 5 10240 cobd5 117 6 10240 cobd6 117 7 10240 cobd7 117 8 10240 cobd8 117 9 10240 cobd9 7 0 10240 loop0 7 1 10240 loop1 7 2 10240 loop2 7 3 10240 loop3 9 0 30528 md0 arya@co-calculon:~/raid$ sudo mdadm --create /dev/md1 --level=5 --raid-devices=4 /dev/cobd{5,6,7,8} dmesg shows: md: bind<cobd5> md: bind<cobd6> md: bind<cobd7> md: bind<cobd8> raid5: device cobd7 operational as raid disk 2 raid5: device cobd6 operational as raid disk 1 raid5: device cobd5 operational as raid disk 0 raid5: allocated 4196kB for md1 raid5: raid level 5 set md1 active with 3 out of 4 devices, algorithm 2 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:cobd5 disk 1, o:1, dev:cobd6 disk 2, o:1, dev:cobd7 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:cobd5 disk 1, o:1, dev:cobd6 disk 2, o:1, dev:cobd7 disk 3, o:1, dev:cobd8 md: recovery of RAID array md1 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 10176 blocks. general protection fault: 0000 [#2] PREEMPT Modules linked in: raid456 xor md_mod ipv6 fuse CPU: 0 EIP: 0060:[<c0103d46>] Not tainted VLI EFLAGS: 00010002 (2.6.22.18-co-0.7.3 #1) EIP is at math_state_restore+0x26/0x50 eax: 8005003b ebx: daa2d530 ecx: d91a5500 edx: 00000000 esi: d8fbe000 edi: d91a4500 ebp: d8fbfce0 esp: d8fbfcd8 ds: 007b es: 007b fs: 0000 gs: 0000 ss: 0068 Process md1_raid5 (pid: 2888, ti=d8fbe000 task=daa2d530 task.ti=d8fbe000) Stack: d91a2500 d91a3500 d8fbfd78 c01038fe d91a2500 d91a5500 0000000b d91a3500 d91a4500 d8fbfd78 d8fbfd2c 0000007b 0000007b 00000000 ffffffff e0814aa8 00000060 00010202 8005003b d91a5000 00000010 00000000 00000000 00000000 Call Trace: [<c0103bba>] show_trace_log_lvl+0x1a/0x30 [<c0103c79>] show_stack_log_lvl+0xa9/0xd0 [<c01040eb>] show_registers+0x21b/0x3a0 [<c0104365>] die+0xf5/0x210 [<c010534d>] do_general_protection+0x1ad/0x1f0 [<c02a999a>] error_code+0x6a/0x70 [<c01038fe>] device_not_available+0x2e/0x33 [<e081282e>] xor_block+0x6e/0xa0 [xor] [<e095eaf8>] compute_block+0xd8/0x130 [raid456] [<e095fce7>] handle_stripe5+0x1197/0x13c0 [raid456] [<e0961522>] handle_stripe+0x32/0x16f0 [raid456] [<e0964007>] raid5d+0x2f7/0x450 [raid456] [<e094db00>] md_thread+0x30/0x100 [md_mod] [<c0123d32>] kthread+0x42/0x70 [<c01039c7>] kernel_thread_helper+0x7/0x10 ======================= Code: c3 8d 74 26 00 55 89 e5 83 ec 08 89 74 24 04 89 e6 89 1c 24 81 e6 00 e0 ff ff 8b 1e 0f 06 f6 43 0d 20 75 07 89 d8 e8 9a 37 00 00 <0f> ae 8b 10 02 00 00 83 4e 0c 01 fe 83 8d 01 00 00 8b 1c 24 8b EIP: [<c0103d46>] math_state_restore+0x26/0x50 SS:ESP 0068:d8fbfcd8 note: md1_raid5[2888] exited with preempt_count 2 On a previous attempt, I've also had it take colinux-daemon.exe into an infinite loop or something: arya@co-calculon:~$ cat /proc/partitions major minor #blocks name 117 0 2097152 cobd0 117 1 128 cobd1 117 5 10240 cobd5 117 6 10240 cobd6 117 7 10240 cobd7 117 8 10240 cobd8 117 9 10240 cobd9 arya@co-calculon:~$ sudo mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/cobd{5,6,7,8} mdadm: error opening /dev/md0: No such device or address arya@co-calculon:~$ sudo modprobe md-mod arya@co-calculon:~$ sudo mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/cobd{5,6,7,8} *crash* Can't kill colinux-daemon.exe, which is using 100% of one HT "cpu". colinux-daemon.exe output shows: md: bind<cobd5> md: bind<cobd6> md: bind<cobd7> md: bind<cobd8> raid5: automatically using best checksumming function: pIII_sse pIII_sse : 2282.400 MB/sec raid5: using function: pIII_sse (2282.400 MB/sec) raid6: int32x1 712 MB/s raid6: int32x2 714 MB/s raid6: int32x4 537 MB/s raid6: int32x8 491 MB/s raid6: mmxx1 1533 MB/s raid6: mmxx2 1893 MB/s raid6: sse1x1 954 MB/s console window shows this much, can't see where it started: [<c0103c79>] show_stack_log_lvl+0xa9/0xd0 [<c01040eb>] show_registers+0x21b/0x3a0 [<c0104365>] die+0xf5/0x210 [<c010b6fc>] do_page_fault+0x38c/0x6e0 [<c02a999a>] error_code+0x6a/0x70 [<c010c3f8>] deactivate_task+0x18/0x30 [<c02a7777>] __sched_text_start+0x377/0x670 [<c01144b4>] do_exit+0x7d4/0x940 [<c010447d>] die+0x20d/0x210 [<c010b6fc>] do_page_fault+0x38c/0x6e0 [<c02a999a>] error_code+0x6a/0x70 [<c010c3f8>] deactivate_task+0x18/0x30 [<c02a7777>] __sched_text_start+0x377/0x670 [<c01144b4>] do_exit+0x7d4/0x940 [<c010447d>] die+0x20d/0x210 [<c010b6fc>] do_page_fault+0x38c/0x6e0 [<c02a999a>] error_code+0x6a/0x70 [<c010c3f8>] deactivate_task+0x18/0x30 [<c02a7777>] __sched_text_start+0x377/0x670 [<c01144b4>] do_exit+0x7d4/0x940 [<c010447d>] die+0x20d/0x210 [<c010b6fc>] do_page_fault+0x38c/0x6e0 [<c02a999a>] error_code+0x6a/0x70 [<c010c3f8>] deactivate_task+0x18/0x30 Thanks for the all the amazing work already! ---------------------------------------------------------------------- >Comment By: Arya (aryairani) Date: 2009-01-21 09:19 Message: The previous post was with 0.8.0 daily. Here I tried again, creating a new array rather than reassembling like with the previous example. It seems to have worked? md: bind<loop0> md: bind<loop1> md: bind<loop2> md: bind<loop3> raid5: measuring checksumming speed 8regs : 1577.200 MB/sec 8regs_prefetch: 1398.800 MB/sec 32regs : 788.000 MB/sec 32regs_prefetch: 861.600 MB/sec raid5: using function: 8regs (1577.200 MB/sec) raid6: int32x1 323 MB/s raid6: int32x2 321 MB/s raid6: int32x4 277 MB/s raid6: int32x8 258 MB/s raid6: mmxx1 739 MB/s raid6: mmxx2 992 MB/s raid6: sse1x1 462 MB/s raid6: sse1x2 873 MB/s raid6: sse2x1 1012 MB/s raid6: sse2x2 1511 MB/s raid6: using algorithm sse2x2 (1511 MB/s) md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 raid5: device loop2 operational as raid disk 2 raid5: device loop1 operational as raid disk 1 raid5: device loop0 operational as raid disk 0 raid5: allocated 4196kB for md0 raid5: raid level 5 set md0 active with 3 out of 4 devices, algorithm 2 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:loop0 disk 1, o:1, dev:loop1 disk 2, o:1, dev:loop2 RAID5 conf printout: --- rd:4 wd:3 disk 0, o:1, dev:loop0 disk 1, o:1, dev:loop1 disk 2, o:1, dev:loop2 disk 3, o:1, dev:loop3 md: recovery of RAID array md0 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 10176 blocks. md: md0: recovery done. RAID5 conf printout: --- rd:4 wd:4 disk 0, o:1, dev:loop0 disk 1, o:1, dev:loop1 disk 2, o:1, dev:loop2 disk 3, o:1, dev:loop3 Works ok with cobdX devices, I don't know why it crashed the firs time? ---------------------------------------------------------------------- Comment By: Arya (aryairani) Date: 2009-01-21 08:55 Message: Bad news, md: md0 stopped. md: bind<loop1> md: bind<loop2> md: bind<loop3> md: bind<loop0> md: md0 stopped. md: unbind<loop0> md: export_rdev(loop0) md: unbind<loop3> md: export_rdev(loop3) md: unbind<loop2> md: export_rdev(loop2) md: unbind<loop1> md: export_rdev(loop1) md: bind<loop1> md: bind<loop2> md: bind<loop3> md: bind<loop0> raid5: measuring checksumming speed 8regs : 1292.800 MB/sec 8regs_prefetch: 1210.400 MB/sec 32regs : 1184.000 MB/sec 32regs_prefetch: 1428.800 MB/sec raid5: using function: 32regs_prefetch (1428.800 MB/sec) raid6: int32x1 605 MB/s raid6: int32x2 563 MB/s raid6: int32x4 444 MB/s raid6: int32x8 424 MB/s raid6: mmxx1 1355 MB/s raid6: mmxx2 1579 MB/s raid6: sse1x1 758 MB/s raid6: sse1x2 1541 MB/s raid6: sse2x1 1667 MB/s colinux-console-nt only shows this much, which is missing the registers / EIP dump... [<c0103c79>] show_stack_log_lvl+0xa9/0xd0 [<c01040eb>] show_registers+0x21b/0x3a0 [<c0104365>] die+0xf5/0x210 [<c010b6cc>] do_page_fault+0x38c/0x6e0 [<c03057fa>] error_code+0x6a/0x70 [<c010c5c8>] deactivate_task+0x18/0x30 [<c03035d7>] __sched_text_start+0x377/0x670 [<c0114814>] do_exit+0x7f4/0x960 [<c010447d>] die+0x20d/0x210 [<c010b6cc>] do_page_fault+0x38c/0x6e0 [<c03057fa>] error_code+0x6a/0x70 [<c010c5c8>] deactivate_task+0x18/0x30 [<c03035d7>] __sched_text_start+0x377/0x670 [<c0114814>] do_exit+0x7f4/0x960 [<c010447d>] die+0x20d/0x210 [<c010b6cc>] do_page_fault+0x38c/0x6e0 [<c03057fa>] error_code+0x6a/0x70 [<c010c5c8>] deactivate_task+0x18/0x30 [<c03035d7>] __sched_text_start+0x377/0x670 [<c0114814>] do_exit+0x7f4/0x960 [<c010447d>] die+0x20d/0x210 [<c010b6cc>] do_page_fault+0x38c/0x6e0 [<c03057fa>] error_code+0x6a/0x70 [<c010c5c8>] deactivate_task+0x18/0x30 ---------------------------------------------------------------------- Comment By: Arya (aryairani) Date: 2009-01-21 00:29 Message: Excellent, thanks for tracking that down, henryn. I'll test the daily tomorrow, with the temporary fix, if it will still help? ---------------------------------------------------------------------- Comment By: Henry N. (henryn) Date: 2009-01-20 23:31 Message: Loading and unloading a special version of module "raid456.ko" with only calling the function "calibrate_xor_block" does exactly your results: Crashing in math_state_restore, lots of page faults, and windows can not shutting down. ---------------------------------------------------------------------- Comment By: Henry N. (henryn) Date: 2009-01-20 22:53 Message: I assume, that the "pIII_sse" use a special register, we not have saved and restored in the passage page. That can crash the "math_state_restore". One option would be to support these xmm-registers for "sse", but currently I don't have idea how. The problem can be near the XMMS_SAVE/XMMS_RESTORE in top of include/asm-i386/xor.h: #define XMMS_SAVE do { \ preempt_disable(); \ cr0 = read_cr0(); \ clts(); \ These operations are well known candidate for crash and for endless page faults. Same we know from function math_state_restore in arch/i386/traps.c. A "clts" while hardware interrupts are enabled can crash coLinux. maniputating the register cr0 is also a high risk. The function "xor_block_pIII_sse" with special macros XMMS_SAVE/XMMS_RESTORE needs to check separately in a kernel test module, outside of the raid. Think, there needs something to do. As temporally idea have disabled the usage of xmm- and mmx-registers under coLinux for the xor-raid-functions. Please check the next autobuild. ---------------------------------------------------------------------- Comment By: Arya (aryairani) Date: 2009-01-20 21:09 Message: The 100% cpu crash seems reproducible if the cobd array is the first one I try to create that boot (i.e. the modules are being loaded for the first time?) Also (though not verified), I get the 100% cpu crash when making an array based on loop block devices if the loop devices are created to point to files using relative paths rather than absolute? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=2524658&group_id=98788 |