From: SourceForge.net <no...@so...> - 2006-10-05 17:00:27
|
Bugs item #1569947, was opened at 2006-10-03 14:48 Message generated for change (Comment added) made by henryn You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=1569947&group_id=98788 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Markus Müller (privi) Assigned to: Nobody/Anonymous (nobody) Summary: Data corruption on md/raid5 under 0.6.3 and also 0.7.1-hn14 Initial Comment: There is somewhere a problem in the coLinux System and/or kernel about md and/or raid5. There is data corruption and also filesystemerrors. Sometimes also kernel panics occure. If you, for example, write 1 GB it is in most cases damaged. The data corruption increases highly, if the raid is rebuilding; if there is no rebuild, every second file of 1 GB is not damaged. If there is rebuild running, there are kernel panics and data corruptions on any files you write which are more than 20 MB. The panics mostly occures if you then read the file, on writing there is no hint at all that something bad is going on. On reading there are often messages like "attempt to access beyond end of device md0: rw=0, want=18446744061337158144, limit=4367491072" The problem occures if you map the cobds via file (cobd1=M:\data.img\ncobd2=N:\data.img\n...), and also via block device (cobd1=\Device\HarddiskVolume1,...): The problem occures in 0.6.3 as well as in 0.7.1-hn14, I reproduced it on both. The Problem does NOT occure if you simply create files via dd on a mounted (cobd-)device. The problem is also NOT reproduceable if you run the raid under native linux, without colinux but in real linux. So there is NOT a Hardware problem! I have here hardware to test, I simply need assistence what to test to determine the problem. Please give input to col...@pr.... Produced with: ============== chieftec:/mnt/md0# while true; do for i in 1 2 3; do dd_rescue -m 1G /dev/zero ./f.$i; done; for i in 1 2 3; do ls -la ./f.$i; md5sum ./f.$i; done; for i in 1 2 3; do rm ./f.$i; done; done dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 5564kB/s, avg.rate: 8975kB/s, avg.load: 18.5% Summary for /dev/zero -> ./f.1: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 8975kB/s, avg.load: 18.5% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 5759kB/s, avg.rate: 7320kB/s, avg.load: 15.7% Summary for /dev/zero -> ./f.2: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 7320kB/s, avg.load: 15.7% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 5602kB/s, avg.rate: 7194kB/s, avg.load: 15.9% Summary for /dev/zero -> ./f.3: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 7194kB/s, avg.load: 15.9% -rw-r----- 1 root root 1073741824 Oct 3 14:17 ./f.1 md5sum: ./f.1: Input/output error -rw-r----- 1 root root 1073741824 Oct 3 14:20 ./f.2 2d261173f8d9f9b8d4257df01d51be48 ./f.2 -rw-r----- 1 root root 1073741824 Oct 3 14:22 ./f.3 md5sum: ./f.3: Input/output error Segmentation fault Kernel Panic: ============= md: using 128k window, over a total of 311963648 blocks. md: md0: sync done. md: md0 stopped. md: unbind<cobd8> md: export_rdev(cobd8) md: unbind<cobd7> md: export_rdev(cobd7) md: unbind<cobd6> md: export_rdev(cobd6) md: unbind<cobd5> md: export_rdev(cobd5) md: unbind<cobd4> md: export_rdev(cobd4) md: unbind<cobd3> md: export_rdev(cobd3) md: unbind<cobd2> md: export_rdev(cobd2) md: unbind<cobd1> md: export_rdev(cobd1) md: bind<cobd1> md: bind<cobd2> md: bind<cobd3> md: bind<cobd4> md: bind<cobd5> md: bind<cobd6> md: bind<cobd7> md: bind<cobd8> raid5: device cobd7 operational as raid disk 6 raid5: device cobd6 operational as raid disk 5 raid5: device cobd5 operational as raid disk 4 raid5: device cobd4 operational as raid disk 3 raid5: device cobd3 operational as raid disk 2 raid5: device cobd2 operational as raid disk 1 raid5: device cobd1 operational as raid disk 0 raid5: allocated 8374kB for md0 raid5: raid level 5 set md0 active with 7 out of 8 devices, algorithm 2 RAID5 conf printout: --- rd:8 wd:7 fd:1 disk 0, o:1, dev:cobd1 disk 1, o:1, dev:cobd2 disk 2, o:1, dev:cobd3 disk 3, o:1, dev:cobd4 disk 4, o:1, dev:cobd5 disk 5, o:1, dev:cobd6 disk 6, o:1, dev:cobd7 RAID5 conf printout: --- rd:8 wd:7 fd:1 disk 0, o:1, dev:cobd1 disk 1, o:1, dev:cobd2 disk 2, o:1, dev:cobd3 disk 3, o:1, dev:cobd4 disk 4, o:1, dev:cobd5 disk 5, o:1, dev:cobd6 disk 6, o:1, dev:cobd7 disk 7, o:1, dev:cobd8 .<6>md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction. md: using 128k window, over a total of 311963648 blocks. ReiserFS: dm-9: found reiserfs format "3.6" with standard journal ReiserFS: dm-9: using ordered data mode ReiserFS: dm-9: journal params: device dm-9, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 ReiserFS: dm-9: checking transaction log (dm-9) ReiserFS: dm-9: Using r5 hash to sort names attempt to access beyond end of device dm-9: rw=0, want=11015784704, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744068373508432, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=6890187672, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=16134801512, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744062966200928, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744068915172520, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=4661400440, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=11234349824, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744070176263304, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=13845667976, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744070797588552, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=8587809376, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=13814691040, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=8825944912, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=15052302704, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=6184617992, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744061014929280, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=15635940952, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=18446744073305828400, limit=4367491072 attempt to access beyond end of device dm-9: rw=0, want=11015784704, limit=4367491072 Buffer I/O error on device dm-9, logical block 1376973087 attempt to access beyond end of device dm-9: rw=0, want=11015784704, limit=4367491072 Buffer I/O error on device dm-9, logical block 1376973087 ReiserFS: md0: found reiserfs format "3.6" with standard journal ReiserFS: md0: using ordered data mode ReiserFS: md0: journal params: device md0, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 ReiserFS: md0: checking transaction log (md0) ReiserFS: md0: Using r5 hash to sort names attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063147890160, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063115474816, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744069584729480, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063052644416, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=16766954656, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744059358993104, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744059733663912, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063091516640, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744061337158144, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=12829122048, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=12804994112, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744058686830464, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 Buffer I/O error on device md0, logical block 1042727540 attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 Buffer I/O error on device md0, logical block 1042727540 attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063147890160, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063115474816, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744069584729480, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063052644416, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=16766954656, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744059358993104, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744059733663912, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744063091516640, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744061337158144, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=12829122048, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=12804994112, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=18446744058686830464, limit=4367491072 attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 Buffer I/O error on device md0, logical block 1042727540 attempt to access beyond end of device md0: rw=0, want=8341820328, limit=4367491072 Buffer I/O error on device md0, logical block 1042727540 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 2417127151 is out of range on md0 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 1600624263 is out of range on md0 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 1603640255 is out of range on md0 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 2748418111 is out of range on md0 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 2967712923 is out of range on md0 ReiserFS: md0: warning: vs-4080: reiserfs_free_block: free_block (md0:255068793)[dev:blocknr]: bit already cleared ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 2547981332 is out of range on md0 ReiserFS: md0: warning: vs-4075: reiserfs_free_block: block 2501147481 is out of range on md0 ReiserFS: md0: warning: vs-4080: reiserfs_free_block: free_block (md0:255815333)[dev:blocknr]: bit already cleared ReiserFS: md0: warning: vs-4080: reiserfs_free_block: free_block (md0:532900159)[dev:blocknr]: bit already cleared Unable to handle kernel NULL pointer dereference at virtual address 00000db0 printing eip: c01a9c51 *pde = 00000000 Oops: 0002 [#1] PREEMPT Modules linked in: CPU: 0 EIP: 0060:[<c01a9c51>] Not tainted VLI EFLAGS: 00010286 (2.6.12-co-default) EIP is at set_bit_in_list_bitmap+0x31/0x60 eax: 00000000 ebx: 0000f9d8 ecx: d6918c00 edx: e0a3c724 esi: 00006d93 edi: e09c80f4 ebp: d5867b4c esp: d5867b40 ds: 007b es: 007b ss: 0068 Process rm (pid: 1414, threadinfo=d5866000 task=deb3ba40) Stack: d5867f0c 7cec6d93 d6918c00 d5867b78 c01aeafc d6918c00 7cec6d93 e09c80f4 0006a402 00000000 e09c8000 d5867f0c d6918c00 7cec6d93 d5867ba0 c018a54b d5867f0c d6918c00 7cec6d93 00000001 dde1107c d5867bec 000000fa c3cc6c14 Call Trace: [<c0103c15>] show_stack+0x75/0x90 [<c0103d6b>] show_registers+0x11b/0x190 [<c0103f55>] die+0xd5/0x170 [<c010ee02>] do_page_fault+0x422/0x70c [<c010381f>] error_code+0x4f/0x60 [<c01aeafc>] journal_mark_freed+0xac/0x230 [<c018a54b>] reiserfs_free_block+0x2b/0x70 [<c01a6dcc>] prepare_for_delete_or_cut+0x55c/0x750 [<c01a7f68>] reiserfs_cut_from_item+0xb8/0x500 [<c01a8671>] reiserfs_do_truncate+0x231/0x560 [<c01a799b>] reiserfs_delete_object+0x2b/0x80 [<c01912ff>] reiserfs_delete_inode+0x6f/0xd0 [<c016d6f3>] generic_delete_inode+0x93/0x140 [<c016358c>] sys_unlink+0xbc/0x110 [<c0102849>] syscall_call+0x7/0xb Code: 56 53 8b 4d 08 8b 7d 10 8b 41 0c 8d 1c c5 00 00 00 00 8b 45 0c f7 f3 89 c3 8b 47 04 89 d6 8b 14 98 85 d2 74 15 8b 04 98 8b 40 04 <0f> ab 30 31 c0 8d 65 f4 5b 5e 5f 5d c3 89 f6 51 e8 aa fe ff ff ---------------------------------------------------------------------- >Comment By: Henry N. (henryn) Date: 2006-10-05 19:00 Message: Logged In: YES user_id=579204 Hello, I'm tends to not trust the reiserfs + md tools, and this all on a cobd. Is this also, if you using ext2 as file system? What does "dd-rescue" ? I have this not under SuSE. What can I use for it? A file with some random data, for sample my memory? Henry ---------------------------------------------------------------------- Comment By: Markus Müller (privi) Date: 2006-10-04 20:01 Message: Logged In: YES user_id=1612112 Mistake!! Please ignore the "NOT" in my last post! Correct is: The problem IS also reproduceable if I write to a single drive, for example to cobd1=M:\data.img. ... ---------------------------------------------------------------------- Comment By: Markus Müller (privi) Date: 2006-10-04 20:00 Message: Logged In: YES user_id=1612112 I now reproduced the problem on simply one cobd, mapped into the colinux via "cobd1=M:\linux.img.a". I created there 3 files, 20 gigs per file, and joined them via "mdadm -C /dev/md0 /dev/loop[0-2] --level 5 --raid-disks 3; mkreiserfs /dev/md0; mount /dev/md0 /dev/md0; for i in 1 2 3 4; do dd_rescue -m 1G /dev/zero ./f.$i; done; for i in 1 2 3 4; do ls -la ./f.$i; md5sum ./f.$i; done": -rw-r----- 1 root root 1073741824 Oct 4 19:47 ./f.1 cd573cfaace07e7949bc0c46028904ff ./f.1 -rw-r----- 1 root root 1073741824 Oct 4 19:49 ./f.2 8bb182ea4a691e880808700df7a8c6cd ./f.2 -rw-r----- 1 root root 1073741824 Oct 4 19:50 ./f.3 cd573cfaace07e7949bc0c46028904ff ./f.3 -rw-r----- 1 root root 1073741824 Oct 4 19:51 ./f.4 cd573cfaace07e7949bc0c46028904ff ./f.4 Then I did not remove the files and made again the md5 sums; I saw that the sum of file2 is no longer wrong, but the 3. file is wrong: chieftec:/mnt/md0# for i in 1 2 3 4; do ls -la ./f.$i; md5sum ./f.$i; done -rw-r----- 1 root root 1073741824 Oct 4 19:47 ./f.1 cd573cfaace07e7949bc0c46028904ff ./f.1 -rw-r----- 1 root root 1073741824 Oct 4 19:49 ./f.2 cd573cfaace07e7949bc0c46028904ff ./f.2 -rw-r----- 1 root root 1073741824 Oct 4 19:50 ./f.3 7f2c9cba45d23fd0c9c8f9cb82125df6 ./f.3 -rw-r----- 1 root root 1073741824 Oct 4 19:51 ./f.4 cd573cfaace07e7949bc0c46028904ff ./f.4 chieftec:/mnt/md0# Ok, guy, I think this is, at least in this test, a problem on reading from a raid5... Please help me to find the problem!!! chieftec:/mnt/md0# uname -a Linux chieftec.priv.de 2.6.12-co-default #2 Sun May 28 02:50:44 CEST 2006 i686 GNU/Linux chieftec:/mnt/md0# ---------------------------------------------------------------------- Comment By: Markus Müller (privi) Date: 2006-10-03 18:15 Message: Logged In: YES user_id=1612112 The problem is also NOT reproduceable if I write to a single drive, for example to cobd1=M:\data.img. A further idicator that this is not a hardware or driver caused problem but the cause is into coLinux. chieftec:/mnt/sda1# for i in 1 2 3 4 5; do dd_rescue -m 1G /dev/zero ./f.$i; done; for i in 1 2 3 4 5; do ls -la ./f.$i; md5sum ./f.$i; done; for i in 1 2 3 4 5; do rm ./f.$i; done dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 43279kB/s, avg.rate: 45956kB/s, avg.load: 85.9% Summary for /dev/zero -> ./f.1: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 45954kB/s, avg.load: 85.9% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 37371kB/s, avg.rate: 40409kB/s, avg.load: 86.6% Summary for /dev/zero -> ./f.2: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 40406kB/s, avg.load: 86.6% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 41206kB/s, avg.rate: 40830kB/s, avg.load: 85.4% Summary for /dev/zero -> ./f.3: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 40828kB/s, avg.load: 85.4% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 43374kB/s, avg.rate: 38770kB/s, avg.load: 85.3% Summary for /dev/zero -> ./f.4: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 38768kB/s, avg.load: 85.3% dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 40382kB/s, avg.rate: 40590kB/s, avg.load: 84.4% Summary for /dev/zero -> ./f.5: dd_rescue: (info): ipos: 1048576.0k, opos: 1048576.0k, xferd: 1048576.0k errs: 0, errxfer: 0.0k, succxfer: 1048576.0k +curr.rate: 0kB/s, avg.rate: 40587kB/s, avg.load: 84.4% -rw-r----- 1 root root 1073741824 Oct 3 18:10 ./f.1 cd573cfaace07e7949bc0c46028904ff ./f.1 -rw-r----- 1 root root 1073741824 Oct 3 18:11 ./f.2 cd573cfaace07e7949bc0c46028904ff ./f.2 -rw-r----- 1 root root 1073741824 Oct 3 18:11 ./f.3 cd573cfaace07e7949bc0c46028904ff ./f.3 -rw-r----- 1 root root 1073741824 Oct 3 18:12 ./f.4 cd573cfaace07e7949bc0c46028904ff ./f.4 -rw-r----- 1 root root 1073741824 Oct 3 18:12 ./f.5 cd573cfaace07e7949bc0c46028904ff ./f.5 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=1569947&group_id=98788 |