Thread: [Aoetools-discuss] AoE problem with Ext4fs
Brought to you by:
ecashin,
elcapitansam
From: Christoph P. <pit...@se...> - 2009-10-09 15:42:33
|
Hello, I tried to install Ubuntu 9.10 Karmic beta release to an AoE disk. Running vblade on Ubuntu 9.04 Jaunty server. Installation works fine but at 80% it hangs with "Scanning the CD-ROM..." and I find following error in dmesg output: [ 1491.897974] ------------[ cut here ]------------ [ 1491.897976] kernel BUG at /build/buildd/linux-2.6.31/drivers/block/aoe/aoeblk.c:177! [ 1491.897979] invalid opcode: 0000 [#1] SMP [ 1491.897982] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/evt_media_change [ 1491.897984] CPU 1 [ 1491.897986] Modules linked in: nls_utf8 ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs exportfs reiserfs aoe ppdev lp parport snd_hda_codec_realtek snd_hda_intel bridge snd_hda_codec stp snd_pcm_oss snd_mixer_oss snd_pcm bnep snd_seq_dummy snd_seq_oss snd_seq_midi btusb snd_rawmidi iptable_filter snd_seq_midi_event snd_seq ip_tables x_tables dm_crypt snd_timer snd_seq_device joydev isight_firmware appletouch snd applesmc led_class input_polldev soundcore snd_page_alloc squashfs aufs isofs hid_apple usbhid nls_iso8859_1 nls_cp437vfat fat ohci1394 sky2 ssb ieee1394 intel_agp i915 drm i2c_algo_bit video output [ 1491.898027] Pid: 11780, comm: apt-get Not tainted 2.6.31-11-generic #36-Ubuntu MacBook4,1 [ 1491.898029] RIP: 0010:[<ffffffffa0317833>] [<ffffffffa0317833>] aoeblk_make_request+0x243/0x260 [aoe] [ 1491.898038] RSP: 0018:ffff88001057dc88 EFLAGS: 00010296 [ 1491.898040] RAX: 000000000000002c RBX: ffff88003d1e9540 RCX: 000000000000001e [ 1491.898042] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000246 [ 1491.898044] RBP: ffff88001057dcd8 R08: 0000000000000033 R09: 000000000000bd77 [ 1491.898046] R10: 0000000000000005 R11: 0000000000000000 R12: ffff88004851bc00 [ 1491.898048] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000009800001 [ 1491.898050] FS: 00007f795560c710(0000) GS:ffff8800019e9000(0000) knlGS:0000000000000000 [ 1491.898053] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1491.898055] CR2: 00007f3d777b6a20 CR3: 0000000008d4c000 CR4: 00000000000006a0 [ 1491.898057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1491.898059] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1491.898061] Process apt-get (pid: 11780, threadinfo ffff88001057c000, task ffff880040d296b0) [ 1491.898063] Stack: [ 1491.898064] 0000000000000000 ffff88003d1e9540 ffffffff81073b90 ffff88001057dca0 [ 1491.898067] <0> ffff88001057dca0 0000000000000246 ffff88003d1e9540 ffff880078590fd0 [ 1491.898071] <0> 0000000000000000 0000000000000000 ffff88001057dd98 ffffffff8125be13 [ 1491.898075] Call Trace: [ 1491.898081] [<ffffffff81073b90>] ? autoremove_wake_function+0x0/0x40 [ 1491.898086] [<ffffffff8125be13>] generic_make_request+0x1a3/0x4d0 [ 1491.898091] [<ffffffff810d7df0>] ? mempool_alloc_slab+0x10/0x20 [ 1491.898094] [<ffffffff81073a76>] ? bit_waitqueue+0x16/0xc0 [ 1491.898097] [<ffffffff8125c1bd>] submit_bio+0x7d/0x110 [ 1491.898100] [<ffffffff8125eee6>] blkdev_issue_flush+0x96/0xe0 [ 1491.898103] [<ffffffff811a90ac>] ext4_sync_file+0x12c/0x190 [ 1491.898107] [<ffffffff810de9d8>] ? do_writepages+0x28/0x50 [ 1491.898110] [<ffffffff8113e406>] vfs_fsync+0x86/0xf0 [ 1491.898114] [<ffffffff810fa4a9>] sys_msync+0x149/0x1f0 [ 1491.898118] [<ffffffff81011fc2>] system_call_fastpath+0x16/0x1b [ 1491.898119] Code: b2 31 a0 31 c0 e8 c1 94 20 e1 48 8b 7d b8 be f4 ff ff ff e8 00 c6 e2 e0 e9 62 ff ff ff 48 c7 c7 30 af 31 a0 31 c0 e8 a0 94 20 e1 <0f> 0b eb fe 48 c7 c7 08 b231 a0 31 c0 e8 8e 94 20 e1 0f 0b eb [ 1491.898148] RIP [<ffffffffa0317833>] aoeblk_make_request+0x243/0x260 [aoe] [ 1491.898153] RSP <ffff88001057dc88> [ 1491.898155] ---[ end trace eb4bd991ec51f1af ]--- The strange thing is that everything (except installing the boot loader, but this is a different problem I think) works if I use an ext3 filesystem instead of ext4 (which is default for Ubuntu 9.10). On the server I have running vblade version 16-1ubuntu2, but I also tried a self compiled vblade version 20. The client runs the Ubuntu 9.10 standard kernel which is 2.6.31-11 with aoe module version 47. I'm not sure if it's AoE fault or ext4 but I don't know how to investigate this further. So please let me know if you have any clue how to find the problem :-) best regards, Christoph |
From: Ed C. <ec...@co...> - 2009-10-09 16:42:35
|
On Fri Oct 9 11:43:17 EDT 2009, pit...@se... wrote: > Hello, > I tried to install Ubuntu 9.10 Karmic beta release to an AoE disk. Running > vblade on Ubuntu 9.04 Jaunty server. > > Installation works fine but at 80% it hangs with "Scanning the CD-ROM..." > and I find following error in dmesg output: > > [ 1491.897974] ------------[ cut here ]------------ > [ 1491.897976] kernel BUG at > /build/buildd/linux-2.6.31/drivers/block/aoe/aoeblk.c:177! Hi. This issue is resolved in the stable release, 2.6.31.y. Here is a reference to the bugzilla case: http://bugzilla.kernel.org/show_bug.cgi?id=14343 http://bugzilla.kernel.org/show_bug.cgi?id=13942 And the specific fix is ... http://bugzilla.kernel.org/attachment.cgi?id=23068 -- Ed Cashin <ec...@co...> http://www.coraid.com/ http://noserose.net/e/ |
From: Ed C. <ec...@co...> - 2009-10-09 17:07:06
|
On Fri Oct 9 12:43:10 EDT 2009, ec...@co... wrote: ... > Hi. This issue is resolved in the stable release, 2.6.31.y. Sorry for the cryptic reference. The kernels released at kernel.org have three parts, like 2.6.31, but "stable" kernels are released based on these three-part releases. The current stable release of 2.6.31 is 2.6.31.3. These four-part stable versions are often called "2.6.x.y." -- Ed |
From: Christoph P. <pit...@se...> - 2009-10-09 17:09:00
|
On Fri, 9 Oct 2009 12:40:42 -0400, Ed Cashin <ec...@co...> wrote: > Hi. This issue is resolved in the stable release, 2.6.31.y. > > And the specific fix is ... > http://bugzilla.kernel.org/attachment.cgi?id=23068 Thank you very much for your fast answer! I just downloaded the aoe6-73.tar.gz und used this aoe.ko module to install the client which worked fine. Just one question left: Whats the difference between the aoe6 tar.gz and the aoe kernel module in the mainline kernel? Which one should I use? regards, Christoph |
From: Ed C. <ec...@co...> - 2009-10-09 17:27:17
|
On Fri Oct 9 13:08:44 EDT 2009, pit...@se... wrote: ... > Whats the difference between the aoe6 tar.gz and the aoe kernel module in > the mainline kernel? > Which one should I use? The driver at CORAID's website has features that we intend to push into the kernel.org driver. Which you use will depend on whether you need these features right now. It might be more convenient to use the aoe driver in your Linux distribution. The changelog is the definitive reference for the answer to your question. http://support.coraid.com/support/linux/aoe6-Changelog ... but some highlights include, * driver handles I/O requests instead of bios, so that the I/O schedulers can be used, * system device numbers are allocated dynamically, so that the website driver can support a greater number of AoE targets and a wider range of LUN numbers per shelf address, * AoE responses can return from a MAC address other than the one the command was sent to. The CORAID VS relies on this feature. * Packets with a data payload over a page size can be used in the CORAID website driver. When the network can support jumbo frames over 4200 octets, this feature can increase performance. -- Ed |
From: Matthew I. <ma...@di...> - 2009-10-14 17:55:34
|
I'm testing an active/active setup that uses aoe and was hoping for some feedback. The basic design has a storage device that runs vladed (aoe-storage) and an aoe initiator (controller) that creates a raid0 using mdadm from the aoe devices. From there, lvm (clustered) slices out storage for sharing block devices. I don't want to concentrate on the layers above that (clustered file systems, etc) but mainly the active/active portion of this interacting with raid0 (via md not lvm) and aoe. Some diagrams that may help in understanding... ----- "Physical" diagram: (export network block devices [ iscsi, aoe, etc] ) ----------- / \ active / \ active +-------------+ +-------------+ | controller1 | | controller2 | +-------------+ +-------------+ \ / +---------+ | +-------------+ +-------------+ | aoe-storage | --- drbd raid1 --- | secondary | +-------------+ +-------------+ ----- Layer diagram: +-----------------+ | (net block dev) | controller1/2 +-----------------+ | lvm2 clustered | controller1/2 +-----------------+ | md: linux-raid0 | controller1/2 +-----------------+ | aoe initiator | controller1/2 +-----------------+ | aoe target | aoe-storage +-----------------+ -- Matth |
From: Ed C. <ec...@co...> - 2009-10-15 14:56:02
|
If you would like to have two AoE initiators running md, with AoE targets as the md components, then I think you will need an active/passive setup instead of an active/active setup, because md is not "cluster aware." However, most of the problems I can think of with using md on two initiators at the same time over a common set of AoE targets have to do with failing individual components, or with the consistency of the state of the RAID itself, and these issues go away with a RAID 0, since it's not redundant and immediately fails when one component fails. So I think that if you forget about most of your scenario and only ask in a Linux Software RAID forum about the feasibility of having two hosts trying to do RAID 0 over the same AoE targets at the same time, you might get an answer. You'll probably also get a lot of confused looks, too. Good luck! -- Ed |
From: Ed C. <ec...@co...> - 2009-10-15 15:19:59
|
On Thu Oct 15 10:56:55 EDT 2009, ec...@co... wrote: ... > However, most of the problems I can think of with using md on two > initiators at the same time over a common set of AoE targets have to > do with failing individual components, or with the consistency of the > state of the RAID itself, and these issues go away with a RAID 0, > since it's not redundant and immediately fails when one component > fails. What am I saying? The main problem is the page cache. The I/O for the components of the RAID 0 will go through the page cache on each of the two AoE initiators, and without some way for the two hosts to make sure that the page cache contents match (e.g., "I just changed the data here---update your cache!") the scenario won't work. Because not all the data is on the md components but can also be in RAM on the host doing md RAID, you will have data corruption if you introduce a third location for data to be: the RAM of the second AoE initiator doing md. -- Ed Cashin <ec...@co...> http://www.coraid.com/ http://noserose.net/e/ |
From: Matthew I. <ma...@di...> - 2009-10-15 16:42:28
|
Thanks for the response Ed - I'll follow up on the linux raid list. What are your thoughts on a cluster aware LVM (CLVM) for striping and removing the md devices all together? -- Matth |
From: Ed C. <ec...@co...> - 2009-10-15 18:27:08
|
On Thu, Oct 15, 2009 at 09:42:15AM -0700, Matthew Ingersoll wrote: > Thanks for the response Ed - I'll follow up on the linux raid list. > What are your thoughts on a cluster aware LVM (CLVM) for striping and > removing the md devices all together? If you learn more about that, please get back to the list about it. >From what I've heard about CLVM, it's just like regular LVM2 but with some messaging between hosts when the LVM configurations are updated. So if you were using striped logical volumes it seems like you'd have the same problem as with md, if you were using two different AoE initiators that were writing to the same areas on the AoE targets. -- Ed Cashin |
From: Matthew I. <ma...@di...> - 2009-10-15 16:56:15
|
On Oct 15, 2009, at 8:19 AM, Ed Cashin wrote: > What am I saying? The main problem is the page cache. The I/O for > the components of the RAID 0 will go through the page cache on each of > the two AoE initiators, and without some way for the two hosts to make > sure that the page cache contents match (e.g., "I just changed the > data here---update your cache!") the scenario won't work. At what layer(s) does page cache come into play? I know the gist of how it functions but not the dirty details. I thought it only came into play when using a filesystem? For example, if the controllers shared a block device using vbladed, running with direct io wouldn't ensure consistency on both controllers? -- Matth |
From: Matthew I. <ma...@di...> - 2009-10-15 17:44:46
|
On Oct 15, 2009, at 9:55 AM, Matthew Ingersoll wrote: > > On Oct 15, 2009, at 8:19 AM, Ed Cashin wrote: >> What am I saying? The main problem is the page cache. The I/O for >> the components of the RAID 0 will go through the page cache on each >> of >> the two AoE initiators, and without some way for the two hosts to >> make >> sure that the page cache contents match (e.g., "I just changed the >> data here---update your cache!") the scenario won't work. > > At what layer(s) does page cache come into play? I know the gist of > how it functions but not the dirty details. I thought it only came > into play when using a filesystem? > For example, if the controllers shared a block device using vbladed, > running with direct io wouldn't ensure consistency on both > controllers? > And what about the buffer cache? Does direct/O_DIRECT mode bypass this also? From the setup I described there are no filesystems and page cache wouldn't play a role but buffer cache would (right?). -- Matth |
From: Ed C. <ec...@co...> - 2009-10-15 17:50:08
|
On Thu Oct 15 12:56:00 EDT 2009, ma...@di... wrote: ... > At what layer(s) does page cache come into play? I know the gist of > how it functions but not the dirty details. I thought it only came > into play when using a filesystem? > For example, if the controllers shared a block device using vbladed, > running with direct io wouldn't ensure consistency on both controllers? Even when you're not using a filesystem you have caching going on, unless you request otherwise. You can see the difference with a "dd" that has the option for doing O_DIRECT. With that option, you're bypassing the page cache. Without it, you will notice that there's lots of writing at first but then it slows down to a crawl when the system finally decides to flush out the data from all the pages in RAM that have been dirtied. I guess mkfs is a more obvious one, since it has a progress meter, but you can get progress from dd via SIGUSR1---see the manpage for details. -- Ed |