From: Marc G. <gr...@at...> - 2009-01-29 10:21:24
|
Hi, I opened a bug for this problem https://bugzilla.atix.de/show_bug.cgi?id=325 I will describe/discuss my findings there. I think I have a solution already. Gordan, you might want to add you to this Bug. Regards Marc. BTW: I would not use the current comoonics-bootimage from preview for clusters with multiple nics (means ones with different drivers)! Wait a day or two and I'll come up with a new version. On Wednesday 21 January 2009 16:52:44 Gordan Bobic wrote: > On Wed, 21 Jan 2009 13:19:45 +0100, Marc Grimme <gr...@at...> wrote: > >> It would appear that > >> /opt/atix/comoonics-bootimage/boot-scripts/etc/rhel5/hardware-lib.sh has > >> gone through a few changes in the past few months, which, unfortunately, > >> break it for me. > >> > >> The problem is in the ordering of the detected NICs. On one of my > >> systems I have a dual e1000 built into the mobo, and an e100 as an > >> add-in card. /etc/modprobe.conf lists eth0 and eth1 as the e1000s, and > >> eth2 as e100. This works fine with hardware-lib.sh v1.5, but with v1.7 > >> the ordering seems to be both unstable (about 1/10 of the time it'll > >> actually get the NIC ordering as expected and specified in cluster.conf > >> and the rest of the time it'll do something different) and inconsistent > >> with what is in cluster.conf and modprobe.conf. > > > > That's strange. I have the same problems on one cluster like you describe > > it. > > One time everything works and the other time it doesn't. But all other > > clusters work. > > > > The reason why I changed the hw detection for rhel5 is because it didn't > > work > > for VMs (especially kvm) and I didn't find any problems on all the other > > clusters (except for the one me and the one from you). > > > > I think I have to look deeper into that matter. > > I made a rudimentary attempt at rectifying it by explicitly sorting the > module list, but that didn't fix it. The problem is that the eth* binding > ends up being done in the order the drivers are loaded (i.e. if I load the > e100 driver before the e1000 driver, e100 ends up being eth0). This seems > to override and ignore any settings listed in modprobe.conf, and more > disturbingly, it seems to ignore the by-MAC bindings in cluster.conf which > should really have the highest precedence (but either way they should agree > with modprobe.conf if everything is set up right). > > > So what you say is if you just change hardware-lib.sh from 1.7 to 1.5 > > everything works fine? > > Yes. Note, however, that it could just be that the failure in 1.5 is always > consistent with my hardware so it always comes up the right way around. > 1.7, however, definitely doesn't come up right, and more importantly, it > doesn't come up consistently. > > > Cause I thought it was due to the order (that's what I've changed) of > > udevd > > > and kudzu/modprobe eth* being called. Older versions first called kudzu > > then probed for the nics and then started udevd. > > > > Now I'm first starting udevd then - if appropriate - kudzu and then probe > > for > > the NICs. I always thought that it was because of the order. But if the > > new > > > order works with hardware-lib.sh (v1.5) but not for 1.7 it isn't because > > of > > > the order. As the order is defined by linuxrc.generic.sh. > > > > Can you acknowledge that it's only the version of hardware-lib.sh? > > Yes, it's the only file I copied across from the older package. Note, > however, the caveat above - it could just be that it makes things work on > this one system where I observed it. In other words, just because 1.5 makes > it work doesn't mean that the bug is in hardware-lib.sh. It could just be > covering up a problem elsewhere. It could be some kind of a weird kudzu > problem, too - I've found it to be unreliable and break things in the past, > albeit not recently (having said that, it's the first thing I switch off on > a new system, so maybe I just didn't notice before). > > >> The last version that works for me is v1.5, and the latest released > >> version (I'm talking about CVS version numbers here) appears to be v1.7 > >> for this file (in the comoonics-bootimage-1.3-40.noarch.rpm release). > >> > >> Needless to say, trying to boot off an iSCSI shared root with the NIC > >> not starting because eth designation doesn't match the MAC doesn't get > >> very far. :-/ > > > > Very needless. It's the same for non iscsi clusters ;-) . So this needs > > to > > > be fixed. > > Indeed. DRBD is even worse, as it has extra scope for split-brain, > particularly if IP addresses are fail-over resources and they happen to > live on an interface that does end up coming up correctly. > > > Thanks and sorry about that ugly bug. > > The fact that you observed it, too, is rather a relief, actually. It took > me a fair while and a number of initrd rebuilds and a bit of digging to > make sure that I was seeing what I _thought_ I was seeing, and not a weird > side-effect of something I'd done to the configuration. Please, do post > when you have a fix. :-) > > Gordan > > --------------------------------------------------------------------------- >--- This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Open-sharedroot-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/open-sharedroot-devel -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ |