From: Gordan B. <go...@bo...> - 2009-01-21 15:53:01
|
On Wed, 21 Jan 2009 13:19:45 +0100, Marc Grimme <gr...@at...> wrote: >> It would appear that >> /opt/atix/comoonics-bootimage/boot-scripts/etc/rhel5/hardware-lib.sh has >> gone through a few changes in the past few months, which, unfortunately, >> break it for me. >> >> The problem is in the ordering of the detected NICs. On one of my >> systems I have a dual e1000 built into the mobo, and an e100 as an >> add-in card. /etc/modprobe.conf lists eth0 and eth1 as the e1000s, and >> eth2 as e100. This works fine with hardware-lib.sh v1.5, but with v1.7 >> the ordering seems to be both unstable (about 1/10 of the time it'll >> actually get the NIC ordering as expected and specified in cluster.conf >> and the rest of the time it'll do something different) and inconsistent >> with what is in cluster.conf and modprobe.conf. > That's strange. I have the same problems on one cluster like you describe > it. > One time everything works and the other time it doesn't. But all other > clusters work. > > The reason why I changed the hw detection for rhel5 is because it didn't > work > for VMs (especially kvm) and I didn't find any problems on all the other > clusters (except for the one me and the one from you). > > I think I have to look deeper into that matter. I made a rudimentary attempt at rectifying it by explicitly sorting the module list, but that didn't fix it. The problem is that the eth* binding ends up being done in the order the drivers are loaded (i.e. if I load the e100 driver before the e1000 driver, e100 ends up being eth0). This seems to override and ignore any settings listed in modprobe.conf, and more disturbingly, it seems to ignore the by-MAC bindings in cluster.conf which should really have the highest precedence (but either way they should agree with modprobe.conf if everything is set up right). > So what you say is if you just change hardware-lib.sh from 1.7 to 1.5 > everything works fine? Yes. Note, however, that it could just be that the failure in 1.5 is always consistent with my hardware so it always comes up the right way around. 1.7, however, definitely doesn't come up right, and more importantly, it doesn't come up consistently. > Cause I thought it was due to the order (that's what I've changed) of udevd > and kudzu/modprobe eth* being called. Older versions first called kudzu > then probed for the nics and then started udevd. > > Now I'm first starting udevd then - if appropriate - kudzu and then probe > for > the NICs. I always thought that it was because of the order. But if the new > > order works with hardware-lib.sh (v1.5) but not for 1.7 it isn't because of > > the order. As the order is defined by linuxrc.generic.sh. > > Can you acknowledge that it's only the version of hardware-lib.sh? Yes, it's the only file I copied across from the older package. Note, however, the caveat above - it could just be that it makes things work on this one system where I observed it. In other words, just because 1.5 makes it work doesn't mean that the bug is in hardware-lib.sh. It could just be covering up a problem elsewhere. It could be some kind of a weird kudzu problem, too - I've found it to be unreliable and break things in the past, albeit not recently (having said that, it's the first thing I switch off on a new system, so maybe I just didn't notice before). >> The last version that works for me is v1.5, and the latest released >> version (I'm talking about CVS version numbers here) appears to be v1.7 >> for this file (in the comoonics-bootimage-1.3-40.noarch.rpm release). >> >> Needless to say, trying to boot off an iSCSI shared root with the NIC >> not starting because eth designation doesn't match the MAC doesn't get >> very far. :-/ > > Very needless. It's the same for non iscsi clusters ;-) . So this needs to > be fixed. Indeed. DRBD is even worse, as it has extra scope for split-brain, particularly if IP addresses are fail-over resources and they happen to live on an interface that does end up coming up correctly. > Thanks and sorry about that ugly bug. The fact that you observed it, too, is rather a relief, actually. It took me a fair while and a number of initrd rebuilds and a bit of digging to make sure that I was seeing what I _thought_ I was seeing, and not a weird side-effect of something I'd done to the configuration. Please, do post when you have a fix. :-) Gordan |