From: Gordan B. <go...@bo...> - 2009-01-29 11:22:00
|
Hi, Replying here because I thought it was too discussiony for a bugzilla comment. # A new attribute driver per nic per clusternode was introduced. # <eth name=".." mac=".." driver=".."/> I can see that this is useful for heterogenous clusters, but if "driver" isn't specified, the NIC driver "probing" shouldn't really occur at all. It should be done according to the content of modprobe.conf. I think this should be deemed authoritative unless specifically overriden by the driver parameter in the NIC spec. Also, what happens if we have an alternating NIC driver setup, e.g. eth0 e1000 eth1 e100 eth2 e1000 Will this work correctly, or will loading the e1000 driver wrongly make the two e1000 NICs eth0 and eth1? If udev configuration is dynamically generated from cluster.conf by MAC address using a line like: KERNEL=="eth*", SYSFS{address}=="00:11:22:33:44:55", NAME="eth0" that should probably suffice. Unfortunately, AFAIK this is not redundant with modprobe.conf stuff (need the driver loaded before we can read the MAC). Still, I feel there is a strong argument for making modprobe.conf the default. Or, as a potentially easier-to-implement alternative, maybe it would be better to make the driver parameter mandatory (assuming it isn't at the moment) and abort mkinitrd if it isn't provided. Gordan On Thu, 29 Jan 2009 11:21:07 +0100, Marc Grimme <gr...@at...> wrote: > Hi, > I opened a bug for this problem > https://bugzilla.atix.de/show_bug.cgi?id=325 > > I will describe/discuss my findings there. > > I think I have a solution already. > > Gordan, you might want to add you to this Bug. > > Regards Marc. > > BTW: I would not use the current comoonics-bootimage from preview for > clusters > with multiple nics (means ones with different drivers)! > Wait a day or two and I'll come up with a new version. > On Wednesday 21 January 2009 16:52:44 Gordan Bobic wrote: >> On Wed, 21 Jan 2009 13:19:45 +0100, Marc Grimme <gr...@at...> wrote: >> >> It would appear that >> >> /opt/atix/comoonics-bootimage/boot-scripts/etc/rhel5/hardware-lib.sh >> >> has >> >> gone through a few changes in the past few months, which, >> >> unfortunately, >> >> break it for me. >> >> >> >> The problem is in the ordering of the detected NICs. On one of my >> >> systems I have a dual e1000 built into the mobo, and an e100 as an >> >> add-in card. /etc/modprobe.conf lists eth0 and eth1 as the e1000s, and >> >> eth2 as e100. This works fine with hardware-lib.sh v1.5, but with v1.7 >> >> the ordering seems to be both unstable (about 1/10 of the time it'll >> >> actually get the NIC ordering as expected and specified in >> >> cluster.conf >> >> and the rest of the time it'll do something different) and >> >> inconsistent >> >> with what is in cluster.conf and modprobe.conf. >> > >> > That's strange. I have the same problems on one cluster like you >> > describe >> > it. >> > One time everything works and the other time it doesn't. But all other >> > clusters work. >> > >> > The reason why I changed the hw detection for rhel5 is because it >> > didn't >> > work >> > for VMs (especially kvm) and I didn't find any problems on all the >> > other >> > clusters (except for the one me and the one from you). >> > >> > I think I have to look deeper into that matter. >> >> I made a rudimentary attempt at rectifying it by explicitly sorting the >> module list, but that didn't fix it. The problem is that the eth* binding >> ends up being done in the order the drivers are loaded (i.e. if I load >> the >> e100 driver before the e1000 driver, e100 ends up being eth0). This seems >> to override and ignore any settings listed in modprobe.conf, and more >> disturbingly, it seems to ignore the by-MAC bindings in cluster.conf >> which >> should really have the highest precedence (but either way they should >> agree >> with modprobe.conf if everything is set up right). >> >> > So what you say is if you just change hardware-lib.sh from 1.7 to 1.5 >> > everything works fine? >> >> Yes. Note, however, that it could just be that the failure in 1.5 is >> always >> consistent with my hardware so it always comes up the right way around. >> 1.7, however, definitely doesn't come up right, and more importantly, it >> doesn't come up consistently. >> >> > Cause I thought it was due to the order (that's what I've changed) of >> >> udevd >> >> > and kudzu/modprobe eth* being called. Older versions first called kudzu >> > then probed for the nics and then started udevd. >> > >> > Now I'm first starting udevd then - if appropriate - kudzu and then >> > probe >> > for >> > the NICs. I always thought that it was because of the order. But if the >> >> new >> >> > order works with hardware-lib.sh (v1.5) but not for 1.7 it isn't >> > because >> >> of >> >> > the order. As the order is defined by linuxrc.generic.sh. >> > >> > Can you acknowledge that it's only the version of hardware-lib.sh? >> >> Yes, it's the only file I copied across from the older package. Note, >> however, the caveat above - it could just be that it makes things work on >> this one system where I observed it. In other words, just because 1.5 >> makes >> it work doesn't mean that the bug is in hardware-lib.sh. It could just be >> covering up a problem elsewhere. It could be some kind of a weird kudzu >> problem, too - I've found it to be unreliable and break things in the >> past, >> albeit not recently (having said that, it's the first thing I switch off >> on >> a new system, so maybe I just didn't notice before). >> >> >> The last version that works for me is v1.5, and the latest released >> >> version (I'm talking about CVS version numbers here) appears to be >> >> v1.7 >> >> for this file (in the comoonics-bootimage-1.3-40.noarch.rpm release). >> >> >> >> Needless to say, trying to boot off an iSCSI shared root with the NIC >> >> not starting because eth designation doesn't match the MAC doesn't get >> >> very far. :-/ >> > >> > Very needless. It's the same for non iscsi clusters ;-) . So this needs >> >> to >> >> > be fixed. >> >> Indeed. DRBD is even worse, as it has extra scope for split-brain, >> particularly if IP addresses are fail-over resources and they happen to >> live on an interface that does end up coming up correctly. >> >> > Thanks and sorry about that ugly bug. >> >> The fact that you observed it, too, is rather a relief, actually. It took >> me a fair while and a number of initrd rebuilds and a bit of digging to >> make sure that I was seeing what I _thought_ I was seeing, and not a >> weird >> side-effect of something I'd done to the configuration. Please, do post >> when you have a fix. :-) >> >> Gordan >> >> --------------------------------------------------------------------------- >>--- This SF.net email is sponsored by: >> SourcForge Community >> SourceForge wants to tell your story. >> http://p.sf.net/sfu/sf-spreadtheword >> _______________________________________________ >> Open-sharedroot-devel mailing list >> Ope...@li... >> https://lists.sourceforge.net/lists/listinfo/open-sharedroot-devel |