On Wednesday 25 June 2008 08:02, Dan Porat wrote:
> Well , The thing is that a former revision worked on the same hardware
> exactly.
i think it is more complex because there is always strong relationship between
hardware and software e.g. more optimized code -> shorter latency -> faster
IO rate -> more power , higher temperature, higher probability of some nasty
thing. (this means that a fault may be caused indirectly by newer revision)
as i recall supermicro x7db3 with 2x quadcore 3.0G cpus draws abt 450W in idle
but it raises 100W under heavy FC and ethernet load.
> This make it difficult for me to think that it is a PSU problem
>
> The hardware is HS20 blade , and there are 13 other blades connected to the
> same backplane , with the same PSUs.
yes, i wouldn't consider PSU/cooling for well known server brand.
>
> I guess that a subtly buggy PCIe , could make the difference between
> different revisions of the SCST.
> Will check it on some other blades and update.
yes, this may give a clue. i'm curious if you can find some "procedure" which
ends with NMI spurious interrupt. ie how often does it happen, are there any
special conditions, btw what is your exact configuration of hw/scst.
if you could find a way for trigerring NMI "on demand" this would be big step
ahead and further investigation would be more easy likely.
>
> About the REPORTING-BUGS , will do it later , once the resources become
> available.
>
> Thanks so far.
>
>
> Dan Porat
>
Krzysztof Blaszkowski
>
> On Tue, Jun 24, 2008 at 8:07 PM, Krzysztof Błaszkowski <kb@...>
>
> wrote:
> > i can make a guess according to my experience with very subtly buggy PCIe
> > hardware that such message can be trigerred by some misbehaved card.
> > (or a driver for this card which doesn't handle right all possible states
> > which may not be possible because the card is buggy).
> >
> >
> > Another reason can be a bit overloaded PSU. let me tell a story. i
> > do have an ST 500G sata drive in a backplane and it used to drop link
> > from some time to time. after long lasting observation i found that it
> > happens with some PSUs and with others it doesn't. so i do have an
> > indicator which PSU is good ;).
> >
> > PSUs demystified are here:
> > http://www.playtool.com/pages/psumultirail/multirails.html
> > and other pages on this site. i think this is really comprehensive
> > review. so with not good enough PSU and high load such error may happen.
> >
> > Krzysztof Blaszkowski
> >
> > On Tuesday 24 June 2008 16:35, Dan Porat wrote:
> > > rev 242 worked flawlessly on a Centos machine.
> > > After upgrading to 420 , the following messages appeared (and after
> > > them the machine reboots).
> > >
> > >
> > > Message from syslogd@ at Tue Jun 24 12:01:12 2008 ...
> > > fester kernel: Uhhuh. NMI received for unknown reason 35 on CPU 0.
> > > Message from syslogd@ at Tue Jun 24 12:01:12 2008 ...
> > > fester kernel: Do you have a strange power saving mode enabled?
> > > Message from syslogd@ at Tue Jun 24 12:01:12 2008 ...
> > > fester kernel: Dazed and confused, but trying to continue
> > >
> > > Anyone has knowledge about it ?
> > >
> > >
> > > Dan Porat
|