Re: [Openipmi-developer] IPMI driver performance problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Matt,

On Mon, 2008-03-17 at 05:47 -0600, Matt Domsch wrote:
> On Mon, Mar 17, 2008 at 03:11:27PM +1100, Nathan Scott wrote:
> > Hi all,
> > 
> > We're seeing an IPMI related performance problem on our production
> > servers, which I hope someone can help me with.  These are Dell
> > boxes so I've CC'd Matt in case the answer is known already (sorry
> > for the intrusion if not, Matt).
> 
> yes, this is quite known.

Ah, good to know.

> Short story is, the KCS interface is horrible, but unfortunately has
> become the typical implementation.  Longer story is, you can either
> poll for each character to arrive, or you can get an interrupt for
> each character.  Except that there's no interrupt line present in most
> systems, including all Dell servers with KCS to date.  So, the driver
> must poll (which is that the kipmi0 thread is all about).
> 
> You can disable the polling thread, but then the driver operates at
> the speed of the timer interrupt, so 1ms per character in RHEL4.
> System startup, starting OMSA, in this mode, takes an extra 3-5
> minutes.  Firmware updates take ~15 minutes.  With the polling thread,
> these times are "a few seconds" and 1.5 minutes respectively.

OK, thanks for the explanation!

> I tried to get a hardware interrupt line added to the Dell 10G
> PowerEdge servers.  However, this greatly confused some "other
> operating systems" to the point of making them completely unusable, so
> it had to be disabled.

Bother.

> The polling thread runs at the lowest possible priority, so it
> shouldn't get scheduled that often.  However, when it does get
> scheduled, because it's not pre-emptable, it will consume CPU cycles
> polling until a command completes, at which point it'll get scheduled
> out again.
> 
> If this is seriously impacting your system performance (more than just
> appearing in top/ps/etc.), one option is to disable OMSA (if you
> aren't using it), which is the userspace systems management software
> that occasionally polls the state of all the sensors, which is why you
> see the CPU spikes.

We are using OMSA, and I got my wrist slapped for disabling it
while trying to figure out what was going on here (production
operations folks missed a predictive disk failure alert).

So, I'm kinda stuck I guess.  It is impacting performance for
us (the oprofile example I gave earlier is a 2 CPU NAS box, so
we lose one CPU while kipmid is doing its thing, which is not
helping the software thats meant to be consuming CPU resources
on the machine at those times - nfsd's and constant rsync's to
a secondary NAS).  That 31% was 31% of the CPU events sampled
across both CPUs - the oprofile run was for 5 seconds, and was
started _after_ the initial spike was detected (so 31% in that
worst case was probably not the whole picture either).

Oh well, so it goes.  If you could try again to get hardware
that supports interrupt-driven IPMI processing out there (even
if disabled by default, or something, to avoid the issues you
had with those other OSes) - that would be wonderful and very
much appreciated.

cheers.

-- 
Nathan