From: Michael E. <mi...@el...> - 2008-01-07 22:32:00
|
On Mon, 2008-01-07 at 13:30 -0600, Bob Nelson wrote: > On Monday 07 January 2008 09:13:13 am Maynard Johnson wrote: > > Michael Ellerman wrote: > > > Hi all, > > > > > > Running oprofile (0.9.3) on a cell machine (2.6.24-rc7 kernel) I see = the > > > oprofiled intermittently crashing. It only seems to happen when I run= an > > > SPU program. > > > > > > When it crashes I see this in the log: > > > > > > oprofiled started Mon Jan 7 18:23:21 2008 > > > kernel pointer size: 8 > > > Read buffer of 98307 entries. > > > No anon map for pc 0, app anonymous. > > > =20 > > Well, that's definitely badness, but this, in itself, would not cause=20 > > oprofiled to crash. Is this the last thing you see in the log? Does=20 > > the daemon fail both with and without the --verbose option? > > > Compared to a working run: > > > > > > oprofiled started Mon Jan 7 18:21:12 2008 > > > kernel pointer size: 8 > > > Read buffer of 11 entries. > > > Dangling ESCAPE_CODE. > > > <snip> > > > =20 > > A dangling ESCAPE code is badness, too. For Cell, a buffer with 11=20 > > entries could mean 3 entries for profiling start header info + 8 entrie= s=20 > > for SPU context info. The 11th entry would be the offset of the SPU EL= F=20 > > data, if embedded; otherwise 0. According to the above log snippet, th= e=20 > > 11th entry is an ESCAPE_CODE. This implies to me that another event=20 > > record may be getting intermingled in the buffer. There were locks and= =20 > > memory barriers in place to prevent this from happening. Has there bee= n=20 > > a change in the Cell-oprofile kernel code recently that might be causin= g=20 > > this? Did you see this problem on earlier kernels? Are there any more= =20 > > details you can provide to reproduce the problem? >=20 > Actually I think the dangling escape code message is is a bug I ran into = a > little while back but I haven't put out a patch for it yet. I only saw i= t > in one weird case IIRC. I think it was when the only or last thing in th= e > buffer was a context switch. You indicate this was the 'working' run but > it doesn't look like you are getting any data collected in this case. > If you are you compiling OProfile from source it is a one-line change. >=20 > In the module oprofile-0.9.3/daemon/opd_spu.c in the following line the 7 > should be changed to a 6. >=20 > if (!enough_remaining(trans, 7)) { OK I can't reproduce it now so perhaps it is the same bug you saw once. If I can build oprofile from source I'll try your patch. cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person |