On 06.06.09 08:33:52, John Levon wrote:
> On Mon, Jun 01, 2009 at 02:06:35PM -0500, Maynard Johnson wrote:
> > I then started looking at the oprofile kernel code and identified multiple holes
> > where we could miss logging a context switch of some sort and then subsequently
> > record a sample against the wrong task (or wrong binary, in the case of a lost
> Yes, this is the expected case. We don't attempt for correctness if
> we're running out of space etc. Although it would be interesting to at
> least document these holes in the sources.
There are several ways to lose data or, much worse, to get a corrupt
* writing to the cpu buffer:
All write data sequences to the cpu buffer are in this form:
ESC_CODE, FLAGS, [TASK,]
PC (EIP), EVENT,
PC (EIP), 0 (1st)
PC (EIP), 0, (n-th)
TASK is only written if it is a user space sample. Data in each line
above is written to the buffer atomically. If there is no space left
the _remaining_ data of the sequence is discarded completely. A
sequence is always started with an ESC_CODE, PCs with the value of the
ESC_CODE are discarded. So, a sequence can be identified by a leading
ESC_CODE. This is counted:
cpu_buf->sample_received (every sample)
cpu_buf->sample_invalid_eip (eip equals ESC_CODE)
cpu_buf->sample_lost_overflow (no space left to write the sample)
cpu_buf->backtrace_aborted (no space left to write the backtrace)
The code looks safe until here.
In sync_buffer() the data is copied to the event buffer. Since the
buffer state is reset at buffer boundaries (when no buffer data is
available), there could be corrupt data too. But maybe it works
because the cpu buffers are written with a higher priority since this
code is called from an interrupt handler.
* writing to the event buffer:
All functions finaly use add_event_entry() that writes a single
unsigned long from the cpu buffer to the event buffer. On buffer
overflows or other failures, there could be arbitrarily missing data,
even escacpe sequences that are needed to re-sync the parser in the
user land could get lost. This seems to be very unsafe and I am not
sure if the daemon can handle this. To reproduce corrupt data I
suggest to setup a small event buffer and high interrupt/sampling
If there are missing events, oprofile_stats.event_lost_overflow is
* writing the event buffer into user space:
event_buffer_read() copies the data to the user space. The whole event
buffer must fit into the read buffer for the file operation, the
operation fails otherwise, but the event buffer is not discarded. In
this functions the usage of variable buffer_pos looks racy. Maybe also
buffer boundaries are not safe.
This code is already a long time in the kernel. I am not sure if this
is really a problem. At least this should not stop the next release. A
fix would be in the kernel and not part of the user land release. But
the user land should be fault-tolerant and robust if there is corrupt
data. We should check this.
Finally, the kernel should be reworked to fix this at all. This will
be a lot of work and maybe not worth the effort (since the side
effects are small). Ok, it would be nice to have. Maynard and John,
what do you think?
Advanced Micro Devices, Inc.
Operating System Research Center