From: Maynard J. <may...@us...> - 2009-06-19 21:17:57
Attachments:
op-ovflw-fix.patch
|
As has been discussed on this list (and reported in bug http://sourceforge.net/tracker/?func=detail&aid=1685267&group_id=16191&atid=116191), the oprofile kernel driver has sometimes recorded incorrect data leading to mis-attribution of samples, warnings of files that can't be found (that are bogus filenames), and "invalid filename" errors that cause opreport to choke and die. It's clear that high sample rates (or high logging rates due to callgraph data collection) can result in opening a timing window where task switches are lost and samples are, thus, attributed to the wrong task and binary file. The attached patch does the following to address this problem: 1. I inserted the fix that was first proposed by "rutt" in the above-mentioned bug report which catches the "invalid filename" error and simply prints a message to stderr instead of aborting. 2. I implemented a set of changes to keep overflow stats with the profile session data, including when oparchive is used to archive sample data. When any post-processing tools are used against the session data, the code added in this patch checks the overflow stats and prints a warning message if it finds any overflows. Comments are welcome. Thanks. -Maynard |
From: John L. <le...@mo...> - 2009-06-19 23:56:19
|
On Fri, Jun 19, 2009 at 04:17:50PM -0500, Maynard Johnson wrote: > 1. I inserted the fix that was first proposed by "rutt" in the > above-mentioned bug report which catches the "invalid filename" error and > simply prints a message to stderr instead of aborting. This is going to hide future real bugs. The patch is the equivalent of: some_class::some_method() { if (this == 0) return; } My comments on this "approach" still stand. We should just be biting the bullet and refusing to open any sample file that contains a '{' inside oprofiled. We'd need to cache the negative sfiles for this of course. It's also an excellent spot to report the problem I would think. regards john |
From: Maynard J. <may...@us...> - 2009-06-22 22:06:04
|
John Levon wrote: > On Fri, Jun 19, 2009 at 04:17:50PM -0500, Maynard Johnson wrote: > >> 1. I inserted the fix that was first proposed by "rutt" in the >> above-mentioned bug report which catches the "invalid filename" error and >> simply prints a message to stderr instead of aborting. > > This is going to hide future real bugs. The patch is the equivalent of: > > some_class::some_method() { > if (this == 0) > return; > } > > My comments on this "approach" still stand. > > We should just be biting the bullet and refusing to open any sample file > that contains a '{' inside oprofiled. We'd need to cache the negative > sfiles for this of course. I can do that, but I'm not convinced this will catch 100% of these anomalies. We see this kind of thing with oprofiled when it's overloaded. The crux of the matter is samples being mis-attributed to the wrong task, which can happen with tasks other than oprofiled. But I suppose it's unlikely we'll choke on other filenames like we do on the sample filenames from oprofiled. So I'll re-spin the patch with this approach. Thanks. -Maynard > > It's also an excellent spot to report the problem I would think. > > regards > john |
From: John L. <le...@mo...> - 2009-06-22 22:09:24
|
On Mon, Jun 22, 2009 at 05:04:52PM -0500, Maynard Johnson wrote: > I can do that, but I'm not convinced this will catch 100% of these anomalies. That's fine. It's a statistical profiler: we don't provide 100% accuracy. I agree with the warning (it's useful) but there is no crime in being a little bit off as long as there isn't a significant bias. I can't see an argument that there's bias here. regards john |
From: Maynard J. <may...@us...> - 2009-06-24 17:34:33
|
John Levon wrote: > On Fri, Jun 19, 2009 at 04:17:50PM -0500, Maynard Johnson wrote: > >> 1. I inserted the fix that was first proposed by "rutt" in the >> above-mentioned bug report which catches the "invalid filename" error and >> simply prints a message to stderr instead of aborting. > > This is going to hide future real bugs. The patch is the equivalent of: > > some_class::some_method() { if (this == 0) return; } > > My comments on this "approach" still stand. > > We should just be biting the bullet and refusing to open any sample file that > contains a '{' inside oprofiled. We'd need to cache the negative sfiles for > this of course. I was reviewing some saved sample data where I was seeing the bogus sample file problem. While oprofiled is usually found in these bad sample file names, it's not always the case (haven't really determined how that happens yet). Here are a few examples. ===== From opreport: ========================== warning: /lib64/libdl-2.9.so/PM_RUN_CYC_GRP12.10000.0.all.all.all/{dep}/{root}/var/lib/oprofile/samples/current/{root}/lib64/libdl-2.9.so/{dep}/{root}/lib64/libdl-2.9.so/PM_RUN_CYC_GRP12.10000.0.all.all.all could not be found. warning: /lib64/libreadline.so.5.2/PM_RUN_CYC_GRP12.10000.0.all.all.all/{dep}/{root}/var/lib/oprofile/samples/current/{root}/lib64/libreadline.so.5.2/{dep}/{root}/lib64/libreadline.so.5.2/PM_RUN_CYC_GRP12.10000.0.all.all.all could not be found. =============================== NOTE: In the above case, we just get warning messages. The invalid filename errors (as reported in bug 1685267) are due to the failed attempt to parse bogus sample files when a cg segment is also present. Instrumenting opreport, I found the sample files corresponding to the above warnings were: Sample file name: /var/lib/oprofile/samples/mpj_hpcc2/{root}/var/lib/oprofile/samples/current/{root}/lib64/libdl-2.9.so/{dep}/{root}/lib64/libdl-2.9.so/PM_RUN_CYC_GRP12.10000.0.all.all.all/{dep}/{root}/var/lib/oprofile/samples/current/{root}/lib64/libdl-2.9.so/{dep}/{root}/lib64/libdl-2.9.so/PM_RUN_CYC_GRP12.10000.0.all.all.all/PM_RUN_CYC_GRP12.10000.0.all.all.all Sample file name: /var/lib/oprofile/samples/mpj_hpcc2/{root}/var/lib/oprofile/samples/current/{root}/lib64/libreadline.so.5.2/{dep}/{root}/lib64/libreadline.so.5.2/PM_RUN_CYC_GRP12.10000.0.all.all.all/{dep}/{root}/var/lib/oprofile/samples/current/{root}/lib64/libreadline.so.5.2/{dep}/{root}/lib64/libreadline.so.5.2/PM_RUN_CYC_GRP12.10000.0.all.all.all/PM_RUN_CYC_GRP12.10000.0.all.all.all I had instrumented the oprofile daemon for this run to print a message to the log for each bad cookie it got from the event buffer (i.e., if filename starts with "/var/lib/oprofile"). Here are the corresponding messages from the oprofiled.log: ++++++++++++++++ cookie: /var/lib/oprofile/samples/current/{root}/lib64/libdl-2.9.so/{dep}/{root}/lib64/libdl-2.9.so/PM_RUN_CYC_GRP12.10000.0.all.all.all cookie: /var/lib/oprofile/samples/current/{root}/lib64/libreadline.so.5.2/{dep}/{root}/lib64/libreadline.so.5.2/PM_RUN_CYC_GRP12.10000.0.all.all.all ++++++++++++++++ So, I think rather than filtering on sample files that contain "{" inside oprofiled, a more general approach of simply filtering out any sample file of the form <session-dir>/*<session-dir>*. -Maynard > > It's also an excellent spot to report the problem I would think. > > regards john |
From: John L. <le...@mo...> - 2009-06-24 17:47:19
|
On Wed, Jun 24, 2009 at 12:34:13PM -0500, Maynard Johnson wrote: > So, I think rather than filtering on sample files that contain "{" inside > oprofiled, a more general approach of simply filtering out any sample file > of the form <session-dir>/*<session-dir>*. I probably wasn't clear: we should drop /anything/ that contains a '{'. It's easiest and safest... regards john |
From: Maynard J. <may...@us...> - 2009-06-24 18:12:46
|
John Levon wrote: > On Wed, Jun 24, 2009 at 12:34:13PM -0500, Maynard Johnson wrote: > >> So, I think rather than filtering on sample files that contain "{" inside >> oprofiled, a more general approach of simply filtering out any sample file >> of the form <session-dir>/*<session-dir>*. > > I probably wasn't clear: we should drop /anything/ that contains a '{'. > It's easiest and safest... Are you suggesting to do the filtering in the oprofile daemon? I thought you mentioned that the sfiles should still be kept for these. Anyway, I think it would be better to defer this to post-processing time in profile_spec::valid_candidate. -Maynard > regards > john |
From: John L. <le...@mo...> - 2009-06-24 21:46:50
|
On Wed, Jun 24, 2009 at 01:12:40PM -0500, Maynard Johnson wrote: > > I probably wasn't clear: we should drop /anything/ that contains a '{'. > > It's easiest and safest... > Are you suggesting to do the filtering in the oprofile daemon? I thought you Yes. > mentioned that the sfiles should still be kept for these. I meant that an 'opd_sfile' should exist for them that's na negative cache entry, not the file should ever be written to disk. > would be better to defer this to post-processing time in > profile_spec::valid_candidate. It seems simpler for the daemon to do this, since it means that all sample files on disk are considered for processing? Maybe not. regards john |
From: Maynard J. <may...@us...> - 2009-06-24 22:09:36
|
John Levon wrote: > On Wed, Jun 24, 2009 at 01:12:40PM -0500, Maynard Johnson wrote: > >>> I probably wasn't clear: we should drop /anything/ that contains a '{'. >>> It's easiest and safest... >> Are you suggesting to do the filtering in the oprofile daemon? I thought you > > Yes. > >> mentioned that the sfiles should still be kept for these. > > I meant that an 'opd_sfile' should exist for them that's na negative > cache entry, not the file should ever be written to disk. > >> would be better to defer this to post-processing time in >> profile_spec::valid_candidate. > > It seems simpler for the daemon to do this, since it means that all > sample files on disk are considered for processing? Maybe not. Here's the code to do the filtering in valid_candidate: unsigned int j = base_dir.rfind('/'); string session_samples_dir = base_dir.substr(0,j); if (sub.find(session_samples_dir) != string::npos ) { cerr << "Warning: Removing invalid sample file. " << "May be due to kernel buffer overflows." << endl; return false; } Doesn't get much simpler. This valid_candidate() is called via generate_file_list() from profile_spec, so it's done right up-front. -Maynard > > regards > john |
From: John L. <le...@mo...> - 2009-06-24 22:28:59
|
On Wed, Jun 24, 2009 at 05:08:52PM -0500, Maynard Johnson wrote: > Here's the code to do the filtering in valid_candidate: fair enough. We should have a big comment though as to what's going on thanks john |
From: Maynard J. <may...@us...> - 2009-06-25 20:12:57
Attachments:
op-overflow.patch
|
This is version 2 of a patch, originally posted on Jun 19. ======================================================= As has been discussed on this list (and reported in bug http://sourceforge.net/tracker/?func=detail&aid=1685267&group_id=16191&atid=116191), the oprofile kernel driver has sometimes recorded incorrect data leading to mis-attribution of samples, warnings of files that can't be found (that are bogus filenames), and "invalid filename" errors that cause opreport to choke and die. It's clear that high sample rates (or high logging rates due to callgraph data collection) can result in opening a timing window where task switches are lost and samples are, thus, attributed to the wrong task and binary file. The attached patch does the following to address this problem: 1. Extremely high sampling rates can result in samples being mis-attributed to the oddest things -- for example, a sample file that oprofiled has open at the time. So then a cookie switch is stored in sample data stream coming from the kernel. When the daemon processes this cookie switch, it creates a sample file for the sample file! This craziness has the form: <session-dir>/samples/[blah]/<session-dir>/samples/[blah] When any post-processing tools are done, sample filenames will be examined to determine if they are of the above form; if so, we discard them and print out one message (not a message per file) that warns the user we found invalid sample files, which were probably caused by too high of a sampling rate. 2. Implements a set of changes to keep overflow stats with the profile session data, including when oparchive is used to archive sample data. When any post-processing tools are used against the session data, the code added in this patch checks the overflow stats and prints a warning message if it finds any overflows. Comments are welcome. Thanks. -Maynard |
From: John L. <le...@mo...> - 2009-07-10 21:57:53
|
On Thu, Jun 25, 2009 at 03:12:49PM -0500, Maynard Johnson wrote: > + string session_samples_dir = base_dir.substr(0,j); > + if (sub.find(session_samples_dir) != string::npos ) { Please clean up spacing in both these lines. Otherwise looks OK. Might like to consider storing all the stats after the release, and using that to give a %age of lost samples, instead of the currently slightly vaguer warning. regards john |
From: Maynard J. <may...@us...> - 2009-07-10 20:20:23
|
John Levon wrote: > On Thu, Jun 25, 2009 at 03:12:49PM -0500, Maynard Johnson wrote: > >> + string session_samples_dir = base_dir.substr(0,j); >> + if (sub.find(session_samples_dir) != string::npos ) { > > Please clean up spacing in both these lines. Spacing fixed and patch committed. Thanks for reviewing! > > Otherwise looks OK. Might like to consider storing all the stats after > the release, and using that to give a %age of lost samples, instead of > the currently slightly vaguer warning. Perhaps some real-life data from users can give us an idea of approximately how much loss is acceptable before results start getting noticeably screwy. -Maynard > > regards > john |