From: graydon h. <gr...@re...> - 2002-11-28 06:10:41
Attachments:
oprofile-p4-dropped-overflow-2.patch
|
the patch I posted yesterday which supposedly "fixed" the dropped overflows I'm seeing on p4 did not. on further testing, it still let plenty through; as many as 1% of total NMIs when counting at an appreciably high frequency. so I played around with it some more today, tried implementing intel's suggested fix to the letter (didn't work), tried a few other things, and now have what I think is a more stable solution, attached here. I read and check the counter for positiveness as well as the CCCR OVF flag, and do a read after writing to check that the counter was updated properly, re-writing if it was not. this seems to catch both failure types I was seeing; I ran it at ~ 200k NMIs / second for 30 minutes churning disk, network, cpu, etc. and didn't lose any overflows. I also added some bits to clear out all the ESCRs on initialization, to mirror the clearing of CCCRs I added 2 days ago. the counter reading and writing adds a little more overhead. I haven't completely quantified it yet but it doesn't look to do much more than add an extra couple percent of oprofile's existing overhead (which itself varies rather widely depending on count freq). comments? this is all a little flaky sounding, but it needs fixing. -graydon |
From: Philippe E. <ph...@wa...> - 2002-11-28 17:15:00
Attachments:
bz2-test
do-plot.pl
|
graydon hoare wrote: > the patch I posted yesterday which supposedly "fixed" the dropped > overflows I'm seeing on p4 did not. on further testing, it still let > plenty through; as many as 1% of total NMIs when counting at an > appreciably high frequency. > > so I played around with it some more today, tried implementing intel's > suggested fix to the letter (didn't work), tried a few other things, and > now have what I think is a more stable solution, attached here. I read > and check the counter for positiveness as well as the CCCR OVF flag, and > do a read after writing to check that the counter was updated properly, > re-writing if it was not. this seems to catch both failure types I was > seeing; I ran it at ~ 200k NMIs / second for 30 minutes churning disk, > network, cpu, etc. and didn't lose any overflows. I also added some bits > to clear out all the ESCRs on initialization, to mirror the clearing of > CCCRs I added 2 days ago. > > the counter reading and writing adds a little more overhead. I haven't > completely quantified it yet but it doesn't look to do much more than > add an extra couple percent of oprofile's existing overhead (which > itself varies rather widely depending on count freq). a couple of percent is big. My last experiment on a bz2 test on P6 show an overhead of 1% at 300000 and 1.77% at 100000. You can look at rdpmc to read counter, and also rdpmc(counter_nr | 0x80000000) to reduce overhead. We need measurment for P4 overhead. Basically we use two separate test now, a kernel compile with and w/o profiling at 300000/100000/25000 and a bz2 compression/decompression of a 128 MB text file (a gcc tar distro to get sufficient data) I join the script I use to produce .plot file and the bz2-test script (it assume than a gcc.txt.bz2 exist in current dir, you must tweak start_profiler for P4) each test are runned three time under a vt w/o anything loaded except the test themself. kernel_compile test is in the same way. bz2 test is intended to measure overhead of nmi handler + samples storing by daemon. Kernel test add to this the overhead due to syscall interception + image/map stuff in daemon. After test .out contain various other information, memory used by daemon dump of oprofiled.log, /proc/interrupts etc ... > Index: module/x86/op_model_p4.c > =================================================================== ... > @@ -442,20 +461,36 @@ > struct op_msrs const * const msrs, > struct pt_regs * const regs) > { > - ulong low, high; > + ulong ctr, low, high; > int i; > - > for (i = 0; i < NUM_COUNTERS; ++i) { > + > + if (!sysctl.ctr[i].event) > + continue; > + > + CTR_READ(ctr, high,i); > CCCR_READ(low, high, i); > - if (CCCR_OVF_P(low)) { > + if (CCCR_OVF_P(low) || CTR_OVERFLOW_P(ctr)) { why both test required isn't "if (CTR_OVERFLOW_P(ctr))" sufficient ? > op_do_profile(cpu, regs, i); > + CTR_WRITE(oprof_data[cpu].ctr_count[i], i); > CCCR_CLEAR_OVF(low); > - CTR_WRITE(oprof_data[cpu].ctr_count[i], i); > CCCR_WRITE(low, high, i); Later we need test on various P4 stepping number, the HW is crappy, I'm especially interrested by pre C0 P4 test, there is no errata in intel doc but vtune module don't reset OVF in the same way on P4 pre/post stepping C0, the easiest way is perhaps to call for test on lmkl when porting P4 stuff to 2.5. Is it acceptable for 2.5, after more test, now the features freeze point for 2.5 is reached John ? regards, Phil |
From: graydon h. <gr...@re...> - 2002-11-28 20:23:23
|
On Thu, 2002-11-28 at 13:11, Philippe Elie wrote: > a couple of percent is big. My last experiment on a bz2 test > on P6 show an overhead of 1% at 300000 and 1.77% at 100000. I think you misread me. what I meant is: my correction hack adds a couple percent *of the oprofile time* to the overall overhead. not of total system time. though, well, it's all quite nonlinear. that's somewhat true at the lower counts, but it changes as you get into higher frequencies (lower event counts). here is output from some benchmarking stuff I'm working on now. I'm still trying to work out the statistical accuracy of these things; it's tricky work. before my change: w/ counter @ 500000: 206299 overhead: 0.594489% w/ counter @ 300000: 206969 overhead: 0.921151% w/ counter @ 100000: 212665 overhead: 3.69889% w/ counter @ 50000: 222489 overhead: 8.4892% w/ counter @ 30000: 235878 overhead: 15.0176% w/ counter @ 10000: 321618 overhead: 56.8257% with my change: w/ counter @ 500000: 205103 overhead: 0.392054% w/ counter @ 300000: 207518 overhead: 1.5743% w/ counter @ 100000: 213438 overhead: 4.47179% w/ counter @ 50000: 224113 overhead: 9.69693% w/ counter @ 30000: 239917 overhead: 17.4326% w/ counter @ 10000: 336167 overhead: 64.5443% I think there's still too much sampling noise in here to trust it too much. but clearly there is an affect. I'm not really sure I can do much else though; rdpmc is non-serializing, so I can't really rely on it to catch corner cases like dropped overflows. -graydon |