From: Yi Xu <yx...@su...> - 2007-02-27 12:13:27
Attachments:
fix-gettimeofday.patch
|
Hi, The test gettimeofday02 ( testcase/kernel/syscalls/gettimeofday ) compare= s=20 tv.usec of tv1 and tv2 when tv.sec are the same to check if gettimeofday is= =20 monotonous in 30 seconds.=20 The original test do "gettimeofday" after two comparisons: tv2.tv_usec and= =20 tv1.tv_usec, tv2.tv_sec and tv1.tv_sec, and an assignment: tv1=3Dtv2. With most computers the test will pass, but one of my test machine sometime= s=20 fail. I am a bit suspicious with some error message like "Time is going=20 backwards (old 1172499241.27215 vs new 1172499241.27214! ", since the=20 accuracy of timeval is 1usec, so maybe sometimes 1usec difference will=20 happen, because two comparisons and an assignment might take the system ver= y=20 little time, e.g., some nanoseconds. So I add a nanosleep between the first gettimeofday and the next gettimeofd= ay=20 to make sure there has been a measurable time (for gettimeofday function)=20 elapsed.=20 Thanks in advance for reviewing. Yi Xu =2D---------------------------------------------------------------------- SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N=FCrnberg) |
From: Mike M. <mme...@si...> - 2007-02-27 16:22:09
|
Prior to last weekend, I'd run LTP full for no more than 20 repetitions. Last weekend, I set -t to 60h. When I looked at the output on Monday, I noticed, first, that 518 runs of LTP full had completed and, second, that the number of FAILs was roughly seven times that expected for a single run on my system. So I set about locating the source of the increase. I wrote a script converting the human readable log file into a CSV file that I could chart in Open Office. The chart was revealing and still puzzling. About run 330, the number of FAILs started to climb and hit 148 about run 346 and held there for the remainder of the 60 hours. I checked the detailed output but only found that the return code on the failing tests had changed. There was no test output to help understand the change. My questions: Has anyone seen this behavior before? Has anyone done test runs this long? Is there some means at getting more information about the test failures? The launch command (I used runltp to get here): COMMAND: ltp-full-20061222/pan/pan -e -S -x 6 -O /tmp/ltp-20642 -t 60h -a 20642 -n 20642 -p -f /tmp/ltp-20642/alltests -l <PATH>/results/ltp_full_20061222_070223144637.log -o <PATH>/results/ltp_full_20061222_070223144637.out # uname -a Linux gsrv027 2.6.15-sc-lustre-1.6b7-devo #1 SMP Thu Feb 22 11:42:19 EST 2007 x86_64 AMD Opteron(tm) Processor 248 AuthenticAMD GNU/Linux The Linux has a modified Gentoo kernel on a Lustre (from CFS) file system. Thanks for any direction you can provide, Mike Melendez Test Lead SiCortex Maynard, MA |
From: Patrick K. <pk...@su...> - 2007-02-27 17:59:32
|
Hey, > Prior to last weekend, I'd run LTP full for no more than 20 repetitions. > Last weekend, I set -t to 60h. When I looked at the output on Monday, > I noticed, first, that 518 runs of LTP full had completed and, second, > that the number of FAILs was roughly seven times that expected for a > single run on my system. Can you list the fail(ing|ed) testcase names? So I set about locating the source of the > increase. I wrote a script converting the human readable log file into > a CSV file that I could chart in Open Office. The chart was revealing You can use CTCS (http://sourceforge.net/projects/va-ctcs/) for producing better (human-) readable logfiles. > and still puzzling. About run 330, the number of FAILs started to climb > and hit 148 about run 346 and held there for the remainder of the 60 > hours. I checked the detailed output but only found that the return > code on the failing tests had changed. There was no test output to help > understand the change. > > My questions: > > Has anyone seen this behavior before? Some similar behavior. Can you check for testcases that did not finish, e.g. i have seen several testcases that (are buggy and) did not finish, so they ran longer than the max specified lifetime. > Has anyone done test runs this long? Usually tests that I ran are 24h long. > Is there some means at getting more information about the test failures? > Greetings, -- Patrick Kirsch - Quality Assurance Department SUSE Linux Products GmbH GF: Markus Rex, HRB 16746 (AG Nuernberg) |
From: Mike M. <mme...@si...> - 2007-02-27 18:33:16
Attachments:
tryit-endfails.txt
|
Thanks Patrick, Patrick Kirsch wrote: > Can you list the fail(ing|ed) testcase names? I've attached the list of 148 tests that are all failing every time as of run 346 and continue failing to the end of the run. >>Has anyone seen this behavior before? > > Some similar behavior. > Can you check for testcases that did not finish, e.g. i have seen > several testcases that (are buggy and) did not finish, so they ran > longer than the max specified lifetime. No tests are left in the process table. Anyplace else I can look? > You can use CTCS (http://sourceforge.net/projects/va-ctcs/) for > producing better (human-) readable logfiles. Are you using 1.3.0 or 1.3.1pre1? Am I correct in assuming that the tests would have to be migrated from the pan harness to get the better logfiles? Mike Melendez Test Lead SiCortex Maynard, MA |
From: Nate S. <nat...@re...> - 2007-03-04 04:38:45
|
On Tue, Feb 27, 2007 at 11:21:54AM -0500, Mike Melendez wrote: > and still puzzling. About run 330, the number of FAILs started to climb > and hit 148 about run 346 and held there for the remainder of the 60 > hours. I checked the detailed output but only found that the return Did the file system fill up? Nate |
From: Mike M. <mme...@si...> - 2007-03-05 21:06:58
|
Last week I began experimenting with the background load generation available in the runltp and runltplite.sh scripts beginning with memory load, i.e. -m. My experiments showed that requesting anything more than 1MB memory load grew to completely consume all memory and all CPU time by proliferating processes that allocated and deallocated memory. The larger size specified did so faster than smaller sizes. Needless to say, this made background memory load of little value, so I investigated. First discovery: all load is generated by the OSS Stress program renamed genload within the LTP scripts. Second discovery: the version of Stress used by LTP (070228 as well as 061222) is 0.17pre11 while the currently available version is 0.18.9. I downloaded the latest Stress for comparison. Third discovery: the latest Stress does not allow the value of stress -vm used by LTP, namely 0. The Stress parameter -vm specifies the number of processes that generate memory load. The argument, 0, used an algorithm that apparently the author of Stress decided not to continue to use. So, I replaced Stress -vm 0 with Stress -vm 1 within LTP when the -m parameter is used and am trying that now. My thought is that CPU loading uses process multiplication so that capability is not needed in memory loading. Unless I am misunderstanding something, I recommend that -vm in LTP be set to 1. Here are context diffs for the launch scripts. ---------- Index: runltplite.sh =================================================================== --- runltplite.sh (revision 28286) +++ runltplite.sh (working copy) @@ -157,7 +157,7 @@ m) MEMSIZE=$(($OPTARG * 1024 * 1024)) - $LTPROOT/testcases/bin/genload --vm 0 --vm-bytes $MEMSIZE \ + $LTPROOT/testcases/bin/genload --vm 1 --vm-bytes $MEMSIZE \ >/dev/null 2>&1 & GENLOAD=1;; ---------- ---------- Index: runltp =================================================================== --- runltp (revision 28284) +++ runltp (working copy) @@ -176,7 +176,7 @@ m) MEMSIZE=$(($OPTARG * 1024 * 1024)) - $LTPROOT/testcases/bin/genload --vm 0 --vm-bytes $MEMSIZE \ + $LTPROOT/testcases/bin/genload --vm 1 --vm-bytes $MEMSIZE \ >/dev/null 2>&1 & GENLOAD=1;; ---------- I also recommend updating to the most recent version of Stress. Mike Melendez Test Lead SiCortex 3 Clock Tower Place Suite 210 Maynard, MA 01754 |
From: Michael R. <mr...@us...> - 2007-02-27 23:44:53
|
Patch Applied Thanks Michael |
From: Patrick K. <pk...@su...> - 2007-02-28 08:42:22
|
Hey, > I've attached the list of 148 tests that are all failing every time as > of run 346 and continue failing to the end of the run. interesting, your listed tests are other ones from my "candidates" (testcases which show random or unexpected behavior). > > No tests are left in the process table. Anyplace else I can look? I guess you already looked up syslog for example, in the worst case some testcases creates segfault and disturb other ones. > >> You can use CTCS (http://sourceforge.net/projects/va-ctcs/) for >> producing better (human-) readable logfiles. > > Are you using 1.3.0 or 1.3.1pre1? I'm using version 1.3.0. > > Am I correct in assuming that the tests would have to be migrated from > the pan harness to get the better logfiles? Ctcs2 does not migrate testcases from pan, pan is still used. Ctcs2 creates a system snapshot (rpm files, kernel version ..) and observes the execution of the testcases > Furthermore your command: COMMAND: ltp-full-20061222/pan/pan -e -S -x 6 -O /tmp/ltp-20642 -t 60h -a 20642 -n 20642 -p -f /tmp/ltp-20642/alltests You ran ltp 6 times parallel (-x 6), maybe you should try to run only one instance of ltp at one time, if you ran ltp multiple times at once, that can end in unpredictable results. Greetings, -- Patrick Kirsch - Quality Assurance Department SUSE Linux Products GmbH GF: Markus Rex, HRB 16746 (AG Nuernberg) |
From: Mike M. <mme...@si...> - 2007-02-28 13:12:47
|
Patrick Kirsch wrote: > Hey, > >>I've attached the list of 148 tests that are all failing every time as >>of run 346 and continue failing to the end of the run. > > interesting, your listed tests are other ones from my "candidates" > (testcases which show random or unexpected behavior). > >>No tests are left in the process table. Anyplace else I can look? > > I guess you already looked up syslog for example, in the worst case some > testcases creates segfault and disturb other ones. > >>>You can use CTCS (http://sourceforge.net/projects/va-ctcs/) for >>>producing better (human-) readable logfiles. >> >>Are you using 1.3.0 or 1.3.1pre1? > > I'm using version 1.3.0. > >>Am I correct in assuming that the tests would have to be migrated from >>the pan harness to get the better logfiles? > > Ctcs2 does not migrate testcases from pan, pan is still used. Ctcs2 > creates a system snapshot (rpm files, kernel version ..) and observes > the execution of the testcases > > Furthermore your command: > COMMAND: ltp-full-20061222/pan/pan -e -S -x 6 -O /tmp/ltp-20642 -t 60h > -a 20642 -n 20642 -p -f /tmp/ltp-20642/alltests > You ran ltp 6 times parallel (-x 6), maybe you should try to run only > one instance of ltp at one time, if you ran ltp multiple times at once, > that can end in unpredictable results. I appreciate these bits of advice very much. Though I am concerned about the final one. If I understand correctly, LTP is unstable when multiple tests are run in parallel. Is that right? If so, is that also true when you run with random order? Mike Melendez Test Lead SiCortex Maynard, MA |
From: Helge D. <de...@gm...> - 2007-03-17 11:44:17
|
Michael Reed wrote: > Patch Applied > > Thanks > Michael To be honest, I think this patch is wrong and should be reverted. On Yi's machine the second call of two consecutive calls to gettimeofday(), the second one returned a time which was lower than the first call. This is a kernel bug and adding a nanosleep() just hides the real bug in the kernel. gettimeofday() has to always return monotonic growing (or at least the same value if the resolution is too low) time values, and if it doesn't this has to be fixed in the Linux kernel. Just search the Linux kernel mailing archives for the keywords "TSC", "clocksource" or "TSC unstable" and you will find a lot of work which was done to keep the values correct. Many applications really depend on stable monotonic clock values by gettimeofday(). Another point: The "if"-clause itself is wrong as well. In my opinion it should be: Index: gettimeofday02.c =================================================================== RCS file: /cvsroot/ltp/ltp/testcases/kernel/syscalls/gettimeofday/gettimeofday02.c,v retrieving revision 1.5 diff -u -p -r1.5 gettimeofday02.c --- gettimeofday02.c 27 Feb 2007 23:43:21 -0000 1.5 +++ gettimeofday02.c 17 Mar 2007 11:40:09 -0000 @@ -109,9 +109,9 @@ int main(int ac, char **av) } gettimeofday(&tv2,NULL); - if ( (tv2.tv_usec < tv1.tv_usec) && - (tv2.tv_sec <= tv1.tv_sec) - ) { + if ( (tv2.tv_sec < tv1.tv_sec) || + (tv2.tv_sec == tv1.tv_sec) && (tv2.tv_usec < tv1.tv_usec) + ) { tst_resm(TFAIL, "Time is going backwards (old %d.%d vs new %d %d!",tv1.tv_sec,tv1.tv_usec,tv2.tv_sec,tv2.tv_usec); cleanup(); exit(1); |
From: Mike F. <va...@ge...> - 2007-03-19 07:09:55
|
On Saturday 17 March 2007, Helge Deller wrote: > To be honest, I think this patch is wrong and should be reverted. agreed, done so > Another point: > The "if"-clause itself is wrong as well. > In my opinion it should be: indeed, added to cvs -mike |
From: Yi Xu <yx...@su...> - 2007-03-22 11:09:13
|
On Saturday 17 March 2007 12:44, Helge Deller wrote: > Michael Reed wrote: > > Patch Applied > > > > Thanks > > Michael > > To be honest, I think this patch is wrong and should be reverted. > > On Yi's machine the second call of two consecutive calls to gettimeofday(= ), > the second one returned a time which was lower than the first call. This = is > a kernel bug and adding a nanosleep() just hides the real bug in the > kernel. Thanks for pointing it out.=20 I have talked with my colleague, and they pointed out it is a hardware issu= e=20 (jitter) instead of testcase or kernel. Sorry I didn't give an update here. Yi > > gettimeofday() has to always return monotonic growing (or at least the sa= me > value if the resolution is too low) time values, and if it doesn't this h= as > to be fixed in the Linux kernel. Just search the Linux kernel mailing > archives for the keywords "TSC", "clocksource" or "TSC unstable" and you > will find a lot of work which was done to keep the values correct. > > Many applications really depend on stable monotonic clock values by > gettimeofday(). > > Another point: > The "if"-clause itself is wrong as well. > In my opinion it should be: > > Index: gettimeofday02.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > RCS > file: > /cvsroot/ltp/ltp/testcases/kernel/syscalls/gettimeofday/gettimeofday02.c,v > retrieving revision 1.5 > diff -u -p -r1.5 gettimeofday02.c > --- gettimeofday02.c 27 Feb 2007 23:43:21 -0000 1.5 > +++ gettimeofday02.c 17 Mar 2007 11:40:09 -0000 > @@ -109,9 +109,9 @@ int main(int ac, char **av) > } > > gettimeofday(&tv2,NULL); > - if ( (tv2.tv_usec < tv1.tv_usec) && > - (tv2.tv_sec <=3D tv1.tv_sec) > - ) { > + if ( (tv2.tv_sec < tv1.tv_sec) || > + (tv2.tv_sec =3D=3D tv1.tv_sec) && (tv2.tv_usec < > tv1.tv_usec) + ) { > tst_resm(TFAIL, "Time is going backwards (old %d.= %d > vs new %d %d!",tv1.tv_sec,tv1.tv_usec,tv2.tv_sec,tv2.tv_usec); > cleanup(); > exit(1); > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Ltp-list mailing list > Ltp...@li... > https://lists.sourceforge.net/lists/listinfo/ltp-list =2D-=20 Yi Xu --- RD-QA-Kernel SUSE LINUX Products GmbH Maxfeldstrasse 5, D-90409 Nuernberg Tel: +49-911-740 53 - 607 =2D---------------------------------------------------------------------- SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N=FCrnberg) |