|
|
|
@550
|
[550]
|
4 years |
sonicsoft70 |
|
|
updated docs for cbench 1.2.1 from the new trac site
http://apps.sourceforge.net/trac/cbench
|
|
|
|
@549
|
[549]
|
4 years |
sonicsoft70 |
|
|
expected release date
|
|
|
|
@548
|
[548]
|
4 years |
sonicsoft70 |
|
|
1.2.1 prep
|
|
|
|
@547
|
[547]
|
4 years |
sonicsoft70 |
|
|
prepping for cbench 1.2.1 release
|
|
|
|
@546
|
[546]
|
4 years |
sonicsoft70 |
|
|
Add some basic stats during cbench_start_jobs.pl
|
|
|
|
@545
|
[545]
|
4 years |
sonicsoft70 |
|
|
add a status line at the end of generation
|
|
|
|
@544
|
[544]
|
4 years |
sonicsoft70 |
|
|
starting to prepare for cbench 1.2.1
|
|
|
|
@541
|
[541]
|
4 years |
sonicsoft70 |
|
|
new job in the LAMMPS testset, rhodolong.scaled, which is a
work in progress
|
|
|
|
@540
|
[540]
|
4 years |
sonicsoft70 |
|
|
found that the lammps eam.scaled and lj.scaled job templates
were blowing up memory. tweak the scaling factors for them.
add a new lammps job template rhodolong.scaled which will
be long running job (24-96 hours targeted) with restart file
dumping and restarting the job from a restart file within
the same batch job
tweak some debug output levels
|
|
|
|
@539
|
[539]
|
4 years |
sonicsoft70 |
|
|
install lammps testset 'bench' directory, which all the
input decks and data files are referenced from, from
the explicitly version tracked openapps/lammps/bench
directory instead of the dynamic lammps tarball
|
|
|
|
@538
|
[538]
|
4 years |
sonicsoft70 |
|
add cbench_runin_tempdir() function that job scripts can
use to run in an isolated unique directory. real apps need
to do this more often than not.
|
|
|
|
@536
|
[536]
|
4 years |
sonicsoft70 |
|
|
when --customparse mode is enabled, keep a hash summarizing
the various matches (some of which will likely be repeated) and
print out a section with the summary information. For example:
Customparse Matches Summary:
'FORRTL: error 78, process killed via SIGTERM' => 4 matches
'OMPI says orterun killing job' => 15 matches
'SLURM JOB 975 NODE FAILURE' => 1 matches
'SLURM JOB WALLTIME EXCEEDED' => 16 matches
|
|
|
|
@535
|
[535]
|
4 years |
sonicsoft70 |
|
|
Catch a LAMMPS memory allocation failure
|
|
|
|
@534
|
[534]
|
4 years |
sonicsoft70 |
|
|
Simple utility script to remove output files from jobs
that had ERROR states so they no longer show up in output
parsing. The --force parameter must be given to actualy
delete anything:
Usage looks like:
bandwidth_output_parse.pl --ident ompi13-intel11 --diag --report | cbench_rm_failed_jobs.pl
For example:
[n280 cbench-test]$ cbench_output_parse.pl --meta --nodata --diag --report --custom | cbench_rm_failed_jobs.pl
Would remove: lammps/ompi13-intel11/lj.scaled-1ppn-1/slurm.o856
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-1/slurm.o854
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-2/slurm.o867
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-8/slurm.o888
Would remove: lammps/ompi13-intel11/lj.scaled-1ppn-9/slurm.o911
.
.
.
Otherwise the output looks like:
[n280 cbench-test]$ cbench_output_parse.pl --meta --nodata --diag --report --custom
..........**DIAG**(lammps/ompi13-intel11/lj.scaled-1ppn-1/slurm.o856) had a ERROR with status STARTED
**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-1/slurm.o854) had a ERROR with status STARTED
**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-2/slurm.o867) had a ERROR with status STARTED
.**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-8/slurm.o888) had a ERROR with status STARTED
**DIAG**(lammps/ompi13-intel11/lj.scaled-1ppn-9/slurm.o911) had a ERROR with status STARTED
**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-9/slurm.o912) had a ERROR with status STARTED
.
.
.
|
|
|
|
@533
|
[533]
|
4 years |
sonicsoft70 |
|
|
didn't mean to change the layout of the DIAG line...
|
|
|
|
@532
|
[532]
|
4 years |
sonicsoft70 |
|
|
Since Slurm spools job stdout/stderr output continually into the
slurm.oNNNN files, the Cbench output parsing structure appears to have
phantom jobs that are in an ERROR state that later disappear. This
is because the output parser is looking at output from a live job.
Torque/PBS does not behave this way because the .oNNNN file does not
show up until the job has completed.
To deal with this intelligently, if cbench_output_parse.pl finds itself
parsing output files from Slurm batch jobs, it will call the slurm_query()
subroutine once to cache the state of Slurm jobs. Then if the parse
module for a job returns an ERROR status of some sort, the job is cross-
referenced against jobs known to be running in Slurm. If the job is
running according to the cached Slurm data, the job is flagged as RUNNING
and not as an error. For example, here is a snippet from an output parse
run with running jobs:
**DIAG**(lammps/ompi13-intel11/eam.scaled-4ppn-100/slurm.o1012) had ERROR with status STARTED
.**DIAG**(qcd/ompi13-intel11/qcd-4ppn-4/slurm.o1137) is still RUNNING
**DIAG**(cth/ompi13-intel11/amr3doblique-1ppn-1/slurm.o1037) had ERROR with status FATALERROR
**DIAG**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) had ERROR with status STARTED
**PARSEMATCH**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) => SLURM JOB WALLTIME EXCEEDED
**PARSEMATCH**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) => OMPI says orterun killing job
**DIAG**(cth/ompi13-intel11/amr3doblique-4ppn-8/slurm.o1064) is still RUNNING
slurm_query() was updated a bit to cache Jobid data as well as job name.
|
|
|
|
@531
|
[531]
|
4 years |
sonicsoft70 |
|
|
make the slurm job cancelled regex catch newer and older
syntax
|
|
|
|
@530
|
[530]
|
4 years |
sonicsoft70 |
|
|
look for 'Elapsed time' as the end of a sweep job.
the fortran stop looks to be compiler dependent as i
don't see it with intel 11.0
|
|
|
|
@529
|
[529]
|
4 years |
sonicsoft70 |
|
|
help output was ordered poorly
|
|
|
|
@528
|
[528]
|
4 years |
sonicsoft70 |
|
|
add --maxnodes, --minnodes, --nodes options
|
|
|
|
@527
|
[527]
|
4 years |
sonicsoft70 |
|
|
bugfix
|
|
|
|
@526
|
[526]
|
4 years |
braithr |
|
|
Change bonnie++ so the Makefile dynamically downloads and builds the program, similar to other Cbench tests
|
|
|
|
@525
|
[525]
|
4 years |
sonicsoft70 |
|
|
add --minnodes, --maxnodes, --nodes command line options
|
|
|
|
@524
|
[524]
|
4 years |
braithr |
|
|
Get rid of "make[1]: *** No rule to make target `distclean'. Stop." errors by making distclean targets where none existed. All of the distclean targets just point to their clean target for now.
|
|
|
|
@523
|
[523]
|
4 years |
braithr |
|
|
First-pass addition of High-Performance Linpack 2.0 to Cbench.
|
|
|
|
@521
|
[521]
|
4 years |
sonicsoft70 |
|
|
alias_spec() must return undef and not empty string
if it does not want to provide any aliases
|
|
|
|
@520
|
[520]
|
4 years |
sonicsoft70 |
|
|
adding support for mpiBench collective benchmark
from LLNL Phloem benchmarks
added three mpibench jobs to the collective testset
|
|
|
|
@519
|
[519]
|
4 years |
sonicsoft70 |
|
|
added support for LLNL Sequoia message rate benchmark
named SQMR from the Phloem benchmarks
added a sqmr job to the BANDWIDTH testset
|
|
|
|
@518
|
[518]
|
4 years |
sonicsoft70 |
|
|
bugfixes
message rate output wasn't getting parsed right
|
|
|
|
@517
|
[517]
|
4 years |
sonicsoft70 |
|
|
make the interactive mode JOBID generation more
resistant to collisions
|
|
|
|
@516
|
[516]
|
4 years |
sonicsoft70 |
|
|
'com' output parse module updated to deal with the
com version found in the Phloem 1.0.0 benchmark
updated the com job template in the bandwidth testset
for the newer com version usage
added a com job template in the latency testset
for the newer com version which can do latency as well
|
|
|
|
@515
|
[515]
|
4 years |
sonicsoft70 |
|
|
bugfixes
|
|
|
|
@514
|
[514]
|
4 years |
sonicsoft70 |
|
|
smarter makefileness
|
|
|
|
@513
|
[513]
|
4 years |
sonicsoft70 |
|
|
adding Phloem MPI Benchmarks v1.0.0 from ASCI Sequoia
benchmarks. has mpiBench and mpiGraph and Presta among others
|
|
|
|
@512
|
[512]
|
4 years |
sonicsoft70 |
|
|
move 'npb' down to the end of $core_testsets
|
|
|
|
@511
|
[511]
|
4 years |
sonicsoft70 |
|
|
more tweaking on the stressful IOR job templates
|
|
|
|
@510
|
[510]
|
4 years |
sonicsoft70 |
|
|
catch another test module elapsed time edge case
|
|
|
|
@509
|
[509]
|
4 years |
sonicsoft70 |
|
|
catch the iostress and iosanity jobs as well
|
|
|
|
@508
|
[508]
|
4 years |
sonicsoft70 |
|
|
change the 'doitall' target to compile correctly
|
|
|
|
@507
|
[507]
|
4 years |
sonicsoft70 |
|
|
add distclean target
|
|
|
|
@506
|
[506]
|
4 years |
sonicsoft70 |
|
|
tweak the cbench-init.{sh,csh}
|
|
|
|
@505
|
[505]
|
4 years |
sonicsoft70 |
|
|
adding a job in the Shakedown testset to put IOR
stress on a filesystem
rename job templates to be more clear on what io load
they are creating
|
|
|
|
@504
|
[504]
|
4 years |
sonicsoft70 |
|
|
tweaks to ior params and comment updates
|
|
|
|
@503
|
[503]
|
4 years |
sonicsoft70 |
|
|
distclean target
|
|
|
|
@502
|
[502]
|
4 years |
sonicsoft70 |
|
|
allow testsets to install alternate job templates or
other files properly named. Anything in
templates/TESTSETNAME_*.* will be installed into TESTSETNAME
appropriately.
|
|
|
|
@501
|
[501]
|
4 years |
sonicsoft70 |
|
|
IO testset:
- remove the Nto1 test case as a default job
- update NtoN test case params to mirror SWL setup which
randy likes
IOSANITY testset:
- update params to mirror changes from IO NtoN case except
for data scale
|
|
|
|
@500
|
[500]
|
4 years |
sonicsoft70 |
|
|
install all the header templates all the time...
|
|
|
|
@499
|
[499]
|
4 years |
braithr |
|
|
First pass at SWEEP3D output parsing
|
|
|
|
@498
|
[498]
|
4 years |
sonicsoft70 |
|
|
implemented slurm_query() so throttledbatch mode works
with slurm
|
|
|
|
@497
|
[497]
|
4 years |
sonicsoft70 |
|
|
add iotest target to just make IO testing binaries
|
|
|
|
@496
|
[496]
|
4 years |
sonicsoft70 |
|
|
honor CFLAGS from Cbench make.def
|
|
|
|
@495
|
[495]
|
4 years |
sonicsoft70 |
|
|
make the makefile smarter about configure
have stress print out how many procs it ran on
|
|
|
|
@494
|
[494]
|
4 years |
sonicsoft70 |
|
|
check for more errors from stress
|
|
|
|
@493
|
[493]
|
4 years |
sonicsoft70 |
|
|
tokensmash rises again...
|
|
|
|
@492
|
[492]
|
4 years |
sonicsoft70 |
|
|
better error checking
|
|
|
|
@491
|
[491]
|
4 years |
sonicsoft70 |
|
|
silence silly compiler warning about printf specifiers
change the msgrate reporting to be per rank
|
|
|
|
@490
|
[490]
|
4 years |
sonicsoft70 |
|
|
update help output
|
|
|
|
@489
|
[489]
|
4 years |
sonicsoft70 |
|
|
cruft
|
|
|
|
@488
|
[488]
|
4 years |
sonicsoft70 |
|
|
cosmetic cleanups
|
|
|
|
@487
|
[487]
|
4 years |
sonicsoft70 |
|
|
some bugfixes to handle mpi_request accounting better
add message rate stats
|
|
|
|
@486
|
[486]
|
4 years |
sonicsoft70 |
|
|
reinstating mpi_tokensmash since this may be
useful to me soon
|
|
|
|
@485
|
[485]
|
4 years |
braithr |
|
|
Add sweep3d installation and job generation trickery.
|
|
|
|
@483
|
[483]
|
4 years |
sonicsoft70 |
|
|
node failure message in slurm 1.3
|
|
|
|
@482
|
[482]
|
4 years |
sonicsoft70 |
|
|
handle the case where open_and_slurp() attempts to
slurp a file that is too big to sanely parse more
gracefully
|
|
|
|
@481
|
[481]
|
4 years |
sonicsoft70 |
|
|
slight bug with what $status is returned, not sure
why i never noticed this before...
|
|
|
|
@471
|
[471]
|
5 years |
sonicsoft70 |
|
|
prototype code for the --usecache cache feature talked
about in ticket #13
this is not perfect yet and i think will always behave
a bit differently than the non-cached mode
|
|
|
|
@470
|
[470]
|
5 years |
sonicsoft70 |
|
|
updated for post 1.2.0 dev
|