root/trunk/cbench

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Rev Chgset Date Author
(edit) @550 [550] 5 years sonicsoft70

updated docs for cbench 1.2.1 from the new trac site
http://apps.sourceforge.net/trac/cbench

(edit) @549 [549] 5 years sonicsoft70

expected release date

(edit) @548 [548] 5 years sonicsoft70

1.2.1 prep

(edit) @547 [547] 5 years sonicsoft70

prepping for cbench 1.2.1 release

(edit) @546 [546] 5 years sonicsoft70

Add some basic stats during cbench_start_jobs.pl

(edit) @545 [545] 5 years sonicsoft70

add a status line at the end of generation

(edit) @544 [544] 5 years sonicsoft70

starting to prepare for cbench 1.2.1

(edit) @541 [541] 5 years sonicsoft70

new job in the LAMMPS testset, rhodolong.scaled, which is a
work in progress

(edit) @540 [540] 5 years sonicsoft70

found that the lammps eam.scaled and lj.scaled job templates
were blowing up memory. tweak the scaling factors for them.

add a new lammps job template rhodolong.scaled which will
be long running job (24-96 hours targeted) with restart file
dumping and restarting the job from a restart file within
the same batch job

tweak some debug output levels

(edit) @539 [539] 5 years sonicsoft70

install lammps testset 'bench' directory, which all the
input decks and data files are referenced from, from
the explicitly version tracked openapps/lammps/bench
directory instead of the dynamic lammps tarball

(edit) @538 [538] 5 years sonicsoft70

add cbench_runin_tempdir() function that job scripts can
use to run in an isolated unique directory. real apps need
to do this more often than not.

(edit) @536 [536] 5 years sonicsoft70

when --customparse mode is enabled, keep a hash summarizing
the various matches (some of which will likely be repeated) and
print out a section with the summary information. For example:

Customparse Matches Summary:


'FORRTL: error 78, process killed via SIGTERM' => 4 matches
'OMPI says orterun killing job' => 15 matches
'SLURM JOB 975 NODE FAILURE' => 1 matches
'SLURM JOB WALLTIME EXCEEDED' => 16 matches

(edit) @535 [535] 5 years sonicsoft70

Catch a LAMMPS memory allocation failure

(edit) @534 [534] 5 years sonicsoft70

Simple utility script to remove output files from jobs
that had ERROR states so they no longer show up in output
parsing. The --force parameter must be given to actualy
delete anything:

Usage looks like:

bandwidth_output_parse.pl --ident ompi13-intel11 --diag --report | cbench_rm_failed_jobs.pl

For example:
[n280 cbench-test]$ cbench_output_parse.pl --meta --nodata --diag --report --custom | cbench_rm_failed_jobs.pl
Would remove: lammps/ompi13-intel11/lj.scaled-1ppn-1/slurm.o856
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-1/slurm.o854
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-2/slurm.o867
Would remove: lammps/ompi13-intel11/eam.scaled-1ppn-8/slurm.o888
Would remove: lammps/ompi13-intel11/lj.scaled-1ppn-9/slurm.o911
.
.
.

Otherwise the output looks like:
[n280 cbench-test]$ cbench_output_parse.pl --meta --nodata --diag --report --custom
..........**DIAG**(lammps/ompi13-intel11/lj.scaled-1ppn-1/slurm.o856) had a ERROR with status STARTED


**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-1/slurm.o854) had a ERROR with status STARTED


**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-2/slurm.o867) had a ERROR with status STARTED


.**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-8/slurm.o888) had a ERROR with status STARTED


**DIAG**(lammps/ompi13-intel11/lj.scaled-1ppn-9/slurm.o911) had a ERROR with status STARTED


**DIAG**(lammps/ompi13-intel11/eam.scaled-1ppn-9/slurm.o912) had a ERROR with status STARTED
.
.
.

(edit) @533 [533] 5 years sonicsoft70

didn't mean to change the layout of the DIAG line...

(edit) @532 [532] 5 years sonicsoft70

Since Slurm spools job stdout/stderr output continually into the
slurm.oNNNN files, the Cbench output parsing structure appears to have
phantom jobs that are in an ERROR state that later disappear. This
is because the output parser is looking at output from a live job.
Torque/PBS does not behave this way because the .oNNNN file does not
show up until the job has completed.

To deal with this intelligently, if cbench_output_parse.pl finds itself
parsing output files from Slurm batch jobs, it will call the slurm_query()
subroutine once to cache the state of Slurm jobs. Then if the parse
module for a job returns an ERROR status of some sort, the job is cross-
referenced against jobs known to be running in Slurm. If the job is
running according to the cached Slurm data, the job is flagged as RUNNING
and not as an error. For example, here is a snippet from an output parse
run with running jobs:


**DIAG**(lammps/ompi13-intel11/eam.scaled-4ppn-100/slurm.o1012) had ERROR with status STARTED


.**DIAG**(qcd/ompi13-intel11/qcd-4ppn-4/slurm.o1137) is still RUNNING


**DIAG**(cth/ompi13-intel11/amr3doblique-1ppn-1/slurm.o1037) had ERROR with status FATALERROR


**DIAG**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) had ERROR with status STARTED
**PARSEMATCH**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) => SLURM JOB WALLTIME EXCEEDED
**PARSEMATCH**(cth/ompi13-intel11/rsrl-1ppn-1/slurm.o1038) => OMPI says orterun killing job


**DIAG**(cth/ompi13-intel11/amr3doblique-4ppn-8/slurm.o1064) is still RUNNING


slurm_query() was updated a bit to cache Jobid data as well as job name.

(edit) @531 [531] 5 years sonicsoft70

make the slurm job cancelled regex catch newer and older
syntax

(edit) @530 [530] 5 years sonicsoft70

look for 'Elapsed time' as the end of a sweep job.
the fortran stop looks to be compiler dependent as i
don't see it with intel 11.0

(edit) @529 [529] 5 years sonicsoft70

help output was ordered poorly

(edit) @528 [528] 5 years sonicsoft70

add --maxnodes, --minnodes, --nodes options

(edit) @527 [527] 5 years sonicsoft70

bugfix

(edit) @526 [526] 5 years braithr

Change bonnie++ so the Makefile dynamically downloads and builds the program, similar to other Cbench tests

(edit) @525 [525] 5 years sonicsoft70

add --minnodes, --maxnodes, --nodes command line options

(edit) @524 [524] 5 years braithr

Get rid of "make[1]: *** No rule to make target `distclean'. Stop." errors by making distclean targets where none existed. All of the distclean targets just point to their clean target for now.

(edit) @523 [523] 5 years braithr

First-pass addition of High-Performance Linpack 2.0 to Cbench.

(edit) @521 [521] 5 years sonicsoft70

alias_spec() must return undef and not empty string
if it does not want to provide any aliases

(edit) @520 [520] 5 years sonicsoft70

adding support for mpiBench collective benchmark
from LLNL Phloem benchmarks

added three mpibench jobs to the collective testset

(edit) @519 [519] 5 years sonicsoft70

added support for LLNL Sequoia message rate benchmark
named SQMR from the Phloem benchmarks

added a sqmr job to the BANDWIDTH testset

(edit) @518 [518] 5 years sonicsoft70

bugfixes

message rate output wasn't getting parsed right

(edit) @517 [517] 5 years sonicsoft70

make the interactive mode JOBID generation more
resistant to collisions

(edit) @516 [516] 5 years sonicsoft70

'com' output parse module updated to deal with the
com version found in the Phloem 1.0.0 benchmark

updated the com job template in the bandwidth testset
for the newer com version usage

added a com job template in the latency testset
for the newer com version which can do latency as well

(edit) @515 [515] 5 years sonicsoft70

bugfixes

(edit) @514 [514] 5 years sonicsoft70

smarter makefileness

(edit) @513 [513] 5 years sonicsoft70

adding Phloem MPI Benchmarks v1.0.0 from ASCI Sequoia
benchmarks. has mpiBench and mpiGraph and Presta among others

(edit) @512 [512] 5 years sonicsoft70

move 'npb' down to the end of $core_testsets

(edit) @511 [511] 5 years sonicsoft70

more tweaking on the stressful IOR job templates

(edit) @510 [510] 5 years sonicsoft70

catch another test module elapsed time edge case

(edit) @509 [509] 5 years sonicsoft70

catch the iostress and iosanity jobs as well

(edit) @508 [508] 5 years sonicsoft70

change the 'doitall' target to compile correctly

(edit) @507 [507] 5 years sonicsoft70

add distclean target

(edit) @506 [506] 5 years sonicsoft70

tweak the cbench-init.{sh,csh}

(edit) @505 [505] 5 years sonicsoft70

adding a job in the Shakedown testset to put IOR
stress on a filesystem

rename job templates to be more clear on what io load
they are creating

(edit) @504 [504] 5 years sonicsoft70

tweaks to ior params and comment updates

(edit) @503 [503] 5 years sonicsoft70

distclean target

(edit) @502 [502] 5 years sonicsoft70

allow testsets to install alternate job templates or
other files properly named. Anything in
templates/TESTSETNAME_*.* will be installed into TESTSETNAME
appropriately.

(edit) @501 [501] 5 years sonicsoft70

IO testset:

  • remove the Nto1 test case as a default job
  • update NtoN test case params to mirror SWL setup which randy likes

IOSANITY testset:

  • update params to mirror changes from IO NtoN case except for data scale
(edit) @500 [500] 5 years sonicsoft70

install all the header templates all the time...

(edit) @499 [499] 5 years braithr

First pass at SWEEP3D output parsing

(edit) @498 [498] 5 years sonicsoft70

implemented slurm_query() so throttledbatch mode works
with slurm

(edit) @497 [497] 5 years sonicsoft70

add iotest target to just make IO testing binaries

(edit) @496 [496] 5 years sonicsoft70

honor CFLAGS from Cbench make.def

(edit) @495 [495] 5 years sonicsoft70

make the makefile smarter about configure

have stress print out how many procs it ran on

(edit) @494 [494] 5 years sonicsoft70

check for more errors from stress

(edit) @493 [493] 5 years sonicsoft70

tokensmash rises again...

(edit) @492 [492] 5 years sonicsoft70

better error checking

(edit) @491 [491] 5 years sonicsoft70

silence silly compiler warning about printf specifiers

change the msgrate reporting to be per rank

(edit) @490 [490] 5 years sonicsoft70

update help output

(edit) @489 [489] 5 years sonicsoft70

cruft

(edit) @488 [488] 5 years sonicsoft70

cosmetic cleanups

(edit) @487 [487] 5 years sonicsoft70

some bugfixes to handle mpi_request accounting better

add message rate stats

(edit) @486 [486] 5 years sonicsoft70

reinstating mpi_tokensmash since this may be
useful to me soon

(edit) @485 [485] 5 years braithr

Add sweep3d installation and job generation trickery.

(edit) @483 [483] 5 years sonicsoft70

node failure message in slurm 1.3

(edit) @482 [482] 5 years sonicsoft70

handle the case where open_and_slurp() attempts to
slurp a file that is too big to sanely parse more
gracefully

(edit) @481 [481] 5 years sonicsoft70

slight bug with what $status is returned, not sure
why i never noticed this before...

(edit) @471 [471] 6 years sonicsoft70

prototype code for the --usecache cache feature talked
about in ticket #13

this is not perfect yet and i think will always behave
a bit differently than the non-cached mode

(edit) @470 [470] 6 years sonicsoft70

updated for post 1.2.0 dev

Note: See TracRevisionLog for help on using the revision log.