Menu

Home

Rob Egan Aydin Buluc

HipMer v 1.0, Copyright (c) 2019, The Regents of the University of California,
through Lawrence Berkeley National Laboratory (subject to receipt of any
required approvals from the U.S. Dept. of Energy).  All rights reserved.

If you have questions about your rights to use or distribute this software,
please contact Berkeley Lab's Innovation & Partnerships Office at  IPO@lbl.gov.

NOTICE.  This Software was developed under funding from the U.S. Department
of Energy and the U.S. Government consequently retains certain rights. As such,
the U.S. Government has been granted for itself and others acting on its behalf
a paid-up, nonexclusive, irrevocable, worldwide license in the Software to
reproduce, distribute copies to the public, prepare derivative works, and
perform publicly and display publicly, and to permit other to do so.

HipMer -- High Performance Meraculous

HipMer is a high performance, distributed memory and scalable version of Meraculous, a de novo genome assembler.

HipMer is a PGAS application, and the main software dependencies are the UPC language and UPC++ library, both of which use GASNet-EX for communication.
It can run on most any system from laptops to supercomputers, but requires a high speed and low latency networking interface for any cluster such as infiniband to scale efficiently.

This project is a joint collaboration between JGI,
NERSC and CRD and is primarily funded by the ExaBiome project, one of the US Department of Energy (DOE)'s Exascale Computing Projects (ECP)

Primary authors are:
Evangelos Georganas, Aydın Buluç, Steven Hofmey, Rob Egan and Eugene Goltsman,
with leadershipm, direction and advice from Kathy Yelick and Leonid Oliker.

The original Meraculous was developed by Jarrod Chapman, Isaac Ho, Eugene Goltsman,
and Daniel Rokhsar.


Building and installing

Please see the README.md for the most up to date instructions


Running

To run, use the src/hipmer/run_hipmer.sh script which requires the install
directory and meraculous.config file:

${PREFIX}/bin/run_hipmer.sh ${PREFIX} meraculous.config

There are several configuration files in test/pipeline/*.config:

Config File Description
meraculous-validation.config a small validation test
meraculous-ecoli.config ecoli dataset, easy to run on single node systems with limited cores & memory
meraculous-chr14.config human chromosome 14 (diploid). Can be run on single node systems, but will be slower than ecoli.
meraculous-human.config full human dataset, requires around 1TB of memory

For convenience, there are run scripts in .platform_deploy that make it easier
to run jobs. For systems like Edison, these scripts can be submitted directly
to the job queue (with overridden queue, mppwidth and wall time options). For
.generic_deploy, the scripts can be executed directly. The scripts expect the
data for the tests to be in hipmer_name_data, where name is 'ecoli',
'validation', 'human', 'chr14'. If a dataset doesn't exist, the script will
download and install it (with a stripe of 72 on Edison).

The run scripts automatically set the CORES_PER_NODE and THREADS variables. In
the case of Edison, the number of threads is determined from the number of
processors found at runtime, and the CORES_PER_NODE is fixed to 24. In the case
of generic, the number of threads by default is all those available on the
single node, and the CORES_PER_NODE is the same value. You can override these
values, but make sure to set the CORES_PER_NODE appropriately if you change
anything.

The pipeline can also be run with all the intermediate per-thread files in
shared memory (/dev/shm), plus the FASTQ inputs. Set the environment variable
USE_SHM=1 to achieve this. There are some scripts that have the shared memory
option, e.g. .edison_deploy/test_hipmer_human-edison-shm.sh. Using shared
memory will be faster, especially at larger concurrencies, but require a lot
more memory (around 1TB for human).

Before launching run_hipmer.sh, some settings can be changed through
environmental variables:

Physical memory properties:
  PHYS_MEM_MB=413135

The number of UPC threads in the job (MPI will not use hyperthreads):
  THREADS=20 HYPERTHREADS=1

The rundirectory to place all the files (will be created if necessary):
  RUNDIR=/global/homes/r/regan/workspace/hipmeraculous/cori09-116095-20160309-160941
  DATADIR=
  LFS_STRIPE=72

Gasnet properties (by default 80% of physmem):
  GASNET_PHYSMEM_MAX=413135MB
  GASNET_PHYSMEM_NOPROBE=1

UPC properties:
  UPC_SHARED_HEAP_MB=15500 (Do not set to use 80% of the node memory)
  UPC_PTHREADS=
  UPC_PTHREADS_OPT=

HipMer options (will override config file defaults):
  UFX_HLL=0
  MIN_DEPTH_CUTOFF=0 # use 0 for auto-detect after UFX generation
  BUBBLE_MIN_DEPTH_CUTOFF=
  KMER_LENGTH=
  MIN_CONTIG_LENGTH=
  NUM_ALIGNMENTS=192000
  ILLUMINA_VERSION=
  HIPMER_SW_CACHE=6144
  NO_CONCAT=
  CAN_SPLIT_JOB=0
  ONO_ACTIVE_THREADS=20

MPI/UPC environment:
  MPIRUN=mpirun -n 20
  UPCRUN=upcrun    -shared-heap=15500M  -n

Note: the Illumina Version is automatically detected, and will be reported in
the output for the run. To override, set the ILLUMINA_VERSION environment
variable.

Some features of run_hipmer.sh:

  • Should detect and run the alternate diploid workflow if set in the config
    file
  • Runs the proper set and ordering of splinter, spanner, bmaToLinks and oNo
    for all libraries based on the config file
  • Organizes inputs by library, finds the files specified in the config file
  • Calls the proper version of binaries (KMER_LENGTH & READ_LENGTH)
  • Logs all commands and timings in timings.log
  • Aborts on error, continues on first failed step
  • Validates that outputs are generated

To rerun specific stages in the pipeline, first delete the .log file for the
stage or stages, and then execute:

export RUNDIR=<name of output dir>
${PREFIX}/bin/run_hipmer.sh <install_path> <output_dir>/<config_file>

Workflow

The HipMer workflow is controlled within the configuration file, when the
libraries are specified. For each library, you can specify what round of oNo to
use it in, and you can specify whether or not to use it for splinting. The
workflow is as follows (see run_hipmer.sh for details):

  1. (prepare input fastq files)
  2. They must be uncompressed
  3. They ought to be striped for efficient parallel access
  4. prepare meraculous.config
  5. ufx
  6. contigs
  7. contigMerDepth
  8. if diploid:
    1. contigEndAnalyzer
    2. bubbleFinder
  9. (optionally upc_canonical_assembly: canonical_contigs.fa)
  10. for each library:
    1. merAligner
    2. splinter (if specified in config file)
  11. for each oNoSetID:
    1. for each library in that oNoSetId:
      1. merAlignerAnalyzer (histogrammer)
      2. spanner
    2. bmaToLinks
    3. merger
      4 for each oNoRuns choose a -p
      1. oNo
    4. splitter
  12. gapclosing
  13. upc_canonical_assembly: final_assembly.fa

This means that the first round of bmaToLinks could end up processing the
outputs from multiple iterations of splinter plus multiple ones of spanner. The
subsequent calls to bmaToLinks will only process outputs from spanner.