HipMer v 1.0, Copyright (c) 2019, The Regents of the University of California,
through Lawrence Berkeley National Laboratory (subject to receipt of any
required approvals from the U.S. Dept. of Energy). All rights reserved.
If you have questions about your rights to use or distribute this software,
please contact Berkeley Lab's Innovation & Partnerships Office at IPO@lbl.gov.
NOTICE. This Software was developed under funding from the U.S. Department
of Energy and the U.S. Government consequently retains certain rights. As such,
the U.S. Government has been granted for itself and others acting on its behalf
a paid-up, nonexclusive, irrevocable, worldwide license in the Software to
reproduce, distribute copies to the public, prepare derivative works, and
perform publicly and display publicly, and to permit other to do so.
HipMer is a high performance, distributed memory and scalable version of Meraculous, a de novo genome assembler.
HipMer is a PGAS application, and the main software dependencies are the UPC language and UPC++ library, both of which use GASNet-EX for communication.
It can run on most any system from laptops to supercomputers, but requires a high speed and low latency networking interface for any cluster such as infiniband to scale efficiently.This project is a joint collaboration between JGI,
NERSC and CRD and is primarily funded by the ExaBiome project, one of the US Department of Energy (DOE)'s Exascale Computing Projects (ECP)Primary authors are:
Evangelos Georganas, Aydın Buluç, Steven Hofmey, Rob Egan and Eugene Goltsman,
with leadershipm, direction and advice from Kathy Yelick and Leonid Oliker.The original Meraculous was developed by Jarrod Chapman, Isaac Ho, Eugene Goltsman,
and Daniel Rokhsar.
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "HipMer: An Extreme-Scale De Novo Genome Assembler". 27th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2015), Austin, TX, USA, November 2015.
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "merAligner: A Fully Parallel Sequence Aligner". 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2015), Hyderabad, INDIA, May 2015.
Jarrod A Chapman, Martin Mascher, Aydın Buluç, Kerrie Barry, Evangelos Georganas, Adam Session, Veronika Strnadova, Jerry Jenkins, Sunish Sehgal, Leonid Oliker, Jeremy Schmutz, Katherine A Yelick, Uwe Scholz, Robbie Waugh, Jesse A Poland, Gary J Muehlbauer, Nils Stein and Daniel S Rokhsar "A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome" . Genome Biology 2015, 16:26 .
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly". 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2014), New Orleans, LA, USA, November 2014.
Please see the README.md for the most up to date instructions
To run, use the src/hipmer/run_hipmer.sh script which requires the install
directory and meraculous.config file:
${PREFIX}/bin/run_hipmer.sh ${PREFIX} meraculous.config
There are several configuration files in test/pipeline/*.config:
| Config File | Description |
|---|---|
| meraculous-validation.config | a small validation test |
| meraculous-ecoli.config | ecoli dataset, easy to run on single node systems with limited cores & memory |
| meraculous-chr14.config | human chromosome 14 (diploid). Can be run on single node systems, but will be slower than ecoli. |
| meraculous-human.config | full human dataset, requires around 1TB of memory |
For convenience, there are run scripts in .platform_deploy that make it easier
to run jobs. For systems like Edison, these scripts can be submitted directly
to the job queue (with overridden queue, mppwidth and wall time options). For
.generic_deploy, the scripts can be executed directly. The scripts expect the
data for the tests to be in hipmer_name_data, where name is 'ecoli',
'validation', 'human', 'chr14'. If a dataset doesn't exist, the script will
download and install it (with a stripe of 72 on Edison).
The run scripts automatically set the CORES_PER_NODE and THREADS variables. In
the case of Edison, the number of threads is determined from the number of
processors found at runtime, and the CORES_PER_NODE is fixed to 24. In the case
of generic, the number of threads by default is all those available on the
single node, and the CORES_PER_NODE is the same value. You can override these
values, but make sure to set the CORES_PER_NODE appropriately if you change
anything.
The pipeline can also be run with all the intermediate per-thread files in
shared memory (/dev/shm), plus the FASTQ inputs. Set the environment variable
USE_SHM=1 to achieve this. There are some scripts that have the shared memory
option, e.g. .edison_deploy/test_hipmer_human-edison-shm.sh. Using shared
memory will be faster, especially at larger concurrencies, but require a lot
more memory (around 1TB for human).
Before launching run_hipmer.sh, some settings can be changed through
environmental variables:
Physical memory properties:
PHYS_MEM_MB=413135
The number of UPC threads in the job (MPI will not use hyperthreads):
THREADS=20 HYPERTHREADS=1
The rundirectory to place all the files (will be created if necessary):
RUNDIR=/global/homes/r/regan/workspace/hipmeraculous/cori09-116095-20160309-160941
DATADIR=
LFS_STRIPE=72
Gasnet properties (by default 80% of physmem):
GASNET_PHYSMEM_MAX=413135MB
GASNET_PHYSMEM_NOPROBE=1
UPC properties:
UPC_SHARED_HEAP_MB=15500 (Do not set to use 80% of the node memory)
UPC_PTHREADS=
UPC_PTHREADS_OPT=
HipMer options (will override config file defaults):
UFX_HLL=0
MIN_DEPTH_CUTOFF=0 # use 0 for auto-detect after UFX generation
BUBBLE_MIN_DEPTH_CUTOFF=
KMER_LENGTH=
MIN_CONTIG_LENGTH=
NUM_ALIGNMENTS=192000
ILLUMINA_VERSION=
HIPMER_SW_CACHE=6144
NO_CONCAT=
CAN_SPLIT_JOB=0
ONO_ACTIVE_THREADS=20
MPI/UPC environment:
MPIRUN=mpirun -n 20
UPCRUN=upcrun -shared-heap=15500M -n
Note: the Illumina Version is automatically detected, and will be reported in
the output for the run. To override, set the ILLUMINA_VERSION environment
variable.
Some features of run_hipmer.sh:
To rerun specific stages in the pipeline, first delete the .log file for the
stage or stages, and then execute:
export RUNDIR=<name of output dir>
${PREFIX}/bin/run_hipmer.sh <install_path> <output_dir>/<config_file>
The HipMer workflow is controlled within the configuration file, when the
libraries are specified. For each library, you can specify what round of oNo to
use it in, and you can specify whether or not to use it for splinting. The
workflow is as follows (see run_hipmer.sh for details):
This means that the first round of bmaToLinks could end up processing the
outputs from multiple iterations of splinter plus multiple ones of spanner. The
subsequent calls to bmaToLinks will only process outputs from spanner.