HipMer Wiki

A High performance distributed memory assembler for big genomic data

Brought to you by: aydozz, egeor, egoltsman, robegan, shofmeyr

Home

HipMer v 1.0, Copyright (c) 2019, The Regents of the University of California,
through Lawrence Berkeley National Laboratory (subject to receipt of any
required approvals from the U.S. Dept. of Energy).  All rights reserved.

If you have questions about your rights to use or distribute this software,
please contact Berkeley Lab's Innovation & Partnerships Office at  IPO@lbl.gov.

NOTICE.  This Software was developed under funding from the U.S. Department
of Energy and the U.S. Government consequently retains certain rights. As such,
the U.S. Government has been granted for itself and others acting on its behalf
a paid-up, nonexclusive, irrevocable, worldwide license in the Software to
reproduce, distribute copies to the public, prepare derivative works, and
perform publicly and display publicly, and to permit other to do so.

HipMer -- High Performance Meraculous

HipMer is a high performance, distributed memory and scalable version of Meraculous, a de novo genome assembler.

HipMer is a PGAS application, and the main software dependencies are the UPC language and UPC++ library, both of which use GASNet-EX for communication.
It can run on most any system from laptops to supercomputers, but requires a high speed and low latency networking interface for any cluster such as infiniband to scale efficiently.

This project is a joint collaboration between JGI,
NERSC and CRD and is primarily funded by the ExaBiome project, one of the US Department of Energy (DOE)'s Exascale Computing Projects (ECP)

Primary authors are:
Evangelos Georganas, Aydın Buluç, Steven Hofmey, Rob Egan and Eugene Goltsman,
with leadershipm, direction and advice from Kathy Yelick and Leonid Oliker.

The original Meraculous was developed by Jarrod Chapman, Isaac Ho, Eugene Goltsman,
and Daniel Rokhsar.

Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Steven Hofmeyr, Chaitanya Aluru, Rob Egan, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "HipMer: An Extreme-Scale De Novo Genome Assembler". 27th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2015), Austin, TX, USA, November 2015.
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "merAligner: A Fully Parallel Sequence Aligner". 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2015), Hyderabad, INDIA, May 2015.
Jarrod A Chapman, Martin Mascher, Aydın Buluç, Kerrie Barry, Evangelos Georganas, Adam Session, Veronika Strnadova, Jerry Jenkins, Sunish Sehgal, Leonid Oliker, Jeremy Schmutz, Katherine A Yelick, Uwe Scholz, Robbie Waugh, Jesse A Poland, Gary J Muehlbauer, Nils Stein and Daniel S Rokhsar "A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome" . Genome Biology 2015, 16:26 .
Evangelos Georganas, Aydın Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar and Katherine Yelick, "Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly". 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2014), New Orleans, LA, USA, November 2014.

Building and installing

Please see the README.md for the most up to date instructions

Running

To run, use the src/hipmer/run_hipmer.sh script which requires the install
directory and meraculous.config file:

${PREFIX}/bin/run_hipmer.sh ${PREFIX} meraculous.config

There are several configuration files in test/pipeline/*.config:

Config File	Description
meraculous-validation.config	a small validation test
meraculous-ecoli.config	ecoli dataset, easy to run on single node systems with limited cores & memory
meraculous-chr14.config	human chromosome 14 (diploid). Can be run on single node systems, but will be slower than ecoli.
meraculous-human.config	full human dataset, requires around 1TB of memory

For convenience, there are run scripts in .platform_deploy that make it easier
to run jobs. For systems like Edison, these scripts can be submitted directly
to the job queue (with overridden queue, mppwidth and wall time options). For
.generic_deploy, the scripts can be executed directly. The scripts expect the
data for the tests to be in hipmer_name_data, where name is 'ecoli',
'validation', 'human', 'chr14'. If a dataset doesn't exist, the script will
download and install it (with a stripe of 72 on Edison).

The run scripts automatically set the CORES_PER_NODE and THREADS variables. In
the case of Edison, the number of threads is determined from the number of
processors found at runtime, and the CORES_PER_NODE is fixed to 24. In the case
of generic, the number of threads by default is all those available on the
single node, and the CORES_PER_NODE is the same value. You can override these
values, but make sure to set the CORES_PER_NODE appropriately if you change
anything.

The pipeline can also be run with all the intermediate per-thread files in
shared memory (/dev/shm), plus the FASTQ inputs. Set the environment variable
USE_SHM=1 to achieve this. There are some scripts that have the shared memory
option, e.g. .edison_deploy/test_hipmer_human-edison-shm.sh. Using shared
memory will be faster, especially at larger concurrencies, but require a lot
more memory (around 1TB for human).

Before launching run_hipmer.sh, some settings can be changed through
environmental variables:

Physical memory properties:
  PHYS_MEM_MB=413135

The number of UPC threads in the job (MPI will not use hyperthreads):
  THREADS=20 HYPERTHREADS=1

The rundirectory to place all the files (will be created if necessary):
  RUNDIR=/global/homes/r/regan/workspace/hipmeraculous/cori09-116095-20160309-160941
  DATADIR=
  LFS_STRIPE=72

Gasnet properties (by default 80% of physmem):
  GASNET_PHYSMEM_MAX=413135MB
  GASNET_PHYSMEM_NOPROBE=1

UPC properties:
  UPC_SHARED_HEAP_MB=15500 (Do not set to use 80% of the node memory)
  UPC_PTHREADS=
  UPC_PTHREADS_OPT=

HipMer options (will override config file defaults):
  UFX_HLL=0
  MIN_DEPTH_CUTOFF=0 # use 0 for auto-detect after UFX generation
  BUBBLE_MIN_DEPTH_CUTOFF=
  KMER_LENGTH=
  MIN_CONTIG_LENGTH=
  NUM_ALIGNMENTS=192000
  ILLUMINA_VERSION=
  HIPMER_SW_CACHE=6144
  NO_CONCAT=
  CAN_SPLIT_JOB=0
  ONO_ACTIVE_THREADS=20

MPI/UPC environment:
  MPIRUN=mpirun -n 20
  UPCRUN=upcrun    -shared-heap=15500M  -n

Note: the Illumina Version is automatically detected, and will be reported in
the output for the run. To override, set the ILLUMINA_VERSION environment
variable.

Some features of run_hipmer.sh:

Should detect and run the alternate diploid workflow if set in the config
file
Runs the proper set and ordering of splinter, spanner, bmaToLinks and oNo
for all libraries based on the config file
Organizes inputs by library, finds the files specified in the config file
Calls the proper version of binaries (KMER_LENGTH & READ_LENGTH)
Logs all commands and timings in timings.log
Aborts on error, continues on first failed step
Validates that outputs are generated

To rerun specific stages in the pipeline, first delete the .log file for the
stage or stages, and then execute:

export RUNDIR=<name of output dir>
${PREFIX}/bin/run_hipmer.sh <install_path> <output_dir>/<config_file>

Workflow

The HipMer workflow is controlled within the configuration file, when the
libraries are specified. For each library, you can specify what round of oNo to
use it in, and you can specify whether or not to use it for splinting. The
workflow is as follows (see run_hipmer.sh for details):

(prepare input fastq files)
They must be uncompressed
They ought to be striped for efficient parallel access
prepare meraculous.config
ufx
contigs
contigMerDepth
if diploid:
1. contigEndAnalyzer
2. bubbleFinder
(optionally upc_canonical_assembly: canonical_contigs.fa)
for each library:
1. merAligner
2. splinter (if specified in config file)
for each oNoSetID:
1. for each library in that oNoSetId:
  1. merAlignerAnalyzer (histogrammer)
  2. spanner
2. bmaToLinks
3. merger
  4 for each oNoRuns choose a -p
  1. oNo
4. splitter
gapclosing
upc_canonical_assembly: final_assembly.fa

This means that the first round of bmaToLinks could end up processing the
outputs from multiple iterations of splinter plus multiple ones of spanner. The
subsequent calls to bmaToLinks will only process outputs from spanner.