qPileup

Authors:

Introduction

qpileup is a standalone Java application for gathering base-level metrics across a whole genome from a group of bams. It is conceptually similar to samtools or Picard pileup in that it parses BAM files to create a summary of information for each position in a reference sequence(s). It differs from existing pileup implementations in that it captures a much larger range of metrics and it makes use of the HDF5 data storage format to store the data, allowing for smaller files sizes (via compression) and much faster access (via indexing).

Installation

qpileup requires java 7 and (ideally) a multi-core machine (5 threads are run concurrently) with at least 20GB of RAM.
Download the qpileup tar file
Untar the tar file into a directory of your choice
You should see jar files for qpileup and its dependencies:

[oholmes@minion0 qsnp]$ tar xjvf qsnp.tar.bz2
x antlr-3.2.jar
x ini4j-0.5.2-SNAPSHOT.jar
x jopt-simple-3.2.jar
x picard-1.110.jar
x qbamfilter-1.0pre.jar
x qcommon-0.1pre.jar
x qio-0.1pre.jar
x qpicard-0.1pre.jar
x qpileup-0.1pre.jar
x sam-1.110.jar
x jhdfobj.jar
x jhdf5obj.jar
x jhdf5.jar
x jhdf.jar
[oholmes@minion0 qsnp]$

Usage

java -Xmx20g -jar qpileup-0.1pre.jar --ini /path/to/ini/file/demo.ini

Modes

qpileup has several modes:

Mode	Description
bootstrap	Creates a HDF5 file for a reference genome, storing position and reference base information for the genome. The output from this mode is a complete HDF5 qpileup file but with no values in any of the summary metrics. For a given reference genome, it would make sense to run bootstrap mode once to create a "clean" initialised qpileup HSF5 and then use that bootstrap file as the basis for numerous BAM collections. Using 12 threads on a cluster node with 2 x 6-core CPUs, bootstrapping the human genome takes approximately 12 minutes.
add	Takes one or more BAM files as input and performs a pileup for the reads, adding the new data to an existing qpileup file. It is expected that this and "view" will be the most often used qpileup modes as this mode allows for addition of new BAMs to existing qpileup files and "view" allows for querying of locations in the qpileup file.
remove	Takes one or more BAM files as input and performs a pileup for the reads, appropriately decrementing the values of all of the qpileup metrics. This should rarely be necessary but could be invaluable in cases of sample or tumour/normal swaps, or incorrect labelling of BAM files. Without this mode, any case where a BAM was incorrcetly added to an HDF5 file would require the HDF5 file to be regenerated from scratch.
view	Reads metrics from the HDF5 file for a genome and write to a CSV file. Can be the whole genome or for specific chromosomal locations. Will output a separate CSV for each chromosome/contig that is part of the list of ranges queried. Each file will contain metadata at the top including the HDF5 file the summary was extracted from and the regions extacted. View mode can also be run in a limited fashion from the command line. This mode can be to view metadata for the HDF file and the qpileup metrics for a small (up to one chromosome) region of the reference genome.
merge	Merge 2 or more HDF files together. Must use the same reference genome, and the same values for lowreadcount and percentnonref.
metrics	Information required for custom metrics modes that have been implemented.

HDF5 file

HDF5 is a binary file format that allows for high-speed random access to very large datasets, and includes support for user-defined composite data structures and compression. The first version was created in 1987 and it has been used by hundreds of organisations worldwide including NASA, Deutsche Bank, Baylor College of Medicine and ANSTO.
The data stored in the qpileup HDF5 file conceptually fits into 3 categories: position which relates to the reference genome, strand summary which holds the per-base metrics derived from the reads in the the BAMs added to the HDF5 file, and metadata which is a log of the bootstrap/add/remove operations that have been applied to the HDF5.

Position Information

Position is stored in the HDF as 1D Scalar datasets (Integer for postion, char (as byte) for base)
Data Elements are chunked (size=10000), with compression level of 1.

Data Element	Type	Description
Position	Integer	Offset of this base within the sequence. Should be 1-based so the first base is numbered 1.
Reference	Char	Reference base at this position

Strand Summary

Each of the following data elements is compiled independently for each strand so these elements will exist in the H5 file in _for (forward) and _rev (reverse) versions.
Strand Data Elements are created individually as 1D Scalar Datasets (This structure is used due to speed considerations: use of compound datasets or 2D datasets results in much slower run times due to inefficiency of data structure compatabilities with Java/C.
Data Elements are chunked (size=10000), with compression level of 1.

Data Element	Type	Description
A	Integer	Count of all the A bases observed
C	Integer	Count of all the C bases observed
G	Integer	Count of all the G bases observed
T	Integer	Count of all the T bases observed
N	Integer	Count of all the N bases observed
ReferenceNo	Integer	Count of the number of bases at this position which are the same as the reference base
NonreferenceNo	Integer	Count of the number of bases at this position which are not the same as the reference base
HighNonreference	Integer	Count of the number of bams/samples that have a high number of non-reference bases at this position. By default is defined as having a minimum of 10 bases at this position and where non-reference bases account is at least 20% of the total number of bases. The minimum number of bases can be defined in bootstrap mode using the lowreadcount inifile option. The non-reference base percentage minimum threshold can ve defined using the percentnonref inifile option.
LowReadCount	Integer	Count of the number of bams/samples that have a low number of reads covering this position. By default this is define as below 10. The lowreadcount inifile option can be used to set this in bootstrap mode. If a sample has a low read count it is not used when calculating HighNonreference base count
AQual	Long	Sum of the qualities of all the A bases at this position
CQual	Long	Sum of the qualities of all the C bases at this position
GQual	Long	Sum of the qualities of all the G bases at this position
TQual	Long	Sum of the qualities of all the T bases at this position
NQual	Long	Sum of the qualities of all the N bases at this position
MapQual	Long	Sum of the mapping qualities of all reads that provide bases at this position
StartAll	Integer	Count of all reads where alignment starts at this base (obeys clipping)
StartNondup	Integer	As for StartAll except that we only count non-duplicate reads (obeys clipping)
StopAll	Integer	Count of all reads where alignment stops at this base (obeys clipping)
Dup	Integer	Count of reads that were flagged as duplicate and have a base at this position
MateUnmapped	Integer	Count of reads at this position that have an unmapped mate
CigarI	Integer	Count of reads that have an "I" in the CIGAR string at this position. Only is counted at the first position at which the insertion occurs. Defined as where there is an insertion between this reference position and the next reference position.
CigarD	Integer	Count of reads that have an "D" in the CIGAR string at this position
CigarD_start	Integer	Count of reads that have an "D" in the CIGAR string that starts at this position
CigarS	Integer	Count of reads that have an "S" in the CIGAR string at this position
CigarS_start	Integer	Count of reads that have an "S" in the CIGAR string that starts at this position
CigarH	Integer	Count of reads that have an "H" in the CIGAR string at this position
CigarH_start	Integer	Count of reads that have an "H" in the CIGAR string that starts at this position
CigarN	Integer	Count of reads that have an "N" in the CIGAR string at this position (only valid for RNA alignments)

Metadata

Stored as chunked (size=1) 1D Scalar DS, with compression level of 1.
Strings are stored as bytes
Two types of metadata:

Record metadata

A comma separated string with the following elements:

Data Element	Description
Mode	Mode carried out: bootstrap/add/remove
Date	Date a bootstrap/add/remove was performed
Run time	Run time for the analysis
Bam	Path of bam file added/removed
Record Count	Number of records in the bam file

The following attributes are also associated with the metadata and are added during bootstrap mode and potentially modified in other modes:

BAMS_ADDED: a count of the bams that have been added in add mode. Is modified during add mode.
LOW_READ_COUNT: The lowreadcount iniFile option. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.
MIN_NONREF_PERCENT: The percentnonref iniFile option. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.

Reference metadata

The reference metadata contains length and name information for the reference genome. The information is added during bootstrap and cannot be modified after this point. It is a comma separated string with the following elements:

Data Element	Description
Sequence	Name of the reference sequence (ie chromosome or contig)
Length	Number of base pairs in the sequence

Options

Option                        Description                            
------                        -----------                            
-H                            View the header of the HDF file        
-V, -v, --version             Print version info.                    
--element                     Select the qpileup data elements to view                                 
--group                       Select the group of qpileup data elements to view: 
                                   forward, reverse, bases, quals, cigars, readStats      
-h, --help                    Shows this help message.               
--hdf                         HDF File that will be viewed.                              
--ini                         Ini file with required options         
--range                       The range of the viewing to be undertaken: 
                                   in format of chromosome number:start-end eg chr1:1-1000
--view                        Use this option to view the HDF file header

ini file

Required. Use this option to specify the iniFile containing relevant qpileup options. See the example below:

[general]
log=path to log file (required)
loglevel=logging level - DEBUG or INFO. (optional)
hdf=path to hdf file (required. Must end in .h5)
mode=bootstrap, add, remove or view (required)
bam_override=whether duplicate bam files can be added. (optional. False by default)
thread_no=number of threads to use for multithreading. Must be between 1 and 12. (required). 
               Total threads will be the number specified + 3. 
output_dir=the place that pileup files are written to (required for view,metrics modes)
range=the range to view eg all, chr1, chr:1-1000. (Required for view, metrics modes. More than one range option allowed.)

[bootstrap]
reference=path to reference genome fasta file (required for bootstrap mode)
low_read_count=the number below which the LowReadCount element is defined. Also used to define the HighNonReference element. 
              If not defined, the default is 10
nonref_percent=Used for HighNonreference strand summary element. Minimum percent that non-reference bases 
to total bases must have in order to be counts as HighNonReference. If not defined the default is 20(%)

[add_remove]
bam_list=path of file with list of bams OR
name=path to bam files (Required for add, remove modes. More than one name allowed)

[merge]
input_hdf=path to the hdf file/s that will be merged (required for merge mode)

[view]
group=Group of qpileup data elements to view. Possible groups are: forward, reverse, bases, quals, cigars, readStats. (Optional for view mode)
element=Qpileup data element to view. See strand summary table above: eg A, Aqula,CigarI etc. (Optional for view mode)

[metrics]
min_bases=minimum average coverage (base count) per reference position
bigwig_path=directory where wigToBigWig program is located
chrom_sizes=file listing chromosome lengths for wigToBigWig conversion

[metrics/clip]
position_value=minimum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/nonreference_base]
position_value=minimum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/indel]
position_value=minimum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/mapping_qual]
position_value=maximum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/high_coverage]
position_value=minimum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/snp]
dbSNP=path to dbSNP VCF file
germlineDB=path to qSNP germlineDB file
nonref_percent=Minimum percentage that non-reference bases that must be in order to be included as a SNP. Default 10 (ie 10% non reference)
nonref_count=Minimum count of non reference bases in order to be included. Default 10. 
high_nonref_count=Minimum number of patients that had a high non_reference count. Default 0.
position_value=minimum value as percent of total bases per position
window_count=minimum number of positions that pass the position_value in the window

[metrics/strand_bias]
min_percent_diff=minimum percent difference in % nonreference bases between strands
min_nonreference_bases=minimum number of non-reference bases per reference position

View mode options

qpileup offers a limited view mode option from the command line. Users may use this option to view the HDF header/metadata or qpileup output for a single reference genome range (maximum size is one chromosome)

Option	Description
--view	Required
--H	View header only
--V	View version information
--range	The range to view. eg chr1 or chr1:1-1000
--hdf	Path to the HDF file to view. Required in view mode
--element	qpileup data element to view (see strand summary table above: eg A, Aqula,CigarI etc). Optional for view mode
--group	Group of qpileup data elements to view. Optional for view mode.

Possible groups are:

forward: all elements forward strand reads.
reverse: all elements: reverse strand reads
bases: elements: A,C,G,T,N,ReferenceNo,NonreferenceNo,HighNonreference,LowReadCount
quals: AQual,CQual,GQual,NQual,MapQual
cigars: CigarI,CigarD,CigarD_start,CigarS,CigarS_start,CigarH,CigarH_start,CigarN,CigarN_start
readStats: StartAll,StartNondup,StopAll,Dup,MateUnmapped

Examples

Bootstrap options for ini file

[bootstrap]
reference=/genomes/GRCh37_ICGC_standard_v2/GRCh37_ICGC_standard_v2.fa
low_read_count=20
nonref_percent=30

Add mode options for ini file

Bam files can be listed in 2 ways:
within the ini file using name=path to bam file
provide a file with bam files listed using bamlist=path to bam list file

[general]
log=/qpileup/runs/test/test.log
loglevel=DEBUG
hdf=/qpileup/runs/test/target_109.qpileup.h5
mode=add
bam_override=true
thread_no=12

[add_remove]
bam_list=/qpileup/bam_list_20130202.txt OR
name=/ABCD_1234/T02_20120318_153_FragBC.nopd.IonXpress_001.sorted.bam
name=/ABCD_1234/T02_20120318_153_FragBC.nopd.IonXpress_003.sorted.bam

Merge mode options for ini file

[general]
log=/qpileup/runs/test/test.log
loglevel=DEBUG
hdf=/qpileup/runs/test/target_109.qpileup.h5
mode=merge
bam_override=true
threadNo=12

[merge]
input_hdf==/qpileup/runs/test/testA.h5
input_hdf==/qpileup/runs/test/testB.h5

View mode options for ini file

[general]
log=/qpileup/runs/test/test.log
loglevel=DEBUG
hdf=/qpileup/runs/test/target_109.qpileup.h5
mode=add
bamoverride=true
thread_no=12
output_dir=/qpileup/runs/test/
range=all

[view]

; choose one of the following
range=all        ;View whole genome
range=chr1       ;View whole chromosome
range=chr1:1-1000    ;View part of a chromosome
range=chr1:1-1000 --element A    ;View part of a chromosome, A base element only
range=chr1:1-1000 --group forward    ;View part of a chromosome, forward group elements only

View mode (command line)

View version

java -Xmx20g -jar qpileup-0.1pre.jar  --view -V --hdf /qpileup/runs/test/testhdf.h5

View header

java -Xmx20g -jar qpileup-0.1pre.jar  --view -H --hdf /qpileup/runs/test/testhdf.h5

View pileup for region of chromosome

java -Xmx20g -jar qpileup-0.1pre.jar  --view -H --hdf /qpileup/runs/test/testhdf.h5 --range chr1:1-1000

Metrics mode

[metrics]
min_bases=3
bigwig_path=/qpileup/wiggle
chrom_sizes=/qpileup/wiggle/grc37.chromosomes.txt

[metrics/clip]
position_value=2
window_count=10

[metrics/nonreference_base]
position_value=2
window_count=10

[metrics/indel]
position_value=0.2
window_count=3

[metrics/mapping_qual]
position_value=10
window_count=10

[metrics/high_coverage]
position_value=2000
window_count=10

[metrics/snp]
dbSNP=/dbSNP/135/00-All_chr.vcf
germlineDB=/qpileup/files/icgc_germline_qsnp.vcf
nonref_percent=20
nonref_count=100
high_nonref_count=1
position_value=1
window_count=2

[metrics/strand_bias]
min_percent_diff=30
min_nonreference_bases=100