AdamaJava / Wiki / qprofiler 2.0

qprofiler 2.0

Authors:

Attachments

qprofiler_bam_qual.png (76615 bytes)

qprofiler_bam_rdepth.png (171538 bytes)

qprofiler_bam_summary.png (162055 bytes)

System Requirements
Installation
Usage
Options
Output
Log file

qprofiler is a standalone Java application designed to provide quality control reporting for next-generation sequencing (NGS). qprofiler takes FASTQ, BAM or VCF files as input and outputs an XML file containing summary statistics tailored to the input file type. The VCF mode is new and is under active development so it should be considered experimental and subject to change.

While the XML file is useful for extracting values for further analysis, a second tool (qvisualise) was created to parse the qprofiler XML files and produce HTML output with embedded graphs via the Charts javascript library developed by Google. qvisualise exists as a standalone program but it is also integrated into qprofiler so a HTML file will always be output by qprofiler unless the --nohtml tag is used. The HTML file name is based on the XML output filename with the extension .html appended. There is currently no visualisation for the experimental VCF mode but one will be developed once the format and contents of the VCF XML report stabilises.

qprofiler uses the Picard library to access BAM files.

If no output file is specified then the default file name will be used: qprofiler.xml.

System Requirements

Java 1.8
Multi-core machine (ideally) and 20GB of RAM

Installation

Obtain a copy of the qprofiler2.0.tar.bz2 file from John Pearson (john.pearson@qimrberghofer.edu.au). The tar file contains a directory qprofiler2.0 which contains the qprofiler2.0.jar file along with 3rd party jar files that are dependencies.
Untar the tar file

You should see something like this:

qprofiler2.0/
qprofiler2.0/qvisualise-2.0.jar
qprofiler2.0/commons-lang3-3.5.jar
qprofiler2.0/commons-math3-3.3.jar
qprofiler2.0/qprofiler-2.0.jar
qprofiler2.0/picard-lib.jar
qprofiler2.0/qcommon-0.4.jar
qprofiler2.0/qpicard-1.1.jar
qprofiler2.0/trove-3.1a1.jar
qprofiler2.0/qio-0.1pre.jar
qprofiler2.0/jopt-simple-4.6.jar
qprofiler2.0/htsjdk-1.140.jar

Usage

The full option list is describe below but there are 3 options that you should probably specify every time you call qprofiler: --input, --output, and --log. If you have access to a multi-core machine (e.g. a compute node on a cluster) then you should also look at the thread-count parameters: ntProducer and --ntConsumer if you are processing BAM files.

In general, we would recommend using as many consumer threads as you have cores available (so 16 comsumers for a 16-core machine) and with approximately a 1:4 ratio between producer and consumer threads. The producer threads are relatively lightweight and will not occupy a full core each.

For example, to run on a 16 core computer, we would suggest something like:

java -jar qprofiler-2.0.jar \
     -input ~/sample_virus.BWA-backtrack.bam \
     -log ~/sample_virus.BWA-backtrack.bam.qp.log \
     -output ~/sample_virus.BWA-backtrack.bam.qp.xml \
     -ntP 4 -ntC 16

The recommendations on counts of consumer and producer threads are empirical so if you are going to do lots of qprofiler work, you should probably do some testing of your own to see what thread counts and ratios work best on your servers or cluster nodes. This is especially important for cluster work where core count is critical - if you request 8 cores, you need to make sure that your threading parameters are dialled to keep qprofiler inside the number of cores you requested. It's also worth noting that hyperthreaded cores can cause the counts to be off - clusters often count each hyperthreaded core as two cores, i.e. capable of running 2 threads, but they will not be as efficient as 2 separate cores so again you will need some empirical testing to see what thread counts and producer/consumer ratios work best for you.

It is also worth noting that it is not unusual to find BAM files that contain headers or reads considered to be invalid by the Picard library which will throw exceptions and cause qprofiler to exit. This is why the default option for --validation is SILENT but this is not an ideal situation. If you are primarily a consumer of BAMs then it's probably OK to always operate in SILENT mode but if anything odd happens with your output, you should rerun with STRICT or LENIENT to see if there are problems with the BAM. If, on the other hand, you are a BAM producer, you should probably use STRICT and if any of your BAMs cause exceptions to be thrown, you should probably try to fix the underlying causes.

Options

Option	Description
--input	Input file in FASTQ, BAM or VCF format. required
--output	Output file. optional, defaults to qprofiler.xml
--include	This option is deprecated. It produces some additional visualisations primarily targetted at sequencing from Life Technolgies SOLiD platform. It is not thread-safe and will not work correctly in conjunction with the `-ntC` and `-ntP` options. It will be removed in a future version.
--maxRecords	Specify how many records should be parsed by the qprofiler. ''Note''' that qprofiler will always start at the beginning of a BAM file, meaning that you will always get the first `maxRecords` records back. This option is designed for testing or for when you want a quick look at a BAM and can't wait for the full file to be processed.
--nohtml	If this flag is set, qprofiler will not call the qvisualise library so no HTML visualisation file will be created. This option is only relevant for FASTQ and BAM files. optional, defaults to not set
--ntProducer	Only relevant to BAM files. Specifies how many threads (integer) should be used to '''produce''' reads from the input file. optional
--ntConsumer	Only relevant to BAM files. Specify how many threads should be used to '''consume''' reads from the input file. optional
--tags	Perform aggregations on user defined tags for BAM files. Example values are "ZC", "XY", etc. This option is considered legacy and may be deprecated in a future release. As the contents of BAM files has stabilised, custom reporting and visualisations have been created for the most common and useful tags.
--log	Log file. optional but VERY highly recomended
--loglevel	Level at which logging should be applied. Possible values in increasing order of detail are INFO, DEBUG, ALL. At DEBUG level and above, the logging is very granular so you should not use these levels unless you truly are debugging a qprofiler run. optional, defaults to INFO
--help	Show usage and help text and exit.
--validation	How strict to be when reading a SAM or BAM file. Possible values are STRICT, LENIENT, SILENT and the default is SILENT. This value is passed to `Picard` as the parameter `Validation Stringency`.
--version	Print version info and exit.

Output

This example output shows XML from running qrofiler against a BAM file. This is a high level view and most of the contents have been elided (...).

<qProfiler finish_time="2017-07-05 22:09:53" run_by_os="Linux" run_by_user="christiX" start_time="2017-07-05 17:36:42" version="2.0 (1954)">
  <BAMReport execution_finished="2017-07-05 22:09:33" execution_started="2017-07-05 17:36:42" file="/mnt/lustre/working/genomeinfo/sample/c/9/c9a6be94-bdb7-4c0d-a89d-4addbf76e486/aligned_read_group_set/0f443106-e17d-4200-87ec-bd66fe91195f.bam">
    <HEADER>...</HEADER>
    <SUMMARY>...</SUMMARY>
    <SEQ>...</SEQ>
    <QUAL>...</QUAL>
    <TAG>...</TAG>
    <ISIZE>...</ISIZE>
    <RNEXT>...</RNEXT>
    <CIGAR>...</CIGAR>
    <MAPQ>...</MAPQ>
    <RNAME_POS>...</RNAME_POS>
    <FLAG>...</FLAG>
  </BAMReport>
</qProfiler>

The following screenshots show HTML visualisations created by qvisualise from qprofiler BAM mode XML file:

Summary page including breakdown of bases usable for analysis:

Summary page including breakdown of bases usable for analysis

Base quality by read and cycle:

Base quality by read and cycle

By-chromosome representation of read depth in million-base windows:

By-chromosome representation of read depth in million-base windows

Log file

This example log file is from running qrofiler against a BAM file. The majority of the log file has been elided (...) to save space.

17:36:42.356 [main] EXEC org.qcmg.qprofiler.QProfiler - Uuid c637af74-2c8f-4682-944a-ccd42dd57967
17:36:42.357 [main] EXEC org.qcmg.qprofiler.QProfiler - StartTime 2017-07-05 17:36:42
17:36:42.358 [main] EXEC org.qcmg.qprofiler.QProfiler - OsName Linux
17:36:42.358 [main] EXEC org.qcmg.qprofiler.QProfiler - OsArch amd64
17:36:42.359 [main] EXEC org.qcmg.qprofiler.QProfiler - OsVersion 3.10.0-327.3.1.el7.x86_64
17:36:42.360 [main] EXEC org.qcmg.qprofiler.QProfiler - RunBy christiX
17:36:42.360 [main] EXEC org.qcmg.qprofiler.QProfiler - ToolName qprofiler
17:36:42.361 [main] EXEC org.qcmg.qprofiler.QProfiler - ToolVersion 2.0 (1954)
17:36:42.362 [main] EXEC org.qcmg.qprofiler.QProfiler - CommandLine qprofiler --log /mnt/lustre/home/christiX/qprofiler/colo_829.analysis/qprofiler2.0/output/0f443106-e17d-4200-87ec-bd66fe91195f.bam.qp.xml.log --loglevel INFO --output /mnt/lustre/home/christiX/qprofiler/colo_829.analysis/qprofiler2.0/output/0f443106-e17d-4200-87ec-bd66fe91195f.bam.qp.xml --input /mnt/lustre/working/genomeinfo/sample/c/9/c9a6be94-bdb7-4c0d-a89d-4addbf76e486/aligned_read_group_set/0f443106-e17d-4200-87ec-bd66fe91195f.bam -ntP 4 -ntC 20
17:36:42.363 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaHome /software/java/jdk1.8.0_77/jre
17:36:42.363 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaVendor Oracle Corporation
17:36:42.364 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaVersion 1.8.0_77
17:36:42.365 [main] EXEC org.qcmg.qprofiler.QProfiler - host hpcnode040.adqimr.ad.lan
17:36:42.367 [main] TOOL org.qcmg.qprofiler.QProfiler - Running in multi-threaded mode (BAM files only). No of available processors: 56, no of requested consumer threads: 20, producer threads: 4
17:36:42.415 [main] INFO org.qcmg.qprofiler.QProfiler - processing file /mnt/lustre/working/genomeinfo/sample/c/9/c9a6be94-bdb7-4c0d-a89d-4addbf76e486/aligned_read_group_set/0f443106-e17d-4200-87ec-bd66fe91195f.bam
17:36:42.418 [pool-1-thread-1] INFO org.qcmg.qprofiler.QProfiler - running BamSummarizerMT
17:36:42.770 [pool-1-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT - will create 20 consumer threads
17:36:42.777 [pool-1-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT - waiting for Producer thread to finish (max wait will be 20 hours)
17:36:42.948 [pool-3-thread-2] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr1
17:36:42.969 [pool-3-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr2
17:36:42.974 [pool-3-thread-4] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr3
...
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - StopTime 2017-07-05 22:09:54
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - TimeTaken 04:33:12
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - ExitStatus 0