qMotif 1.0

Authors:

Introduction
Installation
Usage
Options
Config file
How it works
Output
- XML
- BAM

This document refers to an old version of qMotif. The current version is 1.2

Introduction

qmotif runs against a BAM file searching for user specified motifs. It's primary use to date has been to search for repeats commonly found within telomeres.

Installation

qmotif requires java 7 and (ideally) a multi-core machine with at least 20GB of RAM. To install:

Download the qmotif tar file.
Untar the tar file into a directory of your choice.
Invoke java with the qmotif jar file

When you untar the qmotif tar file, you should see java jar files for qmotif and its dependencies:

tar xjvf qmotif.tar.bz2
x antlr-3.2.jar
x jopt-simple-3.2.jar
x picard-1.110.jar
x qbamfilter-1.0pre.jar
x qcommon-0.1pre.jar
x qpicard-0.1pre.jar
X qio-0.1pre.jar
x qmotif-1.0.jar
x sam-1.110.jar
x qmotif.ini

The qmotif-1.0.jar jar file contains the qmotif code and the other jar files contain libraries that are used by qmotif.

Usage

java -Xmx20g -jar qmotif-1.0.jar \
            -n 8 \
            --bam /mydata/bamFile.bam \
            --bai /mydata/bamFile.bam.bai \
            --log /mydata/bamFile.bam.qmotif.log \
            --loglevel DEBUG \
            -i /mydata/qmotif.ini \
            -o /mydata/bamFile.bam.qmotif.xml \
            -o /mydata/bamFile.telomere.bam

Options

The following command line options are supported:

Option	Required	Description
i	Yes	Configuration file in INI format. This file is designed to contain parameters that are likely to be fixed for a given type of analysis. So the general idea is that you test parameters in the INI file until you are happy and then you use that single INI file with all of the BAM files that are part of this analysis.
bam	Yes	BAM file on which to operate.
bai	Yes	Index file for the BAM file being processed.
log	Yes	Log file.
loglevel	No	Level of verbosity of logging. The default level is INFO and valid values are INFO, DEBUG.
n	No	Number of threads to use when running qmotif. Defaults to 1.
o	Yes	Output XML file containing the motifs found and the regions they were found in.
o	Yes	Output BAM file containing the reads that matched the stage1 search regex/string

Config file

The qmotif config file is in the standard INI-file format and has 3 sections, PARAMS, INCLUDES, and EXCLUDES. The PARAMS sections lists the search criteria along with other customisable parameters, the INCLUDES section contains a list of genomic regions that are to be targeted for analysis, and the EXCLUDES section contains a list of genomic regions that are to be excluded from analysis.

Lines can be commented out by prepending a ';' character as can be seen in the example below where a number of different stage1 and stage2 motif parameters appear but only one of each type is ''not'' commented out.

An example ini file follows:

[PARAMS]
stage1_motif_string=TTAGGGTTAGGGTTAGGG
;stage1_motif_string=TTAGGGTTAGGGTTAGGGTTAGGG
;stage2_motif_string=TTAGGG
stage2_motif_regex=(...GGG){2,}|(CCC...){2,}
;stage2_motif_regex=((TTA|TCA|TTC|GTA|TGA|TTG|TAA|ATA|CTA|TTT|TTAA)GGG){2,}|(CCC(TAA|TGA|GAA|TAC|TCA|CAA|TTA|TAT|TAG|AAA|TTAA)){2,}
revcomp=true
window_size=10000
cutoff_size=5

[INCLUDES]
; name    regions (sequence:start-stop)
chr1p   chr1:10001-12464
chr1q   chr1:249237907-249240620
chr2p   chr2:10001-12592
chr2q   chr2:243187373-243189372
chr2xA  chr2:243150480-243154648
chr3p   chr3:60001-62000
chr3q   chr3:197960430-197962429
chr3xB  chr3:197897576-197903397
chr4p   chr4:10001-12193
chr4q   chr4:191041613-191044275
chr5p   chr5:10001-13806
chr5q   chr5:180903260-180905259
chr6p   chr6:60001-62000
chr6q   chr6:171053067-171055066
...

[EXCLUDES]
; regions (sequence:start-stop)
;chr1:143274114-143274336

This table shows the parameters that can be specified in [PARAMS] :

Parameter	Example	Description
stage1_motif_string	TTAGGGTTAGGG	String that reads are matched against in the stage 1 match. More than one string can be specified separated by commas.
stage1_motif_regex	(...GGG){2,}\|(CCC...){2,}	Regular expression that reads are matched against in the stage 1 match.
stage2_motif_string	TTAGGGTTAGGG	As per `stage1_motif_string` but for the second stage matching.
stage2_motif_regex	(...GGG){2,}\|(CCC...){2,}	As per `stage1_motif_regex` but for the second stage matching.
revcomp	true	If this parameter is set to true then the reverse complement is also matched for any motifs specified in `stage1_motif_string` or `stage2_motif_string`. It has no effect if the `_regex` parameters are used so any regex should include the reverse complement if you care about it.
window_size	10000	For GENOMIC regions (see below), this is the size of the windows that are considered for reporting. If window_size is not specified, the default value is 10000.
cutoff_size	5	Regions with less than this number of matching reads are not reported. This was intended to remove noise in the output XML file by removing regions with read counts below a useful threshold. The default is 10 so if you want all regions with even a single matching read to be reported then you need to explicitly set a value of 0 for this parameter.

The _string and _regex parameters are mutually exclusive within a stage, i.e. you can set either stage1_motif_string or stage1_motif_regex but not both.

How it works

Every single read in the bam is put through the stage1 match (string or regex). If the read passes the stage1 match, we decide which region of the BAM the read falls within and depending on the type of region (see table below for a discussion of region types), the read may or may not go on to stage2 matching. If the read passes the stage1 match, the stage1Cov tally for the region is incremented.

Stage2 involves another match against the read sequence. Again, the match could be against a string or a regex but this time, the actual matches are retrieved and a tally is kept for each region of how many of which motifs were seen in reads from that region. If the read passes the stage2 match, the stage2Cov tally for the region is incremented.

The 2-stage matching system was specifically designed to help us with speed so in practice, we always use quick string matching for stage1 and only reads that pass stage1 (potentially) go on to the much slower regex match in stage2. There is no reason why you must do it this way but this is how we designed the system. This is worth understanding because if you decide to use a stage1 regex on a large BAM, your runtimes could blow out significantly. And if you use string matching in stage2, your tally of motifs will of course be very simple.

The reference genome is split into regions of 4 different types: INCLUDES, EXCLUDES, UNMAPPED, and GENOMIC. Depending on the region, mapped reads or unmapped reads or both types may be considered for stage2 matching. All reads are always put through stage1 so region type only becomes a factor for stage2 matching.

Region Type	Reads passed to stage2 match	Description
INCLUDES	mapped, unmapped	These regions are specified in the ini file and are the regions of particular interest. In the case of telomere analysis, these are regions that are defined as telomeric. All reads from pairs that fall within these regions are analysed, i.e. mapped reads and unmapped reads where the paired read maps within the region.
EXCLUDES	neither	These regions are specified in the ini file but in this case these are regions that we specifically want to exclude from the analysis. For example, in the case of telomeres these might include centromeric regions or non-telomeric places in the genome that are known to contain the telomeric motif and so should be actively ignored. EXCLUDE regions are still analysed and reported (only unmapped reads are included) but this gives us a mechanism to ensure that the region will not appear in any GENOMIC windows.
GENOMIC	unmapped	All other regions in the genome not covered by INCLUDES and EXCLUDES. Only unmapped reads are included in analysis of these regions. The size of each GENOMIC region is determined by the window_size entry in the ini file. The length of the chromosome is divided by the window size to give the number of genomic regions. Note that INCLUDES and EXCLUDES are overlaid on top of GENOMIC regions so if one of then occurs in the middle of a GENOMIC window, that window is split into two smaller windows on either side of the INCLUDE/EXCLUDE. This means that you need to pay attention to the chrPos attribute of region elements in the XML to work out how big the region is - you can't assume they are all the window_size bases.
UNMAPPED	unmapped	Single region for all pairs where both reads are unmapped. Obviously only unmapped reads are included in this analysis.

A matching example - a read with the following sequence:

ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT

will pass the stage1 check (assuming a filter of TTAGGGTTAGGGTTAGGG with revcomp set to true)
and will add the following string to the list of matches for the region:

CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA

which is identical to the read sequence apart from the first character and the last 4 characters.

This information is gathered for all reads in all regions and summarised in the output xml file.
An output bam file is also produced which contains only the reads that passed both the stage1 and stage2 filters.

Output

XML

Sample XML output from motif:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<qmotif>
  <summary bam="/path/to/my/bamfile.bam">
    <counts>
      <windowSize count="10000"/>
      <cutoff count="5"/>
      <totalReadCount count="1809130333"/>
      <noOfMotifs count="34497"/>
      <rawUnmapped count="15452"/>
      <rawIncludes count="107983"/>
      <rawGenomic count="214"/>
      <scaledUnmapped count="8541"/>
      <scaledIncludes count="59687"/>
      <scaledGenomic count="118"/>
    </counts>
  </summary>
  <motifs>
    <motif id="1" motif="AAAGGGAGAGGGGGAGGGGGAGGG" noOfHits="1"/>
    <motif id="2" motif="AAAGGGATAGGG" noOfHits="2"/>
    <motif id="3" motif="AAAGGGATTGGGCTAGGGATAGGGTTAGGGGTAGGG" noOfHits="1"/>
    <motif id="4" motif="AAAGGGCTAGGG" noOfHits="1"/>
    <motif id="5" motif="AAAGGGGAAGGG" noOfHits="1"/>
    ...
  </motifs>
  <regions>
    <region chrPos="chr1:0-9999" stage1Cov="30" stage2Cov="30" type="genomic">
      <motif id="16087" number="3" strand="R"/>
      <motif id="28504" number="1" strand="R"/>
      <motif id="16053" number="1" strand="R"/>
      <motif id="30640" number="1" strand="R"/>
      <motif id="22632" number="2" strand="R"/>
      ...
    </region>
    <region chrPos="chr1:10000-10000" stage1Cov="6" stage2Cov="6" type="genomic">
      <motif id="25994" number="6" strand="F"/>
    </region>
    <region chrPos="chr1:10001-12464" stage1Cov="1805" stage2Cov="1805" type="includes">
      <motif id="397" number="1" strand="F"/>
      <motif id="26591" number="1" strand="F"/>
      <motif id="7782" number="1" strand="F"/>
      <motif id="18961" number="1" strand="F"/>
      <motif id="3065" number="1" strand="F"/>
      ...
    </region>
    ...
    <region chrPos="unmapped:0-9999" stage1Cov="15452" stage2Cov="15452" type="unmapped">
      <motif id="5017" number="3" strand="F"/>
      <motif id="19855" number="1" strand="F"/>
      <motif id="20772" number="5" strand="F"/>
      <motif id="20256" number="1" strand="F"/>
      <motif id="18776" number="1" strand="F"/>
      ...
    </region>
  </regions>
</qmotif>

The output file has 3 sections:

summary - the input parameters plus any parameters that were not specified but that have defaults
motifs - an entry for each stage2_motif found along with a count of the total number of times the motif was seen
regions - counts of the different motifs seen in each of the regions (INCLUDES, EXCLUDES, GENOMIC, UNMAPPED) where each motif ID relates to one of the motif elements in "motifs"

BAM

Any reads that pass the stage 1 filter will make it into the output bam file, as long as the region that the read is in allows that type of read. For example:

If the read falls within an INCLUDES region, passes the stage 1 filter, and is mapped or unmapped, then it will make it into the bam.
If the read falls within an GENOMIC, or UNMAPPED region, passes the stage 1 filter, and is mapped it will NOT make it into the bam.
If the read is unmapped and passes the filter, then it will make it into the bam.
If the read is in an EXCLUDES region, then it will not make it into the bam, regardless of filters and mapped status.

Output bams are coordinate sorted by default.