The qbamfilter library is used in almost all AdamaJava tools as a way of filtering out reads that are not appropriate for the analysis being conducted. The heart of qbamfilter is its query language which determines which records pass and are included in the analysis, and which fail and are discarded.
A qbamfilter query is a string with combination of conditions, the form is:
operator( condition, condition+ )
i.e., it consists of operators and conditions and operators can themselves be conditions . Currently there are only two operators available - and and or. Conditions take the form:
key comparator value
There is no operator required, if the query only contain a single condition. At this point, examples might make more sense:
RNAME =~ chr*
This first example has a single condition (RNAME =~ chr*). Remembering that BAM records are only kept if the qbamfilter string resolves to TRUE for them, this query has the effect of rejecting any BAM record where the RNAME field does not contain the string "chr". This is useful in cases where your reference contains chromosomes plus additional non-chromosome sequences, e.g. GL000191.1, and where you wish to ignore the non-chromosomal records.
and( RNAME =~ chr*, Cigar_M > 35 )
This query is based on the previous but with the addition of a second condition. The second condition requires that more than 35 of the bases in the CIGAR string are "M" indicating a match or mismatch. The effect of this condition is that reads where most of the bases are clipped are discarded.
It is worth noting that conditions are evaluated left to right so standard short-circuit thinking applies, i.e. in an and() operator clause, if the first condition evaluates to false, the rest of the conditions are not evaluated. This means it is cheaper (and faster) to put the conditions that will reject the most records on the left, i.e. first.
and( RNAME =~ chr*, Cigar_M > 35, or( MAPQ > 20, option_ZM == 1 ) )
This example shows the addition of a third condition and that condition is another operator clause. The or() clause shown has 2 conditions and will return TRUE if either of the conditions are true. The first condition requires that a record's MAPQ score be above 20 and the second condition requires that a user-supplied ZM field is present and set to 1.
and (Cigar_M > 35, RNAME =~ chr*, or (MAPQ > 50, option_ZM == 1), Flag_DuplicateRead == false )
This query string, joined by operator and and contains three conditions. here we treat the sub query or (MAPQ > 50, option_ZM == 1) as one condition.
Below table lists current avalible condition key types for four of the most important SAM/BAM fields:
BAM Field | Key | Comparator | Value | Examples |
---|---|---|---|---|
Flag | flag_ReadPaired flag_ProperPair flag_ReadUnmapped flag_Mateunmapped flag_ReadNegativeStrand flag_MateNegativeStrand flag_FirstOfpair flag_SecondOfpair flag_NotprimaryAlignment flag_ReadFailsVendorQuality flag_DuplicateRead flag_SupplementaryRead |
"==" "!=" |
"1" "0" "true" "false" (case insenstive) |
to report all duplicated reads: flag_duplicated == true flag_duplicated != 0 |
Cigar | Cigar_M Cigar_I Cigar_D Cigar_N Cigar_S Cigar_H Cigar_P |
"==" "!=" ">=" "<=" ">" "<" |
int | to report all read with mapped base more than 15 Cigar_M >= 16 Cigar_M > 15 |
MAPQ | MAPQ | as above | int | to report all reads with higher mapping quality than 16 MAPQ > 16 |
SEQ | seq_numberN | as above | int | At moment, we can only count N base. eg. to report all reads containing less than 5 'N' base Seq_numberN < 5 |
QUAL | Qual_average | as above | int | At moment, we can count calculate average base quality. eg. to report all reads which average base quality is less than 20 Qual_average < 20 |
ISIZE | ISIZE | as above | int | to report all reads with longer insert size than 16 ISIZE > 16 |
POS | pos | as above | int | to report all reads with start position between 1000 to 2000 and ( pos >= 1000, pos <= 2000 ) |
optional field | option_<tag> | as above | int | to report the tag "ZM" which value lower than 2 option_ZM < 2 |
==" "!=" |
string ('*' is NOT wild case) |
to report the tag "RG" with value "Tumor" option_RG == Tumor to report the tag "ZP" with value "Z" option_ZP == Z** |
||
"=~" "!~" |
string treat '*' as wild case for string match only allow one or zero '*' appears at the begin or end of the value string |
to report the tag "RG" without containing subString "known" option_RG !~ known to report the tag "ZP" without starting with subString "unll" option_ZP !~= null* to report the tag "LB" without ending with subString "null" option_ZP !~= *null |
||
RNAME MRNM |
RNAME MRNM |
==" "!=" |
string ('*' is NOT wild case) |
to report all reads except the one mapping on chr1 RNAME != chr1 to report paired reads mapped on diff reference and ( RNAME != MRNM, MRNM != * ) |
"=~" "!~" |
string '*' is a wild case, allow one or zero '*' appears at the begin or end of the value string |
to report all reads mapping on the reference name start with chr RNAME =~ chr1* to report paired reads mapped on diff reference but mate reference name contain NG and ( RNAME != MRNM, MRNM =~ NG ) to report all reads mapping on the reference name end with "1", eg "chr1" , "chr11" RNAME =~ *1 |
additional query for MD field
An extra filter called MDFilter is used to count the mismatched base number based on the MDField. Below is an example of query the MD field when user want to filter out all SAM record contains one or two mismatched bases:
"and (MD_mismatch < 3, MD_Mismatch > 0)"
This query can combine with all queries shown in above table. eg.
" and ( CIGAR_M > 35 , FLAG_DuplicateRead != true , or (MD_mismatch < 3, MAPQ > 20) )"
This table lists current available additional condition types for four SAM/BAM optional fields (at moment only for MD tag):
optional|field |key (case insenstive)|comparator|value| examples
----|----|----|----|----
MD|MD_mismatch| ">="
"<="
"=="
"!="
">"
"<"|int| Discard any read with 2 or more mismatches as determined by inspecting the MD tag:
MD_mismatch < 2