Menu

qbamfilter_query

John Pearson christina Xu

The qbamfilter library is used in almost all AdamaJava tools as a way of filtering out reads that are not appropriate for the analysis being conducted. The heart of qbamfilter is its query language which determines which records pass and are included in the analysis, and which fail and are discarded.

A qbamfilter query is a string with combination of conditions, the form is:

operator( condition, condition+ )  

i.e., it consists of operators and conditions and operators can themselves be conditions . Currently there are only two operators available - and and or. Conditions take the form:

key comparator value 

There is no operator required, if the query only contain a single condition. At this point, examples might make more sense:

RNAME =~ chr* 

This first example has a single condition (RNAME =~ chr*). Remembering that BAM records are only kept if the qbamfilter string resolves to TRUE for them, this query has the effect of rejecting any BAM record where the RNAME field does not contain the string "chr". This is useful in cases where your reference contains chromosomes plus additional non-chromosome sequences, e.g. GL000191.1, and where you wish to ignore the non-chromosomal records.

and( RNAME =~ chr*, Cigar_M > 35 )

This query is based on the previous but with the addition of a second condition. The second condition requires that more than 35 of the bases in the CIGAR string are "M" indicating a match or mismatch. The effect of this condition is that reads where most of the bases are clipped are discarded.

It is worth noting that conditions are evaluated left to right so standard short-circuit thinking applies, i.e. in an and() operator clause, if the first condition evaluates to false, the rest of the conditions are not evaluated. This means it is cheaper (and faster) to put the conditions that will reject the most records on the left, i.e. first.

and( RNAME =~ chr*, Cigar_M > 35, or( MAPQ > 20, option_ZM == 1 ) )

This example shows the addition of a third condition and that condition is another operator clause. The or() clause shown has 2 conditions and will return TRUE if either of the conditions are true. The first condition requires that a record's MAPQ score be above 20 and the second condition requires that a user-supplied ZM field is present and set to 1.

and (Cigar_M > 35, RNAME =~ chr*, or (MAPQ > 50, option_ZM == 1), Flag_DuplicateRead == false )

This query string, joined by operator and and contains three conditions. here we treat the sub query or (MAPQ > 50, option_ZM == 1) as one condition.

Below table lists current avalible condition key types for four of the most important SAM/BAM fields:

BAM Field Key Comparator Value Examples
Flag flag_ReadPaired
flag_ProperPair
flag_ReadUnmapped
flag_Mateunmapped
flag_ReadNegativeStrand
flag_MateNegativeStrand
flag_FirstOfpair
flag_SecondOfpair
flag_NotprimaryAlignment
flag_ReadFailsVendorQuality
flag_DuplicateRead
flag_SupplementaryRead
"=="
"!="
"1"
"0"
"true"
"false"
(case insenstive)
to report all duplicated reads:
flag_duplicated == true
flag_duplicated != 0
Cigar Cigar_M
Cigar_I
Cigar_D
Cigar_N
Cigar_S
Cigar_H
Cigar_P
"=="
"!="
">="
"<="
">"
"<"
int to report all read with mapped base more than 15
Cigar_M >= 16
Cigar_M > 15
MAPQ MAPQ as above int to report all reads with higher mapping quality than 16
MAPQ > 16
SEQ seq_numberN as above int At moment, we can only count N base. eg. to report all reads containing less than 5 'N' base
Seq_numberN < 5
QUAL Qual_average as above int At moment, we can count calculate average base quality. eg. to report all reads which average base quality is less than 20
Qual_average < 20
ISIZE ISIZE as above int to report all reads with longer insert size than 16
ISIZE > 16
POS pos as above int to report all reads with start position between 1000 to 2000
and ( pos >= 1000, pos <= 2000 )
optional field option_<tag> as above int to report the tag "ZM" which value lower than 2
option_ZM < 2
=="
"!="
string
('*' is NOT wild case)
to report the tag "RG" with value "Tumor"
option_RG == Tumor
to report the tag "ZP" with value "Z"
option_ZP == Z**

"=~"
"!~"
string
treat '*' as wild case for string match only allow one or zero '*' appears at the begin or end of the value string
to report the tag "RG" without containing subString "known"
option_RG !~ known
to report the tag "ZP" without starting with subString "unll"
option_ZP !~= null*
to report the tag "LB" without ending with subString "null"
option_ZP !~= *null
RNAME
MRNM
RNAME
MRNM
=="
"!="
string
('*' is NOT wild case)
to report all reads except the one mapping on chr1
RNAME != chr1
to report paired reads mapped on diff reference
and ( RNAME != MRNM, MRNM != * )
"=~"
"!~"
string
'*' is a wild case, allow one or zero '*' appears at the begin or end of the value string
to report all reads mapping on the reference name start with chr
RNAME =~ chr1*
to report paired reads mapped on diff reference but mate reference name contain NG
and ( RNAME != MRNM, MRNM =~ NG )
to report all reads mapping on the reference name end with "1", eg "chr1" , "chr11"
RNAME =~ *1

additional query for MD field
An extra filter called MDFilter is used to count the mismatched base number based on the MDField. Below is an example of query the MD field when user want to filter out all SAM record contains one or two mismatched bases:

 "and (MD_mismatch < 3, MD_Mismatch > 0)"

This query can combine with all queries shown in above table. eg.
" and ( CIGAR_M > 35 , FLAG_DuplicateRead != true , or (MD_mismatch < 3, MAPQ > 20) )"
This table lists current available additional condition types for four SAM/BAM optional fields (at moment only for MD tag):

optional|field |key (case insenstive)|comparator|value| examples
----|----|----|----|----
MD|MD_mismatch| ">="
"<="
"=="
"!="
">"
"<"|int| Discard any read with 2 or more mismatches as determined by inspecting the MD tag:
MD_mismatch < 2


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.