TriageTools Wiki

Tools for partitioning and prioritizing fastq data

Brought to you by: tkonopka

TriageRegex

Triage by regular expression

The regex tool selects reads based on a regular expression. This is useful for extracting reads with special labels or simple sequence motifs.

Singe-end samples

To extract reads that contain a sequence pattern with 10 or more A nucleotides in a row, use

java -jar triagetools.jar regex --pattern "A{10,}" -i allreads.txt.gz -o myreads.txt.gz

This will create one output file myreads-hits.txt.gz that will contain reads with the desired pattern. Note the pattern matching is not aware of sequence complementarity, so reads containing long runs of Ts will not be included in the hits output.

To obtain a full partition of the input, ie. one file with pattern-containing reads and another with the remaining reads, add the --all flag:

java -jar triagetools.jar regex --pattern "B{10,}" --all -i allreads.txt.gz -o myreads.txt.gz

Here, the pattern is changed to runs of 10 or more B. For Illumina reads, this can pick out reads with long low-quality tails requiring trimming.

Paired-end samples

Paired samples are processed with two -i and two -o flags, e.g.

java -jar triagetools.jar regex --pattern chr17 -i allreads_1.txt.gz -i allreads_2.txt.gz 
    -o myreads_1.txt.gz -o myreads_2.txt.gz

Here, the pattern is a string with a chromosome name. This can be useful for processing synthetic data in which the chromosome of origin has been encoded in the read ID.