The duplicates tool selects from a large set of reads those items that have unique sequences. Non-unique sequence, i.e. duplicates, can appear in a sample due to PCR amplification and can thus increase apparent coverage depth without being informative about genetic content.
Duplicate sequences can be removed from alignment (bam) files with existing tools such as Picard. With Triagetools, it is possible to remove many of them also at the fastq level. Avoiding passing duplicates to alignment programs and further to downstream analysis can potentially reduce computational costs as well as avoid skewing calculations depending on sequencing depth.
Identifying duplicate sequences involves sorting a large set of reads. This is achieved in two stages. In the first stage, read sequences are placed into bins/buckets. These buckets are stored in temporary files on disk. In the second stage, each bucket is sorted and duplicates are identified. This is performed in memory. As a result, depending on the size of the buckets and the number of duplicate reads, this stage may consume a sizable amount of memory.
By default, reads are treated as duplicates if their sequences are identical up to sequence complementary. Sequences such as ATCG and CGAT are thus considered identical. When two or more reads are identical, one of them is output as unique and the others as duplicates. The unique representative is the one with the maximum sum of base qualities.
It is also possible to search for duplicates based on a part of the read sequence. This feature is useful for longer reads, which can contain sequencing errors toward the end of the sequence that spoil identical string matching. Specifying an interval from the third to the fiftieth base, for example, can detect duplicates with end-of-read sequencing errors.
To extract unique reads from a sample:
java -Xmx8g -jar triagetools.jar duplicates -i allreads.txt.gz -o myreads.txt.gz
This will create one output file myreads-unique.txt.gz containing reads with unique sequences. To obtain also the set of duplicates, add the --all flag:
java -Xmx8g -jar triagetools.jar duplicates --all -i allreads.txt.gz -o myreads.txt.gz
This will create two output files myreads-unique.txt.gz and myreads-duplicate.txt.gz.
To mark duplicates based on a sub-part of the read sequence from the third to the 48th base:
java -Xmx8g -jar triagetools.jar duplicates --from 3 --to 48 --all -i allreads.txt.gz -o myreads.txt.gz
The commands above create and then clean up some temporary files. The number of these files can be controlled via the parameter --prefixsize, e.g.
java -Xmx8g -jar triagetools.jar duplicates --prefixsize 3 -i allreads.txt.gz -o myreads.txt.gz
The value 3 is the default and leads the tool to use 65 (=1+4^3) temporary files. Larger prefixsize implies using more temporary files.
Paired-end data can be processed using multiple -i and -o flags, e.g.:
java -Xmx8g -jar triagetools.jar duplicates -i allreads_1.txt.gz -i allreads_2.txt.gz -o myreads_1.txt.gz -o myreads_2.txt.gz
For two paired-end reads to be considered duplicates, the pair of mates must be identical up to sequence complementarity. For example, a pair AAA/CCC is treated as identical to CCC/AAA, but not to TTT/GGG.
Note: the above commands contain a -Xmx8g flag in order to allow the java virtual machine use of a large pool of memory. 8gb is typically sufficient for average-coverage depth exome and transcriptome samples. The amount actually used will depend on the input data (number of reads, length of reads, number of duplicates).