Hongshan Jiang - 2013-09-15

Background

As sequencing technologies continue to improve their throughput and read-length, there is a trend that the actual sequencing read-length excess some or most of the target DNA/RNA fragments. Thus urges the requirement for adapter trimming after sequencing runs. For example, the most popular application of adapter trimming is the small RNA sequencing where the typical fragment length ranges from 16 to 30 and the sequencing read-length is typically 36.

In resequencing projects for genotyping such as SNP calling, Indel calling, and Copy Number variation detection, longer read length improves the alignment specificity and reduces the ratio of reads that have multiple hits in the genome. Later the improved specificity benefits the final variant calling. But the reality in the DNA sample preparation, which is a stochastic process involving DNA fragmentation, enrichment and sometimes PCR, is that the prepared DNA fragments usually have read lengths in a range and one cannot guarantee to collect sufficient DNA templates that is longer than read-length specification for final sequencing. Depending on the DNA sequence context and the fragmentation technology used, some regions tend to be more fragmented while other regions produce longer DNA fragments. To gain the balance between uniformity of genome coverage and specificity of the sequence alignment, one has to chose the fragment set which includes shorter fragments.

Nowadays there are many on-shelf adapter trimming tools, e.g. FASTQC, BTrim, cutadapt, but surprisingly none of them consider utilizing the information in paired-end reads for adapter trimming, furthermore none of them parallelize the trimming task which may be more and more routine in future.