Menu

Read Filter

Marc Strous

back to modules

Read Filter

Filters raw sequencing reads in a fastq or fasta file. Can remove tags, trim and/or remove reads based on per base sequencing qualities. You may want to run this module separately before you run the remainder of the pipeline to determine the optimal parameters for the Read filter. You can inspect the results in the user interface under the tab "reads". Here the length distribution of the reads, the quality as a function of read length and the base composition as a function of read length. When you see atypical base composition at the start of the reads that means you probably still need to clip off tags. You can also find information on how much of the original data was removed by the filter.


Modules per project

One Read Filter module is created for every readset added to the project.


Runtime

Seconds to minutes, depending on the number of reads.

External dependencies

None.


Parameters (type, default)

  • Minimum read length to keep (int, 25): Shorter reads will be discarded.

  • Maximum read length (int, 1000): Longer reads will be trimmed to the maximum read length.

  • Minimum base quality (int, 15): When a window (see Minimum base quality window size below) of bases is encountered that all have a lower quality than the minimum base quality, the read is trimmed at the start of the window.

  • Minimum base quality window size (int, 4): When a window of x bases is encountered that all have a lower quality than the Minimum base quality (see above), the read is trimmed at the start of the window.

  • Minimum read quality (int, 10): If the average base quality of the read is below this value, the read is discarded.

  • Read start position (int, 0): The bases before the start position are discarded; use this parameter to clip tags.

  • Format of read file (enumeration, autodetect): Fastq, fasta, or autodetect.

  • Fastq encoding (enumeration, autodetect): Phred+33, Phred+64, or autodetect.

  • Filter for low complexity (boolean, off): Use this filter for the mapping of transcriptomic data to prevent ambiguous mappings. This is a simple filter based on tetranucleotide frequencies. If the read contains fewer than the minimum of ([read length]*[Low complexity filter D/L]) and (136 * [Low complexity filter max D]), the read is discarded.

  • Low complexity filter D/L (double, 0.25): See above, length dependent minimum.

  • Low complexity filter max D (double, 0.66): See above, length independent minimum.

  • Number of reads used for statistics (int, 100,000): This is the maximum number of reads that is used for the computation of the overall read statistics.


Files generated

  • ./temp/[readfilename].filtered: Fastq/fasta file with the filtered reads.

  • ./temp/[readfilename].mates.filtered: Fastq/fasta file with the filtered reads.

  • ./temp/[readfilename].stats: File with some statistics on the raw and filtered reads.


Related

Wiki: Pipeline modules

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.