Menu

Home

David Rosenkranz

NGS tools for the novice

This toolbox comprises simple and handy Perl scripts for processing of next generation sequencing (NGS) data. The Perl scripts are command line based and thus perfectly suited for automated sequence analysis pipelines. "How to use" details and an instruction on how to simply create personalized pipelines are available in this Wiki or at:

http://www.uni-mainz.de/FB/Biologie/Anthropologie/487_ENG_HTML.php

NGS tools for the novice is written by David Rosenkranz, Institue of Anthropology, Johannes Gutenberg University Mainz, Germany. Visit the authors homepage for further information and bioinformatic software.

http://www.uni-mainz.de/FB/Biologie/Anthropologie/428_ENG_HTML.php

Email contact: rosenkrd@uni-mainz.de

The complete toolbox is packed in the compressed folder NGS-toolbox.zip.

List of tools (21.02.2012):

  • basic_analyses.pl
    Counts the number of sequences, shows length distribution and
    calculates overall base composition

  • discard_redundant_sequences.pl
    Discards redundant sequences from the dataset. Fasta titles
    will refer to the sequence abundance.

  • FASTQ_to_FASTA.pl
    Converts sequence files from FASTQ to FASTA format

  • filter_simple_repeats.pl
    Filers sequences that contain or consist solely of stretches
    of simple repeats (homo- and/or dipolymeric stretches).

  • length_cutoff.pl
    Applies a user defiend length cutoff. Sequences will be sorted
    into three output files (<min length,="">max length, >min<max length)

  • map_sequences.pl
    Maps sequences to an arbitrary number of reference sequences from
    one or several files.

  • merge_FASTA.pl
    Concatenates an arbitrary number of FASTA files.

  • q_filter.pl
    Filters sequence reads based on Phred quality scores. Several
    options for the filtering process are available. Low quality
    ends of sequence reads (indicated by B for Illumina1.5+ or #
    for Illumina 1.8+) can be clipped prior the filtering process.

  • q_analyzer.pl
    Anaylses FASTQ files (Illumina or Sanger format) based on Phred
    quality scores. Outputs helpful statistics like average overall
    read accuracy and average positional Phred score.

  • remove_TAGs.pl
    Removes TAG sequences from inputfiles. Several options for removal
    (e.g. only TAG, everything preceeding the TAG but not the TAG itself
    etc.) are available. Sequences will automatically be sorted by TAG.

  • reverse_complement.pl
    Manipulates sequences and makes them reverse, complementary or
    reverse complementary.

  • sort_by_TAGs.pl
    Sorts sequences by TAG without removing the TAG. Several options
    for TAG tracing are available (sequence has to start/end with TAG
    etc.).

  • split_FASTA.pl
    Splits a FASTA file into several output files. The User can set a
    maximum number of sequences per output file or determine a fixed
    number of output files per input file.

A short instruction of each Perl script is embedded within the script. You can also browse the local Wiki or visit the project homepage at:

http://www.uni-mainz.de/FB/Biologie/Anthropologie/472_ENG_HTML.php

IMPORTANT NOTE / DISCLAIMER:
It is strongly recommended to work in a seperate folder. Create backup copies of all your datasets in a seperate folder. Files may be overwritten without confirmation by the user! We assume no liability for loss of data or correctness of results.