This toolbox comprises simple and handy Perl scripts for processing of next generation sequencing (NGS) data. The Perl scripts are command line based and thus perfectly suited for automated sequence analysis pipelines. "How to use" details and an instruction on how to simply create personalized pipelines are available in this Wiki or at:
http://www.uni-mainz.de/FB/Biologie/Anthropologie/487_ENG_HTML.php
NGS tools for the novice is written by David Rosenkranz, Institue of Anthropology, Johannes Gutenberg University Mainz, Germany. Visit the authors homepage for further information and bioinformatic software.
http://www.uni-mainz.de/FB/Biologie/Anthropologie/428_ENG_HTML.php
Email contact: rosenkrd@uni-mainz.de
The complete toolbox is packed in the compressed folder NGS-toolbox.zip.
List of tools (21.02.2012):
basic_analyses.pl
Counts the number of sequences, shows length distribution and
calculates overall base composition
discard_redundant_sequences.pl
Discards redundant sequences from the dataset. Fasta titles
will refer to the sequence abundance.
FASTQ_to_FASTA.pl
Converts sequence files from FASTQ to FASTA format
filter_simple_repeats.pl
Filers sequences that contain or consist solely of stretches
of simple repeats (homo- and/or dipolymeric stretches).
length_cutoff.pl
Applies a user defiend length cutoff. Sequences will be sorted
into three output files (<min length,="">max length, >min<max length)
map_sequences.pl
Maps sequences to an arbitrary number of reference sequences from
one or several files.
merge_FASTA.pl
Concatenates an arbitrary number of FASTA files.
q_filter.pl
Filters sequence reads based on Phred quality scores. Several
options for the filtering process are available. Low quality
ends of sequence reads (indicated by B for Illumina1.5+ or #
for Illumina 1.8+) can be clipped prior the filtering process.
q_analyzer.pl
Anaylses FASTQ files (Illumina or Sanger format) based on Phred
quality scores. Outputs helpful statistics like average overall
read accuracy and average positional Phred score.
remove_TAGs.pl
Removes TAG sequences from inputfiles. Several options for removal
(e.g. only TAG, everything preceeding the TAG but not the TAG itself
etc.) are available. Sequences will automatically be sorted by TAG.
reverse_complement.pl
Manipulates sequences and makes them reverse, complementary or
reverse complementary.
sort_by_TAGs.pl
Sorts sequences by TAG without removing the TAG. Several options
for TAG tracing are available (sequence has to start/end with TAG
etc.).
split_FASTA.pl
Splits a FASTA file into several output files. The User can set a
maximum number of sequences per output file or determine a fixed
number of output files per input file.
A short instruction of each Perl script is embedded within the script. You can also browse the local Wiki or visit the project homepage at:
http://www.uni-mainz.de/FB/Biologie/Anthropologie/472_ENG_HTML.php
IMPORTANT NOTE / DISCLAIMER:
It is strongly recommended to work in a seperate folder. Create backup copies of all your datasets in a seperate folder. Files may be overwritten without confirmation by the user! We assume no liability for loss of data or correctness of results.