PReprocessing and INformation of SEQuence data (http://prinseq.sourceforge.net)
PRINSEQ will help you to preprocess your genomic or metagenomic sequence data in
FASTA or FASTQ format.
The lite version is a standalone perl script (prinseq-lite.pl) that does not
require any non-core perl modules for processing.
The used modules are:
Fcntl qw(:flock SEEK_END)
List::Util qw(sum min max)
To run the lite version, you can either use "perl prinseq-lite.pl [options]" or
rename the "prinseq-lite.pl" into "prinseq-lite", add the execution mode
("chmod +x prinseq-lite") and use "prinseq-lite [options]".
The available options are processed in the following order: seq_num, trim_left,
trim_right, trim_left_p, trim_right_p, trim_qual_left, trim_qual_right,
trim_tail_left, trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len,
min_len, max_len, range_len, min_qual_score, max_qual_score, min_qual_mean,
max_qual_mean, min_gc, max_gc, range_gc, ns_max_p, ns_max_n, noniupac,
lc_method, derep, seq_id, seq_case, dna_rna, out_format
The graphs version is a standalone perl script (prinseq-graphs.pl) that
generates graphs and HTML report files. The input is generated by the lite
The used modules are:
Fcntl qw(:flock SEEK_END)
Due to issues with the Statistics::PCA module on certain platforms, a graphs
version without this module is additionally provided (prinseq-graphs-noPCA.pl).
To run the graphs version, you can either use "perl prinseq-graphs.pl [options]"
or rename the "prinseq-graphs.pl" into "prinseq-graphs", add the execution mode
("chmod +x prinseq-graphs") and use "prinseq-graphs [options]".
If you have trouble installing the required modules or want to see an output
example report, upload the graph data file at:
http://edwards.sdsu.edu/prinseq -> Choose "Get Report"
The web version runs the lite and graphs version in the backend. Therefore, all
dependencies for those apply to the web version.
Additional Perl modules required are:
1. Copy the files in the html directory to your html directory
2. Copy the files in the cgi-bin directory to your cgi-bin directory
3. Change the forwarding URL in the index.html file in the html directory.
4. Adjust the config file (prinseqConfig.pm) in lines 8-31 if necessary.
5. Example data is commented out. If you want to show example data, process data
using the web version, put the example Data IDs in the config file and
uncomment the code in the 'access' function to show the example data in the
6. Change the PHP file size limit for uploads in the php.ini file as follows:
sudo vim /etc/php5/apache2/php.ini
sudo vim /etc/php.ini
Change the following lines based on the files you are expecting:
max_execution_time = 6000 ; Maximum execution time of each script, in seconds
max_input_time = 6000 ; Maximum time each script may spend parsing data
memory_limit = 512M ; Maximum amount of memory a script may consume
post_max_size = 9999M
upload_max_filesize = 9999M
sudo /usr/sbin/apachectl restart
sudo /etc/init.d/apache2 restart
7. Use Firebug (http://getfirebug.com/) or similar to find any other issues such
as links and change them according to your system setup.
8. Setup a cronjob (or similar) to clean the output directory and store the data
for the stats in the archive directory.
The image files are by default written to a directory which your Apache is
configured to execute files from (cgi-bin) and is therefore trying to execute
image files. Search for a file called "httpd.conf" on your web server. In that
file, you have to specify that images should not be executed. Here is a copy of
this part from my config file. The important line starts with "AddHandler" and
all file extensions specified will not be executed.
ScriptAlias /cgi-bin/ /var/www/cgi-bin/
Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
AddHandler default-handler .html .jpg .png .gif .zip .gz .fasta .qual .fastq .txt
Allow from all
If you find a bug please email me at <rschmieder_at_gmail_dot_com> so that I can
make PRINSEQ better.
If you want to receive emails for new releases and updates, please sign up for
the mailing list at https://lists.sourceforge.net/lists/listinfo/prinseq-news
COPYRIGHT AND LICENSE
Copyright (C) 2010-2013 Robert SCHMIEDER
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.
Certain updates only apply to the web version, but they should be obvious to
figure out since they are likely modification of the interface.
Fixed issue of incorrect duplicate counts when a sequence is both an exact
duplicate and reverse complement exact duplicate of another sequence.
Added support for STDOUT output to paired-end processing.
Release of web version files to run the web version on a local machine.
Fixed issue with FASTA inputs that caused the program to exit.
Fixed depricated use of 'defined' on aggregates. Added options "trim_left_p" and
"trim_right_p" to trim reads by a percentage value in addition to options that
trim by number of nucleotides. Added option "stats_assembly" to report N50, N90,
etc contig size in the standalone version's summary statistics output. Added
support for paired-end data (new options "fasta2" and "fastq2").
prinseq-graphs-0.6 / prinseq-web-0.20:
Added support for paired-end data.
Fixed issue of incorrect quality trimming with arguments "min" and "max" for
prinseq-lite-0.19.4 / prinseq-graphs-0.5.1:
Fixed issues related to the use of qw() in loops for Perl version 5.14+ (thanks
to Evan Staton for pointing out the issue and providing the link with details:
#Use_of_qw%28...%29_as_parentheses ). Fixed issue with 5'/3' duplicate removal
that forced option -exact_only (thanks to Stephanie Pierson for reporting the
issue). Fixed issue with missing duplicate statistics in graph data output if
-derep or -graph_stats was not specified. Suppressed output of PCA module when
generating PCA plots.
Added new output file option to keep track of sequence identifier renaming
(option -seq_id_mappings). Fixed trim_qual_rule parameter listed twice in the
log file. Fixed issue with sequences of length 3bp when calculating DUST scores.
Fixed issue with exact_only parameter check.
Increased memory efficiency for graph data calculation on big input files.
Fixed rounding issue in sequence complexity calculation.
prinseq-lite-0.19 / prinseq-graphs-0.5:
Added check for counts of filtered sequence to report when no sequences were
filtered. Optimized dinucleotide calculation (~400% faster), sequence complexity
calculation (~80% faster) and quality filtering and trimming (~90% faster when
both filtering and trimming). Added option (-graph_stats) to select what
statistics should be calculated and included in the graph_data file (useful if
you e.g. do not need sequence complexity information, which requires a lot of
computation). Added binned base quality data to graph data output (as generated
in web version up to 0.17.4). Removed annotations from length distribution graph
if standard deviation is zero.
Fixed phred64 scaling issue for graph data outputs (thanks to Komal Jain for
pointing out the issue).
Fixed typo in selection of graph data elements that resulted in missing quality
prinseq-lite-0.18.1 / prinseq-graphs-0.4.1:
Fixed missing zero count for Ns when generating graphs data file. Fixed
duplication count table output for HTML report.
prinseq-lite-0.18 / prinseq-graphs-0.4:
Added options for web version processing (lite+graphs). Added custom parameter
processing (same as already available in web version). Added option to input
parameters saved in a file (lite). Fixed issue with output to STDOUT for
"-out_format 4" option. Added counts by type for filtered sequences to verbose
output (lite). Updated layout of HTML report to match web version and to use
less colors to reduce printing costs and increase readability (graphs).
Fixed issue with spaces and parenthesis in filename when compressing files.
relying on lite and graphs Perl scripts (all further lite and graphs updates
will automatically apply to the web version).
Fixed issue with MID tag output when using the -graph_data option.
Fixed issue with non-exact duplicate removal that caused incorrect out_bad files
(filtered out outputs) introduced in last version.
Fixed issue with non-exact duplicate removal when graph data and data processing
is performed at the same time.
prinseq-lite-0.17.1 / prinseq-graphs-0.3:
Added support for tag sequence check to the HTML output.
prinseq-lite-0.17 / prinseq-graphs-0.2:
Added error message if statistics and graph data are generated at the same time.
Prevented generation of graphs for missing data that might otherwise generate
errors. Prevent the use of -stats outputs when generating graphs data. Added
example data for prinseq-graphs. Fixed issue with filenames containing a
non-alphanuerical sign after the period sign (thanks to Marmaduke for pointing
out the issue). New option -no_qual_header allows to reduce the file size of
FASTQ files by preventing any header information output for the quality data.
New option -derep_min to specify the duplication threshold (e.g. only filter
sequences that occur more than 5 times).
Fixed issue with mean and max quality score rule for trimming and changed trim
"until" to "while" (web only, lite version is not affected).
prinseq-web-0.16 / prinseq-lite-0.16:
Check if sequence qualities are in Phred+64 format, if specified. Added the
reporting of errors during processing of data. Multiple output formats are now
supported (prinseq-lite). Extended the input format from ACGTN to full nucleic
acid ambiguity code (ACGTURYKMSWBDHVNX-). Allow processing of amino acid
sequences (prinseq-lite). Replace option -si13 with -phred64 to specify input
files in Phred+64 format. New options to generate graphs in standalone lite
version (using prinseq-graphs.pl or online form).
prinseq-web-0.15.1 / prinseq-lite-0.15.1:
Fixed problem with dots in directory names (prinseq-lite). Fixed problem with
trimming from left of reads that are shorter than the specified trim length.
Fixed error in calculation of Phred quality scores for Solexa/Illumina 1.3+
prinseq-web-0.15 / prinseq-lite-0.15:
New file input by URL (web version). Corrected typo in regex (missing \ before
s*) and sequence id hash value (was seqi_d instead of seq_id). Added quality
score scaling for Solexa/Illumina 1.3+ data. New option to trim poly-N tails.
New option to read from STDIN and write to STDOUT (lite version). Adjusted graph
labels for datasets with more than 1 million reads (web version).
prinseq-web-0.14.4 / prinseq-lite-0.14.4:
Corrected line break possition in output format for QUAL files. Fixed warnings
for quality trimming from the 3'-ends (lite version).
Fixed warnings in tag sequence function.
prinseq-web-0.14.2 / prinseq-lite-0.14.2:
Fixed issue of file format check with non-Unix line breaks causing
misidentification of FASTQ files.
prinseq-web-0.14.1 / prinseq-lite-0.14.1:
Fixed warning when trimming and dereplicating.
prinseq-web-0.14 / prinseq-lite-0.14:
Added status report for writing data after duplicate removal (lite version).
Added number of bases and mean length to output summary statistics in verbose
and log mode (lite version). Modified data processing to allow larger files and
higher compressed input files that previously caused callback timeout (web
Fixed warning when out_good or out_bad is set to null.
Fixed issue of renaming sequence identifiers when additionally removing read
prinseq-web-0.13 / prinseq-lite-0.13:
Fixed issue with leading spaces in first quality score. Added length check that
ensures that the number of bases matches the number of quality scores. (This
also ensures that each sequence has quality scores, if a QUAL file is provided
prinseq-web-0.12 / prinseq-lite-0.12:
Fixed issue when sequences are 3bp or shorter that caused a division by zero and
incorrect DUST complexity scores. Added -log option to generate a log file with
the used command and basic input/output statistics (lite version). Fixed
renaming issue for duplicate removal (lite version). Fixed issue for sequences
with a single base and a quality score of 0.
prinseq-web-0.11 / prinseq-lite-0.11:
Improved tag sequence probability estimation with additional check for MID tags
(454 GSMIDs and RLMIDs) and report of MID sequence if found. Visualization for
odds ratios to easily identify over- and under-represented dinucleotides (web
version only). Added table with minimum and maximum complexity values and the
respective sequece to the web version.
prinseq-web-0.10 / prinseq-lite-0.10:
Corrected typos in option description and user interface. Fixed bug when both
output options out_good and out_bad are set to "null" in standalone version.
Added summary statistics calculation for basic infos (stats_info), length
(stats_len), dinucleotide odds ratios (stats_dinuc), tag probabilities
(stats_tag), sequence duplicates (stats_dupl), ambiguous base N (stats_ns) and
all together (stats_all) to the standalone version.
prinseq-web-0.9 / prinseq-lite-0.9:
Fixed parameter loading for JSON data. Fixed ID type in sequence complexity
method. Changed order of tail trimming and quality trimming. Fixed 3'-end tail
trimming bug. Extended documentation and verbose print output of lite-version.
Added option to prevent output generation of certain files to lite-version.
Fixed issue of maximum number of sequences in combination with duplicate
prinseq-web-0.9 beta / prinseq-lite-0.8:
Fixed missing quality trimming in trimming of sequence to fixed length. Fixed GC
content range filtering. Changed integer to float for percentage value
filtering. Removed debugging Data::Dumper output. Forced single line output for
Use JSON to manage parameters on server and user site. Add mean and standard
deviations to length and GC content plot to guide choice of minimum and maximum
values. Added example data.
Add dinucleotide odds ratio calculation and PCA plots including several viral
and microbial metagenomes. Add sequence complexity plots and filters using DUST
and entropy methods. Reorganize input counts and input info to merge into single
table. Add tables with counts to most plots. Add percentage for sequence
Use Cairo graphics library to generate graphics. Added parameter management
functionality and pre-defined parameter sets. Separate duplication plot into
separate plot and add reverse complement counts. Add two plots to show
duplication level and number of duplicate counts. Use box-plots for quality
Use ExtJS for web-interface. Change progress bars and other functionality to JS.
Use bi-histograms for duplicate identification in GC content and length
distribution plots. Only show graphs when there are values to plot to reduce
load on user site. Added sequence quality scores plot and filter functionality.
Removed rarely used information shown in "Input stats". Added base frequencies
at sequence ends and tag sequence probability for tag sequence check. Added line
width formatting option for FASTA (and QUAL) output. Use binning for datasets
with long sequences.
Added information to "Reformat Options" field for renaming sequence ids. Remove
spaces, ">" and quotes automatically from sequence ids before renaming. Fixed
problem with saving "0" values instead of default values into parameters file.
Fixed header line keep/remove mismatch. Fixed "division by 0" bug when
calculating 1/length for sequence fractions. Automatically remove space and dash
from sequences when parsing the input data. Add function to convert base U to T.
Fixed length range filter bug. Fixed issue parsing FASTQ files with no
information in '+' header line.
Fixed .qual file linebreak bug.
First release of prinseq web version.