*** mfsizes v. 1.8.7 - 2017-11-22 - by J
Program to count the size of each sequence in a FASTA file, amount of Ns per sequence,
calculate the total and weighted average size for the total set of sequences, GC% for
each sequence (Met proportion and total, if protein), and N50, 90, and 95 (and their
L numbers, i.e. the contig where that value was reached). Optionally,
calculate the GC content for each codon position separately.
Input file can be compressed; mfsizes looks for the .gz extension to detect that.
* Normal usage: mfsizes -i filename [-o outfile] [-p] [-c] [-q] [-e]
* Options:
-i Input FASTA file, mandatory;
-o Output file (optional, use STDOUT for standard output);
-p Input sequences are protein (default: DNA);
-c Calculate GC separately for each codon position (default: no);
-e Ignore "empty" sequences (default: no);
-q Quiet output, see below (default: detailed output);
-n Do NOT sort (descrescing) sequences by size (default: sort);
-v Prints program version and exits;
-h Prints this help message and exits.
- Order of options is irrelevant, and switches can be bundled (e.g. -cq);
- If the output file is not specified (-o option), the program will generate an
outfile with the same name of the input file, but with .sizes extension added. If
you want output to go to standard output, then use STDOUT for this option;
- If your sequences are protein, use the -p switch, otherwise meaningless results will
be produced;
- For "quiet" output (no general statistics, no comments, just each sequence's ID,
size and GC or Met content), use the -q switch;
- Use -e to ignore "empty" sequences (definition lines after which there is no
sequence) in the output -- these sequences are never used to calculate the averages,
even if not ignored, so it is safe to use the default behavior;
- The -c switches on the separate codon position calculation of GC (output will also
have three GC compositions for each sequence, one for each codon position). This
is ignored if you use -p, and might give meaningless results if sequences are not
protein coding (as well as some warnings or even errors);
- The program will list each sequence's name followed by its size and percent GC
(or %Met and total number of Met). At the end of the file, the average size,
standard deviation and weigthed average GC (or Met).
Copyright J.M.P. Alves 2003-2017 (alvesjmp@yahoo.com)
This software is licensed under the GNU General Public License v. 3.
Please see http://www.fsf.org/licensing/licenses/gpl.html for details.