selectseq version 1.4 - 2019-03-15
------------------------------------------
Program to get certain sequences (names stored in a file, one per line) and send to a different file or STDOUT.
If names start with the > character (from FASTA headers) or the @ character (FASTQ headers), it is removed and
the rest of the first word in the line is used as sequence identifier instead. Empty lines or lines starting
with # will be ignored.
Instead of the first word in each line, it is possible to use any other word instead, using the -k option.
If you want just one sequence, use that as "file name" (unless there IS a file by the same name, it will
try to get just the given sequence). You will be warned (unless -q is used) that no file by that name was found,
just in case it was a typo or some other such mistake.
Examples: selectseq -s sequences_file -o output_file -l selected_seqs [-c] [-e string] [-f fastq]
selectseq -n 10 < sequences_file > output_file
Options:
-l File name of list of sequences to select -- if there is no file with this name,
this will be interpreted as a single sequence name (mandatory unless using -n N);
-s File with all sequences; compressed files, as supported by zcat, allowed (default: STDIN);
-f Format for file with all sequences, one of FASTA or FASTQ, case insensitive (default: FASTA);
-o Output file to store selected sequences (default: STDOUT);
-n N Get only sequence number N (integer) from file;
-c Complement mode: gets only sequences NOT present in the list file;
-k Which column in the sequence list file to use as the identifier (default: 1);
-e Add ending to IDs (e.g. "_1") before searching -- useful when extracting protein sequences but using
gene identifiers, and protein IDs differ by a suffix (e.g. as created by transeq);
-m N Use "matching" mode for sequence IDs, i.e. the whole ID does not need to match, but only one of the
parts between "|" characters -- useful for when one has a set of NCBI sequences but a GI list, for
example. N is an integer number selecting which part of the ID to match (default: no partial matching).
This options is ignored if sequence format is FASTQ;
-q Quiet mode, do not print warnings to screen, only errors (default: not quiet);
-d Print debug information (default: no);
-V Verbose error messages listing all identifiers for sequences not found (default: not);
-h Display this help message;
-v Display program version.
Copyright J.M.P. Alves 2003-2019 (alvesjmp@yahoo.com)
This software is licensed under the GNU General Public License v. 3.
Please see http://www.fsf.org/licensing/licenses/gpl.html for details.