Home
Name Modified Size InfoDownloads / Week
selectseq 2019-03-19 7.2 kB
README 2019-03-19 2.7 kB
selectseq_old_1.3.6 2016-07-08 5.4 kB
selectseq_old_1.3.1 2016-05-03 4.7 kB
selectseq_old_1.1a 2015-01-30 5.1 kB
selectseq_old 2011-05-20 4.2 kB
Totals: 6 Items   29.3 kB 0
selectseq version 1.4 -  2019-03-15
------------------------------------------
Program to get certain sequences (names stored in a file, one per line) and send to a different file or STDOUT.
If names start with the > character (from FASTA headers) or the @ character (FASTQ headers), it is removed and 
the rest of the first word in the line is used as sequence identifier instead. Empty lines or lines starting 
with # will be ignored.

Instead of the first word in each line, it is possible to use any other word instead, using the -k option.

If you want just one sequence, use that as "file name" (unless there IS a file by the same name, it will
try to get just the given sequence). You will be warned (unless -q is used) that no file by that name was found, 
just in case it was a typo or some other such mistake.

Examples: selectseq -s sequences_file -o output_file -l selected_seqs [-c] [-e string] [-f fastq]
          selectseq -n 10 < sequences_file > output_file

Options:
-l 	 File name of list of sequences to select -- if there is no file with this name,
   	 this will be interpreted as a single sequence name (mandatory unless using -n N);
-s 	 File with all sequences; compressed files, as supported by zcat, allowed (default: STDIN);
-f 	 Format for file with all sequences, one of FASTA or FASTQ, case insensitive (default: FASTA);
-o 	 Output file to store selected sequences (default: STDOUT);
-n N	 Get only sequence number N (integer) from file;
-c 	 Complement mode: gets only sequences NOT present in the list file;
-k 	 Which column in the sequence list file to use as the identifier (default: 1);
-e 	 Add ending to IDs (e.g. "_1") before searching -- useful when extracting protein sequences but using 
   	 gene identifiers, and protein IDs differ by a suffix (e.g. as created by transeq);
-m N	 Use "matching" mode for sequence IDs, i.e. the whole ID does not need to match, but only one of the
   	 parts between "|" characters -- useful for when one has a set of NCBI sequences but a GI list, for
   	 example. N is an integer number selecting which part of the ID to match (default: no partial matching).
   	 This options is ignored if sequence format is FASTQ;
-q 	 Quiet mode, do not print warnings to screen, only errors (default: not quiet);
-d 	 Print debug information (default: no);
-V 	 Verbose error messages listing all identifiers for sequences not found (default: not);
-h 	 Display this help message;
-v 	 Display program version.

Copyright J.M.P. Alves 2003-2019 (alvesjmp@yahoo.com)
This software is licensed under the GNU General Public License v. 3.
Please see http://www.fsf.org/licensing/licenses/gpl.html for details.
Source: README, updated 2019-03-19