Name | Modified | Size | Downloads / Week |
---|---|---|---|
selectseq | 2019-03-19 | 7.2 kB | |
README | 2019-03-19 | 2.7 kB | |
selectseq_old_1.3.6 | 2016-07-08 | 5.4 kB | |
selectseq_old_1.3.1 | 2016-05-03 | 4.7 kB | |
selectseq_old_1.1a | 2015-01-30 | 5.1 kB | |
selectseq_old | 2011-05-20 | 4.2 kB | |
Totals: 6 Items | 29.3 kB | 0 |
selectseq version 1.4 - 2019-03-15 ------------------------------------------ Program to get certain sequences (names stored in a file, one per line) and send to a different file or STDOUT. If names start with the > character (from FASTA headers) or the @ character (FASTQ headers), it is removed and the rest of the first word in the line is used as sequence identifier instead. Empty lines or lines starting with # will be ignored. Instead of the first word in each line, it is possible to use any other word instead, using the -k option. If you want just one sequence, use that as "file name" (unless there IS a file by the same name, it will try to get just the given sequence). You will be warned (unless -q is used) that no file by that name was found, just in case it was a typo or some other such mistake. Examples: selectseq -s sequences_file -o output_file -l selected_seqs [-c] [-e string] [-f fastq] selectseq -n 10 < sequences_file > output_file Options: -l File name of list of sequences to select -- if there is no file with this name, this will be interpreted as a single sequence name (mandatory unless using -n N); -s File with all sequences; compressed files, as supported by zcat, allowed (default: STDIN); -f Format for file with all sequences, one of FASTA or FASTQ, case insensitive (default: FASTA); -o Output file to store selected sequences (default: STDOUT); -n N Get only sequence number N (integer) from file; -c Complement mode: gets only sequences NOT present in the list file; -k Which column in the sequence list file to use as the identifier (default: 1); -e Add ending to IDs (e.g. "_1") before searching -- useful when extracting protein sequences but using gene identifiers, and protein IDs differ by a suffix (e.g. as created by transeq); -m N Use "matching" mode for sequence IDs, i.e. the whole ID does not need to match, but only one of the parts between "|" characters -- useful for when one has a set of NCBI sequences but a GI list, for example. N is an integer number selecting which part of the ID to match (default: no partial matching). This options is ignored if sequence format is FASTQ; -q Quiet mode, do not print warnings to screen, only errors (default: not quiet); -d Print debug information (default: no); -V Verbose error messages listing all identifiers for sequences not found (default: not); -h Display this help message; -v Display program version. Copyright J.M.P. Alves 2003-2019 (alvesjmp@yahoo.com) This software is licensed under the GNU General Public License v. 3. Please see http://www.fsf.org/licensing/licenses/gpl.html for details.