Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
archive | 2014-05-08 | ||
sp2fasta_ansic-20140511.tar.gz | 2014-05-11 | 19.5 kB | |
README.sp2fasta | 2014-05-11 | 3.6 kB | |
sp2fasta_perl-20140509.tar.gz | 2014-05-09 | 10.9 kB | |
sp2fasta_java-20140509-sources.jar | 2014-05-09 | 7.1 kB | |
sp2fasta_java-20140509.jar | 2014-05-09 | 5.9 kB | |
Totals: 6 Items | 47.0 kB | 0 |
Convert SWISSPROT / EMBL format sequence into fasta format ========================================================== Implementations --------------- * ANSI C: sp2fasta_ansic-<version>.tar.gz Suitable for any platform with an ANSI C compiler (e.g. cc). Development and testing has mainly been performed on Linux with gcc. * Java: sp2fasta_java-<version>.jar An executable jar for use with Java. Compiled for Java 1.5, for eariler versions of the Java specification, you will need to recompile from source. * Perl: sp2fasta_perl-<version>.tar.gz An implementation for Perl 5 environments. Usage ----- Convert UniProtKB, SWISS-PROT, EMBL-Bank or EMBL-CDS formatted sequence into fasta sequence format. Usage: sp2fasta -h sp2fasta -V sp2fasta [-c case] [-g] [-l dbPrefix] [-s] [-u] [dataFileName ...] -h This message. -V Version information -c Specify the character case of the output sequence: original (o), lower (l) or upper (u). Default: o -g Ignored for compatibility with WU-BLAST sp2fasta. -l Specify database label. Default: 'emb' for nucleotide, 'sp' for protein, 'tr' for "Unreviewed" protein, 'sp' for unidentified sequence data. -s Simple fasta headers, entry 'ID' and description only. -u UniProtKB style fasta headers, appends 'OS', 'GN', 'PE' and 'SV' data to description. Default input is read from STDIN unless file names are specified. To explicitly specify STDIN to be used for input, use '-' as a file name. Example Usage ------------- A. Swiss-Prot Converting SWISS-PROT 45 in to the standard sp2fasta fasta format: > sp2fasta sprot45.dat > sprot45 This gives headers like: >sp|P15711|104K_THEPA 104 kDa microneme-rhoptry antigen. The default database prefix is 'sp' for amino-acid entries. The database prefix can be specified using the -l option. B. UniProtKB With UniProtKB release 14.0 the format of the DE lines changed to be more structured: ID 104K_THEPA Reviewed; 924 AA. AC P15711; Q4N2B5; ... DE RecName: Full=104 kDa microneme/rhoptry antigen; DE AltName: Full=p104; DE Flags: Precursor; ... Using the default options this gives structured descriptions in the fasta format headers: >sp|P15711|104K_THEPA RecName: Full=104 kDa microneme/rhoptry antigen; AltName: Full=p104; Flags: Precursor; In some cases these descriptions can be very long and are a little difficult to read. The -u option can be used to trim the description to just the primary description (i.e. the first one) and simulate the UniProtKB fasta format headers: sp2fasta -u uniprot_sprot.dat > uniprot_sprot.fasta This gives headers like: >sp|P15711|104K_THEPA 104 kDa microneme/rhoptry antigen OS=Theileria parva GN=TP04_0437 PE=2 SV=1 C. EMBL-Bank Converting an EMBL-Bank data file in to the standard sp2fasta fasta format: > sp2fasta pln01.dat > pln01 This gives headers like: >emb|AB000093|AB000093 Arabidopsis thaliana gene for inorganic phosphate... The default database prefix is 'emb' for nucleotide entries. The database prefix can be specified using the -l option. D. Simple Format To generate fasta formatted output with simple headers use the -s option: > sp2fasta -s sprot45.dat > sprot45 This gives headers like: >104K_THEPA 104 kDa microneme-rhoptry antigen. E. Database Prefix To specify an alternative database name to use in the output use the -l option: > sp2fasta -l swissprot sprot45.dat > sprot45 This gives headers like: >swissprot|P15711|104K_THEPA 104 kDa microneme-rhoptry antigen. Note: if a label is specified it will be used for all input sequences regardless of their sequence type.