FASTA conversion to EXP

2011-02-25
2013-04-18
1 2 > >> (Page 1 of 2)
  • I just downloaded Staden Package for Windows and attempted to load a fasta file into Pregap4 that had many sequences in it in fasta format.

    I clicked the Add Files button, browsed til I found my fasta file, unchecked everything except initialize experimental files in the configure module and clicked Run.

    I get the following error:

    - Report Production -
    Passed files:

    Failed files:
        {C:/Documents and Settings/melendrezmc/Desktop/KS07064.fa} (UNK) 'init: Unknown file type'

                           ***   Processing finished   ***

    I'd like to assemble the sequences in fasta files that I obtained from 454 sequencing. How do I convert fasta to exp? I've tried different appends (.fna, .fas, .fa, .fasta, .txt) none of them work.

    Thank you.

    Melanie (melendrezmc@afrims.org)

     
  • James Bonfield
    James Bonfield
    2011-02-25

    Realistically the assembly algorithms in Gap4 just aren't appropriate for NGS data, and that includes 454. While they can work (and someone even claimed they did a great job, but I suspect they had no repeats) the old notion of one file per sequence is rediculous with modern data sets.

    That said, I'm not sure what failed for you. Pregap4 does have code to split fasta files into experiment files in the init_exp module.   I haven't tested this for years as my main focus is gap5 now, but I just tried on a linux box with a couple manually made entries in "test.fasta" and it split the files up just fine. I can't test it on windows right now though as my windows system isn't my work machine.

    The error you are seeing implies it hasn't worked out that the data is in fasta format, although I'm not sure why. Could you test it with a small manually entered snippet of just 1 or 2 entries first. If that works, then it's something in your data that is causing it to fail (but not necessarily a fault of the data).  If it's data specific then I'd be glad to take a look at the data and try and fix the reading code.

     
  • Hello Me Again…thanks for the response. I did as you suggested and created a fastatest.fasta file (literally just a text file) with the following 3 sequences in fasta format:

    >ABC
    AGCTGTCGAATGCCGTA
    >DEF
    AGCCCTTTGAGGAGTAG
    >GHI
    AGGTTCCGGTAATGCAG

    The program gave me the same error no matter what I appended the file with (.txt, .fas, .fa, .fasta, .fna…same problem as before). Unfortunately I cannot obtain the .abi files for the sequences, all they gave me was a CD full of 1000's to 10,000's of reads per patient (viral sequences) and they are .fna (fasta files with more than one sequence in them), similar to my example above.

    The reads assemble on the program Reference Mapper, however our group doesn't have a license for that (nor can afford it) and I was asked if I would try to find another package so I found Staden as it's freeware available on windows. My problem is basic but frustrating, perhaps this is a windows problem? Do I need to find a way to run Staden (pregap) in linux?

    Am I not clicking the right upload buttons?
    I opened pregap and clicked add files browsed and found my file, in the window it pops up as a path (C:/Documents and Settings/melendrezmc/Desktop/fastatest.fasta)
    I then click over to configure modules and un-check everything except Initialise experiment files because I don't have quality data or anything to add to the .fna data I was given.
    I click run, that's when I get the error.

    Should I be checking any other modules? Does the .fasta need to be in the physical same file as Staden or some other file that came with Staden? I tried placing it in the same folder as pregap4 but that didn't work either.

    Thanks again!
    Melanie

     
  • Ehsan
    Ehsan
    2011-06-23

    Hi

    I am trying to convert a fasta file of 454 reads into gap5 database using the last release of Staden with tg_index v. 1.2.11.
    I use the following command to convert: tg_index -o database_db 454AllContigs.fna
    At the end, all I need is to get gap5 read the file.
    The program seems to run but no database is created. Does anyone know what's going on? I am reluctant to use pregap4 because of the large number of reads.

    Thanks a lot

    Ehsan

    Below is the output I have on the screen.

    === g idx_hash ===
    Nbuckets  = 131072
    Nused     = 0
    Avg chain = 0.000000
    Chain var.= 0.000000
    %age full = 0.000000
    max len   = 0
    cache_size= 131072
    N.cached  = 0
    N.locked  = 0
    Searches  = 0
    Cache hits= 0 (  -nan)%
    Chain  0   = 131072

    === g idx_hash ===
    Nbuckets  = 131072
    Nused     = 0
    Avg chain = 0.000000
    Chain var.= 0.000000
    %age full = 0.000000
    max len   = 0
    cache_size= 131072
    N.cached  = 0
    N.locked  = 0
    Searches  = 0
    Cache hits= 0 (  -nan)%
    Chain  0   = 131072
    gio_open: Success

     
  • Hello Ehsan,

    I'm not sure tg_index is going to do what you want it to do.  Fasta files do not have any alignment data, all you will end up with is one read per contig (and with a great many contigs).  If you want to import reads from 454 you can use the ace file that newbler produces (though you won't get any quality values).

    Andrew

     
  • Ehsan
    Ehsan
    2011-06-24

    Thanks Andrew, but then how could I use my 454 reads with gap5? I do not have access to ace files.
    The only think I have are a bunch of contigs in a fasta file.
    I know pregap4 can read fasta but because of the size of my file, it would not be advisable I guess.
    Any suggestion?
    Thanks a lot

     
  • Hi Ehsan,

    The programs gap4 and gap5 are assembly editors and viewers, they work best when you want to see how assemblies fit together (and to fix any assembly mistakes).  The ability to import fasta files is mainly there so you can add new reads to already existing assemblies.

    It might help if you let me know what  you want to do with your contigs?

    Andrew

     
  • Ehsan
    Ehsan
    2011-06-27

    I have all these contigs in fasta format, and a bunch of Illumina and 454 reads also in fasta format. The idea is to do manual assembly by trying to assemble all the contigs into one large contig and filling the gaps with the short Illumina and 454 reads. Because of the presence of repeats, so far no automated assembler could do it.
    So, all I need is to transfer the fasta files into a format readable by gap5. Again, I do not have access to the other file formats (such as ace, etc).

    Thanks

     
  • Okay, so I tried to create a gap5 file from a fasta file and it reliably crashes.  I'll have to take a closer look to see what is going on.

     
  • I've fixed the problem in the source code.  If you are happy with compiling the Staden Package from Subversion then the fix is there for you.  If not, let me know what OS you are running on and I'll see what I can do.

    Andrew 

     
  • Ehsan
    Ehsan
    2011-06-28

    Got the new version from Subversion using this command: svn co https://staden.svn.sourceforge.net/svnroot/staden staden
    No error was reported apparently during the process, but when I try tg_index, it still gives me the same screen output an no file is created… Am I even using the correct command?
    This is the command I use: tg_index -o database_db 454AllContigs.fna

    Thanks
    E

     
  • The command is right but your output should look something like this:

    tg_index -o fred fred.fna
        g_index:    Short Read Alignment Indexer, version 1.2.13-rSVN_VERSION
        Author:     James Bonfield (jkb@sanger.ac.uk)
                    2007-2011, Wellcome Trust Sanger Institute
    Database version=2
    Processing FASTA file fred.fna
    Loading fred.fna...
    Loaded 4 sequences
    Sorting sequence name index
    buf=sort < /tmp/file11QMxd > /tmp/filev7FvXf
    done
    Building index
    Nbuckets  = 1024
    Nused     = 19
    Avg chain = 0.018555
    Chain var.= 0.020164
    %age full = 1.757812
    max len   = 2
    cache_size= 1024
    N.cached  = 18
    N.locked  = 0
    Searches  = 77
    Cache hits= 39 ( 50.65)%
    Chain  0   = 1006
    Chain  1   = 17
    Chain  2   = 1
    === btree_hash ===
    Nbuckets  = 1024
    Nused     = 1
    Avg chain = 0.000977
    Chain var.= 0.000976
    %age full = 0.097656
    max len   = 1
    cache_size= 1024
    N.cached  = 0
    N.locked  = 0
    Searches  = 1
    Cache hits= 1 (100.00)%
    Chain  0   = 1023
    Chain  1   = 1
    === btree_hash ===
    Nbuckets  = 1024
    Nused     = 1
    Avg chain = 0.000977
    Chain var.= 0.000976
    %age full = 0.097656
    max len   = 1
    cache_size= 1024
    N.cached  = 0
    N.locked  = 0
    Searches  = 1
    Cache hits= 1 (100.00)%
    Chain  0   = 1023
    Chain  1   = 1
    === g idx_hash ===
    Nbuckets  = 131072
    Nused     = 256
    Avg chain = 0.001953
    Chain var.= 0.001965
    %age full = 0.194550
    max len   = 2
    cache_size= 131072
    N.cached  = 0
    N.locked  = 0
    Searches  = 543
    Cache hits= 286 ( 52.67)%
    Chain  0   = 130817
    Chain  1   = 254
    Chain  2   = 1
    *** I/O stats (type, write count/size read count/size) ***
    GT_RecArray               1              7        2              0
    GT_Bin                   12            157        0              0
    GT_Range                  4            105        0              0
    GT_BTree                  2            136        0              0
    GT_Track                  0              0        0              0
    GT_Contig                 8            107        4             24
    GT_Seq                    0              0        0              0
    GT_Anno                   0              0        0              0
    GT_AnnoEle                0              0        0              0
    GT_SeqBlock               1          12546        1              0
    GT_AnnoEleBlock           0              0        0              0
    

    If you are not getting the full thing, including the IO stats at the end, then it is going wrong somewhere.  I suspect it is crashing but not telling you for some reason.

    After you retrieved the source code from Subversion did it build and install properly?

    Andrew

     
  • Ehsan
    Ehsan
    2011-06-29

    After retrieving the source code, it went through a long list of files, ending by Checked out revision 2561.

    Yet, I still have the same output

            g_index:        Short Read Alignment Indexer, version 1.2.11

            Author:         James Bonfield (jkb@sanger.ac.uk)
                            2007-2011, Wellcome Trust Sanger Institute

    === g idx_hash ===
    Nbuckets  = 131072
    Nused     = 0
    Avg chain = 0.000000
    Chain var.= 0.000000
    %age full = 0.000000
    max len   = 0
    cache_size= 131072
    N.cached  = 0
    N.locked  = 0
    Searches  = 0
    Cache hits= 0 (  -nan)%
    Chain  0   = 131072

    === g idx_hash ===
    Nbuckets  = 131072
    Nused     = 0
    Avg chain = 0.000000
    Chain var.= 0.000000
    %age full = 0.000000
    max len   = 0
    cache_size= 131072
    N.cached  = 0
    N.locked  = 0
    Searches  = 0
    Cache hits= 0 (  -nan)%
    Chain  0   = 131072
    gio_open: Success

    Is it possible to give a logfile where we can see what's wrong?

     
  • After retrieving the source code, it went through a long list of files, ending by Checked out revision 2561

    You need to configure it, build it and install it.  Instructions are in the README file.  If you are unfamiliar with compiling software it can prove trying.

    If tg_index is crashing it should produce a core file or at the very least produce an error message about segmentation violations.  It would help me to know what operating system you are running on.

     

  • Anonymous
    2012-07-13

    Andrew,

    I just installer staden 2.0.0b9 on my Mac 10.6 and have the same problem described above, cannot open fasta files and don't have any other format. I tried to follow the link mentioned above for an update but didn't know what to do there… Sorry I am not computer savvy at all. Can you please help me
    Thanks,

    Meg

     
  • Hi Meg,

    As I wrote above, you really need more than fasta files to make proper use of gap5.

    I tried to follow the link mentioned above for an update but didn't know what to do there

    The link pre-dates  v2.0.0b9 and is no longer relevant.

    What are you trying to do?  What data do you have available?

    regards,

    Andrew

     

  • Anonymous
    2012-07-13

    Andrew,

    Thank you for answering so fast.
    I have 454 sequencing data of different phage genomes. Data as been organized by phage in individual fasta files with 30000 to 50000 reads of 450nt average length.
    I am trying to assemble the genomes, define a consensus and then do the usual: blast, annotate, predict ORFs and proteins etc.

    Meg

     
  • John Nash
    John Nash
    2012-07-13

    I wish genome centres would stop giving Newbler dumps of assembled data to customers without explaining what it is.

    Meg, it the data is already assembled by the genome centre, and you want to resolve the contigs, you should really convert the ace file (which should have been generated as part of the process) into a gap5 database with tg_index, and then edit each phage, one by one, in gap5.

    If all you have is a fasta file or file, it's probably more complex to process, but check to see if they can give you ace files for each genome.

     

  • Anonymous
    2012-07-13

    Andrew,

    I will ask them right now.

    Magali

     

  • Anonymous
    2012-07-13

    Andrew,

    However, the data is not assemble by the sequencing center, it is just sorted by phage (i.e. by tag) since we sequenced a dozen at once. Does it changes anything regarding obtaining ace format files?

    Meg

     
  • Hi Meg,

    It was John who mentioned sequencing centres and the ace files.  If they haven't assembled it for you (and I would have expected them to do so) then you are going to have to assemble them yourself.

    Andrew 

     

  • Anonymous
    2012-07-13

    Andrew,

    Sorry for the confusion, I didn't see it was John whom answered.
    I am willing and want to assemble them, it is still a matter of being able to use the only file format I have (.fna) with Gap5 (that can handle 454 sequences if I understood correctly) that requires .aux,.g5d or .g5x format files. So how can I convert my fasta sequence files to one of these Gap5-friendly format?

    Meg

     
  • John Nash
    John Nash
    2012-07-13

    Hi Meg (this is John)

    It is extremely inadvisable to assemble any DNA sequences without quality values (i.e. how accurately the machine thinks that the specific nucleotide has been called).  In the old days when we typed in sequence from a piece of film, we had to guess sometimes.  Then fancy-dancy machines used to generate sequence which came with quality values embedded.  For example, ABI files had quality values embedded, and we didn't really even notice.

    Fasta format has no quality values, and we started to deal with that in several ways.  With finished sequences, we don't need quality scores, so fasta files are ok. For assembling raw reads, we do need them. Typically these days, we use fastq files, which have the quality scores embedded in them, or we use fasta/qual file pairs, e.g, "sequence.fasta" and its pair  "sequence.fasta.qual".  These can be fed into an assembler which uses the quality values during the assembly to be more accurate.  Assembling sequence without quality values can be very dangerous.

    Assemblers like Newbler (the 454 assembler, aka gsAssembler) and MIRA (my favourite) take sequence and quality values and spit out assemblies.  These are typically ace, sam or caf files, and contain each read, where they map to the consensus, the consensus itself, and quality values for the read AND the consensus.  Gap5 is not an assembler but an assembly editor and proofreader, and it takes in assembly files (ace, sam and caf files) via tg_index which converts these assemblies into a gap database (e.g. yourProject.g5d and yourProject.g5x).

    For each  of your phage genomes you need to get the ace file (typically generated by the sequencing centre) and make a gap5 database. Or you need to ask the sequencing centre for the raw reads (or flowgrams) - which are sff files, and feed them into an assembler like Mira or Newbler (if you have a copy of either), to generate an assembly.  Mira is free but very complicated and can run on Unix or Mac OS X, Newbler comes with the 454 sequencer and is point-and-click but runs on Unix.

    J

     
  • John Nash
    John Nash
    2012-07-13

    Arrgh - I wish I could edit my previous comments…

    To clarify a point mentioned above.  The raw flowgrams (sff files) are either read directly by the assembler (e.g. Newbler) or converted to fastq files using a pre-processor, and fed to an assembler (e.g. mira).

     
  • John Nash
    John Nash
    2012-07-13

    This is a rant and I have just had an espresso.  I should really put it on SEQanswers or something. It is not directed at Meg but directed at people who will use Google and find this thread in the future. I have to deal with this issue EVERY single WEEK of my professional life.

    If your sequencing centre just gives you a bunch of fasta files after you pay them money to generate sequence, berate them and complain bitterly, or ask nicely - your choice.  They should ALWAYS provide you with the raw reads, even if you never use them. You paid for them. It is important data.  If it's 454, ask for the flowgrams. If it's illumina, ask for the fastq or bam files.  If it's a new technology in the future, ask for the raw output.

    But in 2012, it's the 454 facilities who are the big culprits. The illumina guys are pretty good at giving raw data - but then, they generally don't assemble for you.

    If they do assembly for you, you should ALWAYS get their assembled data too, not just the fasta list of contigs.

    Because one day, if you get stuck and ask a bioinformatician for help, he or she will ask for the raw data because that is the only way that mistakes can be checked out.

    Thank you

     
1 2 > >> (Page 1 of 2)