Re: [Inchworm-users] Max Kmer size
Brought to you by:
bhaas
From: Brian J H. <bh...@br...> - 2011-02-25 17:50:10
|
Also, I forgot to mention, the Trinity paper (describing Inchworm and the other tools to be released) is currently under review. Best, -b On Fri, Feb 25, 2011 at 12:48 PM, Brian J Haas <bh...@br...>wrote: > Hi Greg, > > I've responded to your questions below: > > On Fri, Feb 25, 2011 at 12:23 PM, Greg Concepcion <gco...@gm...>wrote: > >> Hi Brian, >> >> Thanks for the awesome resource! >> >> I've been using the previous version of inchworm (inchworm_01-20-2011<http://sourceforge.net/projects/inchworm/files/OLD_VERSIONS/inchworm_01-20-2011.tgz/download>) >> for de novo assembly of transcriptomic data from a non-model organism with >> no reference genome available. So far my success has been great, both in >> terms of transcript length and maximum memory requirements (which crippled >> my Velvet/Oases assembly) >> >> > Excellent news! > > > >> My strategy so far has been to run inchworm on the raw data with a range >> of Kmer sizes, from 25-37, concatenate those outputs and reassemble using >> the '--reassembleIworm' option. >> This brings me to my two questions so far: >> 1. I had to modify the source to allow a max Kmer size >31. Is there a >> particular reason for this limit? >> > > The true maximal limit should be 32, since the kmers are stored as 64 bit > unsigned integers (with 2 bits per base encoding). I made the max 31 > because I was hoping to reserve the last couple of bits to store additional > info at some point... but I haven't used it. If you try to go beyond 32, > your still only storing a 32mer worth of sequence and tossing out the other > bits. A different storage strategy would need to be built into inchworm to > go higher than the 32 bits and require some serious reengineering. In my > various tests, I've found that 25mers work very well for transcriptome data > and going higher (such as beyond 29mers) can end up fragmenting some > otherwise nice full-length transcripts. Also, the strategy of assembling > using a bunch of different kmer lengths and combining the data (which I > borrowed conceptually from trans-ABySS though it works very differently > here, doesn't buy much, at least in the various tests I've done. In many > cases, we end up getting an ever so slight increase in the number of > full-length transcripts. > > >> 2. I doubt the strategy i'm using is "the best" one, however from one lane >> of a flow cell (~5 Gb raw illumina data (2x76bp * 34E06 reads)) I was able >> to generate ~120 Mb of consensus sequence representing >500,000 >> contigs/transcripts (above a 100bp threshold). Should I be doing anything >> differently to maximize Inchworm's potential to assemble transcripts? So far >> the lengths are pretty good, with >8,000 transcripts longer than 1,000bps, >> and a few in the 10,000bp range. >> >> > Is this strand-specific data? > > Note, I made some important improvements to inchworm recently: both > removing error-containing kmers and adjustments for non-strand-specific > data. I definitely encourage you to give it a whirl.. just run it once with > the default settings (k=25) and see how it looks. The total number of > contigs (and amount of complete garbage that result) should be minimal, and > the contigs that are reported should be heavily enriched for quality > assemblies. > > Also note, I'm hoping to release the full Trinity package sometime next > week (http://trinityrnaseq.sf.net) which takes Inchworm results to another > level of utility, especially where alternative splicing is concerned. > > > Best, > > -brian > > > >> Also, I'm hoping to submit this data for publication in the near future, >> is there an ETA on a date for a publication that I can cite? >> >> Aloha! >> >> Gregory T. Concepcion, PhD >> Cell Biology and Molecular Genetics >> 2107 Biosciences Research Building >> University of Maryland >> College Park, MD 20742 >> >> w:301.405.8300 >> c:301.828.8210 >> >> >> ------------------------------------------------------------------------------ >> Free Software Download: Index, Search & Analyze Logs and other IT data in >> Real-Time with Splunk. Collect, index and harness all the fast moving IT >> data >> generated by your applications, servers and devices whether physical, >> virtual >> or in the cloud. Deliver compliance at lower cost and gain new business >> insights. http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> Inchworm-users mailing list >> Inc...@li... >> https://lists.sourceforge.net/lists/listinfo/inchworm-users >> >> > > > -- > -- > Brian J. Haas > Manager, Bioinformatics Outreach, Genome Annotation and Analysis > The Broad Institute > http://broad.mit.edu/~bhaas > > > > > -- -- Brian J. Haas Manager, Bioinformatics Outreach, Genome Annotation and Analysis The Broad Institute http://broad.mit.edu/~bhaas |