Thread: [Inchworm-users] Max Kmer size
Brought to you by:
bhaas
From: Greg C. <gco...@gm...> - 2011-02-25 17:23:25
|
Hi Brian, Thanks for the awesome resource! I've been using the previous version of inchworm (inchworm_01-20-2011<http://sourceforge.net/projects/inchworm/files/OLD_VERSIONS/inchworm_01-20-2011.tgz/download>) for de novo assembly of transcriptomic data from a non-model organism with no reference genome available. So far my success has been great, both in terms of transcript length and maximum memory requirements (which crippled my Velvet/Oases assembly) My strategy so far has been to run inchworm on the raw data with a range of Kmer sizes, from 25-37, concatenate those outputs and reassemble using the '--reassembleIworm' option. This brings me to my two questions so far: 1. I had to modify the source to allow a max Kmer size >31. Is there a particular reason for this limit? 2. I doubt the strategy i'm using is "the best" one, however from one lane of a flow cell (~5 Gb raw illumina data (2x76bp * 34E06 reads)) I was able to generate ~120 Mb of consensus sequence representing >500,000 contigs/transcripts (above a 100bp threshold). Should I be doing anything differently to maximize Inchworm's potential to assemble transcripts? So far the lengths are pretty good, with >8,000 transcripts longer than 1,000bps, and a few in the 10,000bp range. Also, I'm hoping to submit this data for publication in the near future, is there an ETA on a date for a publication that I can cite? Aloha! Gregory T. Concepcion, PhD Cell Biology and Molecular Genetics 2107 Biosciences Research Building University of Maryland College Park, MD 20742 w:301.405.8300 c:301.828.8210 |
From: Brian J H. <bh...@br...> - 2011-02-25 17:49:01
|
Hi Greg, I've responded to your questions below: On Fri, Feb 25, 2011 at 12:23 PM, Greg Concepcion <gco...@gm...>wrote: > Hi Brian, > > Thanks for the awesome resource! > > I've been using the previous version of inchworm (inchworm_01-20-2011<http://sourceforge.net/projects/inchworm/files/OLD_VERSIONS/inchworm_01-20-2011.tgz/download>) > for de novo assembly of transcriptomic data from a non-model organism with > no reference genome available. So far my success has been great, both in > terms of transcript length and maximum memory requirements (which crippled > my Velvet/Oases assembly) > > Excellent news! > My strategy so far has been to run inchworm on the raw data with a range of > Kmer sizes, from 25-37, concatenate those outputs and reassemble using the > '--reassembleIworm' option. > This brings me to my two questions so far: > 1. I had to modify the source to allow a max Kmer size >31. Is there a > particular reason for this limit? > The true maximal limit should be 32, since the kmers are stored as 64 bit unsigned integers (with 2 bits per base encoding). I made the max 31 because I was hoping to reserve the last couple of bits to store additional info at some point... but I haven't used it. If you try to go beyond 32, your still only storing a 32mer worth of sequence and tossing out the other bits. A different storage strategy would need to be built into inchworm to go higher than the 32 bits and require some serious reengineering. In my various tests, I've found that 25mers work very well for transcriptome data and going higher (such as beyond 29mers) can end up fragmenting some otherwise nice full-length transcripts. Also, the strategy of assembling using a bunch of different kmer lengths and combining the data (which I borrowed conceptually from trans-ABySS though it works very differently here, doesn't buy much, at least in the various tests I've done. In many cases, we end up getting an ever so slight increase in the number of full-length transcripts. > 2. I doubt the strategy i'm using is "the best" one, however from one lane > of a flow cell (~5 Gb raw illumina data (2x76bp * 34E06 reads)) I was able > to generate ~120 Mb of consensus sequence representing >500,000 > contigs/transcripts (above a 100bp threshold). Should I be doing anything > differently to maximize Inchworm's potential to assemble transcripts? So far > the lengths are pretty good, with >8,000 transcripts longer than 1,000bps, > and a few in the 10,000bp range. > > Is this strand-specific data? Note, I made some important improvements to inchworm recently: both removing error-containing kmers and adjustments for non-strand-specific data. I definitely encourage you to give it a whirl.. just run it once with the default settings (k=25) and see how it looks. The total number of contigs (and amount of complete garbage that result) should be minimal, and the contigs that are reported should be heavily enriched for quality assemblies. Also note, I'm hoping to release the full Trinity package sometime next week (http://trinityrnaseq.sf.net) which takes Inchworm results to another level of utility, especially where alternative splicing is concerned. Best, -brian > Also, I'm hoping to submit this data for publication in the near future, is > there an ETA on a date for a publication that I can cite? > > Aloha! > > Gregory T. Concepcion, PhD > Cell Biology and Molecular Genetics > 2107 Biosciences Research Building > University of Maryland > College Park, MD 20742 > > w:301.405.8300 > c:301.828.8210 > > > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT > data > generated by your applications, servers and devices whether physical, > virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ > Inchworm-users mailing list > Inc...@li... > https://lists.sourceforge.net/lists/listinfo/inchworm-users > > -- -- Brian J. Haas Manager, Bioinformatics Outreach, Genome Annotation and Analysis The Broad Institute http://broad.mit.edu/~bhaas |
From: Brian J H. <bh...@br...> - 2011-02-25 17:50:10
|
Also, I forgot to mention, the Trinity paper (describing Inchworm and the other tools to be released) is currently under review. Best, -b On Fri, Feb 25, 2011 at 12:48 PM, Brian J Haas <bh...@br...>wrote: > Hi Greg, > > I've responded to your questions below: > > On Fri, Feb 25, 2011 at 12:23 PM, Greg Concepcion <gco...@gm...>wrote: > >> Hi Brian, >> >> Thanks for the awesome resource! >> >> I've been using the previous version of inchworm (inchworm_01-20-2011<http://sourceforge.net/projects/inchworm/files/OLD_VERSIONS/inchworm_01-20-2011.tgz/download>) >> for de novo assembly of transcriptomic data from a non-model organism with >> no reference genome available. So far my success has been great, both in >> terms of transcript length and maximum memory requirements (which crippled >> my Velvet/Oases assembly) >> >> > Excellent news! > > > >> My strategy so far has been to run inchworm on the raw data with a range >> of Kmer sizes, from 25-37, concatenate those outputs and reassemble using >> the '--reassembleIworm' option. >> This brings me to my two questions so far: >> 1. I had to modify the source to allow a max Kmer size >31. Is there a >> particular reason for this limit? >> > > The true maximal limit should be 32, since the kmers are stored as 64 bit > unsigned integers (with 2 bits per base encoding). I made the max 31 > because I was hoping to reserve the last couple of bits to store additional > info at some point... but I haven't used it. If you try to go beyond 32, > your still only storing a 32mer worth of sequence and tossing out the other > bits. A different storage strategy would need to be built into inchworm to > go higher than the 32 bits and require some serious reengineering. In my > various tests, I've found that 25mers work very well for transcriptome data > and going higher (such as beyond 29mers) can end up fragmenting some > otherwise nice full-length transcripts. Also, the strategy of assembling > using a bunch of different kmer lengths and combining the data (which I > borrowed conceptually from trans-ABySS though it works very differently > here, doesn't buy much, at least in the various tests I've done. In many > cases, we end up getting an ever so slight increase in the number of > full-length transcripts. > > >> 2. I doubt the strategy i'm using is "the best" one, however from one lane >> of a flow cell (~5 Gb raw illumina data (2x76bp * 34E06 reads)) I was able >> to generate ~120 Mb of consensus sequence representing >500,000 >> contigs/transcripts (above a 100bp threshold). Should I be doing anything >> differently to maximize Inchworm's potential to assemble transcripts? So far >> the lengths are pretty good, with >8,000 transcripts longer than 1,000bps, >> and a few in the 10,000bp range. >> >> > Is this strand-specific data? > > Note, I made some important improvements to inchworm recently: both > removing error-containing kmers and adjustments for non-strand-specific > data. I definitely encourage you to give it a whirl.. just run it once with > the default settings (k=25) and see how it looks. The total number of > contigs (and amount of complete garbage that result) should be minimal, and > the contigs that are reported should be heavily enriched for quality > assemblies. > > Also note, I'm hoping to release the full Trinity package sometime next > week (http://trinityrnaseq.sf.net) which takes Inchworm results to another > level of utility, especially where alternative splicing is concerned. > > > Best, > > -brian > > > >> Also, I'm hoping to submit this data for publication in the near future, >> is there an ETA on a date for a publication that I can cite? >> >> Aloha! >> >> Gregory T. Concepcion, PhD >> Cell Biology and Molecular Genetics >> 2107 Biosciences Research Building >> University of Maryland >> College Park, MD 20742 >> >> w:301.405.8300 >> c:301.828.8210 >> >> >> ------------------------------------------------------------------------------ >> Free Software Download: Index, Search & Analyze Logs and other IT data in >> Real-Time with Splunk. Collect, index and harness all the fast moving IT >> data >> generated by your applications, servers and devices whether physical, >> virtual >> or in the cloud. Deliver compliance at lower cost and gain new business >> insights. http://p.sf.net/sfu/splunk-dev2dev >> _______________________________________________ >> Inchworm-users mailing list >> Inc...@li... >> https://lists.sourceforge.net/lists/listinfo/inchworm-users >> >> > > > -- > -- > Brian J. Haas > Manager, Bioinformatics Outreach, Genome Annotation and Analysis > The Broad Institute > http://broad.mit.edu/~bhaas > > > > > -- -- Brian J. Haas Manager, Bioinformatics Outreach, Genome Annotation and Analysis The Broad Institute http://broad.mit.edu/~bhaas |