Re: [Inchworm-users] Max Kmer size

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Also, I forgot to mention, the Trinity paper (describing Inchworm and the
other tools to be released) is currently under review.

Best,

-b

On Fri, Feb 25, 2011 at 12:48 PM, Brian J Haas <bh...@br...>wrote:

> Hi Greg,
>
> I've responded to your questions below:
>
> On Fri, Feb 25, 2011 at 12:23 PM, Greg Concepcion <gco...@gm...>wrote:
>
>> Hi Brian,
>>
>> Thanks for the awesome resource!
>>
>> I've been using the previous version of inchworm (inchworm_01-20-2011<http://sourceforge.net/projects/inchworm/files/OLD_VERSIONS/inchworm_01-20-2011.tgz/download>)
>> for de novo assembly of transcriptomic data from a non-model organism with
>> no reference genome available. So far my success has been great, both in
>> terms of transcript length and maximum memory requirements (which crippled
>> my Velvet/Oases assembly)
>>
>>
> Excellent news!
>
>
>
>>  My strategy so far has been to run inchworm on the raw data with a range
>> of Kmer sizes, from 25-37, concatenate those outputs and reassemble using
>> the '--reassembleIworm' option.
>> This brings me to my two questions so far:
>> 1. I had to modify the source to allow a max Kmer size >31. Is there a
>> particular reason for this limit?
>>
>
> The true maximal limit should be 32, since the kmers are stored as 64 bit
> unsigned integers (with 2 bits per base encoding).  I made the max 31
> because I was hoping to reserve the last couple of bits to store additional
> info at some point... but I haven't used it.   If you try to go beyond 32,
> your still only storing a 32mer worth of sequence and tossing out the other
> bits.   A different storage strategy would need to be built into inchworm to
> go higher than the 32 bits and require some serious reengineering.   In my
> various tests, I've found that 25mers work very well for transcriptome data
> and going higher (such as beyond 29mers) can end up fragmenting some
> otherwise nice full-length transcripts.   Also, the strategy of assembling
> using a bunch of different kmer lengths and combining the data (which I
> borrowed conceptually from trans-ABySS though it works very differently
> here, doesn't buy much, at least in the various tests I've done.  In many
> cases, we end up getting an ever so slight increase in the number of
> full-length transcripts.
>
>
>> 2. I doubt the strategy i'm using is "the best" one, however from one lane
>> of a flow cell (~5 Gb raw illumina data (2x76bp * 34E06 reads)) I was able
>> to generate ~120 Mb of consensus sequence representing >500,000
>> contigs/transcripts (above a 100bp threshold). Should I be doing anything
>> differently to maximize Inchworm's potential to assemble transcripts? So far
>> the lengths are pretty good, with >8,000 transcripts longer than 1,000bps,
>> and a few in the 10,000bp range.
>>
>>
> Is this strand-specific data?
>
> Note, I made some important improvements to inchworm recently: both
> removing error-containing kmers and adjustments for non-strand-specific
> data.  I definitely encourage you to give it a whirl.. just run it once with
> the default settings (k=25) and see how it looks.  The total number of
> contigs (and amount of complete garbage that result) should be minimal, and
> the contigs that are reported should be heavily enriched for quality
> assemblies.
>
> Also note, I'm hoping to release the full Trinity package sometime next
> week (http://trinityrnaseq.sf.net) which takes Inchworm results to another
> level of utility, especially where alternative splicing is concerned.
>
>
> Best,
>
> -brian
>
>
>
>> Also, I'm hoping to submit this data for publication in the near future,
>> is there an ETA on a date for a publication that I can cite?
>>
>> Aloha!
>>
>> Gregory T. Concepcion, PhD
>> Cell Biology and Molecular Genetics
>> 2107 Biosciences Research Building
>> University of Maryland
>> College Park, MD 20742
>>
>> w:301.405.8300
>> c:301.828.8210
>>
>>
>> ------------------------------------------------------------------------------
>> Free Software Download: Index, Search & Analyze Logs and other IT data in
>> Real-Time with Splunk. Collect, index and harness all the fast moving IT
>> data
>> generated by your applications, servers and devices whether physical,
>> virtual
>> or in the cloud. Deliver compliance at lower cost and gain new business
>> insights. http://p.sf.net/sfu/splunk-dev2dev
>> _______________________________________________
>> Inchworm-users mailing list
>> Inc...@li...
>> https://lists.sourceforge.net/lists/listinfo/inchworm-users
>>
>>
>
>
> --
> --
> Brian J. Haas
> Manager, Bioinformatics Outreach, Genome Annotation and Analysis
> The Broad Institute
> http://broad.mit.edu/~bhaas
>
>
>
>
>

-- 
-- 
Brian J. Haas
Manager, Bioinformatics Outreach, Genome Annotation and Analysis
The Broad Institute
http://broad.mit.edu/~bhaas