Memory leak when combined 454/Illumina db

Brought to you by: awhitwham, jkbonfield

#82 Memory leak when combined 454/Illumina db

Status: open

Owner: James Bonfield

Labels: Gap5 (15)

Priority: 5

Updated: 2010-10-07

Created: 2010-10-07

Creator: Torsten Seemann

Private: No

I have a draft genome (~3 Mbp in 60 contigs), some paired 454 reads (~30x) and some paired Illumina reads (~200x).

When I use gap5 with just the 454 reads alone, or the Illumina reads alone, it all works fine. If I create a database of them both, I can load it, but when I "Edit contig" the CPU goes to 100% and memory use continually increases (I let it go to 48 GB and then killed it).

To generate the 454 database I use gsMapper to align the .SFF files to the draft contigs, and it produces a .ace file:
% tg_index -o EF.db -A EF.454.ace -t

To append the Illumina database I create a .bam using SHRiMP2+samtools and add it in:
% tg_index -o EF.db -b EF.bam -a -g -t

Am I doing anything wrong? Is the gsMapper (Newbler) producing a dodgy .ace file? It has lots of pads in it, and gap5 seems to import the contig itself as a 'read' too. Is "-g" the correct thing to do. The 100%CPU/RAM++ looks like an infinite loop with memory allocation it it?

Thanks for any help!

Torsten.

Discussion

James Bonfield - 2010-10-07

I'm not sure if -g is appropriate or not! What did you align your illumina data against?

The plan for -g was that if you exported a consensus from gap5, depadded it in the process, and then used that as a reference to align against to produce a bam file, then when importing it would know which gaps in the BAM file matched already existing ones in Gap5's consensus, which were new and need adding to the existing gap5 data, and which are missing in the bam data (most likely if you started with 454) and need adding to the bam data during import.

It's all a bit ghastly. I'm currently working on trying to get gap5 to entirely work with unpadded data and BAM style cigar strings. It's proving more than a little bit complex unfortunately, but in time I hope it'll remove the need for these nasty coordinate transforms provided there's a consistent reference.

Either way, I wouldn't expect this behaviour. Something has clearly gone wrong somewhere, but I'm not sure what. Is it possible to export just one contig from gap5 as SAM and produce a new gap5 db from it using tg_index. Does that show the same problem? If so it means it may be possible to generate a small test set or to look manually at the SAM alignment strings to see if there's something strange going on. (Or possibly exporting will hang too if we're unlucky.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Torsten Seemann - 2010-10-08

(Arrghh, I wrote a big response, and when I tried to upload a file with it, it rejected it for size, but I lost all my textbox state!)

In summary, I reproduced the bug (1) again with the full data (2) again by exporting the .sam of one contig and recreating a database, and (3) with a simple 1000bp example from the same readsets.

Here are the files for (3) excluding the raw reads:
http://dna.med.monash.edu.au/~torsten/tmp/staden-bug-3082717.tar.gz

Torsten

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Bonfield - 2010-10-08

Many thanks for the test data, although oddly it's working just fine here. I'll try compiling the latest released beta version rather than my current SVN checkout just incase it's something I've already fixed, although it doesn't ring any bells.

I should point out that working fine is misleading too - it views what it's given, but is out of alignment. I think the problem here is the order of steps in your example, but I need to experiment more on the cause of that. Mixing ace and bam though is tricky as one is padded while the other is unpadded. Or rather an overcall in the consensus of a 454 assembly will shift data. One solution to this is assemble the ace first, output the consensus from gap5, then use this new sequence (likely subtly different to c.fa) for shrimp to produce your bam, and then merge in that bam file. Messy!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.