A frg library was created from NCBI traces using the method described here:
http://sourceforge.net/p/wgs-assembler/wiki/TracearchiveToCA/
for wgs built from trunk on 2014_07_03. When run like this it segfaults in one of the "2" files. Change the order around and it goes more or less entries, but always segfaults.
/home/mathog/wgs_project/wgs_trunk_2014_07_03/Linux-amd64/bin/gatekeeper \ -o /home/mathog/wgs_project/corrected_Spur_PACBIO/temppacbio_corrected/asm.gkpStore.BUILDING \ -F \ /home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.1.lib.frg \ /home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.2.001.frg \ /home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.2.002.frg
Using gdb I traced this down to what appears to be a basic memory allocation error in AS_GKP_checkFrag.C. The attached patch resolved that error and let the program run longer. It is currently grinding through the 26 "2" frg files, and I expect it will finish.
Perhaps there was a change recently in that file? It is hard to see how this bug could have been around for long.
I do not know yet if this patch results in correctly constructed output, just that it keeps gatekeeper from crashing on this input.
Yup, that looks broken, and your fix is correct. It only breaks after 2048 libraries are encountered, which is quite large. How many libraries are in your 1.lib.frg? There might be a problem with the conversion.
wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/fasta
wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/qual
wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/xml*
This is a collection of Sanger reads, some are BAC end sequences. The number of "active" entries in each library varies between 3 and 2325619. The library names are very peculiar: PQASP, PQCZP, SPWFP, SPWGQ. Looking in the xml one finds trace entries that look like this (arrows added):
It appears that TracearchiveToCA is making the library names out of either the first 5 letters of PLATE_ID, or the first N letters before a digit or other delimiter.
Am I correct in assuming that TracearchiveToCA should be using SUR3 instead?
EDIT: Looked in TracearchiveToCA and found that it was using the SEQ_LIB_ID field.
Last edit: David Mathog 2014-07-18
Ugh, this data has special cases written into tracearchiveToCA especially for it:
Now I'm thinking that the standard instructions for tracearchiveToCA may not apply to this data. Is there some easy way to figure out who added that comment?