#275 gatekeeper segfault on reading frg library

gatekeeper
open
Brian Walenz
None
5
2014-07-18
2014-07-17
David Mathog
No

A frg library was created from NCBI traces using the method described here:

http://sourceforge.net/p/wgs-assembler/wiki/TracearchiveToCA/

for wgs built from trunk on 2014_07_03. When run like this it segfaults in one of the "2" files. Change the order around and it goes more or less entries, but always segfaults.

/home/mathog/wgs_project/wgs_trunk_2014_07_03/Linux-amd64/bin/gatekeeper \
-o /home/mathog/wgs_project/corrected_Spur_PACBIO/temppacbio_corrected/asm.gkpStore.BUILDING \
-F \
/home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.1.lib.frg \
/home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.2.001.frg \
/home/mathog/wgs_project/NCBI_traces/strongylocentrotus_purpuratus.2.002.frg

Using gdb I traced this down to what appears to be a basic memory allocation error in AS_GKP_checkFrag.C. The attached patch resolved that error and let the program run longer. It is currently grinding through the 26 "2" frg files, and I expect it will finish.

Perhaps there was a change recently in that file? It is hard to see how this bug could have been around for long.

I do not know yet if this patch results in correctly constructed output, just that it keeps gatekeeper from crashing on this input.

1 Attachments

Discussion

  • Brian Walenz
    Brian Walenz
    2014-07-18

    • assigned_to: Brian Walenz
    • Group: consensus --> gatekeeper
     
  • Brian Walenz
    Brian Walenz
    2014-07-18

    Yup, that looks broken, and your fix is correct. It only breaks after 2048 libraries are encountered, which is quite large. How many libraries are in your 1.lib.frg? There might be a problem with the conversion.

     
  • David Mathog
    David Mathog
    2014-07-18

    1. Yes, I think there is something strange going on with the TracearchiveToCA conversion. Not having ever done this before, it wasn't clear what to expect. The source files are these (26 files):

    wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/fasta
    wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/qual

    wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/xml*

    This is a collection of Sanger reads, some are BAC end sequences. The number of "active" entries in each library varies between 3 and 2325619. The library names are very peculiar: PQASP, PQCZP, SPWFP, SPWGQ. Looking in the xml one finds trace entries that look like this (arrows added):

        <trace>
                <CENTER_NAME>BCM</CENTER_NAME>
                <CENTER_PROJECT>PQAQ</CENTER_PROJECT>
                <CLIP_LEFT>0</CLIP_LEFT>
                <CLIP_RIGHT>828</CLIP_RIGHT>
                <INSERT_SIZE>2000</INSERT_SIZE>
                <INSERT_STDEV>1000</INSERT_STDEV>
                <LIBRARY_ID>SUR3</LIBRARY_ID>             <------------1
                <PLATE_ID>PQAQP1D0100</PLATE_ID>          <------------2
                <RUN_DATE>Feb 14 2003 12:00AM</RUN_DATE>
                <RUN_GROUP_ID>641223</RUN_GROUP_ID>
                <RUN_MACHINE_ID>KNS</RUN_MACHINE_ID>
                <RUN_MACHINE_TYPE>3700</RUN_MACHINE_TYPE>
                <SEQ_LIB_ID>PQAQP</SEQ_LIB_ID>
                <SOURCE_TYPE>GENOMIC</SOURCE_TYPE>
                <SPECIES_CODE>STRONGYLOCENTROTUS PURPURATUS</SPECIES_CODE>
                <STRATEGY>POOLCLONE</STRATEGY>
                <SUBMISSION_TYPE>NEW</SUBMISSION_TYPE>
                <TEMPLATE_ID>PQAQP0101</TEMPLATE_ID>
                <TI>182797044</TI>
                <TRACE_END>F</TRACE_END>
                <TRACE_FORMAT>SCF</TRACE_FORMAT>
                <TRACE_NAME>47817235</TRACE_NAME>
                <TRACE_TYPE_CODE>SHOTGUN</TRACE_TYPE_CODE>
        </trace>
               <LIBRARY_ID>SUR3</LIBRARY_ID>
                <PLATE_ID>PQAQP1D0100</PLATE_ID>
    

    It appears that TracearchiveToCA is making the library names out of either the first 5 letters of PLATE_ID, or the first N letters before a digit or other delimiter.

    Am I correct in assuming that TracearchiveToCA should be using SUR3 instead?

    EDIT: Looked in TracearchiveToCA and found that it was using the SEQ_LIB_ID field.

     
    Last edit: David Mathog 2014-07-18
  • David Mathog
    David Mathog
    2014-07-18

    Ugh, this data has special cases written into tracearchiveToCA especially for it:

    #  Early Baylor SeaUrchin data has more than two fragments for
    #  each SEQ_LIB_ID, so we also include the RUN_GROUP_ID, which
    #  seems to differentiate them.
    #
    #  Except it also breaks lots of mates in later files.
    

    Now I'm thinking that the standard instructions for tracearchiveToCA may not apply to this data. Is there some easy way to figure out who added that comment?