Gap5 Contig Data Corruption

Hugo Reyes
  • Hugo Reyes

    Hugo Reyes - 2011-03-02


       I'm currently running Gap5 v1.2.11.  There are some problems with a file that I've been working with.  All the contigs in the original files have sequence data but after unmasking and joining many of the contigs some of these files no longer contain any sequences.  Has anyone encountered this bug yet?  I'm hoping that I don't have to go back to using Gap4.


  • James Bonfield

    James Bonfield - 2011-03-02

    I've occasionally seen bugs where the start and end positions of contigs aren't where you expect (ie incorrect). So maybe the editor thinks bases 1 to 100000 are the contig range, with a scrollbar acting appropriately, but the data is infact at say 200000 to 300000 still. I think most of these bugs I've tracked down, but the break contig code is a horrendously complex beast so I wouldn't like to bet on it being 100% bug free yet!

    If you have any specific examples of cases where doing something causes data to vanish (either apparently or genuinely) then please do let me know. I'm keen to fix such things, but getting hold of examples can be hard.

  • Hugo Reyes

    Hugo Reyes - 2011-03-02

    I rechecked the problem contigs and it looks like you're exactly right.  The editor thinks the contig falls between position -300 and about 10k.  I can't go farther using the scroll bar but there is data from about 16.3k to 26.8k. I also can't manipulate any of the data as it send me back to what it thinks is the end of the contig.  This has happened to about 30 contigs.  If I go back through all the unmasking and joining I'll make more backups and keep track of what might be causing the problem.

    Is there any way to shift the data around so that it falls within the window of the scroll bar?  Or is there some other way to save this data?

    We had an error when converting the files from the Gap4 format into Gap5 that may have something to do with the problem (contig00001 is one of the contigs with the scrollbar problem):

    the message was generated
    by the tg_index program while converting a GAP5 ACE file into a CAF file

    Warning message:

    Contig contig00001
    Warning: A read with no Align_to_SCF record has been found.
    Some traces may not align correctly to the corresponding reads

  • James Bonfield

    James Bonfield - 2011-03-03

    I don't think the CAF error will have had a major impact, especially as right now Gap5 isn't correctly aligning edited reads back to traces correctly anyway. (Sorry!)

    As for fixing your data, I can't say for sure what would cure it. Possibly positoning the editor cursor on the last read and using Control + Reft Arrow to move it right one base (out of alignment), save, then Control + Left arrow to move it back (and save) will cause it to reevaluate the contig boundaries. Alternatively you could try exporting as SAM format and then reload using tg_index.

    When you say unmasking, I assume you mean adjusting the cutoff positions of sequences so that data that was once hidden as cutoff is now in use again? I don';t see why it would cause problems, but it's not something I've tested as much in Gap5 as a lot of the work has been with illumina data instead.

    Have you been running Break Contig on these contigs? That algorithm is one where I've had bugs before involving not moving data correctly. If you break a 1.5Mb contig at position 500Kb you're meant to get a 500Mb contig and a 1Mb contig both starting at base 1. Internally though it starts off with a 1Mb contig starting at base 500Kb and then it has to apply a shift to move the data down. This is actually just one single value to edit, and is just what you need to do, but there's no user interface to do that manually at the moment. (If you're brave and like scripting/coding there are ways of programattically doing such things, but it's a bit of a heroic method for fixing your DB.)


Log in to post a comment.