Deleting 200 kb within a contig, how?

  • Hiya,

    assume I have the following: a mapping assembly with a reference sequence (contig of 5 megabases) with quite some reads mapped to it. Imported in gap4 (or gap5, does not matter really).

    Certain parts of the reference sequence are not covered by reads (or only a read here and there). Say I'd like to remove from the contig (i.e., the reference sequence with the mapped reads) the stretches which have no or almost no coverage. What I do at the moment is this in contig_editor.tcl:

        bind Editor <Delete>        {
        for {set iii 0} {$iii <20390} {incr iii} {
            %W delete_key

    The above will delete 20390 positions from the contig if I place the cursor at a certain consensus position, allow gap4 to delete contig positions regardless of consensus and hit "Delete".

    Question: would there be a more intelligent, respectively faster way to do this?


  • James Bonfield
    James Bonfield

    Well I think you've demonstrated already intelligence there in coming up with a particularly devious solution to the problem.

    However I agree it's perhaps not the most elegant method, and likely slow (especially in gap4) as it's shifting data 20390 times. Starting at the lowest level - in Gap4 as least it could be possible to modify tkEditor.c implementation of "delete_key" to take an argument for how many bases to remove, and edDeleteKey/edDeleteKeyCommon in edInterface.c to call deleteBasesConsensus with more than 1 as num_bases (it already supports deleting multiples).  That's still a nasty code hack though - indeed nastier as it involves recompiling.

    Gap5 operates differently as the C parts of the editor are mainly display code, with the editing being done within the Tcl language instead. That'd make it easier for such hackery, but still unpleasant. The best way to see how this works in Gap5 is to look at the code in contig_editor.tcl, which no doubt you've already done, and src/tgap/tg_tcl.c for the current list of commands available for each type of Tcl object implemented. Eg contig_cmd() there responds to "delete_base" method which in turn then calls contig_delete_base() in tg_contig.c.  This also doesn't take an argument though for the number of bases to remove. Programmatically it's pretty easy though. A single base is removed via:

        set c [$io get_contig $rec]
        $c delete_base $pos
        $c delete; # Deallocate memory for $c - this doesn't remove the contig itself

    Within the confines of existing existing gap4/gap5 code though I'd say the optimal way to do this is using break contig and a manual forced join. If you've got a single sequence spanning the entire contig, eg a reference sequence, then this can cause problems with breaking. I'd say the best bet there is to duplicate it and then adjust the clip points; set left clip point on one and right clip point on the other.

    Then you've got two choices really to break the contig up. You can either do two break contig calls or you can try selecting the read names for the data in the middle and running disassemble readings on it. This is easier, but if you have any other "holes" in your contig then they'll also get broken apart at the same time. After that it's a matter of joining the two ends back together again.

    It's more work, but would be more scalable if you have huge regions to remove as it only involves shifting data twice rather than once per base removed.


  • James Bonfield
    James Bonfield

    I forgot to add, I'm a little bit unsure what happens to reads when you delete every single base within them! I'm a bit suprised Gap4 doesn't grumble at zero base readings, but I guess not. :-)

    This is why I'd say disassemble readings is a safer solution.

  • Hmmm … more work than hoped for.

    The problem I have is hopefully a one-time job with a couple of deletions, the biggest one being ~200k … it runs overnight, no big deal. In case I need that more often I will certainly give your solution a more thorough look.

    gap4 does indeed grumble (zero length gels), but apart from that it continues to work fine. Also, gap2caf generates slightly invalid reads of size 1 (first base) and having clipping point "2 1" (left clip > sequence size), but that's OK as I can catch and correct for that further downstream.

    Thanks a lot for your respone :-)


  • Then again … multiple break contigs might be a very clean solution:
    1) break at the end of the hole (into a separate contig)
    2) disassemble readings at the first stray mapped read within the whole (discarding those reads)
    3) clip back the reference, join, extend the reference back to normal and adapt … finished

    I think I'll try that. Thanks a lot for that idea.