Join Contigs w/ clipped reads and no overlap

  • Nathan S. Watson-Haigh

    I have a Newbler assembly which has generated scaffolds where many neighbouring contigs have no gaps between the.  a quick look at the clipped read info at the ends of these contigs actually shows there are reads that span across the join. I'd like to be able to merge these however because there is zero overlap between the contigs gap5 will not allow me to merge them.

    Is there a way to get gap5 to join the contigs with no overlap or override the check for overlap i.e. trust what I'm doing is sensible? Also i see in the readme for the latest release that gap5 currently can't make use of clipped read information its this a feature that could be implemented soonish? it would be good top be able to move the clipping region manually and force an overlap to be detected.


  • John Nash

    John Nash - 2012-02-06

    What I would do is make an artificial read spanning the join as a FASTA file, and import it.  It then can be used to span the two contigs that you wish to join.  Join from the left hand contig to the spanning sequence. Then join that new contig with the right hand contig.


  • Nathan S. Watson-Haigh

    First of all, sorry for not reading the manual more closely! It is actually possible to set/change the clipping point of reads in gap5 with the use of < and > in the contig editor!

    So my main issue is this:
    I have an assembly with 191k contigs in 18k scaffolds. However, I may be able to reduce the number of contigs down to 68k by performing 120k joins between neighbouring contigs that have zero gaps between them. I hope most of these joins will be supported by read information, which has been clipped by newbler, but spans between the contigs.

    I'd prefer to only use information already present in the reads rather than creating pseudo-reads - if possible. My workflow is like this:
    1) Open two contigs in the join editor
    2) Navigate to the 3' end of the first contig and the 5' end of the 2nd contig
    3) View the clipped information and move the clip point of 1 or more reads to the end if the respective read
    4) Ask gap5 to do an alignment based on the clipped info which is now visible to it for alignment purposes
    5) Manually edit the alignment and merge the two contigs if appropriate

    Clearly this is a burdensome task with 120k possible joins to perform. It appears that a couple of features from gap5 that would help are either:
    1) Not working for me, or
    2) Are not yet implemented

    Firstly, "Find internal joins" has an option for searching "hidden" data, but this doesn't work for me. This feature would allow me to search for possible joins without the need to move clip points in reads and navigate from one to the other using the "next" button.

    Secondly, there apears there could be an unimplemented feature that allows joins to be done using the clipped data. But currently, there needs to be an overlap in the unclipped regions of the contigs. This means I have to mess around with lots of clipping info, just to get the contigs to merge.

    Thirdly, there seems to be a new feature in the code repo that extends contigs using the clipped data. This might help me as I wouldn't have to manually edit the clip points. However, I don;t know how to get staden compiled from the code repo.

    I don't know tcl, but I'm having a quick look to see if anything is obvious/easy to add.

  • James Bonfield

    James Bonfield - 2012-02-13

    The newest code for extending contigs is specifically aimed at addressing some of the downfalls of NGS assemblies. These very often lose track of the individual reads during assembly, for memory efficiency, to produce a set of consensus sequences. As the last step of assembly the reads are then mapped back to the newly produced consensuses (consensii?).

    This works OK mostly, but sometimes we see that the original consensus was truncated so we end up with say 50 reads all clipped at exactly the same absolute position in the contig (the contig end), yet with cutoff data which is in alignment with each other.

    The algorithm simply detects these and extends the cutoff data for as far as multiple sequences are still in alignment with each other. It can run across multiple contigs too, although it's not desparately fast. It's also very basic in operation as it performs no sequence alignment - simply extension if the data is in agreement.

    I wasn't aware the FIJ with cutoff data was so broken. I'll investigate it. I can trivially reproduce this problem on a small data set.

    It'll help when fixed, but it's still a lot of work. What you really need is an automatic joiner (with all the danger and perils that it brings). I don't have such a thing yet though, sorry.

  • James Bonfield

    James Bonfield - 2012-02-13

    Looking at gap5/consen.c now I see that adding the cutoff/hidden data to the ends of contigs was something that got commented out when converting this from Gap4. Presumably it was a quick hack to get it up and running that I immediately forgot about. Sorry! I'll finish off the task now. :-)


Log in to post a comment.