Menu

Using gap4 from c program without tcl dependency

2018-08-21
2018-08-28
  • Will Stokes

    Will Stokes - 2018-08-21

    Hi, I'm looking into the feasibility of using the gap4 algorithm from C code. I understand the exp format and can already generate such files myself. I'm not looking into gap4 and and finding documentation like this:

    http://staden.sourceforge.net/scripting_manual/scripting_111.html

    to be a bit out of date. E.g. there is no io.c (it now appears to be io1.c) and I cannot find a mk directory or a gap4_defs.mk file. I have been exploring setting up a project to build a subset of the gap4 (and related "library", aka g, Misc, etc.) code but am unsure how feasible this is going to be. While I have found the various IO related code, I have yet to find the entry point where a gap4 for alignment is computed. Furthermore I'm disappointed to find much of the the code contains includes for <tcl.h> and <tk.h>. I'd prefer my project to not have a tcl/tk dependency. Is it accurate to ssay that the gap4 algorithm is entirely implemented in C, or is part of it effectively in tcl as well? Has anyone tried to compile a subset of gap4 sufficent to generate a database and compute a contig without linking with tcl/tk?</tk.h></tcl.h>

     
  • James Bonfield

    James Bonfield - 2018-08-22

    Sadly the Staden Package is largely unsupported now, barring bug fixes perhaps if something major and simple cropped up. It's simply a matter of time, most of which is now spent maintaining samtools et al and the CRAM code.

    Regarding Gap4 (and gap5 for that matter), it is inherently tied to tcl/tk. The main GUI of Gap4 is Tcl/Tk with the actual functionality written in C and then providing new Tcl commands to execute those C functions. This provides an easy scripting environment while also maintaining speed. It would be tricky to disentangle this for everything, but not too bad if you just want one single algorithm.

    In your case you mention "the gap4 algorithm". Are you referring to the sequence assembly algorithm? If so you've hit on the most ancient part of Gap4 - be warned! This code is in FORTRAN, dating back some 30 years. See src/gap4/legacy.f (DBAUTO function) with an "FtoC" converted file provided for ease of compliation in src/gap4/legacy_f2c.c. The good news is neither have any Tcl dependency. :-)

    To be honest, if you're just looking for an old sequence assembly algorithm that works on low quality sequences with lots of indels, then Phrap was generally the best of that era. We usually used it ourselves infact and felt that Phrap for assembly + Gap4 for editing was the best combination of tools in that era. The are many newer tools out there of course, but I haven't kept track of the latest and greatest.

     
  • Will Stokes

    Will Stokes - 2018-08-24

    Thanks, I appreciate the frank assessment of the original staden codebase. In addition to Phrap I'm also looking into AMOS which looks promising. I was originally contemplating using Lucy to clean my data before passing it to an assembler and am not yet sure how to best (or if I even need to) do that with AMOS. I have had some success with AMOS and am also looking into using samtools or htslib to convert from SAM to BAM or CRAM. I'm presently not dealing with NGS data but would like to future proof my work so CRAM looks quite compelling. That said, I'm struggling to find documentation for how to use htslib. I can see there are extensive comments in the various .h files in the htslib directory. Is there a better way to get up to speed on using htslib other than studying this and the samtools code itself? I'm primary intersted in being able to do the fllowing:

    -get a list of contigs
    -for a given contig, get a list of the reads that successfully aligned
    -for a given read aligned to a contig, get the range where it aligned and random access for reading in substrings of the aligned read

    I can see how to open a CRAM file although I'm missing the equivalnet of in iolib where you just asked it to open the fie (be it SAM, BAM, or CRAM) and it "just works".

    I'm also curious if there are any provisions for editing contigs via htslib, if the user thinks the aligner got it wrong somewhere. If you can point me to any good examples or documentation I'd really appreciate it. Thanks!

     
  • James Bonfield

    James Bonfield - 2018-08-28

    If you're not looking to do interactive editing and finishing, then Gap4 (and indeed Gap5) aren't the ideal tool and I would thoroughly recommend BAM/CRAM instead as it's what most tools cope with now. The burden of using gap4/5 isn't worth it without the requirement of being able to do on-the-fly editing.

    SAM et al are really designed around references, but contigs are sufficient still to act in the same purpose provided you don't go over several million ish mark, at which point the 2Gb size limit of the header may come into play. Each contig will be an @SQ header line, so the list of them is just "samtools view -H". BAM is just a binary version of SAM and likely the best fit for you. CRAM may also work well, but with assemblies you'll probably want to enable the embedded reference or referenceless modes, which can make it a bit more cumbersome to work with.

    Once indexed, randomaccess is trivial - eg "samtools view file.bam contig10" for all of contig10 or use "contig10:1000-2000" for a portion. Programmtically sam_itr_querys is the function to use here, to get an iterator to permit reading all data within a single region.

    Yes htslib documentation is basically nonexistant bar the header files. :-( As far as opening any file (SAM, BAM or CRAM), use htslib's "sam_open" function. It's confusingly named! There is also sam_open_format which is an extension to this permitting additional arguments, mainly existing for CRAM which can need more data passed in (eg the reference). This is mainly what samtools itself uses. Try looking at something simple to start with such as samtools flagstat (in the bam_stat.c).

    As far as editing within samtools/htslib, it's not done interactively. A few tools attempt to do their own automatic edits by local realignment, etc, but the scale of NGS projects has meant the manual curation step has been removed so the tools don't really exist (bar gap5). Generally programmatic editing is done by streaming a file in, editing, and streaming it out again. It's not efficient, but can be part of a pipeline if the edit is simply an algorithmic thing.

     

Log in to post a comment.