Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Contig Comparator, Read Pairing and Libraries

2012-01-27
2013-04-18
  • I have a large assembly of ~500k contigs from a 450Mbase organism generated by Newbler. I've converted the ACE file to the gap5 database format and run "find read pairs". However, in the Contig Comparator it seems that only the top left part of the plot has any dots and the rest of the contig comparator is completely void of dots. I find it hard to believe there are no pairs here, so I have a few questions:

    1) Is there an upper limit on the number of data points that gap5 will display in the contig comparator?
    2) How does tg_index actually identify read pairs from my Newbler generated ACE file?
    3) How does tg_index/gap5 know what my library insert sizes are?

    Since my contigs are ordered according to scaffolds generated by Newbler, I expect pairs to fall along parallel lines at my insert sizes apart (for intra-contig pairs/mates). This would be more accurate if the display also displayed gaps between contigs so that distances between inter-contig mates were closer to their expected distance apart. Is there any way to do this? I'm thinking of rewriting my contig ACE using information from my AGP scaffold file to output the scaffolds as an ACE file….any guidance on doing that?

    Cheers,
    Nathan

     
  • James Bonfield
    James Bonfield
    2012-01-27

    1) I cannot think of a specific limit, other than it slowing down and using more memory up. One solution to this is to use the contig selector to select a subset of the contigs (not those with matches) and then do find read pairs on those instead.

    2) In gap5 pairs are normally identified by a common sequence name, as we started off with MAQ and later SAM which only has template name data and not specific read names. Later when we added CAF support we allowed for the template name to be a separate field, but for brevity of storage we insist that a template name must be a prefix of the read name.

    Similarly for ACE we default to using read name matching (RD lines), but optionally can parse the DS lines to look for TEMPLATE:. Again if specified it must be a prefix of the read name.

    Once we find duplicate read/template names when loading a data set, they get paired together. Right now gap5 still doesn't cope with complex examples of 3 or more reads fon a template.

    3) The insert sizes are gleaned through assembly metrics. Whenever we're pairing reads during tg_index processing, when we find the second read of a read-pair we check the position of both reads and build up a histogram of the assembled pairs. This obviously doesn't include pairs that span contigs or singletons.

    Note this is absolute position in the padded assembly, so the stats will get skewed in deeply padded regions. However it's quick and simple to compute.

    As for a better display showing gaps between contigs, this currently doesn't exist. I have plans for a proper scaffold viewer, but it's a long term goal as there are still lots of other things to do too. I'm not sure on the best way of faking it right now. Some here seem to create fake sequences to join contigs together in a scaffold, but it causes more problems than it solves in my opinion. Note that contigs in gap5 can have holes, although it's non trivial to actually join them that way - only tg_index is happy with it.

     
  • Hi James,

    In response to my questions 1 and 3…

    1) I'll look into this a little further and see what my issue might be. In the mean time I'll try generate lists of contigs for each scaffold so I can work scaffold-by-scaffold to see if that helps.
    3) What happens when my ACE file contains pairs/mates from 3 different insert sizes? From your explanation, it sounds like they'll all get lumped together into the same distribution?

     
  • James Bonfield
    James Bonfield
    2012-02-13

    Correct on point 3.  I just looked again at the ACE format and I don't think it's possible to specify multiple libraries, so there is no choice.

    When all is said and done ACE, like it's cousin CAF, is an appalling format for next-gen sequence assemblies.

     
  • Is it possible to specify the libraries after the ACE conversion? Or could a script be written that allows reads to be assigned to a library using regular expressions? I quite like AMOS's way of doing this in their mates file.

    What I'd like to be able to do is use "find read pairs" with each library/insert size and colour them differently in the contig comparator. As it is at the moment if I join two contigs because they share overlapping ends and have read pairs at their ends I'll assuming those pairs are from my illumina 500bp inserts. However, there is a posibility that the overlap could be a repeat the the pairs are from a larger insert library.