[Denovoassembler-devel] Random thoughts about Ray
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
From: David E. (gringer) <dav...@mp...> - 2011-07-18 14:13:20
|
On 15/07/11 14:40, Sébastien Boisvert wrote: >> So, for real phiX data, I prepared the sequence by masking out >> bases with quality scores less than 20, and then filtering out the >> 'N' bases > You don't need to filter anything with Ray. You have a great confidence in Ray's ability to deal with any type of sequencing mistake, but I'm afraid I don't share that confidence. There is no way I would carry out a genome assembly on data that only had low quality scores, and I would expect that garbage input would only serve to confuse Ray and make it spend a bit more time processing that data. On 13/07/11 17:50, Sébastien Boisvert wrote: > Well, as I understand, you did not respect one simple rule of version > control: > * Never commit broken code. I'm sorry I did that. I did try to keep to the rules, but I'm new to Git (mostly on the push/pull side), and to Ray, so occasionally I make mistakes in my commits. > As Linus Torvalds would say: commit early, commit often. This seems to contradict the statement you made in the same email. I have some experience doing bisects (particularly with Wine), so I'm aware of the pain involved when things don't compile, but I also know that sometimes it can't be helped [http://wiki.winehq.org/RegressionTesting#head-020f8ea312cecdbccdb0d7936f07b9a39b105791]. One typical example is library changes -- code may have worked fine at an earlier date, but you need to include a change made in some future build in order to get it working for the more modern libraries. > Basically, you must search your own commits to find where you > screwed. But you can not do that because you are committing broken > code to your fork. FWIW, I got into a state where I had 2-3 days worth of code to add, and I wanted to be able to commit at least once per day (so I could do additional work from home, if I had inspiration). This meant that the code was broken, in parts, but I was aware of how to fix those parts, so it didn't make much difference to me when bisecting. The bisect function of git has a 'skip' function that allows you to bypass certain commits, which still reduces the number of manual searches that need to be done on code changes. > Furthermore, I am not sure that all the emails about you debugging > the bugs you introduced in your forks are of general interest for > denovoassembler-users. I therefore created a development mailing list > (should be up in maximum 24 hours). Thanks for doing this. I like being able to CC a list, but didn't feel entirely comfortable talking about code details on a -users list. On 14/07/11 02:00, Sébastien Boisvert wrote: >> So changing that is easy enough to do, it just means the Ray run >> might take tens of seconds to finish, rather than seconds > Yes, but beware that this simple test will fail to catch major bugs. I'm aware that a full test would be a good idea, but would be interested to know what kinds of major bugs you would expect would crop up that are not evident in an attempted assembly of a small genome. I guess memory allocation bugs and message processing would be examples of things that may not be caught for a small genome. On 13/07/11 17:42, Sébastien Boisvert wrote: > The code you are trying to modify is very stable and should not be > modified. This code manages the living workers. As opposed to this other section of code, which also manages the living workers (line 21, sequencesIndexer.cpp)? > /* TODO: find the memory leak in this file -- during the selection > of optimal read markers, the memory goes up ? */ The bits I were looking at weren't quite the same area of code, but I had narrowed down my assembly problem to that specific section of code, which just happened to be just after the "selection of optimal read markers". I had to modify that section, because the alternative was broken code that wouldn't assemble correctly. > Does your institution have a SOLiD 5500xl ? No, we have a SOLiD 4 that is collecting dust and has yet to be used. I'm not particularly happy with colour-space myself, but if I'll be working with it in the future, then I need to be confident that I can process the data that comes off that machine. The SOLiD sequencing data that I'm currently working with was done by another lab. On the other hand, our Illumina sequencer (HiScan SQ) has had at least one successful run (possibly 3), and should get a fair bit of use over the next few months. This is nice for me, because the base-space output is much less confusing to handle, and there are many more applications that will work in base-space. > Life Technologies should write a base caller software that transform > these color-space files into fastq files. All major vendors > (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and > probably Complete Genomics) do that already. It doesn't really make sense to do the colour-space conversion until after the assembly/mapping has been done. It's a fundamentally different technology which uses labelled dinucleotides to pair with DNA sequence. Here's a somewhat fabricated example that demonstrates how post-processing conversion can be useful: |> 1_61_1487_F3 |T[3.03231010100100123]2112320311233112131121011303123 |> 1181_1125_1187_F3 |T2.0032320031212023231321020332[3203231010100100123]0 I've bracketed the similar sequences from both samples. If the conversion to base-space were done on the sequencer, both sequences would be effectively useless. However, using colour-space information from '1181_1125_1187_F3', you can fill-in the second colour for '1_61_1487_F3', and hence work out the entire sequence. You could say that Life Technologies should just write their own colour-space de-novo assembler in order to resolve base errors, but then it's just working on colour-space, which removes the error-reduction advantage of hybrid genome assemblers. Thanks for your time, -- David Eccles (gringer) |