[Denovoassembler-devel] Random thoughts about Ray

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 15/07/11 14:40, Sébastien Boisvert wrote:

>> So, for real phiX data, I prepared the sequence by masking out
>> bases with quality scores less than 20, and then filtering out the
>>  'N' bases
> You don't need to filter anything with Ray.

You have a great confidence in Ray's ability to deal with any type of
sequencing mistake, but I'm afraid I don't share that confidence. There
is no way I would carry out a genome assembly on data that only had low
quality scores, and I would expect that garbage input would only serve
to confuse Ray and make it spend a bit more time processing that data.

On 13/07/11 17:50, Sébastien Boisvert wrote:
> Well, as I understand, you did not respect one simple rule of version
> control:
 > * Never commit broken code.

I'm sorry I did that. I did try to keep to the rules, but I'm new to Git 
(mostly on the push/pull side), and to Ray, so occasionally I make 
mistakes in my commits.

> As Linus Torvalds would say: commit early, commit often.

This seems to contradict the statement you made in the same email. I
have some experience doing bisects (particularly with Wine), so I'm
aware of the pain involved when things don't compile, but I also know
that sometimes it can't be helped
[http://wiki.winehq.org/RegressionTesting#head-020f8ea312cecdbccdb0d7936f07b9a39b105791].
One typical example is library changes -- code may have worked fine at
an earlier date, but you need to include a change made in some future
build in order to get it working for the more modern libraries.

> Basically, you must search your own commits to find where you
> screwed. But you can not do that because you are committing broken
> code to your fork.

FWIW, I got into a state where I had 2-3 days worth of code to add, and
I wanted to be able to commit at least once per day (so I could do
additional work from home, if I had inspiration). This meant that the
code was broken, in parts, but I was aware of how to fix those parts, so
it didn't make much difference to me when bisecting. The bisect function
of git has a 'skip' function that allows you to bypass certain commits,
which still reduces the number of manual searches that need to be done
on code changes.

> Furthermore, I am not sure that all the emails about you debugging
> the bugs you introduced in your forks are of general interest for
> denovoassembler-users. I therefore created a development mailing list
> (should be up in maximum 24 hours).

Thanks for doing this. I like being able to CC a list, but didn't feel
entirely comfortable talking about code details on a -users list.

On 14/07/11 02:00, Sébastien Boisvert wrote:
 >> So changing that is easy enough to do, it just means the Ray run
 >> might take tens of seconds to finish, rather than seconds
 > Yes, but beware that this simple test will fail to catch major bugs.

I'm aware that a full test would be a good idea, but would be interested
to know what kinds of major bugs you would expect would crop up that are
not evident in an attempted assembly of a small genome. I guess memory
allocation bugs and message processing would be examples of things that
may not be caught for a small genome.

On 13/07/11 17:42, Sébastien Boisvert wrote:
> The code you are trying to modify is very stable and should not be
> modified. This code manages the living workers.

As opposed to this other section of code, which also manages the living
workers (line 21, sequencesIndexer.cpp)?

> /* TODO: find the memory leak in this file -- during the selection
> of optimal read markers, the memory goes up ? */

The bits I were looking at weren't quite the same area of code, but I
had narrowed down my assembly problem to that specific section of code,
which just happened to be just after the "selection of optimal read
markers". I had to modify that section, because the alternative was
broken code that wouldn't assemble correctly.

> Does your institution have a SOLiD 5500xl ?

No, we have a SOLiD 4 that is collecting dust and has yet to be used.
I'm not particularly happy with colour-space myself, but if I'll be
working with it in the future, then I need to be confident that I can
process the data that comes off that machine. The SOLiD sequencing data
that I'm currently working with was done by another lab.

On the other hand, our Illumina sequencer (HiScan SQ) has had at least
one successful run (possibly 3), and should get a fair bit of use over
the next few months. This is nice for me, because the base-space output
is much less confusing to handle, and there are many more applications
that will work in base-space.

> Life Technologies should write a base caller software that transform
> these color-space files into fastq files. All major vendors
> (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and
> probably Complete Genomics) do that already.

It doesn't really make sense to do the colour-space conversion until
after the assembly/mapping has been done. It's a fundamentally different 
technology which uses labelled dinucleotides to pair with DNA sequence. 
Here's a somewhat fabricated example that demonstrates how 
post-processing conversion can be useful:

|> 1_61_1487_F3
|T[3.03231010100100123]2112320311233112131121011303123
|> 1181_1125_1187_F3
|T2.0032320031212023231321020332[3203231010100100123]0

I've bracketed the similar sequences from both samples. If the
conversion to base-space were done on the sequencer, both sequences
would be effectively useless. However, using colour-space information
from '1181_1125_1187_F3', you can fill-in the second colour for
'1_61_1487_F3', and hence work out the entire sequence. You could say
that Life Technologies should just write their own colour-space de-novo
assembler in order to resolve base errors, but then it's just working on
colour-space, which removes the error-reduction advantage of hybrid
genome assemblers.

Thanks for your time,
-- David Eccles (gringer)

[Denovoassembler-devel] Random thoughts about Ray

Ray -- Parallel genome assemblies for parallel DNA sequencing

[Denovoassembler-devel] Random thoughts about Ray