[Denovoassembler-devel] RE : Random thoughts about Ray

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi David,

> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 18 juillet 2011 10:12
> À : Sébastien Boisvert; den...@li...
> Objet : Random thoughts about Ray
> 
> On 15/07/11 14:40, Sébastien Boisvert wrote:
> 
>>> So, for real phiX data, I prepared the sequence by masking out
>>> bases with quality scores less than 20, and then filtering out the
>>>  'N' bases
>> You don't need to filter anything with Ray.
> 
> You have a great confidence in Ray's ability to deal with any type of
> sequencing mistake, but I'm afraid I don't share that confidence. There
> is no way I would carry out a genome assembly on data that only had low
> quality scores, and I would expect that garbage input would only serve
> to confuse Ray and make it spend a bit more time processing that data.
>

You are right, systematic errors confuse Ray and sequencing errors consume a lot of 
memory. It is therefore wise to filter errors. Q20 (86% good) is adequate but I think using Q30 (95% good) is too aggressive.

Also, sometimes adapter sequences are messy and hard to remove.

But if sequencing errors are random, then erroneous k-mers will not be redundant.

(See our paper in Journal of Computational Biology)

> On 13/07/11 17:50, Sébastien Boisvert wrote:
>> Well, as I understand, you did not respect one simple rule of version
>> control:
> > * Never commit broken code.
> 
> I'm sorry I did that. I did try to keep to the rules, but I'm new to Git
> (mostly on the push/pull side), and to Ray, so occasionally I make
> mistakes in my commits.
>

I started using git a few weeks ago, and I must say I am very satisfied in comparison with my experience
with subversion.

Initially, I did commit broken code too and I still do occasionally. Finding regressions can be hard on my own bugs so 
I guess finding regressions in your code would be harder because of different programming style.

Since I started working on Ray, I did a lot of refactoring and one other important thing I learned is 
to only work on one particular thing at once. Trying to deal with many features or bugs at the same time
usually does not work well.

>> As Linus Torvalds would say: commit early, commit often.
> 
> This seems to contradict the statement you made in the same email. I
> have some experience doing bisects (particularly with Wine), so I'm
> aware of the pain involved when things don't compile, but I also know
> that sometimes it can't be helped
> [http://wiki.winehq.org/RegressionTesting#head-020f8ea312cecdbccdb0d7936f07b9a39b105791].

I completely agree with you here. Sometimes bugs are just side effects of other changes. 2 months ago I recall
spending 3 weeks on a single bug that caused a parallel infinite loop. One rank was looping in an infinite fashion while
all other peers were waiting for a reply from buggy peer.

I still think that 'commit early, commit often' is fine as long as the code committed works.

> One typical example is library changes -- code may have worked fine at
> an earlier date, but you need to include a change made in some future
> build in order to get it working for the more modern libraries.
> 

I agree. That is why the dependencies for Ray are basically only the C++ standard library (which include
the C library as well, I believe) and a message-passing-interface library.

>> Basically, you must search your own commits to find where you
>> screwed. But you can not do that because you are committing broken
>> code to your fork.
> 
> FWIW, I got into a state where I had 2-3 days worth of code to add, and
> I wanted to be able to commit at least once per day (so I could do
> additional work from home, if I had inspiration).

I know what you mean, when I am stuck on buggy code and want to change computer, I 
usually rsync my work.

There is also the command 'git stash', but I have not used it yet.

> This meant that the
> code was broken, in parts, but I was aware of how to fix those parts, so
> it didn't make much difference to me when bisecting. The bisect function
> of git has a 'skip' function that allows you to bypass certain commits,
> which still reduces the number of manual searches that need to be done
> on code changes.
> 

But I guess you must read somewhere (for instance in the commit message) if a given commit must be skipped or not. Right ?

This is a task I don't want to do.

>> Furthermore, I am not sure that all the emails about you debugging
>> the bugs you introduced in your forks are of general interest for
>> denovoassembler-users. I therefore created a development mailing list
>> (should be up in maximum 24 hours).
> 
> Thanks for doing this. I like being able to CC a list, but didn't feel
> entirely comfortable talking about code details on a -users list.
> 

OK

> On 14/07/11 02:00, Sébastien Boisvert wrote:
> >> So changing that is easy enough to do, it just means the Ray run
> >> might take tens of seconds to finish, rather than seconds
> > Yes, but beware that this simple test will fail to catch major bugs.
> 
> I'm aware that a full test would be a good idea, but would be interested
> to know what kinds of major bugs you would expect would crop up that are
> not evident in an attempted assembly of a small genome. I guess memory
> allocation bugs and message processing would be examples of things that
> may not be caught for a small genome.
>

It depends on which part of the code you work.

For color-space, you definitely want to write unit tests.

But that is not always enough. 

When I added the MAXKMERLENGTH option, I basically committed code that did not work for k values above 32 but the code 
still worked on classical values.

But I had a lot of things to modify in the communication layer because passing a 32-mer is not the same as passing a 64-mer.

I remember this part as being quite hard to do and quite time-consuming.

This migration was test-driven.

For the parallel approach of Ray (to learn where bugs can show up), I suggest you read recent code I added. These are pretty clean and mature code.

For example:

assembler/Partitioner.cpp
communication/NetworkTest.cpp
scaffolder/Scaffolder.cpp

For me, the hardest stuff are bugs occurring in parallel.

Checkpoints would help also, but I have not figured an easy way to implement them in Ray yet.

> On 13/07/11 17:42, Sébastien Boisvert wrote:
>> The code you are trying to modify is very stable and should not be
>> modified. This code manages the living workers.
> 
> As opposed to this other section of code, which also manages the living
> workers (line 21, sequencesIndexer.cpp)?
> 

I did some work on this bug and I am pretty sure the "memory leak" is related to the C++ standard library doing too much memory allocation calls.

For instance, if you use a vector<T>, the used allocator will be allocator<T>.

But allocator<T> probably utilises malloc and free (or something related) and these scale very badly when a lot of calls are made.

That cause intense memory fragmentation.

I actually implemented a memory compactor (see memory/DefragmentationGroup.cpp) for a nice read.

For example, if I set the maximum number of worker to 1, I think I don't get that problem.

Anyway, the code works, system tests pass and unit tests pass too.

>> /* TODO: find the memory leak in this file -- during the selection
>> of optimal read markers, the memory goes up ? */
> 
> The bits I were looking at weren't quite the same area of code, but I
> had narrowed down my assembly problem to that specific section of code,
> which just happened to be just after the "selection of optimal read
> markers". I had to modify that section, because the alternative was
> broken code that wouldn't assemble correctly.
> 

OK. So you think there was some sort of interaction going on ?

How did you fix it (provide commit) ?

>> Does your institution have a SOLiD 5500xl ?
> 
> No, we have a SOLiD 4 that is collecting dust and has yet to be used.

I think people bought SOLiD-powered instruments because Applied Biosystems (Life Technologies) produced the ABI 3730xl.
3730xl was the technology used for the human genome project.

At least these were working properly !

> I'm not particularly happy with colour-space myself, but if I'll be
> working with it in the future, then I need to be confident that I can
> process the data that comes off that machine. The SOLiD sequencing data
> that I'm currently working with was done by another lab.
> 

I presently work at the Ontario Institute for Cancer Research at the MaRS Discovery District in Toronto and
they have numerous SOLiD sequencers, but basically the error rate of these SOLiD sequencers is higher than expected...

On the other hand, Illumina TruSeq technology actually yields true sequences ! 

(sorry for the silly word play)

> On the other hand, our Illumina sequencer (HiScan SQ) has had at least
> one successful run (possibly 3), and should get a fair bit of use over
> the next few months. This is nice for me, because the base-space output
> is much less confusing to handle, and there are many more applications
> that will work in base-space.
> 

This is also my opinion.
Furthermore, the error-rate of SOLiD sequencers is so high that it becomes very hard to
analyze related data. 

With SOLiD technology, you also eliminate a lot of data when filtering with quality values.
This increases the cost per base which is not desirable. 

You basically double the cost when weeding out 50% of the reads.

How many gigabases are generated with one HiScan SQ run ?

>> Life Technologies should write a base caller software that transform
>> these color-space files into fastq files. All major vendors
>> (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and
>> probably Complete Genomics) do that already.
> 
> It doesn't really make sense to do the colour-space conversion until
> after the assembly/mapping has been done. It's a fundamentally different
> technology which uses labelled dinucleotides to pair with DNA sequence.
> Here's a somewhat fabricated example that demonstrates how
> post-processing conversion can be useful:
> 

I know, but regardless the desired result is sequence data, not fancy colored reads.

> |> 1_61_1487_F3
> |T[3.03231010100100123]2112320311233112131121011303123
> |> 1181_1125_1187_F3
> |T2.0032320031212023231321020332[3203231010100100123]0
> 
> I've bracketed the similar sequences from both samples. If the
> conversion to base-space were done on the sequencer, both sequences
> would be effectively useless. However, using colour-space information
> from '1181_1125_1187_F3', you can fill-in the second colour for
> '1_61_1487_F3', and hence work out the entire sequence. You could say
> that Life Technologies should just write their own colour-space de-novo
> assembler in order to resolve base errors, but then it's just working on

I don't think so. They should have developed something that enables them to output base-space reads as well as color-space reads.

And that what they will presumably do with their 5500xl:

"The 5500xl will feature exact call chemistry, new multiple base encoding for improved accuracy, which will allow the data to be output in Base Space or Color Space."

Source: https://www.genomics.upenn.edu/HTSTechnologies

In my opinion, sequencing by ligation sounds nice but must be terribly hard to debug from an engineering point of view.

> colour-space, which removes the error-reduction advantage of hybrid
> genome assemblers.
> 

But to do hybrid assemblies, you would have to convert everything to color-space, right ?

<Silly example>

So imagine you have 2 companies: Company A and Company B.

Company A technology produces greek-letter-space reads.

Company B technology produces heat-space reads.

Converting these reads to nucleotide-space is impossible because of error rates, but both companies say that
software working directly with their read spaces should do well.

Now imagine companies A, B, C, ..., X, Y, Z, each promoting a different read space.

The problem becomes the same as Linux driver development

You just need a common interface so each of them can be integrated.
Then, userland software that send packets over a network are not aware of these numerous drivers but they still do well enough !

The driver equivalent for sequencers would be a base caller that works with color-space or other underlying formats.

The only case were people will adopt an exotic read space is when
this new read space do provide a significant leap forward or a technological advance with applications.

</Silly example>

I think '.' in color-space reads make the problem harder.

Otherwise, you just have to look for enriched k-mers in the space of {0,1,2,3}^k, right ?

The option -write-kmers writes a file on the disk containing all the k-mers in color-space along 
with their counts and arcs.

> Thanks for your time,
> -- David Eccles (gringer)
> 

                                                     Sébastien

[Denovoassembler-devel] RE : Random thoughts about Ray

Ray -- Parallel genome assemblies for parallel DNA sequencing

[Denovoassembler-devel] RE : Random thoughts about Ray