Re: [Denovoassembler-devel] Random thoughts about Ray

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 18/07/11 19:43, Sébastien Boisvert wrote:
>> I had to modify that section, because the alternative was broken
>> code that wouldn't assemble correctly.
> OK. So you think there was some sort of interaction going on ? How
> did you fix it (provide commit) ?

https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58

[SeedingData.cpp, not test_phiX.sh]

I still think that the problem of a "perfect" phiX assembly not working
was due to a parent node that wasn't considered for a seed start, with
all of the child nodes having 1 edge in, 1 edge out. My hypothesis is
that the seed start node had low coverage, so wasn't considered a good
choice. I understand why this is a reasonable decision, but think that a
better fix to this problem would be to also consider low coverage nodes
in SeedWorker.cpp when evaluating if a parent node is a better fit for
the start of a seed. For example, by adding a
m_SEEDING_coverage_test_done check as well (or integrating that with the
SEEDING_1_1_test).

> You are right, systematic errors confuse Ray and sequencing errors
> consume a lot of memory. It is therefore wise to filter errors. Q20
> (86% good) is adequate but I think using Q30 (95% good) is too
> aggressive.

Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in
10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines
= 99.99% = Q40. For Q above 10, it's essentially the same value for
Phred and Sanger quality scores.

http://en.wikipedia.org/wiki/Phred_quality_score#Reliability

>> One typical example is library changes -- code may have worked fine
>> at an earlier date, but you need to include a change made in some
>> future build in order to get it working for the more modern
>> libraries.
> I agree. That is why the dependencies for Ray are basically only the
> C++ standard library (which include the C library as well, I believe)
> and a message-passing-interface library.

Yes, any C program is considered a valid C++ program as well (there are
some quirks to make this work properly).

Given that you've used the boost random number generator for the read
simulator, one other option that I would like for the Ray code would be
the boost gzip and bzip2 filters (from Iostreams).

http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor
http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor

This would allow the use of the more C++-like iostream method of file
access, even for compressed files, which means being able to use strings
for all file access, rather than just for the uncompressed files (i.e.
no more concerns about null-terminated strings). However, after doing
many Debian updates, I am aware of how often the boost libraries change,
and would not suggest having boost libraries as a core dependency.

Once I'm done with the colour-space stuff, I'll see about making a more
generic file loader class that can handle all formats.

> There is also the command 'git stash', but I have not used it yet.

Thanks for pointing that out, I wasn't aware of this command. It looks
like it doesn't seem to work for remote repositories, so maybe I should
just set up a temporary 'broken' or 'stash' branch when I'm doing things
like that.

> For me, the hardest stuff are bugs occurring in parallel. Checkpoints
> would help also, but I have not figured an easy way to implement them
> in Ray yet.

Sorry, I don't think I understand. I was under the impression that Ray
already had stopping points, where it waits for all the other nodes to
finish before continuing. I guess this is a different kind of checkpoint?

> How many gigabases are generated with one HiScan SQ run ?

 From the human RNA run we did, there were about 40-50M sequences per
lane, with each sequence having 50bp (paired end, so 100bp, I guess).
With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The
specification is a bit higher than that (67-75Gb), but it's in the same
ball-park:

http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs

> But to do hybrid assemblies, you would have to convert everything to
> color-space, right?

Yeah, it's annoying that colour-space is the common denominator. But at
least the conversion from base-space to colour-space doesn't lose
information. For any given base-space sequence, there is only one valid
colour-space sequence (as long as you include a starting base in the
colour-space sequence).

> So imagine you have 2 companies: Company A and Company B. Company A
> technology produces greek-letter-space reads. Company B technology
> produces heat-space reads.

Ugh. I would hope that other companies are seeing the light (or
history), and will just all go back to base-space. As an aside, I'm glad
that iontorrent uses a base-space output. Unfortunately, there are still
people who demand that programs should work directly for these new space
sequences, often just because that's what comes off their machines:

http://seqanswers.com/forums/showpost.php?p=15245&postcount=7

> The driver equivalent for sequencers would be a base caller that
> works with color-space or other underlying formats.

A generic program that converts anything into base-space would be nice,
but I don't think it will be written by the sequencer companies -- they
would care too much about their specific technology. More likely, it'll
be written by a random programmer with too much time on their hands who
is frustrated by all the different file formats that are being produced.

-- David

Re: [Denovoassembler-devel] Random thoughts about Ray

Ray -- Parallel genome assemblies for parallel DNA sequencing

Re: [Denovoassembler-devel] Random thoughts about Ray