Re: [Denovoassembler-devel] Random thoughts about Ray
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
From: David E. (gringer) <dav...@mp...> - 2011-07-19 08:33:01
|
On 18/07/11 19:43, Sébastien Boisvert wrote: >> I had to modify that section, because the alternative was broken >> code that wouldn't assemble correctly. > OK. So you think there was some sort of interaction going on ? How > did you fix it (provide commit) ? https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58 [SeedingData.cpp, not test_phiX.sh] I still think that the problem of a "perfect" phiX assembly not working was due to a parent node that wasn't considered for a seed start, with all of the child nodes having 1 edge in, 1 edge out. My hypothesis is that the seed start node had low coverage, so wasn't considered a good choice. I understand why this is a reasonable decision, but think that a better fix to this problem would be to also consider low coverage nodes in SeedWorker.cpp when evaluating if a parent node is a better fit for the start of a seed. For example, by adding a m_SEEDING_coverage_test_done check as well (or integrating that with the SEEDING_1_1_test). > You are right, systematic errors confuse Ray and sequencing errors > consume a lot of memory. It is therefore wise to filter errors. Q20 > (86% good) is adequate but I think using Q30 (95% good) is too > aggressive. Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in 10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines = 99.99% = Q40. For Q above 10, it's essentially the same value for Phred and Sanger quality scores. http://en.wikipedia.org/wiki/Phred_quality_score#Reliability >> One typical example is library changes -- code may have worked fine >> at an earlier date, but you need to include a change made in some >> future build in order to get it working for the more modern >> libraries. > I agree. That is why the dependencies for Ray are basically only the > C++ standard library (which include the C library as well, I believe) > and a message-passing-interface library. Yes, any C program is considered a valid C++ program as well (there are some quirks to make this work properly). Given that you've used the boost random number generator for the read simulator, one other option that I would like for the Ray code would be the boost gzip and bzip2 filters (from Iostreams). http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor This would allow the use of the more C++-like iostream method of file access, even for compressed files, which means being able to use strings for all file access, rather than just for the uncompressed files (i.e. no more concerns about null-terminated strings). However, after doing many Debian updates, I am aware of how often the boost libraries change, and would not suggest having boost libraries as a core dependency. Once I'm done with the colour-space stuff, I'll see about making a more generic file loader class that can handle all formats. > There is also the command 'git stash', but I have not used it yet. Thanks for pointing that out, I wasn't aware of this command. It looks like it doesn't seem to work for remote repositories, so maybe I should just set up a temporary 'broken' or 'stash' branch when I'm doing things like that. > For me, the hardest stuff are bugs occurring in parallel. Checkpoints > would help also, but I have not figured an easy way to implement them > in Ray yet. Sorry, I don't think I understand. I was under the impression that Ray already had stopping points, where it waits for all the other nodes to finish before continuing. I guess this is a different kind of checkpoint? > How many gigabases are generated with one HiScan SQ run ? From the human RNA run we did, there were about 40-50M sequences per lane, with each sequence having 50bp (paired end, so 100bp, I guess). With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The specification is a bit higher than that (67-75Gb), but it's in the same ball-park: http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs > But to do hybrid assemblies, you would have to convert everything to > color-space, right? Yeah, it's annoying that colour-space is the common denominator. But at least the conversion from base-space to colour-space doesn't lose information. For any given base-space sequence, there is only one valid colour-space sequence (as long as you include a starting base in the colour-space sequence). > So imagine you have 2 companies: Company A and Company B. Company A > technology produces greek-letter-space reads. Company B technology > produces heat-space reads. Ugh. I would hope that other companies are seeing the light (or history), and will just all go back to base-space. As an aside, I'm glad that iontorrent uses a base-space output. Unfortunately, there are still people who demand that programs should work directly for these new space sequences, often just because that's what comes off their machines: http://seqanswers.com/forums/showpost.php?p=15245&postcount=7 > The driver equivalent for sequencers would be a base caller that > works with color-space or other underlying formats. A generic program that converts anything into base-space would be nice, but I don't think it will be written by the sequencer companies -- they would care too much about their specific technology. More likely, it'll be written by a random programmer with too much time on their hands who is frustrated by all the different file formats that are being produced. -- David |