[Denovoassembler-devel] RE : Random thoughts about Ray
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
From: Sébastien B. <seb...@ul...> - 2011-07-21 17:31:35
|
> > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 19 juillet 2011 04:32 > À : den...@li... > Objet : Re: [Denovoassembler-devel] Random thoughts about Ray > > On 18/07/11 19:43, Sébastien Boisvert wrote: >>> I had to modify that section, because the alternative was broken >>> code that wouldn't assemble correctly. >> OK. So you think there was some sort of interaction going on ? How >> did you fix it (provide commit) ? > > https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58 > Interesting, k-mers with very low coverage are used in your case. I believe this threshold prevent seeds from having low-quality k-mers at their ends. > [SeedingData.cpp, not test_phiX.sh] > > I still think that the problem of a "perfect" phiX assembly not working > was due to a parent node that wasn't considered for a seed start, with > all of the child nodes having 1 edge in, 1 edge out. My hypothesis is > that the seed start node had low coverage, so wasn't considered a good > choice. I understand why this is a reasonable decision, but think that a > better fix to this problem would be to also consider low coverage nodes > in SeedWorker.cpp when evaluating if a parent node is a better fit for > the start of a seed. For example, by adding a > m_SEEDING_coverage_test_done check as well (or integrating that with the > SEEDING_1_1_test). I totally agree with you. I added a TODO in the code for that. I will set the threshold to 1 and if it works for all system tests I will remove it. > >> You are right, systematic errors confuse Ray and sequencing errors >> consume a lot of memory. It is therefore wise to filter errors. Q20 >> (86% good) is adequate but I think using Q30 (95% good) is too >> aggressive. > > Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in > 10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines > = 99.99% = Q40. For Q above 10, it's essentially the same value for > Phred and Sanger quality scores. > > http://en.wikipedia.org/wiki/Phred_quality_score#Reliability > Q20 86% Q30 95% Q44 98% Q50 99% Q = -10 log (P) / log(10) P=probability(bad) >>> One typical example is library changes -- code may have worked fine >>> at an earlier date, but you need to include a change made in some >>> future build in order to get it working for the more modern >>> libraries. >> I agree. That is why the dependencies for Ray are basically only the >> C++ standard library (which include the C library as well, I believe) >> and a message-passing-interface library. > > Yes, any C program is considered a valid C++ program as well (there are > some quirks to make this work properly). > > Given that you've used the boost random number generator for the read > simulator, one other option that I would like for the Ray code would be > the boost gzip and bzip2 filters (from Iostreams). > But keep in mind that the read simulator is not part of the Ray assembler (the single executable called Ray). No way I am going to include boost stuff. > http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor > http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor > > This would allow the use of the more C++-like iostream method of file > access, even for compressed files, which means being able to use strings > for all file access, rather than just for the uncompressed files (i.e. > no more concerns about null-terminated strings). However, after doing > many Debian updates, I am aware of how often the boost libraries change, > and would not suggest having boost libraries as a core dependency. > > Once I'm done with the colour-space stuff, I'll see about making a more > generic file loader class that can handle all formats. > Yeah. Presently, assembler/Loader.cpp is the "interface". All formats in formats have standard method name like load(). Loaders use lazy-loading by the way. >> There is also the command 'git stash', but I have not used it yet. > > Thanks for pointing that out, I wasn't aware of this command. It looks > like it doesn't seem to work for remote repositories, so maybe I should > just set up a temporary 'broken' or 'stash' branch when I'm doing things > like that. > git reset --hard this resets your repository, pretty useful to remove files after testing rsynced files. >> For me, the hardest stuff are bugs occurring in parallel. Checkpoints >> would help also, but I have not figured an easy way to implement them >> in Ray yet. > > Sorry, I don't think I understand. I was under the impression that Ray > already had stopping points, where it waits for all the other nodes to > finish before continuing. I guess this is a different kind of checkpoint? > You are referring to what is call barriers. Checkpoints are a specialization of barriers wherein the state of the computation is saved in a file (or in many files). This is useful for instance if you are developping something that runs after 10 hours of computation. With checkpoints, you can start right at that point. >> How many gigabases are generated with one HiScan SQ run ? > > From the human RNA run we did, there were about 40-50M sequences per > lane, with each sequence having 50bp (paired end, so 100bp, I guess). > With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The > specification is a bit higher than that (67-75Gb), but it's in the same > ball-park: > > http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs > >> But to do hybrid assemblies, you would have to convert everything to >> color-space, right? > > Yeah, it's annoying that colour-space is the common denominator. But at > least the conversion from base-space to colour-space doesn't lose > information. For any given base-space sequence, there is only one valid > colour-space sequence (as long as you include a starting base in the > colour-space sequence). > OK >> So imagine you have 2 companies: Company A and Company B. Company A >> technology produces greek-letter-space reads. Company B technology >> produces heat-space reads. > > Ugh. I would hope that other companies are seeing the light (or > history), and will just all go back to base-space. As an aside, I'm glad > that iontorrent uses a base-space output. Unfortunately, there are still > people who demand that programs should work directly for these new space > sequences, often just because that's what comes off their machines: > > http://seqanswers.com/forums/showpost.php?p=15245&postcount=7 > nilshomer is right in the sense that if a tool truly work in this native space, then analysis will be better. However, developping these tools is not easy. >> The driver equivalent for sequencers would be a base caller that >> works with color-space or other underlying formats. > > A generic program that converts anything into base-space would be nice, > but I don't think it will be written by the sequencer companies -- they > would care too much about their specific technology. More likely, it'll > be written by a random programmer with too much time on their hands who > is frustrated by all the different file formats that are being produced. > I am not sure because it would require a lot of knowledge of the underlying sequencing technologies. This information is proprietary and rarely available. And nobody "jailbreak" their Illumina HiSeq so that out of the question. We are having a fine discussion I think. > -- David > > Denovoassembler-devel mailing list > Den...@li... > https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel > Sébastien Boisvert http://github.com/sebhtml/ray |