denovoassembler-devel Mailing List for Ray: scalable assembly (Page 4)
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(42) |
Aug
(4) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
(8) |
Dec
(4) |
2013 |
Jan
(6) |
Feb
(21) |
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(3) |
Nov
|
Dec
|
2014 |
Jan
|
Feb
(8) |
Mar
(10) |
Apr
|
May
(1) |
Jun
(5) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Sébastien B. <seb...@ul...> - 2011-08-02 18:26:24
|
Thanks ! I applied your patch. > When a sequence didn't produce any contigs, I ended up wth a segfault in > the make fusions method. gdb very quickly narrowed it down to the array > operator, so I've attached a patch that should fix this problem. > > David Sébastien Boisvert http://github.com/sebhtml/ray |
From: David E. (gringer) <dav...@mp...> - 2011-08-02 15:26:19
|
When a sequence didn't produce any contigs, I ended up wth a segfault in the make fusions method. gdb very quickly narrowed it down to the array operator, so I've attached a patch that should fix this problem. David |
From: David E. (gringer) <dav...@mp...> - 2011-07-27 12:33:21
|
For testing, here's the coverage distribution that was created by combining all the Smed RNA reads I have: http://user.interface.org.nz/~gringer/hacking/planaria_Ray_5_2011-07-26_RayOutput.CoverageDistribution.txt [~140kb] I had a bit of trouble with firefox freezing when I tried pastebin, which is why it's not there. For what it's worth, some of these RNA transcripts (the low-expression transcripts) only have about 50-80 hits for reads from a single platform. While I don't expect to be able to get the sequences from these transcripts -- for one I looked at, the longest stretch with coverage greater than 5 was about 70bp -- it would be nice to have some way to tune Ray to work with RNA data that has similar distributions of coverage. The analysis is as follows: k-mer length: 31 Lowest coverage observed: 1 MinimumCoverage: 215 PeakCoverage: 216 RepeatCoverage: 217 Number of k-mers with at least MinimumCoverage: 1944260 k-mers Estimated genome length: 972130 nucleotides Percentage of vertices with coverage 1: 64.0808 % DistributionFile: results/Ray_5_2011-07-26/RayOutput.CoverageDistribution.txt I'm not quite sure I agree with a Peak/Repeat difference of 1, but given that I'm not sure what the numbers are used for, I'll leave that strangeness for someone else to comment on. Note that it has a the big decrease in k-mers found for the first bit of the distribution. I'm not sure if this is because of the nature of RNA data or something else. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-27 08:39:17
|
I introduced a bug in sebhtml/git by carrying out the following code change: diff --git a/code/structures/Kmer.cpp b/code/structures/Kmer.cpp index 83e145b..10e4871 100644 --- a/code/structures/Kmer.cpp +++ b/code/structures/Kmer.cpp @@ -155,8 +155,8 @@ uint8_t Kmer::getFirstSegmentFirstCode(int w){ int Kmer::vertexRank(int _size,int w,bool color){ Kmer b=complementVertex(w,color); if(isLower(&b)) - b=*this; - return b.hash_function_1()%(_size); + return b.hash_function_1()%(_size); + return hash_function_1()%(_size); } /** In other words, I changed vertexRank to the following: int Kmer::vertexRank(int _size,int w,bool color){ Kmer b=complementVertex(w,color); if(isLower(&b)) return b.hash_function_1()%(_size); return hash_function_1()%(_size); } This shouldn't be a problem, right? All I've done is remove the copy from *this to b, which (in the absence of a copy constructor) should do a shallow copy of the fields of (*this) [i.e. create a new uint64_t[KMER_U64_ARRAY_SIZE], and copy the bytes across]. I can't see how the function outcome can be changed by doing this. [there are a few other places in the code where a similar copy is done] However, I get segfaults after making this change, and I don't know why. In order to get my phiX data to work on sebhtml/git, I needed to adjust the minimumY for CoverageDistribution: diff --git a/code/graph/CoverageDistribution.cpp b/code/graph/CoverageDistribution.cpp index 82cae53..db6cc60 100644 --- a/code/graph/CoverageDistribution.cpp +++ b/code/graph/CoverageDistribution.cpp @@ -48,7 +48,7 @@ CoverageDistribution::CoverageDistribution(map<int,uint64_t>*distributionOfCover int windowSize=10; int minimumX=1; - uint64_t minimumY=2*4096; + uint64_t minimumY=1; uint64_t minimumY2=55000; int maximumX=65535-1; int safeThreshold=256; If you want the coverage distribution for this, you can find it here: http://pastebin.com/KG01Tepp If I run with one processor, everything is fine: $ mpirun -np 1 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs' Number of contigs: 1 Total length of contigs: 5382 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5382 Number of scaffolds: 1 Total length of scaffolds: 5382 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5382 Rank 0 wrote RayOutput.Contigs.fasta Rank 0 wrote RayOutput.Scaffolds.fasta But when I run with more than one processor, there is a segfault: $ mpirun -np 2 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs' [thaliana:23082] *** Process received signal *** [thaliana:23082] Signal: Segmentation fault (11) [thaliana:23082] Signal code: Address not mapped (1) [thaliana:23082] Failing at address: 0x8 [thaliana:23082] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf020) [0x7fc03c7aa020] [thaliana:23082] [ 1] ../../../sebgit/ray/code/Ray(_ZN16MessageProcessor30call_RAY_MPI_TAG_VERTICES_DATAEP7Message+0x29f) [0x43306f] [thaliana:23082] [ 2] ../../../sebgit/ray/code/Ray(_ZN7Machine10runVanillaEv+0x75) [0x445695] [thaliana:23082] [ 3] ../../../sebgit/ray/code/Ray(_ZN7Machine5startEv+0xd57) [0x447767] [thaliana:23082] [ 4] ../../../sebgit/ray/code/Ray(main+0x2b) [0x42608b] [thaliana:23082] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd) [0x7fc03c435ead] [thaliana:23082] [ 6] ../../../sebgit/ray/code/Ray() [0x4262e1] [thaliana:23082] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 23082 on node thaliana exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- So, I can turn on debug and get the same bug: diff --git a/Makefile b/Makefile index d1aca41..fcd7b7e 100644 --- a/Makefile +++ b/Makefile @@ -83,7 +83,7 @@ OPTIMIZE = y # profiling GPROF = n -DEBUG = n +DEBUG = y ifeq ($(GPROF),y) OPTIMIZE = n $ mpirun -np 2 xterm -e gdb --args ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs' [from window for rank 0] Program received signal SIGSEGV, Segmentation fault. 0x00000000004c5aa7 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA ( this=0x7b04c8, message=0x9ca280) at code/communication/MessageProcessor.cpp:465 465 if(candidate->m_count<m_parameters->getMinimumCoverageToStore()) (gdb) bt #0 0x00000000004c5aa7 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA ( this=0x7b04c8, message=0x9ca280) at code/communication/MessageProcessor.cpp:465 #1 0x00000000004c3e61 in MessageProcessor::processMessage (this=0x7b04c8, message=0x9ca280) at code/communication/MessageProcessor.cpp:65 #2 0x00000000004e27fe in Machine::processMessages (this=0x79e090) at code/core/Machine.cpp:614 #3 0x00000000004e24ba in Machine::runVanilla (this=0x79e090) at code/core/Machine.cpp:533 #4 0x00000000004e2492 in Machine::run (this=0x79e090) at code/core/Machine.cpp:517 #5 0x00000000004e1f24 in Machine::start (this=0x79e090) at code/core/Machine.cpp:462 #6 0x00000000005469a4 in main (argc=7, argv=0x7fffffffdf48) at code/assembler/ray_main.cpp:29 (gdb) print candidate $1 = (KmerCandidate *) 0x0 So that code change has caused a bug in the MessageProcessor, specifically the call_RAY_MPI_TAG_VERTICES_DATA function.... but I'm using DEBUG, which should activate ASSERT, which should verify canidate == NULL with an assert. I eventually worked out that I need to make another change to the makefile to remove an erroneous space (cat -A is used here to make the change obvious): $ git diff Makefile | cat -A diff --git a/Makefile b/Makefile$ index d1aca41..8ca7842 100644$ --- a/Makefile$ +++ b/Makefile$ @@ -83,7 +83,7 @@ OPTIMIZE = y$ $ # profiling$ GPROF = n$ -DEBUG = n$ +DEBUG = y$ $ ifeq ($(GPROF),y)$ ^IOPTIMIZE = n$ @@ -93,7 +93,7 @@ endif$ ifeq ($(DEBUG),y)$ ^IOPTIMIZE = n$ ^IFORCE_PACKING = n$ -^IASSERT = y $ +^IASSERT = y$ endif$ $ PEDANTIC = n$ So now I get the assert failing, which means the segfault doesn't get reached: $ mpirun -np 2 xterm -e gdb --args ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs [from window for rank 1] Ray: code/communication/MessageProcessor.cpp:462: void MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA(Message*): Assertion `candidate!=__null' failed. Program received signal SIGABRT, Aborted. 0x00007ffff5f3a405 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007ffff5f3a405 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007ffff5f3d680 in abort () at abort.c:92 #2 0x00007ffff5f335b1 in __assert_fail ( assertion=0x563296 "candidate!=__null", file=<value optimized out>, line=462, function=0x563f40 "void MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA(Message*)") at assert.c:81 #3 0x00000000004c8d37 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA ( this=0x7c44c8, message=0x9de2c0) at code/communication/MessageProcessor.cpp:462 #4 0x00000000004c6f8d in MessageProcessor::processMessage (this=0x7c44c8, message=0x9de2c0) at code/communication/MessageProcessor.cpp:65 #5 0x00000000004e6872 in Machine::processMessages (this=0x7b2090) at code/core/Machine.cpp:614 #6 0x00000000004e64e8 in Machine::runVanilla (this=0x7b2090) at code/core/Machine.cpp:533 #7 0x00000000004e64c0 in Machine::run (this=0x7b2090) at code/core/Machine.cpp:517 #8 0x00000000004e5f52 in Machine::start (this=0x7b2090) at code/core/Machine.cpp:462 #9 0x0000000000550ac0 in main (argc=7, argv=0x7fffffffdf48) at code/assembler/ray_main.cpp:29 So, it's segfaulting because it can't find a k-mer (I've done additional tests to verify that, as expected, both the k-mer and its reverse complement cannot be found), which presumably means that the k-mer that it's looking for wasn't inserted into the graph. Just as a reminder, everything still works with a single processor: $ mpirun -np 1 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs' Number of contigs: 1 Total length of contigs: 5382 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5382 Number of scaffolds: 1 Total length of scaffolds: 5382 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5382 Rank 0 wrote RayOutput.Contigs.fasta Rank 0 wrote RayOutput.Scaffolds.fasta I just don't understand why this removal of a k-mer copy breaks code. What's going on here? Any insight would be appreciated. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-26 15:04:30
|
Send me your problematic phix CoverageDistribution.txt file and I will add it to my unit tests. If coverage decreases sharply from the starts, chances are that it won't work with Ray. > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 22 juillet 2011 07:13 > À : den...@li... > Objet : [Denovoassembler-devel] [PATCH] clean up / simplify CoverageDistribution function > > I've been trying to clean up the coverage distribution function to make > it easier for me to understand. > > Using this patch, it works with my simulated phiX data (because of the > fallback to max coverage with no votes -- it looked like this was in the > code before?). > > However, it fails for the S.med 454 data (transcriptome data, in which > coverage values decrease sharply from the start). I noticed that I can > 'fix' the peak finding code by making this change (about line 83): > > if((votes[i] > votes[largestPosition]) > -> > if((votes[i] >= votes[largestPosition]) > > But then the minimum position is set to the largest position, and the > assembler panics. I need to set the minimum coverage to the coverage at > the first position for things to work with that data: > > m_minimumCoverage=x[0]; // about line 98 > > I haven't put these additional changes in the patch, because I'm not > sure how they will affect other things... and [as I've just noticed] I > get a segfault on the 454 data with my code when doing this.... I'll try > to nail that down in the next day or so. > > -- David > Sébastien Boisvert http://github.com/sebhtml/ray |
From: Sébastien B. <seb...@ul...> - 2011-07-26 14:52:24
|
In old versions of Ray, only the MPI rank 0 read the arguments but now all the MPI ranks do so. This should not affect how things work. > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 25 juillet 2011 06:32 > À : Sébastien Boisvert > Cc : den...@li... > Objet : colour space mode [possibly] not sent from config to other nodes > > While going through the Ray code to remove the colorSpace > option/parameter (because it's not necessary for my code), I noticed > that in Machine.cpp::call_RAY_MASTER_MODE_LOAD_CONFIG() there were lines > that stored the word size and colour-space mode (about lines 680-690), > but no corresponding colourSpace storage function in > MessageProcessor.cpp::call_RAY_MPI_TAG_SET_WORD_SIZE(Message*message) > (about line 230). I'm not sure if this would affect how things work or not. > > -- David Sébastien Boisvert http://github.com/sebhtml/ray |
From: David E. (gringer) <dav...@mp...> - 2011-07-25 16:37:19
|
I noticed earlier today that my code was converting all the csfasta reads into base space before sending them to the Read object, which meant that the rest of the code only saw base-space things. So, my "colour-space" assemblies were only colour-space when they were in the input files. In other words, Ray was doing everything just as it had previously, except at the file reading stage. This has made me think again about the usefulness of colour-space as an internal storage format. Here's my current list of advantages: * Reverse complement is a simple reverse of the sequence * No conversion needed for colour-space sequences, as long as it's the start of the sequence * unknown first bases allow for more sequence recovery Here are a few disadvantages for using colour-space as an internal format: * k-mers need at least one extra base, ideally two to make reverse-complement into base-space a fast process. * conversion needed for base-space sequences * conversion needed for colour-space sequences if not at the start of the sequence (need to know first base) * numerical output for genetic data is confusing * unknown first bases make either Kmer Academy or assembly more difficult The ABI document, 'Color Space Analysis in the SOLiD System: the Theory, Advantages and Solutions', claims "some of the color space rules can be applied to detect error during de novo assembly". Given my understanding of how Ray works for assembly (e.g. incoming/outgoing edges), I just can't see how the colour-space assembly can improve matters. You would lose a bit of information in removing the colour-space data that doesn't transform perfectly into base-space, but there's a good chance that the bad transformation is an indicator of bad data. As such, I think I'll concentrate on converting colour-space to base-space at the Read stage, and forget about the unknown first base stuff for the Kmer academy. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-25 10:33:08
|
While going through the Ray code to remove the colorSpace option/parameter (because it's not necessary for my code), I noticed that in Machine.cpp::call_RAY_MASTER_MODE_LOAD_CONFIG() there were lines that stored the word size and colour-space mode (about lines 680-690), but no corresponding colourSpace storage function in MessageProcessor.cpp::call_RAY_MPI_TAG_SET_WORD_SIZE(Message*message) (about line 230). I'm not sure if this would affect how things work or not. -- David |
From: Eccles, D. <dav...@mp...> - 2011-07-23 12:46:52
|
From: Eccles, David > ... and [as I've just noticed] I get a segfault on the 454 data with > my code when doing this.... I'll try to nail that down in the next day or so. Because there weren't already unit tests for Kmer a.compareTo(&b), I didn't think to put them in my code. I only discovered the bug from a segfault that happened because Ray was looking for a palindromic sequence with different starting bases, and couldn't find it (something like A300330303033003A vs the reverse-complement T300330303033003T). The problem was that it was clearing the first/last bases when it wasn't meant to (i.e. first bases known), and not clearing them when it was meant to (i.e. at least one first base unknown). https://github.com/gringer/ray/commit/451f766a5ce8b7da4ce6d340c234e7d4c5c7ca4 f ... I guess Kmer unit tests will be my next mini-project for Ray. I've been thinking again about unknown first bases and assembly, and think that probably the best thing to do would be to only store unknown first bases in the Kmer academy. In other words, graduation would require both a known first-base, and for the k-mer to occur more than once. It would mean that there's no messing round with unknown bases when doing the assembly. However, it would also mean that an assembly couldn't be done on sequences that all have a misread at their second position. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-22 11:14:47
|
I've been trying to clean up the coverage distribution function to make it easier for me to understand. Using this patch, it works with my simulated phiX data (because of the fallback to max coverage with no votes -- it looked like this was in the code before?). However, it fails for the S.med 454 data (transcriptome data, in which coverage values decrease sharply from the start). I noticed that I can 'fix' the peak finding code by making this change (about line 83): if((votes[i] > votes[largestPosition]) -> if((votes[i] >= votes[largestPosition]) But then the minimum position is set to the largest position, and the assembler panics. I need to set the minimum coverage to the coverage at the first position for things to work with that data: m_minimumCoverage=x[0]; // about line 98 I haven't put these additional changes in the patch, because I'm not sure how they will affect other things... and [as I've just noticed] I get a segfault on the 454 data with my code when doing this.... I'll try to nail that down in the next day or so. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-22 08:06:20
|
On 21/07/11 19:16, Sébastien Boisvert wrote: >> However, this is probably more interesting, using a colour-space >> sequence for one end, and a base-space sequence for the other end: > What do you mean ? It was a quick test of combining colour-space reads and base-space reads together in one run. While it's not expected that first and second reads will be in a different space, that is a side-effect of allowing combined spaces as input files. > But mine does not produce the contigs in nucleotide space. > And I believe yours does ! Yes, it does. The assembly for my fork is done in base-space, while the internal representation for reads and k-mers is colour-space. This means that the hash values for k-mers will differ from your code, but in other respects, other code doesn't need to know the difference. FWIW, the current behaviour of my code could be shoehorned into sebhtml/git by converting all reads into base-space. I don't yet use any k-mers with unknown first bases, but I expect to add that in once I've worked out an appropriate place to do it. I probably need to get the assembly (or at least seed extension) to happen in colour-space so that sequences with a different first base but the same colour-space sequence are lumped together. > So the next steps is to test your work on the system tests I guess. The only "system tests" I have at the moment aren't a particularly good representation of the data Ray has been designed to work on: phiX-simulated: small genome, synthetic data phiX-sequenced: small genome, circularised genome S. mediterranea: transcriptome, rather than genome ?E. coli: colour-space data with high error-rate At the moment, I've only really done any testing on the two phiX datasets. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 17:31:35
|
> > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 19 juillet 2011 04:32 > À : den...@li... > Objet : Re: [Denovoassembler-devel] Random thoughts about Ray > > On 18/07/11 19:43, Sébastien Boisvert wrote: >>> I had to modify that section, because the alternative was broken >>> code that wouldn't assemble correctly. >> OK. So you think there was some sort of interaction going on ? How >> did you fix it (provide commit) ? > > https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58 > Interesting, k-mers with very low coverage are used in your case. I believe this threshold prevent seeds from having low-quality k-mers at their ends. > [SeedingData.cpp, not test_phiX.sh] > > I still think that the problem of a "perfect" phiX assembly not working > was due to a parent node that wasn't considered for a seed start, with > all of the child nodes having 1 edge in, 1 edge out. My hypothesis is > that the seed start node had low coverage, so wasn't considered a good > choice. I understand why this is a reasonable decision, but think that a > better fix to this problem would be to also consider low coverage nodes > in SeedWorker.cpp when evaluating if a parent node is a better fit for > the start of a seed. For example, by adding a > m_SEEDING_coverage_test_done check as well (or integrating that with the > SEEDING_1_1_test). I totally agree with you. I added a TODO in the code for that. I will set the threshold to 1 and if it works for all system tests I will remove it. > >> You are right, systematic errors confuse Ray and sequencing errors >> consume a lot of memory. It is therefore wise to filter errors. Q20 >> (86% good) is adequate but I think using Q30 (95% good) is too >> aggressive. > > Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in > 10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines > = 99.99% = Q40. For Q above 10, it's essentially the same value for > Phred and Sanger quality scores. > > http://en.wikipedia.org/wiki/Phred_quality_score#Reliability > Q20 86% Q30 95% Q44 98% Q50 99% Q = -10 log (P) / log(10) P=probability(bad) >>> One typical example is library changes -- code may have worked fine >>> at an earlier date, but you need to include a change made in some >>> future build in order to get it working for the more modern >>> libraries. >> I agree. That is why the dependencies for Ray are basically only the >> C++ standard library (which include the C library as well, I believe) >> and a message-passing-interface library. > > Yes, any C program is considered a valid C++ program as well (there are > some quirks to make this work properly). > > Given that you've used the boost random number generator for the read > simulator, one other option that I would like for the Ray code would be > the boost gzip and bzip2 filters (from Iostreams). > But keep in mind that the read simulator is not part of the Ray assembler (the single executable called Ray). No way I am going to include boost stuff. > http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor > http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor > > This would allow the use of the more C++-like iostream method of file > access, even for compressed files, which means being able to use strings > for all file access, rather than just for the uncompressed files (i.e. > no more concerns about null-terminated strings). However, after doing > many Debian updates, I am aware of how often the boost libraries change, > and would not suggest having boost libraries as a core dependency. > > Once I'm done with the colour-space stuff, I'll see about making a more > generic file loader class that can handle all formats. > Yeah. Presently, assembler/Loader.cpp is the "interface". All formats in formats have standard method name like load(). Loaders use lazy-loading by the way. >> There is also the command 'git stash', but I have not used it yet. > > Thanks for pointing that out, I wasn't aware of this command. It looks > like it doesn't seem to work for remote repositories, so maybe I should > just set up a temporary 'broken' or 'stash' branch when I'm doing things > like that. > git reset --hard this resets your repository, pretty useful to remove files after testing rsynced files. >> For me, the hardest stuff are bugs occurring in parallel. Checkpoints >> would help also, but I have not figured an easy way to implement them >> in Ray yet. > > Sorry, I don't think I understand. I was under the impression that Ray > already had stopping points, where it waits for all the other nodes to > finish before continuing. I guess this is a different kind of checkpoint? > You are referring to what is call barriers. Checkpoints are a specialization of barriers wherein the state of the computation is saved in a file (or in many files). This is useful for instance if you are developping something that runs after 10 hours of computation. With checkpoints, you can start right at that point. >> How many gigabases are generated with one HiScan SQ run ? > > From the human RNA run we did, there were about 40-50M sequences per > lane, with each sequence having 50bp (paired end, so 100bp, I guess). > With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The > specification is a bit higher than that (67-75Gb), but it's in the same > ball-park: > > http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs > >> But to do hybrid assemblies, you would have to convert everything to >> color-space, right? > > Yeah, it's annoying that colour-space is the common denominator. But at > least the conversion from base-space to colour-space doesn't lose > information. For any given base-space sequence, there is only one valid > colour-space sequence (as long as you include a starting base in the > colour-space sequence). > OK >> So imagine you have 2 companies: Company A and Company B. Company A >> technology produces greek-letter-space reads. Company B technology >> produces heat-space reads. > > Ugh. I would hope that other companies are seeing the light (or > history), and will just all go back to base-space. As an aside, I'm glad > that iontorrent uses a base-space output. Unfortunately, there are still > people who demand that programs should work directly for these new space > sequences, often just because that's what comes off their machines: > > http://seqanswers.com/forums/showpost.php?p=15245&postcount=7 > nilshomer is right in the sense that if a tool truly work in this native space, then analysis will be better. However, developping these tools is not easy. >> The driver equivalent for sequencers would be a base caller that >> works with color-space or other underlying formats. > > A generic program that converts anything into base-space would be nice, > but I don't think it will be written by the sequencer companies -- they > would care too much about their specific technology. More likely, it'll > be written by a random programmer with too much time on their hands who > is frustrated by all the different file formats that are being produced. > I am not sure because it would require a lot of knowledge of the underlying sequencing technologies. This information is proprietary and rarely available. And nobody "jailbreak" their Illumina HiSeq so that out of the question. We are having a fine discussion I think. > -- David > > Denovoassembler-devel mailing list > Den...@li... > https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel > Sébastien Boisvert http://github.com/sebhtml/ray |
From: Sébastien B. <seb...@ul...> - 2011-07-21 17:16:19
|
> > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 19 juillet 2011 10:34 > À : den...@li... > Objet : [Denovoassembler-devel] Colour-space loading / assembly now working (produces base-space output) > > So, things are finally looking like the colour-space stuff is going > somewhere. I've managed to get a good result from a Ray run with > colour-space input (no errors, first bases included), and it might be a > tad quicker than the sebhtml/ray colour-space assembler: > Nice ! So, the reads are in color-space and the contigs are in nucleotide-space ? If so, this is very cool ! > $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.csfasta > tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' > Number of contigs: 1 > Total length of contigs: 5380 > Number of contigs >= 500 nt: 1 > Total length of contigs >= 500 nt: 5380 > Number of scaffolds: 1 > Total length of scaffolds: 5380 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5380 > > real 0m3.107s > user 0m2.432s > sys 0m0.652s > > $ time ../code/Ray -p tests/phix/phix_5k_1.csfasta > tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' > Number of contigs: 1 > Total length of contigs: 5243 > Number of contigs >= 500 nt: 1 > Total length of contigs >= 500 nt: 5243 > Number of scaffolds: 1 > Total length of scaffolds: 5243 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5243 > > real 0m2.894s > user 0m2.392s > sys 0m0.484s > > $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | > grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > > /dev/null && echo "success (match in reverse direction)" > success (match in reverse direction) > > There's nothing particularly impressive with this (apart from it > actually working...). For what it's worth, the sebhtml/ray also produces > a correct sequence, if you feed it the right starting base: > But mine does not produce the contigs in nucleotide space. And I believe yours does ! > $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | > grep $((echo -n 'G'; tail -n +2 RayOutput.Scaffolds.fasta) | > ~/scripts/cs2base.pl | fasta_formatter) > /dev/null && echo "success > (match in forward direction)" > success (match in forward direction) > > However, this is probably more interesting, using a colour-space > sequence for one end, and a base-space sequence for the other end: > What do you mean ? > $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.fasta > tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' > Number of contigs: 3 > Total length of contigs: 15713 > Number of contigs >= 500 nt: 3 > Total length of contigs >= 500 nt: 15713 > Number of scaffolds: 3 > Total length of scaffolds: 15713 > Number of scaffolds >= 500 nt: 3 > Total length of scaffolds >= 500: 15713 > > real 0m4.246s > user 0m3.428s > sys 0m0.796s > bioinf@thaliana:~/install/ray/git/ray/system-tests$ time ../code/Ray -p > tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.csfasta | grep -A 7 > 'Number of contigs:' > Number of contigs: 1 > Total length of contigs: 5243 > Number of contigs >= 500 nt: 1 > Total length of contigs >= 500 nt: 5243 > Number of scaffolds: 1 > Total length of scaffolds: 5243 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5243 > > real 0m2.892s > user 0m2.412s > sys 0m0.456s > > $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | > grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > > /dev/null && echo "success (match in reverse direction)" > success (match in reverse direction) > > [I haven't tried doing a match with the sebhtml/ray sequence.] > > The commit that actually got my code working properly with colour-space > is here: > > https://github.com/gringer/ray/commit/fbd1efd4c3dbe31df0de820167a068736de648f5 > OK > [I forgot to make the same changes for the KmerAcademyBuilder (lines > 109-113) that I made for the VerticesExtractor (lines 110-114)] > So the next steps is to test your work on the system tests I guess. I see that your are 0 commit away from my branch. (Cool) > -- David > > _______________________________________________ > Denovoassembler-devel mailing list > Den...@li... > https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel > Sébastien Boisvert http://github.com/sebhtml/ray |
From: Sébastien B. <seb...@ul...> - 2011-07-21 17:08:45
|
flush_twice_nocomplement_v2.patch This patch has 1 bug: rankToFlush must be computed with the lowest k-mer. There is 1 reverseComplement in the design I described in the other email. Basically, you don't call vertexRank, instead, you call the hash function on the lowest directly. vertexrank_nocopy_nocomplement.patch The method vertexRank is used elsewhere too, in places where reverseComplement is not called before this call. > > Attached. Note that this is actually 2 reverse-complements, not 1, > because a reverse-complement is needed in order to get the correct > vertex rank. I suppose you could avoid this by modifying the vertex rank > function to take a boolean "actually, don't bother with finding the > lowest k-mer, I've already done that." > > oh... and looking at the vertexRank function, you could eliminate a Kmer > copy by using an if statement (see attached patch, which does both of > these things, but doesn't modify the KmerAcademyBuilder.cpp function to > take advantage of that). > > FWIW, when I run this (both patches applied) on my phiX_5k data, it > produces different coverage distributions each time. Does this make sense? > > -- David Sébastien Boisvert http://github.com/sebhtml/ray |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 16:55:28
|
On 21/07/11 17:56, Sébastien Boisvert wrote: > So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine. Attached. Note that this is actually 2 reverse-complements, not 1, because a reverse-complement is needed in order to get the correct vertex rank. I suppose you could avoid this by modifying the vertex rank function to take a boolean "actually, don't bother with finding the lowest k-mer, I've already done that." oh... and looking at the vertexRank function, you could eliminate a Kmer copy by using an if statement (see attached patch, which does both of these things, but doesn't modify the KmerAcademyBuilder.cpp function to take advantage of that). FWIW, when I run this (both patches applied) on my phiX_5k data, it produces different coverage distributions each time. Does this make sense? -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:56:18
|
No need to be sorry ! Basically, this code extracts the k-mers from reads and send them to MPI peers. 2 reverse-complement k-mers will always go to the same destination. In fact, only the lowest k-mer needs to be appended as the other is not used anyway in MessageProcessor::call_RAY_MPI_TAG_KMER_ACADEMY_DATA (communication/MessageProcessor.cpp) call_RAY_MPI_TAG_KMER_ACADEMY_DATA processes messages with tag RAY_MPI_TAG_KMER_ACADEMY_DATA So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine. > On 21/07/11 17:39, Sébastien Boisvert wrote: >> I think there is a bug: >> >> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! > > Ah, sorry, it does actually add the raw bits of a/b as well. I just > noticed that the rankToFlush would be the same, and didn't think about > copying the actual sequence. > > -- David Sébastien |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:45:05
|
Patch work should only be on -devel please ! > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 21 juillet 2011 11:35 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: RE : [Denovoassembler-users] No observed peak in coverage distribution (454 test data) > > On 21/07/11 16:52, Sébastien Boisvert wrote: >> Can you send me (pastebin) the CoverageDistribution.txt file for each of your runs. >> I will add them to my unit tests. > > It gets more weird. With the flush patch, Ray is able to attempt an > assembly, presumably because coverage(0) < coverage(2): > > http://pastebin.com/mK6FGR2B > > Rank 0: the minimum coverage is 0 > Rank 0: the peak coverage is 2 > ... > Number of contigs: 153 > Total length of contigs: 49920 > Number of contigs >= 500 nt: 29 > Total length of contigs >= 500 nt: 19433 > Number of scaffolds: 153 > Total length of scaffolds: 49920 > Number of scaffolds >= 500 nt: 29 > Total length of scaffolds >= 500: 19433 > > Without the flush patch, Ray panics from a bad coverage distribution: > > http://pastebin.com/FqumBBFH > > Rank 0: the minimum coverage is 1 > Rank 0: the peak coverage is 1 > Rank 0: Assembler panic: no peak observed in the k-mer coverage > distribution. > Rank 0: to deal with the sequencing error rate, try to lower the k-mer > length (-k) > Rank 9: sent 203748 messages, received 203749 messages. > Rank 8: sent 203487 messages, received 203488 messages. > Rank 7: sent 204303 messages, received 204304 messages. > Rank 6: sent 204861 messages, received 204862 messages. > Rank 5: sent 203780 messages, received 203781 messages. > Rank 4: sent 203762 messages, received 203763 messages. > Rank 3: sent 203870 messages, received 203871 messages. > Rank 2: sent 204344 messages, received 204345 messages. > Rank 1: sent 204085 messages, received 204086 messages. > Rank 0: sent 203614 messages, received 203604 messages. > >> Also, do you have a place where I could upload you data-for-unit-tests and data-for-system-tests so that >> you can run tests ? > > Unfortunately no. I have plenty of storage space on our Illumina server > (and am expecting ~300TB sometime soon), can download as much as I want > from here, but the internal network is not accessible from external > sites (so I don't think I can set up an upload service). > > -- David > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:44:49
|
On 21/07/11 17:39, Sébastien Boisvert wrote: > I think there is a bug: > > In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! Ah, sorry, it does actually add the raw bits of a/b as well. I just noticed that the rankToFlush would be the same, and didn't think about copying the actual sequence. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:43:12
|
Yes, it will double all the numbers. I saw that in your patch ! See my other email. If you have time, you can implement the pseudo-algorithm I devised in the other email. It should be safe. Also, I would need you to use git format-patch (see code/Submit-a-patch.txt) should you submit me a patch that I would merge because otherwise the Author field in the commit would not be properly populated. (You would be the author and I would be the committer). > On 21/07/11 17:07, David Eccles (gringer) wrote: >> Okay, I've attached a patch that just does the flush twice without any >> additional reverse complelements. > > Er, sorry, that patch seems to alter the coverage distribution. The unit > tests I was able to do were successful, but I guess the coverage was > important this time. > > phix_5k_1/2 with patch: > > http://pastebin.com/1nnezJrQ > > phix_5k_1/2 without patch: > > http://pastebin.com/d3PNUBfQ > > -- David Sébastien |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:39:40
|
I think there is a bug: In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! What I had in mind (pseudo C++): /* get the pair of k-mers */ Kmer a = kmerAtPosition() Kmer b=a.reverseComplement() /* compute the destination */ Kmer rankToFlush=-1; if(a<b) rankToFlush=a.hash_function_1()%m_parameters.getSize() else rankToFlush=b.hash_function_1()%m_parameters.getSize() /* push a to destination and flush if necessary */ push a to rankToFlush usin bufferedData.addAt bufferedData.flush() /* push b to destination and flush if necessary */ push b to rankToFlush usins bufferedData.addAt bufferedData.flush() This has only a call to reverseComplement and do the same as refs/master > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 21 juillet 2011 11:07 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: Why flush both k-mer and its complement? > > On 21/07/11 16:40, Sébastien Boisvert wrote: >> Although both k-mers will go on the same MPI rank, >> the first flush may generate an actual MPI message on the network. >> As such, the second flush won't flush anything because the buffer was just reset. >> On the other hand, if the first k-mer does not generate a message when flush is called, >> then the second flush may do so. >> >> Feel free to send me a patch for this ! > > Okay, I've attached a patch that just does the flush twice without any > additional reverse complelements. > > -- David > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:23:07
|
On 21/07/11 17:07, David Eccles (gringer) wrote: > Okay, I've attached a patch that just does the flush twice without any > additional reverse complelements. Er, sorry, that patch seems to alter the coverage distribution. The unit tests I was able to do were successful, but I guess the coverage was important this time. phix_5k_1/2 with patch: http://pastebin.com/1nnezJrQ phix_5k_1/2 without patch: http://pastebin.com/d3PNUBfQ -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:08:28
|
On 21/07/11 16:40, Sébastien Boisvert wrote: > Although both k-mers will go on the same MPI rank, > the first flush may generate an actual MPI message on the network. > As such, the second flush won't flush anything because the buffer was just reset. > On the other hand, if the first k-mer does not generate a message when flush is called, > then the second flush may do so. > > Feel free to send me a patch for this ! Okay, I've attached a patch that just does the flush twice without any additional reverse complelements. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 14:40:51
|
Short answer: no. Although both k-mers will go on the same MPI rank, the first flush may generate an actual MPI message on the network. As such, the second flush won't flush anything because the buffer was just reset. On the other hand, if the first k-mer does not generate a message when flush is called, then the second flush may do so. I think you bring a good point however. complementVertex being the function that consumes the most CPU cycles (according to GNU gprof), clearly, there is room here for improvement (i.e. as you underline using 1 call instead of 3 would be better). I added a TODO. Feel free to send me a patch for this ! see code/Submit-a-patch.txt > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 21 juillet 2011 05:46 > À : Sébastien Boisvert; den...@li... > Objet : Why flush both k-mer and its complement? > > I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have: > > 120 rankToFlush=m_parameters->_vertexRank(&a); > 121 [add <rankToFlush> to pending flush messages] > 130 Kmer b = > complementVertex(&a,wordSize,m_parameters->getColorSpaceMode()); > 132 rankToFlush=m_parameters->_vertexRank(&b); > 133 [add <rankToFlush> to pending flush messages] > > but the vertexRank function returns the hash of the lowest k-mer: > > core/Parameters.cpp:1200 (_vertexRank) > return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode); > > core/common_functions.cpp (vertexRank) > 134 Kmer b=complementVertex(a,w,color); > 135 if(a->isLower(&b)) > 136 b=*a; > 137 return hash_function_1(&b)%(_size); > > So the code appears to be doing 3 complement vertex operations > (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each > flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be > removed without loss of function? > > -- David > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 09:47:50
|
I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have: 120 rankToFlush=m_parameters->_vertexRank(&a); 121 [add <rankToFlush> to pending flush messages] 130 Kmer b = complementVertex(&a,wordSize,m_parameters->getColorSpaceMode()); 132 rankToFlush=m_parameters->_vertexRank(&b); 133 [add <rankToFlush> to pending flush messages] but the vertexRank function returns the hash of the lowest k-mer: core/Parameters.cpp:1200 (_vertexRank) return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode); core/common_functions.cpp (vertexRank) 134 Kmer b=complementVertex(a,w,color); 135 if(a->isLower(&b)) 136 b=*a; 137 return hash_function_1(&b)%(_size); So the code appears to be doing 3 complement vertex operations (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be removed without loss of function? -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-19 14:35:23
|
So, things are finally looking like the colour-space stuff is going somewhere. I've managed to get a good result from a Ray run with colour-space input (no errors, first bases included), and it might be a tad quicker than the sebhtml/ray colour-space assembler: $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.csfasta tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' Number of contigs: 1 Total length of contigs: 5380 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5380 Number of scaffolds: 1 Total length of scaffolds: 5380 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5380 real 0m3.107s user 0m2.432s sys 0m0.652s $ time ../code/Ray -p tests/phix/phix_5k_1.csfasta tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' Number of contigs: 1 Total length of contigs: 5243 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5243 Number of scaffolds: 1 Total length of scaffolds: 5243 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5243 real 0m2.894s user 0m2.392s sys 0m0.484s $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > /dev/null && echo "success (match in reverse direction)" success (match in reverse direction) There's nothing particularly impressive with this (apart from it actually working...). For what it's worth, the sebhtml/ray also produces a correct sequence, if you feed it the right starting base: $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | grep $((echo -n 'G'; tail -n +2 RayOutput.Scaffolds.fasta) | ~/scripts/cs2base.pl | fasta_formatter) > /dev/null && echo "success (match in forward direction)" success (match in forward direction) However, this is probably more interesting, using a colour-space sequence for one end, and a base-space sequence for the other end: $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' Number of contigs: 3 Total length of contigs: 15713 Number of contigs >= 500 nt: 3 Total length of contigs >= 500 nt: 15713 Number of scaffolds: 3 Total length of scaffolds: 15713 Number of scaffolds >= 500 nt: 3 Total length of scaffolds >= 500: 15713 real 0m4.246s user 0m3.428s sys 0m0.796s bioinf@thaliana:~/install/ray/git/ray/system-tests$ time ../code/Ray -p tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:' Number of contigs: 1 Total length of contigs: 5243 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5243 Number of scaffolds: 1 Total length of scaffolds: 5243 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5243 real 0m2.892s user 0m2.412s sys 0m0.456s $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > /dev/null && echo "success (match in reverse direction)" success (match in reverse direction) [I haven't tried doing a match with the sebhtml/ray sequence.] The commit that actually got my code working properly with colour-space is here: https://github.com/gringer/ray/commit/fbd1efd4c3dbe31df0de820167a068736de648f5 [I forgot to make the same changes for the KmerAcademyBuilder (lines 109-113) that I made for the VerticesExtractor (lines 110-114)] -- David |