Thread: [Denovoassembler-devel] Why flush both k-mer and its complement?
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
From: David E. (gringer) <dav...@mp...> - 2011-07-21 09:47:50
|
I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have: 120 rankToFlush=m_parameters->_vertexRank(&a); 121 [add <rankToFlush> to pending flush messages] 130 Kmer b = complementVertex(&a,wordSize,m_parameters->getColorSpaceMode()); 132 rankToFlush=m_parameters->_vertexRank(&b); 133 [add <rankToFlush> to pending flush messages] but the vertexRank function returns the hash of the lowest k-mer: core/Parameters.cpp:1200 (_vertexRank) return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode); core/common_functions.cpp (vertexRank) 134 Kmer b=complementVertex(a,w,color); 135 if(a->isLower(&b)) 136 b=*a; 137 return hash_function_1(&b)%(_size); So the code appears to be doing 3 complement vertex operations (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be removed without loss of function? -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 14:40:51
|
Short answer: no. Although both k-mers will go on the same MPI rank, the first flush may generate an actual MPI message on the network. As such, the second flush won't flush anything because the buffer was just reset. On the other hand, if the first k-mer does not generate a message when flush is called, then the second flush may do so. I think you bring a good point however. complementVertex being the function that consumes the most CPU cycles (according to GNU gprof), clearly, there is room here for improvement (i.e. as you underline using 1 call instead of 3 would be better). I added a TODO. Feel free to send me a patch for this ! see code/Submit-a-patch.txt > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 21 juillet 2011 05:46 > À : Sébastien Boisvert; den...@li... > Objet : Why flush both k-mer and its complement? > > I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have: > > 120 rankToFlush=m_parameters->_vertexRank(&a); > 121 [add <rankToFlush> to pending flush messages] > 130 Kmer b = > complementVertex(&a,wordSize,m_parameters->getColorSpaceMode()); > 132 rankToFlush=m_parameters->_vertexRank(&b); > 133 [add <rankToFlush> to pending flush messages] > > but the vertexRank function returns the hash of the lowest k-mer: > > core/Parameters.cpp:1200 (_vertexRank) > return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode); > > core/common_functions.cpp (vertexRank) > 134 Kmer b=complementVertex(a,w,color); > 135 if(a->isLower(&b)) > 136 b=*a; > 137 return hash_function_1(&b)%(_size); > > So the code appears to be doing 3 complement vertex operations > (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each > flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be > removed without loss of function? > > -- David > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:08:28
Attachments:
flush_twice_nocomplement.patch
|
On 21/07/11 16:40, Sébastien Boisvert wrote: > Although both k-mers will go on the same MPI rank, > the first flush may generate an actual MPI message on the network. > As such, the second flush won't flush anything because the buffer was just reset. > On the other hand, if the first k-mer does not generate a message when flush is called, > then the second flush may do so. > > Feel free to send me a patch for this ! Okay, I've attached a patch that just does the flush twice without any additional reverse complelements. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:23:07
|
On 21/07/11 17:07, David Eccles (gringer) wrote: > Okay, I've attached a patch that just does the flush twice without any > additional reverse complelements. Er, sorry, that patch seems to alter the coverage distribution. The unit tests I was able to do were successful, but I guess the coverage was important this time. phix_5k_1/2 with patch: http://pastebin.com/1nnezJrQ phix_5k_1/2 without patch: http://pastebin.com/d3PNUBfQ -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:43:12
|
Yes, it will double all the numbers. I saw that in your patch ! See my other email. If you have time, you can implement the pseudo-algorithm I devised in the other email. It should be safe. Also, I would need you to use git format-patch (see code/Submit-a-patch.txt) should you submit me a patch that I would merge because otherwise the Author field in the commit would not be properly populated. (You would be the author and I would be the committer). > On 21/07/11 17:07, David Eccles (gringer) wrote: >> Okay, I've attached a patch that just does the flush twice without any >> additional reverse complelements. > > Er, sorry, that patch seems to alter the coverage distribution. The unit > tests I was able to do were successful, but I guess the coverage was > important this time. > > phix_5k_1/2 with patch: > > http://pastebin.com/1nnezJrQ > > phix_5k_1/2 without patch: > > http://pastebin.com/d3PNUBfQ > > -- David Sébastien |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:39:40
|
I think there is a bug: In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! What I had in mind (pseudo C++): /* get the pair of k-mers */ Kmer a = kmerAtPosition() Kmer b=a.reverseComplement() /* compute the destination */ Kmer rankToFlush=-1; if(a<b) rankToFlush=a.hash_function_1()%m_parameters.getSize() else rankToFlush=b.hash_function_1()%m_parameters.getSize() /* push a to destination and flush if necessary */ push a to rankToFlush usin bufferedData.addAt bufferedData.flush() /* push b to destination and flush if necessary */ push b to rankToFlush usins bufferedData.addAt bufferedData.flush() This has only a call to reverseComplement and do the same as refs/master > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 21 juillet 2011 11:07 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: Why flush both k-mer and its complement? > > On 21/07/11 16:40, Sébastien Boisvert wrote: >> Although both k-mers will go on the same MPI rank, >> the first flush may generate an actual MPI message on the network. >> As such, the second flush won't flush anything because the buffer was just reset. >> On the other hand, if the first k-mer does not generate a message when flush is called, >> then the second flush may do so. >> >> Feel free to send me a patch for this ! > > Okay, I've attached a patch that just does the flush twice without any > additional reverse complelements. > > -- David > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:44:49
|
On 21/07/11 17:39, Sébastien Boisvert wrote: > I think there is a bug: > > In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! Ah, sorry, it does actually add the raw bits of a/b as well. I just noticed that the rankToFlush would be the same, and didn't think about copying the actual sequence. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 15:56:18
|
No need to be sorry ! Basically, this code extracts the k-mers from reads and send them to MPI peers. 2 reverse-complement k-mers will always go to the same destination. In fact, only the lowest k-mer needs to be appended as the other is not used anyway in MessageProcessor::call_RAY_MPI_TAG_KMER_ACADEMY_DATA (communication/MessageProcessor.cpp) call_RAY_MPI_TAG_KMER_ACADEMY_DATA processes messages with tag RAY_MPI_TAG_KMER_ACADEMY_DATA So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine. > On 21/07/11 17:39, Sébastien Boisvert wrote: >> I think there is a bug: >> >> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice ! > > Ah, sorry, it does actually add the raw bits of a/b as well. I just > noticed that the rankToFlush would be the same, and didn't think about > copying the actual sequence. > > -- David Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-21 16:55:28
|
On 21/07/11 17:56, Sébastien Boisvert wrote: > So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine. Attached. Note that this is actually 2 reverse-complements, not 1, because a reverse-complement is needed in order to get the correct vertex rank. I suppose you could avoid this by modifying the vertex rank function to take a boolean "actually, don't bother with finding the lowest k-mer, I've already done that." oh... and looking at the vertexRank function, you could eliminate a Kmer copy by using an if statement (see attached patch, which does both of these things, but doesn't modify the KmerAcademyBuilder.cpp function to take advantage of that). FWIW, when I run this (both patches applied) on my phiX_5k data, it produces different coverage distributions each time. Does this make sense? -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-21 17:08:45
|
flush_twice_nocomplement_v2.patch This patch has 1 bug: rankToFlush must be computed with the lowest k-mer. There is 1 reverseComplement in the design I described in the other email. Basically, you don't call vertexRank, instead, you call the hash function on the lowest directly. vertexrank_nocopy_nocomplement.patch The method vertexRank is used elsewhere too, in places where reverseComplement is not called before this call. > > Attached. Note that this is actually 2 reverse-complements, not 1, > because a reverse-complement is needed in order to get the correct > vertex rank. I suppose you could avoid this by modifying the vertex rank > function to take a boolean "actually, don't bother with finding the > lowest k-mer, I've already done that." > > oh... and looking at the vertexRank function, you could eliminate a Kmer > copy by using an if statement (see attached patch, which does both of > these things, but doesn't modify the KmerAcademyBuilder.cpp function to > take advantage of that). > > FWIW, when I run this (both patches applied) on my phiX_5k data, it > produces different coverage distributions each time. Does this make sense? > > -- David Sébastien Boisvert http://github.com/sebhtml/ray |