Thread: [Denovoassembler-devel] Why flush both k-mer and its complement?

Ray -- Parallel genome assemblies for parallel DNA sequencing

Brought to you by: sebhtml

denovoassembler-devel

[Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 09:47:50

I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have:

120 rankToFlush=m_parameters->_vertexRank(&a);
121 [add <rankToFlush> to pending flush messages]
130 Kmer b = 
complementVertex(&a,wordSize,m_parameters->getColorSpaceMode());
132 rankToFlush=m_parameters->_vertexRank(&b);
133 [add <rankToFlush> to pending flush messages]

but the vertexRank function returns the hash of the lowest k-mer:

core/Parameters.cpp:1200 (_vertexRank)
return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode);

core/common_functions.cpp (vertexRank)
134 Kmer b=complementVertex(a,w,color);
135 if(a->isLower(&b))
136     b=*a;
137 return hash_function_1(&b)%(_size);

So the code appears to be doing 3 complement vertex operations 
(KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each 
flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be 
removed without loss of function?

-- David

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 14:40:51

Short answer: no.

Although both k-mers will go on the same MPI rank,
the first flush may generate an actual MPI message on the network.

As such, the second flush won't flush anything because the buffer was just reset.

On the other hand, if the first k-mer does not generate a message when flush is called,
then the second flush may do so.



I think you bring a good point however.

complementVertex being the function that consumes the most CPU cycles (according to GNU gprof), clearly,
there is room here for improvement (i.e. as you underline using 1 call instead of 3 would be better).

I added a TODO.

Feel free to send me a patch for this !

see code/Submit-a-patch.txt


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 21 juillet 2011 05:46
> À : Sébastien Boisvert; den...@li...
> Objet : Why flush both k-mer and its complement?
> 
> I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have:
> 
> 120 rankToFlush=m_parameters->_vertexRank(&a);
> 121 [add <rankToFlush> to pending flush messages]
> 130 Kmer b =
> complementVertex(&a,wordSize,m_parameters->getColorSpaceMode());
> 132 rankToFlush=m_parameters->_vertexRank(&b);
> 133 [add <rankToFlush> to pending flush messages]
> 
> but the vertexRank function returns the hash of the lowest k-mer:
> 
> core/Parameters.cpp:1200 (_vertexRank)
> return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode);
> 
> core/common_functions.cpp (vertexRank)
> 134 Kmer b=complementVertex(a,w,color);
> 135 if(a->isLower(&b))
> 136     b=*a;
> 137 return hash_function_1(&b)%(_size);
> 
> So the code appears to be doing 3 complement vertex operations
> (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each
> flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be
> removed without loss of function?
> 
> -- David
> 

                                                     Sébastien

Re: [Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:08:28

Attachments: flush_twice_nocomplement.patch

On 21/07/11 16:40, Sébastien Boisvert wrote:
> Although both k-mers will go on the same MPI rank,
> the first flush may generate an actual MPI message on the network.
> As such, the second flush won't flush anything because the buffer was just reset.
> On the other hand, if the first k-mer does not generate a message when flush is called,
> then the second flush may do so.
>
> Feel free to send me a patch for this !

Okay, I've attached a patch that just does the flush twice without any 
additional reverse complelements.

-- David

Re: [Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:23:07

On 21/07/11 17:07, David Eccles (gringer) wrote:
> Okay, I've attached a patch that just does the flush twice without any
> additional reverse complelements.

Er, sorry, that patch seems to alter the coverage distribution. The unit 
tests I was able to do were successful, but I guess the coverage was 
important this time.

phix_5k_1/2 with patch:

http://pastebin.com/1nnezJrQ

phix_5k_1/2 without patch:

http://pastebin.com/d3PNUBfQ

-- David

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:43:12

Yes, it will double all the numbers.

I saw that in your patch !

See my other email.

If you have time, you can implement the pseudo-algorithm I devised in the other email.

It should be safe.

Also, I would need you to use git format-patch (see code/Submit-a-patch.txt) should
you submit me a patch that I would merge because otherwise the Author field in the commit 
would not be properly populated. (You would be the author and I would be the committer).

> On 21/07/11 17:07, David Eccles (gringer) wrote:
>> Okay, I've attached a patch that just does the flush twice without any
>> additional reverse complelements.
> 
> Er, sorry, that patch seems to alter the coverage distribution. The unit
> tests I was able to do were successful, but I guess the coverage was
> important this time.
> 
> phix_5k_1/2 with patch:
> 
> http://pastebin.com/1nnezJrQ
> 
> phix_5k_1/2 without patch:
> 
> http://pastebin.com/d3PNUBfQ
> 
> -- David

                                                     Sébastien

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:39:40

I think there is a bug:

In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !



What I had in mind (pseudo C++):

/* get the pair of k-mers */
Kmer a = kmerAtPosition()
Kmer b=a.reverseComplement()

/* compute the destination */
Kmer rankToFlush=-1;

if(a<b)
	rankToFlush=a.hash_function_1()%m_parameters.getSize()
else
	rankToFlush=b.hash_function_1()%m_parameters.getSize()


/* push a to destination and flush if necessary */
push a to rankToFlush usin bufferedData.addAt
bufferedData.flush()

/* push b to destination and flush if necessary */
push b to rankToFlush usins bufferedData.addAt
bufferedData.flush()


This has only a call to reverseComplement and do the same as refs/master


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 21 juillet 2011 11:07
> À : Sébastien Boisvert
> Cc : den...@li...
> Objet : Re: Why flush both k-mer and its complement?
> 
> On 21/07/11 16:40, Sébastien Boisvert wrote:
>> Although both k-mers will go on the same MPI rank,
>> the first flush may generate an actual MPI message on the network.
>> As such, the second flush won't flush anything because the buffer was just reset.
>> On the other hand, if the first k-mer does not generate a message when flush is called,
>> then the second flush may do so.
>>
>> Feel free to send me a patch for this !
> 
> Okay, I've attached a patch that just does the flush twice without any
> additional reverse complelements.
> 
> -- David
> 

                                                     Sébastien

Re: [Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:44:49

On 21/07/11 17:39, Sébastien Boisvert wrote:
> I think there is a bug:
>
> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !

Ah, sorry, it does actually add the raw bits of a/b as well. I just 
noticed that the rankToFlush would be the same, and didn't think about 
copying the actual sequence.

-- David

[Denovoassembler-devel] RE : RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:56:18

No need to be sorry !

Basically, this code extracts the k-mers from reads and send them to MPI peers.

2 reverse-complement k-mers will always go to the same destination.


In fact, only the lowest k-mer needs to be appended as the other is not used anyway in 

MessageProcessor::call_RAY_MPI_TAG_KMER_ACADEMY_DATA (communication/MessageProcessor.cpp)



call_RAY_MPI_TAG_KMER_ACADEMY_DATA processes messages with tag RAY_MPI_TAG_KMER_ACADEMY_DATA


So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine.




> On 21/07/11 17:39, Sébastien Boisvert wrote:
>> I think there is a bug:
>>
>> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !
> 
> Ah, sorry, it does actually add the raw bits of a/b as well. I just
> noticed that the rankToFlush would be the same, and didn't think about
> copying the actual sequence.
> 
> -- David

                                                     Sébastien

Re: [Denovoassembler-devel] RE : RE : Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 16:55:28

Attachments: flush_twice_nocomplement_v2.patch vertexrank_nocopy_nocomplement.patch

On 21/07/11 17:56, Sébastien Boisvert wrote:
> So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine.

Attached. Note that this is actually 2 reverse-complements, not 1, 
because a reverse-complement is needed in order to get the correct 
vertex rank. I suppose you could avoid this by modifying the vertex rank 
function to take a boolean "actually, don't bother with finding the 
lowest k-mer, I've already done that."

oh... and looking at the vertexRank function, you could eliminate a Kmer 
copy by using an if statement (see attached patch, which does both of 
these things, but doesn't modify the KmerAcademyBuilder.cpp function to 
take advantage of that).

FWIW, when I run this (both patches applied) on my phiX_5k data, it 
produces different coverage distributions each time. Does this make sense?

-- David

[Denovoassembler-devel] RE : RE : RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 17:08:45

flush_twice_nocomplement_v2.patch


This patch has 1 bug: rankToFlush must be computed with the lowest k-mer.

There is 1 reverseComplement in the design I described in the other email.

Basically, you don't call vertexRank, instead, you call the hash function on the lowest directly.



vertexrank_nocopy_nocomplement.patch


The method vertexRank is used elsewhere too, in places where reverseComplement is
not called before this call. 


> 
> Attached. Note that this is actually 2 reverse-complements, not 1,
> because a reverse-complement is needed in order to get the correct
> vertex rank. I suppose you could avoid this by modifying the vertex rank
> function to take a boolean "actually, don't bother with finding the
> lowest k-mer, I've already done that."
> 
> oh... and looking at the vertexRank function, you could eliminate a Kmer
> copy by using an if statement (see attached patch, which does both of
> these things, but doesn't modify the KmerAcademyBuilder.cpp function to
> take advantage of that).
> 
> FWIW, when I run this (both patches applied) on my phiX_5k data, it
> produces different coverage distributions each time. Does this make sense?
> 
> -- David
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray