denovoassembler-devel Mailing List for Ray: scalable assembly (Page 4)

Ray -- Parallel genome assemblies for parallel DNA sequencing

Brought to you by: sebhtml

denovoassembler-devel — Development mailing list

You can subscribe to this list here.

2011	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul (42)	_Aug (4)	_Sep	_Oct	_Nov	_Dec
2012	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (2)	_Nov (8)	_Dec (4)
2013	_Jan (6)	_Feb (21)	_Mar	_Apr	_May (3)	_Jun	_Jul	_Aug	_Sep (1)	_Oct (3)	_Nov	_Dec
2014	_Jan	_Feb (8)	_Mar (10)	_Apr	_May (1)	_Jun (5)	_Jul	_Aug (1)	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 2 3 4 5 > >> (Page 4 of 5)

[Denovoassembler-devel] RE : [PATCH] fixing a segfault when no contigs are found

From: Sébastien B. <seb...@ul...> - 2011-08-02 18:26:24

Thanks !

I applied your patch.

> When a sequence didn't produce any contigs, I ended up wth a segfault in
> the make fusions method. gdb very quickly narrowed it down to the array
> operator, so I've attached a patch that should fix this problem.
> 
> David
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

[Denovoassembler-devel] [PATCH] fixing a segfault when no contigs are found

From: David E. (gringer) <dav...@mp...> - 2011-08-02 15:26:19

Attachments: 0001-fixing-a-segfault-when-no-contigs-are-found.patch

When a sequence didn't produce any contigs, I ended up wth a segfault in 
the make fusions method. gdb very quickly narrowed it down to the array 
operator, so I've attached a patch that should fix this problem.

David

[Denovoassembler-devel] Planaria transcriptome coverage distribution

From: David E. (gringer) <dav...@mp...> - 2011-07-27 12:33:21

For testing, here's the coverage distribution that was created by 
combining all the Smed RNA reads I have:

http://user.interface.org.nz/~gringer/hacking/planaria_Ray_5_2011-07-26_RayOutput.CoverageDistribution.txt 
[~140kb]

I had a bit of trouble with firefox freezing when I tried pastebin, 
which is why it's not there.

For what it's worth, some of these RNA transcripts (the low-expression 
transcripts) only have about 50-80 hits for reads from a single 
platform. While I don't expect to be able to get the sequences from 
these transcripts -- for one I looked at, the longest stretch with 
coverage greater than 5 was about 70bp -- it would be nice to have some 
way to tune Ray to work with RNA data that has similar distributions of 
coverage.

The analysis is as follows:

k-mer length:   31
Lowest coverage observed:       1
MinimumCoverage:        215
PeakCoverage:   216
RepeatCoverage: 217
Number of k-mers with at least MinimumCoverage: 1944260 k-mers
Estimated genome length:        972130 nucleotides
Percentage of vertices with coverage 1: 64.0808 %
DistributionFile: 
results/Ray_5_2011-07-26/RayOutput.CoverageDistribution.txt

I'm not quite sure I agree with a Peak/Repeat difference of 1, but given 
that I'm not sure what the numbers are used for, I'll leave that 
strangeness for someone else to comment on.

Note that it has a the big decrease in k-mers found for the first bit of 
the distribution. I'm not sure if this is because of the nature of RNA 
data or something else.

-- David

[Denovoassembler-devel] Segfault from removing ?unnecessary copy, but only when multiple processors are used

From: David E. (gringer) <dav...@mp...> - 2011-07-27 08:39:17

I introduced a bug in sebhtml/git by carrying out the following code change:

diff --git a/code/structures/Kmer.cpp b/code/structures/Kmer.cpp
index 83e145b..10e4871 100644
--- a/code/structures/Kmer.cpp
+++ b/code/structures/Kmer.cpp
@@ -155,8 +155,8 @@ uint8_t Kmer::getFirstSegmentFirstCode(int w){
  int Kmer::vertexRank(int _size,int w,bool color){
         Kmer b=complementVertex(w,color);
         if(isLower(&b))
-               b=*this;
-       return b.hash_function_1()%(_size);
+               return b.hash_function_1()%(_size);
+       return hash_function_1()%(_size);
  }

  /**

In other words, I changed vertexRank to the following:

int Kmer::vertexRank(int _size,int w,bool color){
        Kmer b=complementVertex(w,color);
        if(isLower(&b))
                return b.hash_function_1()%(_size);
        return hash_function_1()%(_size);
}

This shouldn't be a problem, right? All I've done is remove the copy 
from *this to b, which (in the absence of a copy constructor) should do 
a shallow copy of the fields of (*this) [i.e. create a new 
uint64_t[KMER_U64_ARRAY_SIZE], and copy the bytes across]. I can't see 
how the function outcome can be changed by doing this.

[there are a few other places in the code where a similar copy is done]

However, I get segfaults after making this change, and I don't know why. 
In order to get my phiX data to work on sebhtml/git, I needed to adjust 
the minimumY for CoverageDistribution:

diff --git a/code/graph/CoverageDistribution.cpp 
b/code/graph/CoverageDistribution.cpp
index 82cae53..db6cc60 100644
--- a/code/graph/CoverageDistribution.cpp
+++ b/code/graph/CoverageDistribution.cpp
@@ -48,7 +48,7 @@ 
CoverageDistribution::CoverageDistribution(map<int,uint64_t>*distributionOfCover

         int windowSize=10;
         int minimumX=1;
-       uint64_t minimumY=2*4096;
+       uint64_t minimumY=1;
         uint64_t minimumY2=55000;
         int maximumX=65535-1;
         int safeThreshold=256;

If you want the coverage distribution for this, you can find it here:

http://pastebin.com/KG01Tepp

If I run with one processor, everything is fine:

$ mpirun -np 1 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p 
tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 
'Number of contigs'
Number of contigs: 1
Total length of contigs: 5382
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5382
Number of scaffolds: 1
Total length of scaffolds: 5382
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5382

Rank 0 wrote RayOutput.Contigs.fasta
Rank 0 wrote RayOutput.Scaffolds.fasta

But when I run with more than one processor, there is a segfault:

$ mpirun -np 2 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p 
tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 
'Number of contigs'
[thaliana:23082] *** Process received signal ***
[thaliana:23082] Signal: Segmentation fault (11)
[thaliana:23082] Signal code: Address not mapped (1)
[thaliana:23082] Failing at address: 0x8
[thaliana:23082] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf020) 
[0x7fc03c7aa020]
[thaliana:23082] [ 1] 
../../../sebgit/ray/code/Ray(_ZN16MessageProcessor30call_RAY_MPI_TAG_VERTICES_DATAEP7Message+0x29f) 
[0x43306f]
[thaliana:23082] [ 2] 
../../../sebgit/ray/code/Ray(_ZN7Machine10runVanillaEv+0x75) [0x445695]
[thaliana:23082] [ 3] 
../../../sebgit/ray/code/Ray(_ZN7Machine5startEv+0xd57) [0x447767]
[thaliana:23082] [ 4] ../../../sebgit/ray/code/Ray(main+0x2b) [0x42608b]
[thaliana:23082] [ 5] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd) [0x7fc03c435ead]
[thaliana:23082] [ 6] ../../../sebgit/ray/code/Ray() [0x4262e1]
[thaliana:23082] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 23082 on node thaliana 
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

So, I can turn on debug and get the same bug:

diff --git a/Makefile b/Makefile
index d1aca41..fcd7b7e 100644
--- a/Makefile
+++ b/Makefile
@@ -83,7 +83,7 @@ OPTIMIZE = y

  # profiling
  GPROF = n
-DEBUG = n
+DEBUG = y

  ifeq ($(GPROF),y)
         OPTIMIZE = n

$ mpirun -np 2 xterm -e gdb --args ../../../sebgit/ray/code/Ray 
-write-seeds -k 10 -p tests/phix/phix_5k_1.fasta 
tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs'

[from window for rank 0]

Program received signal SIGSEGV, Segmentation fault.
0x00000000004c5aa7 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA (
     this=0x7b04c8, message=0x9ca280)
     at code/communication/MessageProcessor.cpp:465
465 
if(candidate->m_count<m_parameters->getMinimumCoverageToStore())
(gdb) bt
#0  0x00000000004c5aa7 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA (
     this=0x7b04c8, message=0x9ca280)
     at code/communication/MessageProcessor.cpp:465
#1  0x00000000004c3e61 in MessageProcessor::processMessage (this=0x7b04c8,
     message=0x9ca280) at code/communication/MessageProcessor.cpp:65
#2  0x00000000004e27fe in Machine::processMessages (this=0x79e090)
     at code/core/Machine.cpp:614
#3  0x00000000004e24ba in Machine::runVanilla (this=0x79e090)
     at code/core/Machine.cpp:533
#4  0x00000000004e2492 in Machine::run (this=0x79e090)
     at code/core/Machine.cpp:517
#5  0x00000000004e1f24 in Machine::start (this=0x79e090)
     at code/core/Machine.cpp:462
#6  0x00000000005469a4 in main (argc=7, argv=0x7fffffffdf48)
     at code/assembler/ray_main.cpp:29
(gdb) print candidate
$1 = (KmerCandidate *) 0x0

So that code change has caused a bug in the MessageProcessor, 
specifically the call_RAY_MPI_TAG_VERTICES_DATA function.... but I'm 
using DEBUG, which should activate ASSERT, which should verify canidate 
== NULL with an assert. I eventually worked out that I need to make 
another change to the makefile to remove an erroneous space (cat -A is 
used here to make the change obvious):

$ git diff Makefile | cat -A
diff --git a/Makefile b/Makefile$
index d1aca41..8ca7842 100644$
--- a/Makefile$
+++ b/Makefile$
@@ -83,7 +83,7 @@ OPTIMIZE = y$
  $
  # profiling$
  GPROF = n$
-DEBUG = n$
+DEBUG = y$
  $
  ifeq ($(GPROF),y)$
  ^IOPTIMIZE = n$
@@ -93,7 +93,7 @@ endif$
  ifeq ($(DEBUG),y)$
  ^IOPTIMIZE = n$
  ^IFORCE_PACKING = n$
-^IASSERT = y $
+^IASSERT = y$
  endif$
  $
  PEDANTIC = n$

So now I get the assert failing, which means the segfault doesn't get 
reached:

$ mpirun -np 2 xterm -e gdb --args ../../../sebgit/ray/code/Ray 
-write-seeds -k 10 -p tests/phix/phix_5k_1.fasta 
tests/phix/phix_5k_2.fasta | grep -A 8 'Number of contigs

[from window for rank 1]

Ray: code/communication/MessageProcessor.cpp:462: void 
MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA(Message*): Assertion 
`candidate!=__null' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff5f3a405 in raise (sig=<value optimized out>)
     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
         in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  0x00007ffff5f3a405 in raise (sig=<value optimized out>)
     at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff5f3d680 in abort () at abort.c:92
#2  0x00007ffff5f335b1 in __assert_fail (
     assertion=0x563296 "candidate!=__null", file=<value optimized out>,
     line=462,
     function=0x563f40 "void 
MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA(Message*)") at assert.c:81
#3  0x00000000004c8d37 in MessageProcessor::call_RAY_MPI_TAG_VERTICES_DATA (
     this=0x7c44c8, message=0x9de2c0)
     at code/communication/MessageProcessor.cpp:462
#4  0x00000000004c6f8d in MessageProcessor::processMessage (this=0x7c44c8,
     message=0x9de2c0) at code/communication/MessageProcessor.cpp:65
#5  0x00000000004e6872 in Machine::processMessages (this=0x7b2090)
     at code/core/Machine.cpp:614
#6  0x00000000004e64e8 in Machine::runVanilla (this=0x7b2090)
     at code/core/Machine.cpp:533
#7  0x00000000004e64c0 in Machine::run (this=0x7b2090)
     at code/core/Machine.cpp:517
#8  0x00000000004e5f52 in Machine::start (this=0x7b2090)
     at code/core/Machine.cpp:462
#9  0x0000000000550ac0 in main (argc=7, argv=0x7fffffffdf48)
     at code/assembler/ray_main.cpp:29

So, it's segfaulting because it can't find a k-mer (I've done additional 
tests to verify that, as expected, both the k-mer and its reverse 
complement cannot be found), which presumably means that the k-mer that 
it's looking for wasn't inserted into the graph.

Just as a reminder, everything still works with a single processor:

$ mpirun -np 1 ../../../sebgit/ray/code/Ray -write-seeds -k 10 -p 
tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.fasta | grep -A 8 
'Number of contigs'
Number of contigs: 1
Total length of contigs: 5382
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5382
Number of scaffolds: 1
Total length of scaffolds: 5382
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5382

Rank 0 wrote RayOutput.Contigs.fasta
Rank 0 wrote RayOutput.Scaffolds.fasta

I just don't understand why this removal of a k-mer copy breaks code. 
What's going on here? Any insight would be appreciated.

-- David

[Denovoassembler-devel] RE : [PATCH] clean up / simplify CoverageDistribution function

From: Sébastien B. <seb...@ul...> - 2011-07-26 15:04:30

Send me your problematic phix CoverageDistribution.txt file and I will add it to my unit tests.

If coverage decreases sharply from the starts, chances are that it won't work with Ray.



> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 22 juillet 2011 07:13
> À : den...@li...
> Objet : [Denovoassembler-devel] [PATCH] clean up / simplify     CoverageDistribution function
> 
> I've been trying to clean up the coverage distribution function to make
> it easier for me to understand.
> 
> Using this patch, it works with my simulated phiX data (because of the
> fallback to max coverage with no votes -- it looked like this was in the
> code before?).
> 
> However, it fails for the S.med 454 data (transcriptome data, in which
> coverage values decrease sharply from the start). I noticed that I can
> 'fix' the peak finding code by making this change (about line 83):
> 
> if((votes[i] > votes[largestPosition])
> ->
> if((votes[i] >= votes[largestPosition])
> 
> But then the minimum position is set to the largest position, and the
> assembler panics. I need to set the minimum coverage to the coverage at
> the first position for things to work with that data:
> 
> m_minimumCoverage=x[0]; // about line 98
> 
> I haven't put these additional changes in the patch, because I'm not
> sure how they will affect other things... and [as I've just noticed] I
> get a segfault on the 454 data with my code when doing this.... I'll try
> to nail that down in the next day or so.
> 
> -- David
> 
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

[Denovoassembler-devel] RE : colour space mode [possibly] not sent from config to other nodes

From: Sébastien B. <seb...@ul...> - 2011-07-26 14:52:24

In old versions of Ray, only the MPI rank 0 read the arguments but now all the MPI ranks do so.

This should not affect how things work.


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 25 juillet 2011 06:32
> À : Sébastien Boisvert
> Cc : den...@li...
> Objet : colour space mode [possibly] not sent from config to other nodes
> 
> While going through the Ray code to remove the colorSpace
> option/parameter (because it's not necessary for my code), I noticed
> that in Machine.cpp::call_RAY_MASTER_MODE_LOAD_CONFIG() there were lines
> that stored the word size and colour-space mode (about lines 680-690),
> but no corresponding colourSpace storage function in
> MessageProcessor.cpp::call_RAY_MPI_TAG_SET_WORD_SIZE(Message*message)
> (about line 230). I'm not sure if this would affect how things work or not.
> 
> -- David
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

[Denovoassembler-devel] Reading colour-space -- a reversal of opinion through frustration

From: David E. (gringer) <dav...@mp...> - 2011-07-25 16:37:19

I noticed earlier today that my code was converting all the csfasta 
reads into base space before sending them to the Read object, which 
meant that the rest of the code only saw base-space things. So, my 
"colour-space" assemblies were only colour-space when they were in the 
input files. In other words, Ray was doing everything just as it had 
previously, except at the file reading stage.

This has made me think again about the usefulness of colour-space as an 
internal storage format. Here's my current list of advantages:

* Reverse complement is a simple reverse of the sequence
* No conversion needed for colour-space sequences, as long as it's the 
start of the sequence
* unknown first bases allow for more sequence recovery

Here are a few disadvantages for using colour-space as an internal format:

* k-mers need at least one extra base, ideally two to make 
reverse-complement into base-space a fast process.
* conversion needed for base-space sequences
* conversion needed for colour-space sequences if not at the start of 
the sequence (need to know first base)
* numerical output for genetic data is confusing
* unknown first bases make either Kmer Academy or assembly more difficult

The ABI document, 'Color Space Analysis in the SOLiD System: the Theory, 
Advantages and Solutions', claims "some of the color space rules can be 
applied to detect error during de novo assembly". Given my understanding 
of how Ray works for assembly (e.g. incoming/outgoing edges), I just 
can't see how the colour-space assembly can improve matters. You would 
lose a bit of information in removing the colour-space data that doesn't 
transform perfectly into base-space, but there's a good chance that the 
bad transformation is an indicator of bad data.

As such, I think I'll concentrate on converting colour-space to 
base-space at the Read stage, and forget about the unknown first base 
stuff for the Kmer academy.

  -- David

[Denovoassembler-devel] colour space mode [possibly] not sent from config to other nodes

From: David E. (gringer) <dav...@mp...> - 2011-07-25 10:33:08

While going through the Ray code to remove the colorSpace 
option/parameter (because it's not necessary for my code), I noticed 
that in Machine.cpp::call_RAY_MASTER_MODE_LOAD_CONFIG() there were lines 
that stored the word size and colour-space mode (about lines 680-690), 
but no corresponding colourSpace storage function in 
MessageProcessor.cpp::call_RAY_MPI_TAG_SET_WORD_SIZE(Message*message) 
(about line 230). I'm not sure if this would affect how things work or not.

  -- David

Re: [Denovoassembler-devel] clean up / simplify CoverageDistribution function

From: Eccles, D. <dav...@mp...> - 2011-07-23 12:46:52

From: Eccles, David
> ... and [as I've just noticed] I get a segfault on the 454 data with 
> my code when doing this.... I'll try to nail that down in the next day or
so.

Because there weren't already unit tests for Kmer a.compareTo(&b), I didn't
think to put them in my code. I only discovered the bug from a segfault that
happened because Ray was looking for a palindromic sequence with different
starting bases, and couldn't find it (something like A300330303033003A vs the
reverse-complement T300330303033003T). The problem was that it was clearing
the first/last bases when it wasn't meant to (i.e. first bases known), and
not clearing them when it was meant to (i.e. at least one first base
unknown).

https://github.com/gringer/ray/commit/451f766a5ce8b7da4ce6d340c234e7d4c5c7ca4
f

... I guess Kmer unit tests will be my next mini-project for Ray.

I've been thinking again about unknown first bases and assembly, and think
that probably the best thing to do would be to only store unknown first bases
in the Kmer academy. In other words, graduation would require both a known
first-base, and for the k-mer to occur more than once. It would mean that
there's no messing round with unknown bases when doing the assembly. However,
it would also mean that an assembly couldn't be done on sequences that all
have a misread at their second position.

-- David

[Denovoassembler-devel] [PATCH] clean up / simplify CoverageDistribution function

From: David E. (gringer) <dav...@mp...> - 2011-07-22 11:14:47

Attachments: CoverageDistribution_cleanup.patch

I've been trying to clean up the coverage distribution function to make 
it easier for me to understand.

Using this patch, it works with my simulated phiX data (because of the 
fallback to max coverage with no votes -- it looked like this was in the 
code before?).

However, it fails for the S.med 454 data (transcriptome data, in which 
coverage values decrease sharply from the start). I noticed that I can 
'fix' the peak finding code by making this change (about line 83):

if((votes[i] > votes[largestPosition])
->
if((votes[i] >= votes[largestPosition])

But then the minimum position is set to the largest position, and the 
assembler panics. I need to set the minimum coverage to the coverage at 
the first position for things to work with that data:

m_minimumCoverage=x[0]; // about line 98

I haven't put these additional changes in the patch, because I'm not 
sure how they will affect other things... and [as I've just noticed] I 
get a segfault on the 454 data with my code when doing this.... I'll try 
to nail that down in the next day or so.

-- David

Re: [Denovoassembler-devel] Colour-space loading / assembly now working (produces base-space output)

From: David E. (gringer) <dav...@mp...> - 2011-07-22 08:06:20

On 21/07/11 19:16, Sébastien Boisvert wrote:
>> However, this is probably more interesting, using a colour-space
>> sequence for one end, and a base-space sequence for the other end:
> What do you mean ?

It was a quick test of combining colour-space reads and base-space reads 
together in one run. While it's not expected that first and second reads 
will be in a different space, that is a side-effect of allowing combined 
spaces as input files.

> But mine does not produce the contigs in nucleotide space.
> And I believe yours does !

Yes, it does. The assembly for my fork is done in base-space, while the 
internal representation for reads and k-mers is colour-space. This means 
that the hash values for k-mers will differ from your code, but in other 
respects, other code doesn't need to know the difference.

FWIW, the current behaviour of my code could be shoehorned into 
sebhtml/git by converting all reads into base-space.

I don't yet use any k-mers with unknown first bases, but I expect to add 
that in once I've worked out an appropriate place to do it. I probably 
need to get the assembly (or at least seed extension) to happen in 
colour-space so that sequences with a different first base but the same 
colour-space sequence are lumped together.

> So the next steps is to test your work on the system tests I guess.

The only "system tests" I have at the moment aren't a particularly good 
representation of the data Ray has been designed to work on:

phiX-simulated: small genome, synthetic data
phiX-sequenced: small genome, circularised genome
S. mediterranea: transcriptome, rather than genome
?E. coli: colour-space data with high error-rate

At the moment, I've only really done any testing on the two phiX datasets.

-- David

[Denovoassembler-devel] RE : Random thoughts about Ray

From: Sébastien B. <seb...@ul...> - 2011-07-21 17:31:35

> 
> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 19 juillet 2011 04:32
> À : den...@li...
> Objet : Re: [Denovoassembler-devel] Random thoughts about Ray
> 
> On 18/07/11 19:43, Sébastien Boisvert wrote:
>>> I had to modify that section, because the alternative was broken
>>> code that wouldn't assemble correctly.
>> OK. So you think there was some sort of interaction going on ? How
>> did you fix it (provide commit) ?
> 
> https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58
> 

Interesting, k-mers with very low coverage are used in your case.

I believe this threshold prevent seeds from having low-quality k-mers at their ends.


> [SeedingData.cpp, not test_phiX.sh]
> 
> I still think that the problem of a "perfect" phiX assembly not working
> was due to a parent node that wasn't considered for a seed start, with
> all of the child nodes having 1 edge in, 1 edge out. My hypothesis is
> that the seed start node had low coverage, so wasn't considered a good
> choice. I understand why this is a reasonable decision, but think that a
> better fix to this problem would be to also consider low coverage nodes
> in SeedWorker.cpp when evaluating if a parent node is a better fit for
> the start of a seed. For example, by adding a
> m_SEEDING_coverage_test_done check as well (or integrating that with the
> SEEDING_1_1_test).

I totally agree with you. I added a TODO in the code for that.
I will set the threshold to 1 and if it works for all system tests I will remove it.

> 
>> You are right, systematic errors confuse Ray and sequencing errors
>> consume a lot of memory. It is therefore wise to filter errors. Q20
>> (86% good) is adequate but I think using Q30 (95% good) is too
>> aggressive.
> 
> Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in
> 10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines
> = 99.99% = Q40. For Q above 10, it's essentially the same value for
> Phred and Sanger quality scores.
> 
> http://en.wikipedia.org/wiki/Phred_quality_score#Reliability
> 

Q20 86%
Q30 95%
Q44 98%
Q50 99%

Q = -10 log (P) / log(10)

P=probability(bad)


>>> One typical example is library changes -- code may have worked fine
>>> at an earlier date, but you need to include a change made in some
>>> future build in order to get it working for the more modern
>>> libraries.
>> I agree. That is why the dependencies for Ray are basically only the
>> C++ standard library (which include the C library as well, I believe)
>> and a message-passing-interface library.
> 
> Yes, any C program is considered a valid C++ program as well (there are
> some quirks to make this work properly).
> 
> Given that you've used the boost random number generator for the read
> simulator, one other option that I would like for the Ray code would be
> the boost gzip and bzip2 filters (from Iostreams).
>

But keep in mind that the read simulator is not part of the Ray assembler (the single executable called Ray).

No way I am going to include boost stuff.
 
> http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor
> http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor
> 
> This would allow the use of the more C++-like iostream method of file
> access, even for compressed files, which means being able to use strings
> for all file access, rather than just for the uncompressed files (i.e.
> no more concerns about null-terminated strings). However, after doing
> many Debian updates, I am aware of how often the boost libraries change,
> and would not suggest having boost libraries as a core dependency.
> 
> Once I'm done with the colour-space stuff, I'll see about making a more
> generic file loader class that can handle all formats.
> 

Yeah.

Presently, assembler/Loader.cpp is the "interface".

All formats in formats have standard method name like load().

Loaders use lazy-loading by the way.

>> There is also the command 'git stash', but I have not used it yet.
> 
> Thanks for pointing that out, I wasn't aware of this command. It looks
> like it doesn't seem to work for remote repositories, so maybe I should
> just set up a temporary 'broken' or 'stash' branch when I'm doing things
> like that.
> 

git reset --hard 

this resets your repository, pretty useful to remove files after testing rsynced files.

>> For me, the hardest stuff are bugs occurring in parallel. Checkpoints
>> would help also, but I have not figured an easy way to implement them
>> in Ray yet.
> 
> Sorry, I don't think I understand. I was under the impression that Ray
> already had stopping points, where it waits for all the other nodes to
> finish before continuing. I guess this is a different kind of checkpoint?
> 

You are referring to what is call barriers.

Checkpoints are a specialization of barriers wherein the state of the computation is saved in
a file (or in many files). 

This is useful for instance if you are developping something that runs after 10 hours of computation.

With checkpoints, you can start right at that point.

>> How many gigabases are generated with one HiScan SQ run ?
> 
> From the human RNA run we did, there were about 40-50M sequences per
> lane, with each sequence having 50bp (paired end, so 100bp, I guess).
> With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The
> specification is a bit higher than that (67-75Gb), but it's in the same
> ball-park:
> 
> http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs
> 
>> But to do hybrid assemblies, you would have to convert everything to
>> color-space, right?
> 
> Yeah, it's annoying that colour-space is the common denominator. But at
> least the conversion from base-space to colour-space doesn't lose
> information. For any given base-space sequence, there is only one valid
> colour-space sequence (as long as you include a starting base in the
> colour-space sequence).
> 

OK

>> So imagine you have 2 companies: Company A and Company B. Company A
>> technology produces greek-letter-space reads. Company B technology
>> produces heat-space reads.
> 
> Ugh. I would hope that other companies are seeing the light (or
> history), and will just all go back to base-space. As an aside, I'm glad
> that iontorrent uses a base-space output. Unfortunately, there are still
> people who demand that programs should work directly for these new space
> sequences, often just because that's what comes off their machines:
> 
> http://seqanswers.com/forums/showpost.php?p=15245&postcount=7
> 

nilshomer is right in the sense that if a tool truly work in this native space,
then analysis will be better. However, developping these tools is not easy.

>> The driver equivalent for sequencers would be a base caller that
>> works with color-space or other underlying formats.
> 
> A generic program that converts anything into base-space would be nice,
> but I don't think it will be written by the sequencer companies -- they
> would care too much about their specific technology. More likely, it'll
> be written by a random programmer with too much time on their hands who
> is frustrated by all the different file formats that are being produced.
> 

I am not sure because it would require a lot of knowledge of the underlying sequencing technologies.
This information is proprietary and rarely available.

And nobody "jailbreak" their Illumina HiSeq so that out of the question.


We are having a fine discussion I think.

> -- David
> 
> Denovoassembler-devel mailing list
> Den...@li...
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel
> 
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

[Denovoassembler-devel] RE : Colour-space loading / assembly now working (produces base-space output)

From: Sébastien B. <seb...@ul...> - 2011-07-21 17:16:19

> 
> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 19 juillet 2011 10:34
> À : den...@li...
> Objet : [Denovoassembler-devel] Colour-space loading / assembly now working (produces base-space output)
> 
> So, things are finally looking like the colour-space stuff is going
> somewhere. I've managed to get a good result from a Ray run with
> colour-space input (no errors, first bases included), and it might be a
> tad quicker than the sebhtml/ray colour-space assembler:
> 

Nice !

So, the reads are in color-space and the contigs are in nucleotide-space ?

If so, this is very cool !

> $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.csfasta
> tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
> Number of contigs: 1
> Total length of contigs: 5380
> Number of contigs >= 500 nt: 1
> Total length of contigs >= 500 nt: 5380
> Number of scaffolds: 1
> Total length of scaffolds: 5380
> Number of scaffolds >= 500 nt: 1
> Total length of scaffolds >= 500: 5380
> 
> real    0m3.107s
> user    0m2.432s
> sys     0m0.652s
> 
> $ time ../code/Ray -p tests/phix/phix_5k_1.csfasta
> tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
> Number of contigs: 1
> Total length of contigs: 5243
> Number of contigs >= 500 nt: 1
> Total length of contigs >= 500 nt: 5243
> Number of scaffolds: 1
> Total length of scaffolds: 5243
> Number of scaffolds >= 500 nt: 1
> Total length of scaffolds >= 500: 5243
> 
> real    0m2.894s
> user    0m2.392s
> sys     0m0.484s
> 
> $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement |
> grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') >
> /dev/null && echo "success (match in reverse direction)"
> success (match in reverse direction)
> 
> There's nothing particularly impressive with this (apart from it
> actually working...). For what it's worth, the sebhtml/ray also produces
> a correct sequence, if you feed it the right starting base:
> 

But mine does not produce the contigs in nucleotide space.

And I believe yours does !




> $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement |
> grep $((echo -n 'G'; tail -n +2 RayOutput.Scaffolds.fasta) |
> ~/scripts/cs2base.pl | fasta_formatter) > /dev/null && echo "success
> (match in forward direction)"
> success (match in forward direction)
> 
> However, this is probably more interesting, using a colour-space
> sequence for one end, and a base-space sequence for the other end:
>

What do you mean ?
 
> $ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.fasta
> tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
> Number of contigs: 3
> Total length of contigs: 15713
> Number of contigs >= 500 nt: 3
> Total length of contigs >= 500 nt: 15713
> Number of scaffolds: 3
> Total length of scaffolds: 15713
> Number of scaffolds >= 500 nt: 3
> Total length of scaffolds >= 500: 15713
> 
> real    0m4.246s
> user    0m3.428s
> sys     0m0.796s
> bioinf@thaliana:~/install/ray/git/ray/system-tests$ time ../code/Ray -p
> tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.csfasta | grep -A 7
> 'Number of contigs:'
> Number of contigs: 1
> Total length of contigs: 5243
> Number of contigs >= 500 nt: 1
> Total length of contigs >= 500 nt: 5243
> Number of scaffolds: 1
> Total length of scaffolds: 5243
> Number of scaffolds >= 500 nt: 1
> Total length of scaffolds >= 500: 5243
> 
> real    0m2.892s
> user    0m2.412s
> sys     0m0.456s
> 
> $ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement |
> grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') >
> /dev/null && echo "success (match in reverse direction)"
> success (match in reverse direction)
> 
> [I haven't tried doing a match with the sebhtml/ray sequence.]
> 
> The commit that actually got my code working properly with colour-space
> is here:
> 
> https://github.com/gringer/ray/commit/fbd1efd4c3dbe31df0de820167a068736de648f5
> 

OK

> [I forgot to make the same changes for the KmerAcademyBuilder (lines
> 109-113) that I made for the VerticesExtractor (lines 110-114)]
> 

So the next steps is to test your work on the system tests I guess.

I see that your are 0 commit away from my branch. (Cool)

> -- David
> 
> _______________________________________________
> Denovoassembler-devel mailing list
> Den...@li...
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel
> 
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

[Denovoassembler-devel] RE : RE : RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 17:08:45

flush_twice_nocomplement_v2.patch


This patch has 1 bug: rankToFlush must be computed with the lowest k-mer.

There is 1 reverseComplement in the design I described in the other email.

Basically, you don't call vertexRank, instead, you call the hash function on the lowest directly.



vertexrank_nocopy_nocomplement.patch


The method vertexRank is used elsewhere too, in places where reverseComplement is
not called before this call. 


> 
> Attached. Note that this is actually 2 reverse-complements, not 1,
> because a reverse-complement is needed in order to get the correct
> vertex rank. I suppose you could avoid this by modifying the vertex rank
> function to take a boolean "actually, don't bother with finding the
> lowest k-mer, I've already done that."
> 
> oh... and looking at the vertexRank function, you could eliminate a Kmer
> copy by using an if statement (see attached patch, which does both of
> these things, but doesn't modify the KmerAcademyBuilder.cpp function to
> take advantage of that).
> 
> FWIW, when I run this (both patches applied) on my phiX_5k data, it
> produces different coverage distributions each time. Does this make sense?
> 
> -- David
                  Sébastien Boisvert
                  http://github.com/sebhtml/ray

Re: [Denovoassembler-devel] RE : RE : Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 16:55:28

Attachments: flush_twice_nocomplement_v2.patch vertexrank_nocopy_nocomplement.patch

On 21/07/11 17:56, Sébastien Boisvert wrote:
> So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine.

Attached. Note that this is actually 2 reverse-complements, not 1, 
because a reverse-complement is needed in order to get the correct 
vertex rank. I suppose you could avoid this by modifying the vertex rank 
function to take a boolean "actually, don't bother with finding the 
lowest k-mer, I've already done that."

oh... and looking at the vertexRank function, you could eliminate a Kmer 
copy by using an if statement (see attached patch, which does both of 
these things, but doesn't modify the KmerAcademyBuilder.cpp function to 
take advantage of that).

FWIW, when I run this (both patches applied) on my phiX_5k data, it 
produces different coverage distributions each time. Does this make sense?

-- David

[Denovoassembler-devel] RE : RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:56:18

No need to be sorry !

Basically, this code extracts the k-mers from reads and send them to MPI peers.

2 reverse-complement k-mers will always go to the same destination.


In fact, only the lowest k-mer needs to be appended as the other is not used anyway in 

MessageProcessor::call_RAY_MPI_TAG_KMER_ACADEMY_DATA (communication/MessageProcessor.cpp)



call_RAY_MPI_TAG_KMER_ACADEMY_DATA processes messages with tag RAY_MPI_TAG_KMER_ACADEMY_DATA


So basically, if your patch is modified to flush the lowest k-mer among a and b, then it should work fine.




> On 21/07/11 17:39, Sébastien Boisvert wrote:
>> I think there is a bug:
>>
>> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !
> 
> Ah, sorry, it does actually add the raw bits of a/b as well. I just
> noticed that the rankToFlush would be the same, and didn't think about
> copying the actual sequence.
> 
> -- David

                                                     Sébastien

[Denovoassembler-devel] RE : RE : [Denovoassembler-users] No observed peak in coverage distribution (454 test data)

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:45:05

Patch work should only be on -devel please !


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 21 juillet 2011 11:35
> À : Sébastien Boisvert
> Cc : den...@li...
> Objet : Re: RE : [Denovoassembler-users] No observed peak in coverage distribution (454 test data)
> 
> On 21/07/11 16:52, Sébastien Boisvert wrote:
>> Can you send me (pastebin) the CoverageDistribution.txt file for each of your runs.
>> I will add them to my unit tests.
> 
> It gets more weird. With the flush patch, Ray is able to attempt an
> assembly, presumably because coverage(0) < coverage(2):
> 
> http://pastebin.com/mK6FGR2B
> 
> Rank 0: the minimum coverage is 0
> Rank 0: the peak coverage is 2
> ...
> Number of contigs: 153
> Total length of contigs: 49920
> Number of contigs >= 500 nt: 29
> Total length of contigs >= 500 nt: 19433
> Number of scaffolds: 153
> Total length of scaffolds: 49920
> Number of scaffolds >= 500 nt: 29
> Total length of scaffolds >= 500: 19433
> 
> Without the flush patch, Ray panics from a bad coverage distribution:
> 
> http://pastebin.com/FqumBBFH
> 
> Rank 0: the minimum coverage is 1
> Rank 0: the peak coverage is 1
> Rank 0: Assembler panic: no peak observed in the k-mer coverage
> distribution.
> Rank 0: to deal with the sequencing error rate, try to lower the k-mer
> length (-k)
> Rank 9: sent 203748 messages, received 203749 messages.
> Rank 8: sent 203487 messages, received 203488 messages.
> Rank 7: sent 204303 messages, received 204304 messages.
> Rank 6: sent 204861 messages, received 204862 messages.
> Rank 5: sent 203780 messages, received 203781 messages.
> Rank 4: sent 203762 messages, received 203763 messages.
> Rank 3: sent 203870 messages, received 203871 messages.
> Rank 2: sent 204344 messages, received 204345 messages.
> Rank 1: sent 204085 messages, received 204086 messages.
> Rank 0: sent 203614 messages, received 203604 messages.
> 
>> Also, do you have a place where I could upload you data-for-unit-tests and data-for-system-tests so that
>> you can run tests ?
> 
> Unfortunately no. I have plenty of storage space on our Illumina server
> (and am expecting ~300TB sometime soon), can download as much as I want
> from here, but the internal network is not accessible from external
> sites (so I don't think I can set up an upload service).
> 
> -- David
> 

                                                     Sébastien

Re: [Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:44:49

On 21/07/11 17:39, Sébastien Boisvert wrote:
> I think there is a bug:
>
> In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !

Ah, sorry, it does actually add the raw bits of a/b as well. I just 
noticed that the rankToFlush would be the same, and didn't think about 
copying the actual sequence.

-- David

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:43:12

Yes, it will double all the numbers.

I saw that in your patch !

See my other email.

If you have time, you can implement the pseudo-algorithm I devised in the other email.

It should be safe.

Also, I would need you to use git format-patch (see code/Submit-a-patch.txt) should
you submit me a patch that I would merge because otherwise the Author field in the commit 
would not be properly populated. (You would be the author and I would be the committer).

> On 21/07/11 17:07, David Eccles (gringer) wrote:
>> Okay, I've attached a patch that just does the flush twice without any
>> additional reverse complelements.
> 
> Er, sorry, that patch seems to alter the coverage distribution. The unit
> tests I was able to do were successful, but I guess the coverage was
> important this time.
> 
> phix_5k_1/2 with patch:
> 
> http://pastebin.com/1nnezJrQ
> 
> phix_5k_1/2 without patch:
> 
> http://pastebin.com/d3PNUBfQ
> 
> -- David

                                                     Sébastien

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 15:39:40

I think there is a bug:

In your for loop, the second iteration (flush=1) should add b in the m_bufferedData, not a twice !



What I had in mind (pseudo C++):

/* get the pair of k-mers */
Kmer a = kmerAtPosition()
Kmer b=a.reverseComplement()

/* compute the destination */
Kmer rankToFlush=-1;

if(a<b)
	rankToFlush=a.hash_function_1()%m_parameters.getSize()
else
	rankToFlush=b.hash_function_1()%m_parameters.getSize()


/* push a to destination and flush if necessary */
push a to rankToFlush usin bufferedData.addAt
bufferedData.flush()

/* push b to destination and flush if necessary */
push b to rankToFlush usins bufferedData.addAt
bufferedData.flush()


This has only a call to reverseComplement and do the same as refs/master


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 21 juillet 2011 11:07
> À : Sébastien Boisvert
> Cc : den...@li...
> Objet : Re: Why flush both k-mer and its complement?
> 
> On 21/07/11 16:40, Sébastien Boisvert wrote:
>> Although both k-mers will go on the same MPI rank,
>> the first flush may generate an actual MPI message on the network.
>> As such, the second flush won't flush anything because the buffer was just reset.
>> On the other hand, if the first k-mer does not generate a message when flush is called,
>> then the second flush may do so.
>>
>> Feel free to send me a patch for this !
> 
> Okay, I've attached a patch that just does the flush twice without any
> additional reverse complelements.
> 
> -- David
> 

                                                     Sébastien

Re: [Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:23:07

On 21/07/11 17:07, David Eccles (gringer) wrote:
> Okay, I've attached a patch that just does the flush twice without any
> additional reverse complelements.

Er, sorry, that patch seems to alter the coverage distribution. The unit 
tests I was able to do were successful, but I guess the coverage was 
important this time.

phix_5k_1/2 with patch:

http://pastebin.com/1nnezJrQ

phix_5k_1/2 without patch:

http://pastebin.com/d3PNUBfQ

-- David

Re: [Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 15:08:28

Attachments: flush_twice_nocomplement.patch

On 21/07/11 16:40, Sébastien Boisvert wrote:
> Although both k-mers will go on the same MPI rank,
> the first flush may generate an actual MPI message on the network.
> As such, the second flush won't flush anything because the buffer was just reset.
> On the other hand, if the first k-mer does not generate a message when flush is called,
> then the second flush may do so.
>
> Feel free to send me a patch for this !

Okay, I've attached a patch that just does the flush twice without any 
additional reverse complelements.

-- David

[Denovoassembler-devel] RE : Why flush both k-mer and its complement?

From: Sébastien B. <seb...@ul...> - 2011-07-21 14:40:51

Short answer: no.

Although both k-mers will go on the same MPI rank,
the first flush may generate an actual MPI message on the network.

As such, the second flush won't flush anything because the buffer was just reset.

On the other hand, if the first k-mer does not generate a message when flush is called,
then the second flush may do so.



I think you bring a good point however.

complementVertex being the function that consumes the most CPU cycles (according to GNU gprof), clearly,
there is room here for improvement (i.e. as you underline using 1 call instead of 3 would be better).

I added a TODO.

Feel free to send me a patch for this !

see code/Submit-a-patch.txt


> ________________________________________
> De : David Eccles (gringer) [dav...@mp...]
> Date d'envoi : 21 juillet 2011 05:46
> À : Sébastien Boisvert; den...@li...
> Objet : Why flush both k-mer and its complement?
> 
> I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have:
> 
> 120 rankToFlush=m_parameters->_vertexRank(&a);
> 121 [add <rankToFlush> to pending flush messages]
> 130 Kmer b =
> complementVertex(&a,wordSize,m_parameters->getColorSpaceMode());
> 132 rankToFlush=m_parameters->_vertexRank(&b);
> 133 [add <rankToFlush> to pending flush messages]
> 
> but the vertexRank function returns the hash of the lowest k-mer:
> 
> core/Parameters.cpp:1200 (_vertexRank)
> return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode);
> 
> core/common_functions.cpp (vertexRank)
> 134 Kmer b=complementVertex(a,w,color);
> 135 if(a->isLower(&b))
> 136     b=*a;
> 137 return hash_function_1(&b)%(_size);
> 
> So the code appears to be doing 3 complement vertex operations
> (KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each
> flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be
> removed without loss of function?
> 
> -- David
> 

                                                     Sébastien

[Denovoassembler-devel] Why flush both k-mer and its complement?

From: David E. (gringer) <dav...@mp...> - 2011-07-21 09:47:50

I notice that in assembler/KmerAcademyBuilder.cpp:120-139 you have:

120 rankToFlush=m_parameters->_vertexRank(&a);
121 [add <rankToFlush> to pending flush messages]
130 Kmer b = 
complementVertex(&a,wordSize,m_parameters->getColorSpaceMode());
132 rankToFlush=m_parameters->_vertexRank(&b);
133 [add <rankToFlush> to pending flush messages]

but the vertexRank function returns the hash of the lowest k-mer:

core/Parameters.cpp:1200 (_vertexRank)
return vertexRank(a,m_size,m_wordSize,m_colorSpaceMode);

core/common_functions.cpp (vertexRank)
134 Kmer b=complementVertex(a,w,color);
135 if(a->isLower(&b))
136     b=*a;
137 return hash_function_1(&b)%(_size);

So the code appears to be doing 3 complement vertex operations 
(KmerAcademyBuilder.cpp:130, common_functions.cpp:134 * 2) for each 
flush. Could the second flush (i.e. KmerAcademyBuilder.cpp:130-139) be 
removed without loss of function?

-- David

[Denovoassembler-devel] Colour-space loading / assembly now working (produces base-space output)

From: David E. (gringer) <dav...@mp...> - 2011-07-19 14:35:23

So, things are finally looking like the colour-space stuff is going 
somewhere. I've managed to get a good result from a Ray run with 
colour-space input (no errors, first bases included), and it might be a 
tad quicker than the sebhtml/ray colour-space assembler:

$ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.csfasta 
tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
Number of contigs: 1
Total length of contigs: 5380
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5380
Number of scaffolds: 1
Total length of scaffolds: 5380
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5380

real	0m3.107s
user	0m2.432s
sys	0m0.652s

$ time ../code/Ray -p tests/phix/phix_5k_1.csfasta 
tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
Number of contigs: 1
Total length of contigs: 5243
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5243
Number of scaffolds: 1
Total length of scaffolds: 5243
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5243

real	0m2.894s
user	0m2.392s
sys	0m0.484s

$ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | 
grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > 
/dev/null && echo "success (match in reverse direction)"
success (match in reverse direction)

There's nothing particularly impressive with this (apart from it 
actually working...). For what it's worth, the sebhtml/ray also produces 
a correct sequence, if you feed it the right starting base:

$ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | 
grep $((echo -n 'G'; tail -n +2 RayOutput.Scaffolds.fasta) | 
~/scripts/cs2base.pl | fasta_formatter) > /dev/null && echo "success 
(match in forward direction)"
success (match in forward direction)

However, this is probably more interesting, using a colour-space 
sequence for one end, and a base-space sequence for the other end:

$ time ../../../sebgit/ray/code/Ray -p tests/phix/phix_5k_1.fasta 
tests/phix/phix_5k_2.csfasta | grep -A 7 'Number of contigs:'
Number of contigs: 3
Total length of contigs: 15713
Number of contigs >= 500 nt: 3
Total length of contigs >= 500 nt: 15713
Number of scaffolds: 3
Total length of scaffolds: 15713
Number of scaffolds >= 500 nt: 3
Total length of scaffolds >= 500: 15713

real	0m4.246s
user	0m3.428s
sys	0m0.796s
bioinf@thaliana:~/install/ray/git/ray/system-tests$ time ../code/Ray -p 
tests/phix/phix_5k_1.fasta tests/phix/phix_5k_2.csfasta | grep -A 7 
'Number of contigs:'
Number of contigs: 1
Total length of contigs: 5243
Number of contigs >= 500 nt: 1
Total length of contigs >= 500 nt: 5243
Number of scaffolds: 1
Total length of scaffolds: 5243
Number of scaffolds >= 500 nt: 1
Total length of scaffolds >= 500: 5243

real	0m2.892s
user	0m2.412s
sys	0m0.456s

$ fasta_formatter -i tests/phix/phix.fasta | fastx_reverse_complement | 
grep $(fasta_formatter -i RayOutput.Scaffolds.fasta | grep -v '^>') > 
/dev/null && echo "success (match in reverse direction)"
success (match in reverse direction)

[I haven't tried doing a match with the sebhtml/ray sequence.]

The commit that actually got my code working properly with colour-space 
is here:

https://github.com/gringer/ray/commit/fbd1efd4c3dbe31df0de820167a068736de648f5

[I forgot to make the same changes for the KmerAcademyBuilder (lines 
109-113) that I made for the VerticesExtractor (lines 110-114)]

-- David

Flat | Threaded

<< < 1 2 3 4 5 > >> (Page 4 of 5)

2011	Jan	Feb	Mar	Apr	May	Jun	Jul (42)	Aug (4)	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (2)	Nov (8)	Dec (4)
2013	Jan (6)	Feb (21)	Mar	Apr	May (3)	Jun	Jul	Aug	Sep (1)	Oct (3)	Nov	Dec
2014	Jan	Feb (8)	Mar (10)	Apr	May (1)	Jun (5)	Jul	Aug (1)	Sep	Oct	Nov	Dec