denovoassembler-devel Mailing List for Ray: scalable assembly (Page 5)
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(42) |
Aug
(4) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
(8) |
Dec
(4) |
2013 |
Jan
(6) |
Feb
(21) |
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(3) |
Nov
|
Dec
|
2014 |
Jan
|
Feb
(8) |
Mar
(10) |
Apr
|
May
(1) |
Jun
(5) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: David E. (gringer) <dav...@mp...> - 2011-07-19 08:33:01
|
On 18/07/11 19:43, Sébastien Boisvert wrote: >> I had to modify that section, because the alternative was broken >> code that wouldn't assemble correctly. > OK. So you think there was some sort of interaction going on ? How > did you fix it (provide commit) ? https://github.com/gringer/ray/commit/075ea052a6949a0974ac7d12bcec238f2cf49b58 [SeedingData.cpp, not test_phiX.sh] I still think that the problem of a "perfect" phiX assembly not working was due to a parent node that wasn't considered for a seed start, with all of the child nodes having 1 edge in, 1 edge out. My hypothesis is that the seed start node had low coverage, so wasn't considered a good choice. I understand why this is a reasonable decision, but think that a better fix to this problem would be to also consider low coverage nodes in SeedWorker.cpp when evaluating if a parent node is a better fit for the start of a seed. For example, by adding a m_SEEDING_coverage_test_done check as well (or integrating that with the SEEDING_1_1_test). > You are right, systematic errors confuse Ray and sequencing errors > consume a lot of memory. It is therefore wise to filter errors. Q20 > (86% good) is adequate but I think using Q30 (95% good) is too > aggressive. Q10 (assuming Phred qualities) is a base call accuracy of 90% (i.e. 1 in 10), Q20 is 1 in 100. Every 10 is another 'nine' for quality -- 4 nines = 99.99% = Q40. For Q above 10, it's essentially the same value for Phred and Sanger quality scores. http://en.wikipedia.org/wiki/Phred_quality_score#Reliability >> One typical example is library changes -- code may have worked fine >> at an earlier date, but you need to include a change made in some >> future build in order to get it working for the more modern >> libraries. > I agree. That is why the dependencies for Ray are basically only the > C++ standard library (which include the C library as well, I believe) > and a message-passing-interface library. Yes, any C program is considered a valid C++ program as well (there are some quirks to make this work properly). Given that you've used the boost random number generator for the read simulator, one other option that I would like for the Ray code would be the boost gzip and bzip2 filters (from Iostreams). http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/bzip2.html#basic_bzip2_decompressor http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/gzip.html#basic_gzip_decompressor This would allow the use of the more C++-like iostream method of file access, even for compressed files, which means being able to use strings for all file access, rather than just for the uncompressed files (i.e. no more concerns about null-terminated strings). However, after doing many Debian updates, I am aware of how often the boost libraries change, and would not suggest having boost libraries as a core dependency. Once I'm done with the colour-space stuff, I'll see about making a more generic file loader class that can handle all formats. > There is also the command 'git stash', but I have not used it yet. Thanks for pointing that out, I wasn't aware of this command. It looks like it doesn't seem to work for remote repositories, so maybe I should just set up a temporary 'broken' or 'stash' branch when I'm doing things like that. > For me, the hardest stuff are bugs occurring in parallel. Checkpoints > would help also, but I have not figured an easy way to implement them > in Ray yet. Sorry, I don't think I understand. I was under the impression that Ray already had stopping points, where it waits for all the other nodes to finish before continuing. I guess this is a different kind of checkpoint? > How many gigabases are generated with one HiScan SQ run ? From the human RNA run we did, there were about 40-50M sequences per lane, with each sequence having 50bp (paired end, so 100bp, I guess). With 8 lanes, that's about 45*8*100 ~= 36 gigabases per run. The specification is a bit higher than that (67-75Gb), but it's in the same ball-park: http://www.illumina.com/systems/hiscansq.ilmn#workflow_specs > But to do hybrid assemblies, you would have to convert everything to > color-space, right? Yeah, it's annoying that colour-space is the common denominator. But at least the conversion from base-space to colour-space doesn't lose information. For any given base-space sequence, there is only one valid colour-space sequence (as long as you include a starting base in the colour-space sequence). > So imagine you have 2 companies: Company A and Company B. Company A > technology produces greek-letter-space reads. Company B technology > produces heat-space reads. Ugh. I would hope that other companies are seeing the light (or history), and will just all go back to base-space. As an aside, I'm glad that iontorrent uses a base-space output. Unfortunately, there are still people who demand that programs should work directly for these new space sequences, often just because that's what comes off their machines: http://seqanswers.com/forums/showpost.php?p=15245&postcount=7 > The driver equivalent for sequencers would be a base caller that > works with color-space or other underlying formats. A generic program that converts anything into base-space would be nice, but I don't think it will be written by the sequencer companies -- they would care too much about their specific technology. More likely, it'll be written by a random programmer with too much time on their hands who is frustrated by all the different file formats that are being produced. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-18 17:43:38
|
Hi David, > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 18 juillet 2011 10:12 > À : Sébastien Boisvert; den...@li... > Objet : Random thoughts about Ray > > On 15/07/11 14:40, Sébastien Boisvert wrote: > >>> So, for real phiX data, I prepared the sequence by masking out >>> bases with quality scores less than 20, and then filtering out the >>> 'N' bases >> You don't need to filter anything with Ray. > > You have a great confidence in Ray's ability to deal with any type of > sequencing mistake, but I'm afraid I don't share that confidence. There > is no way I would carry out a genome assembly on data that only had low > quality scores, and I would expect that garbage input would only serve > to confuse Ray and make it spend a bit more time processing that data. > You are right, systematic errors confuse Ray and sequencing errors consume a lot of memory. It is therefore wise to filter errors. Q20 (86% good) is adequate but I think using Q30 (95% good) is too aggressive. Also, sometimes adapter sequences are messy and hard to remove. But if sequencing errors are random, then erroneous k-mers will not be redundant. (See our paper in Journal of Computational Biology) > On 13/07/11 17:50, Sébastien Boisvert wrote: >> Well, as I understand, you did not respect one simple rule of version >> control: > > * Never commit broken code. > > I'm sorry I did that. I did try to keep to the rules, but I'm new to Git > (mostly on the push/pull side), and to Ray, so occasionally I make > mistakes in my commits. > I started using git a few weeks ago, and I must say I am very satisfied in comparison with my experience with subversion. Initially, I did commit broken code too and I still do occasionally. Finding regressions can be hard on my own bugs so I guess finding regressions in your code would be harder because of different programming style. Since I started working on Ray, I did a lot of refactoring and one other important thing I learned is to only work on one particular thing at once. Trying to deal with many features or bugs at the same time usually does not work well. >> As Linus Torvalds would say: commit early, commit often. > > This seems to contradict the statement you made in the same email. I > have some experience doing bisects (particularly with Wine), so I'm > aware of the pain involved when things don't compile, but I also know > that sometimes it can't be helped > [http://wiki.winehq.org/RegressionTesting#head-020f8ea312cecdbccdb0d7936f07b9a39b105791]. I completely agree with you here. Sometimes bugs are just side effects of other changes. 2 months ago I recall spending 3 weeks on a single bug that caused a parallel infinite loop. One rank was looping in an infinite fashion while all other peers were waiting for a reply from buggy peer. I still think that 'commit early, commit often' is fine as long as the code committed works. > One typical example is library changes -- code may have worked fine at > an earlier date, but you need to include a change made in some future > build in order to get it working for the more modern libraries. > I agree. That is why the dependencies for Ray are basically only the C++ standard library (which include the C library as well, I believe) and a message-passing-interface library. >> Basically, you must search your own commits to find where you >> screwed. But you can not do that because you are committing broken >> code to your fork. > > FWIW, I got into a state where I had 2-3 days worth of code to add, and > I wanted to be able to commit at least once per day (so I could do > additional work from home, if I had inspiration). I know what you mean, when I am stuck on buggy code and want to change computer, I usually rsync my work. There is also the command 'git stash', but I have not used it yet. > This meant that the > code was broken, in parts, but I was aware of how to fix those parts, so > it didn't make much difference to me when bisecting. The bisect function > of git has a 'skip' function that allows you to bypass certain commits, > which still reduces the number of manual searches that need to be done > on code changes. > But I guess you must read somewhere (for instance in the commit message) if a given commit must be skipped or not. Right ? This is a task I don't want to do. >> Furthermore, I am not sure that all the emails about you debugging >> the bugs you introduced in your forks are of general interest for >> denovoassembler-users. I therefore created a development mailing list >> (should be up in maximum 24 hours). > > Thanks for doing this. I like being able to CC a list, but didn't feel > entirely comfortable talking about code details on a -users list. > OK > On 14/07/11 02:00, Sébastien Boisvert wrote: > >> So changing that is easy enough to do, it just means the Ray run > >> might take tens of seconds to finish, rather than seconds > > Yes, but beware that this simple test will fail to catch major bugs. > > I'm aware that a full test would be a good idea, but would be interested > to know what kinds of major bugs you would expect would crop up that are > not evident in an attempted assembly of a small genome. I guess memory > allocation bugs and message processing would be examples of things that > may not be caught for a small genome. > It depends on which part of the code you work. For color-space, you definitely want to write unit tests. But that is not always enough. When I added the MAXKMERLENGTH option, I basically committed code that did not work for k values above 32 but the code still worked on classical values. But I had a lot of things to modify in the communication layer because passing a 32-mer is not the same as passing a 64-mer. I remember this part as being quite hard to do and quite time-consuming. This migration was test-driven. For the parallel approach of Ray (to learn where bugs can show up), I suggest you read recent code I added. These are pretty clean and mature code. For example: assembler/Partitioner.cpp communication/NetworkTest.cpp scaffolder/Scaffolder.cpp For me, the hardest stuff are bugs occurring in parallel. Checkpoints would help also, but I have not figured an easy way to implement them in Ray yet. > On 13/07/11 17:42, Sébastien Boisvert wrote: >> The code you are trying to modify is very stable and should not be >> modified. This code manages the living workers. > > As opposed to this other section of code, which also manages the living > workers (line 21, sequencesIndexer.cpp)? > I did some work on this bug and I am pretty sure the "memory leak" is related to the C++ standard library doing too much memory allocation calls. For instance, if you use a vector<T>, the used allocator will be allocator<T>. But allocator<T> probably utilises malloc and free (or something related) and these scale very badly when a lot of calls are made. That cause intense memory fragmentation. I actually implemented a memory compactor (see memory/DefragmentationGroup.cpp) for a nice read. For example, if I set the maximum number of worker to 1, I think I don't get that problem. Anyway, the code works, system tests pass and unit tests pass too. >> /* TODO: find the memory leak in this file -- during the selection >> of optimal read markers, the memory goes up ? */ > > The bits I were looking at weren't quite the same area of code, but I > had narrowed down my assembly problem to that specific section of code, > which just happened to be just after the "selection of optimal read > markers". I had to modify that section, because the alternative was > broken code that wouldn't assemble correctly. > OK. So you think there was some sort of interaction going on ? How did you fix it (provide commit) ? >> Does your institution have a SOLiD 5500xl ? > > No, we have a SOLiD 4 that is collecting dust and has yet to be used. I think people bought SOLiD-powered instruments because Applied Biosystems (Life Technologies) produced the ABI 3730xl. 3730xl was the technology used for the human genome project. At least these were working properly ! > I'm not particularly happy with colour-space myself, but if I'll be > working with it in the future, then I need to be confident that I can > process the data that comes off that machine. The SOLiD sequencing data > that I'm currently working with was done by another lab. > I presently work at the Ontario Institute for Cancer Research at the MaRS Discovery District in Toronto and they have numerous SOLiD sequencers, but basically the error rate of these SOLiD sequencers is higher than expected... On the other hand, Illumina TruSeq technology actually yields true sequences ! (sorry for the silly word play) > On the other hand, our Illumina sequencer (HiScan SQ) has had at least > one successful run (possibly 3), and should get a fair bit of use over > the next few months. This is nice for me, because the base-space output > is much less confusing to handle, and there are many more applications > that will work in base-space. > This is also my opinion. Furthermore, the error-rate of SOLiD sequencers is so high that it becomes very hard to analyze related data. With SOLiD technology, you also eliminate a lot of data when filtering with quality values. This increases the cost per base which is not desirable. You basically double the cost when weeding out 50% of the reads. How many gigabases are generated with one HiScan SQ run ? >> Life Technologies should write a base caller software that transform >> these color-space files into fastq files. All major vendors >> (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and >> probably Complete Genomics) do that already. > > It doesn't really make sense to do the colour-space conversion until > after the assembly/mapping has been done. It's a fundamentally different > technology which uses labelled dinucleotides to pair with DNA sequence. > Here's a somewhat fabricated example that demonstrates how > post-processing conversion can be useful: > I know, but regardless the desired result is sequence data, not fancy colored reads. > |> 1_61_1487_F3 > |T[3.03231010100100123]2112320311233112131121011303123 > |> 1181_1125_1187_F3 > |T2.0032320031212023231321020332[3203231010100100123]0 > > I've bracketed the similar sequences from both samples. If the > conversion to base-space were done on the sequencer, both sequences > would be effectively useless. However, using colour-space information > from '1181_1125_1187_F3', you can fill-in the second colour for > '1_61_1487_F3', and hence work out the entire sequence. You could say > that Life Technologies should just write their own colour-space de-novo > assembler in order to resolve base errors, but then it's just working on I don't think so. They should have developed something that enables them to output base-space reads as well as color-space reads. And that what they will presumably do with their 5500xl: "The 5500xl will feature exact call chemistry, new multiple base encoding for improved accuracy, which will allow the data to be output in Base Space or Color Space." Source: https://www.genomics.upenn.edu/HTSTechnologies In my opinion, sequencing by ligation sounds nice but must be terribly hard to debug from an engineering point of view. > colour-space, which removes the error-reduction advantage of hybrid > genome assemblers. > But to do hybrid assemblies, you would have to convert everything to color-space, right ? <Silly example> So imagine you have 2 companies: Company A and Company B. Company A technology produces greek-letter-space reads. Company B technology produces heat-space reads. Converting these reads to nucleotide-space is impossible because of error rates, but both companies say that software working directly with their read spaces should do well. Now imagine companies A, B, C, ..., X, Y, Z, each promoting a different read space. The problem becomes the same as Linux driver development You just need a common interface so each of them can be integrated. Then, userland software that send packets over a network are not aware of these numerous drivers but they still do well enough ! The driver equivalent for sequencers would be a base caller that works with color-space or other underlying formats. The only case were people will adopt an exotic read space is when this new read space do provide a significant leap forward or a technological advance with applications. </Silly example> I think '.' in color-space reads make the problem harder. Otherwise, you just have to look for enriched k-mers in the space of {0,1,2,3}^k, right ? The option -write-kmers writes a file on the disk containing all the k-mers in color-space along with their counts and arcs. > Thanks for your time, > -- David Eccles (gringer) > Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-18 14:13:20
|
On 15/07/11 14:40, Sébastien Boisvert wrote: >> So, for real phiX data, I prepared the sequence by masking out >> bases with quality scores less than 20, and then filtering out the >> 'N' bases > You don't need to filter anything with Ray. You have a great confidence in Ray's ability to deal with any type of sequencing mistake, but I'm afraid I don't share that confidence. There is no way I would carry out a genome assembly on data that only had low quality scores, and I would expect that garbage input would only serve to confuse Ray and make it spend a bit more time processing that data. On 13/07/11 17:50, Sébastien Boisvert wrote: > Well, as I understand, you did not respect one simple rule of version > control: > * Never commit broken code. I'm sorry I did that. I did try to keep to the rules, but I'm new to Git (mostly on the push/pull side), and to Ray, so occasionally I make mistakes in my commits. > As Linus Torvalds would say: commit early, commit often. This seems to contradict the statement you made in the same email. I have some experience doing bisects (particularly with Wine), so I'm aware of the pain involved when things don't compile, but I also know that sometimes it can't be helped [http://wiki.winehq.org/RegressionTesting#head-020f8ea312cecdbccdb0d7936f07b9a39b105791]. One typical example is library changes -- code may have worked fine at an earlier date, but you need to include a change made in some future build in order to get it working for the more modern libraries. > Basically, you must search your own commits to find where you > screwed. But you can not do that because you are committing broken > code to your fork. FWIW, I got into a state where I had 2-3 days worth of code to add, and I wanted to be able to commit at least once per day (so I could do additional work from home, if I had inspiration). This meant that the code was broken, in parts, but I was aware of how to fix those parts, so it didn't make much difference to me when bisecting. The bisect function of git has a 'skip' function that allows you to bypass certain commits, which still reduces the number of manual searches that need to be done on code changes. > Furthermore, I am not sure that all the emails about you debugging > the bugs you introduced in your forks are of general interest for > denovoassembler-users. I therefore created a development mailing list > (should be up in maximum 24 hours). Thanks for doing this. I like being able to CC a list, but didn't feel entirely comfortable talking about code details on a -users list. On 14/07/11 02:00, Sébastien Boisvert wrote: >> So changing that is easy enough to do, it just means the Ray run >> might take tens of seconds to finish, rather than seconds > Yes, but beware that this simple test will fail to catch major bugs. I'm aware that a full test would be a good idea, but would be interested to know what kinds of major bugs you would expect would crop up that are not evident in an attempted assembly of a small genome. I guess memory allocation bugs and message processing would be examples of things that may not be caught for a small genome. On 13/07/11 17:42, Sébastien Boisvert wrote: > The code you are trying to modify is very stable and should not be > modified. This code manages the living workers. As opposed to this other section of code, which also manages the living workers (line 21, sequencesIndexer.cpp)? > /* TODO: find the memory leak in this file -- during the selection > of optimal read markers, the memory goes up ? */ The bits I were looking at weren't quite the same area of code, but I had narrowed down my assembly problem to that specific section of code, which just happened to be just after the "selection of optimal read markers". I had to modify that section, because the alternative was broken code that wouldn't assemble correctly. > Does your institution have a SOLiD 5500xl ? No, we have a SOLiD 4 that is collecting dust and has yet to be used. I'm not particularly happy with colour-space myself, but if I'll be working with it in the future, then I need to be confident that I can process the data that comes off that machine. The SOLiD sequencing data that I'm currently working with was done by another lab. On the other hand, our Illumina sequencer (HiScan SQ) has had at least one successful run (possibly 3), and should get a fair bit of use over the next few months. This is nice for me, because the base-space output is much less confusing to handle, and there are many more applications that will work in base-space. > Life Technologies should write a base caller software that transform > these color-space files into fastq files. All major vendors > (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and > probably Complete Genomics) do that already. It doesn't really make sense to do the colour-space conversion until after the assembly/mapping has been done. It's a fundamentally different technology which uses labelled dinucleotides to pair with DNA sequence. Here's a somewhat fabricated example that demonstrates how post-processing conversion can be useful: |> 1_61_1487_F3 |T[3.03231010100100123]2112320311233112131121011303123 |> 1181_1125_1187_F3 |T2.0032320031212023231321020332[3203231010100100123]0 I've bracketed the similar sequences from both samples. If the conversion to base-space were done on the sequencer, both sequences would be effectively useless. However, using colour-space information from '1181_1125_1187_F3', you can fill-in the second colour for '1_61_1487_F3', and hence work out the entire sequence. You could say that Life Technologies should just write their own colour-space de-novo assembler in order to resolve base errors, but then it's just working on colour-space, which removes the error-reduction advantage of hybrid genome assemblers. Thanks for your time, -- David Eccles (gringer) |
From: Sébastien B. <seb...@ul...> - 2011-07-15 12:45:30
|
> ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 15 juillet 2011 07:27 > À : Sébastien Boisvert; den...@li... > Objet : Re: Confused about coding -- completed seeds without distributions > > On 15/07/11 12:47, David Eccles (gringer) wrote: >> I'll have a look at what comes out if I break the circularity of this >> DNA by removing sequence that includes the start and end pieces together. > > Done, sort of. Sorry, I think I forgot to link in the circular reads > sequence in my previous email, so here are both of them: > > http://user.interface.org.nz/~gringer/hacking/head100k_interleaved_noN_phiX_region1101.fasta > > http://user.interface.org.nz/~gringer/hacking/linearised_head100k_interleaved_noN_phiX_region1101.fasta > > Curiously, Ray finds more seeds in another rank (rank 6) by removing > this connecting sequence: > Simply because Ray is based on peer-to-peer computation -- no rank will ever do the same computation. > $ mpirun -np 10 ../../code/Ray -i > ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta > > Rank 0 is extending seeds [0/0] (completed) > Rank 5 is extending seeds [0/0] (completed) > Rank 5 extended 0 seeds out of 0 (-nan%) > Rank 4 is extending seeds [0/0] (completed) > Rank 4 extended 0 seeds out of 0 (-nan%) > Rank 7 is extending seeds [0/0] (completed) > Rank 7 extended 0 seeds out of 0 (-nan%) > Rank 2 is extending seeds [0/0] (completed) > Rank 2 extended 0 seeds out of 0 (-nan%) > Rank 3 is extending seeds [0/0] (completed) > Rank 3 extended 0 seeds out of 0 (-nan%) > Rank 1 is extending seeds [0/0] (completed) > Rank 1 extended 0 seeds out of 0 (-nan%) > Rank 8 is extending seeds [0/0] (completed) > Rank 8 extended 0 seeds out of 0 (-nan%) > Rank 6 is extending seeds [1/1] > Rank 6 starts on a seed, length is 4063 [0/1] > Rank 6 reached 0 vertices (GCAGGTTGGATACGCCAATCA) from seed 1 > Rank 9 is extending seeds [1/1] > Rank 9 starts on a seed, length is 1277 [0/1] > Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1 > Rank 0 extended 0 seeds out of 0 (-nan%) > Rank 9 reached 1277 vertices from seed 1 (changing direction) > Rank 9 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1 > Rank 9 reached 1279 vertices (CCTCTCCCGACTGTTAACACT) from seed 1 > Rank 9 (extension done) > Rank 9 is extending seeds [1/1] (completed) > Rank 9 extended 1 seeds out of 1 (100.00%) > Rank 6 reached 4063 vertices from seed 1 (changing direction) > Rank 6 reached 0 vertices (CTGGTTATATTGACCATGCCG) from seed 1 > Rank 6 reached 4064 vertices (TGATTGGCGTATCCAACCTGC) from seed 1 > Rank 6 (extension done) > Rank 6 is extending seeds [1/1] (completed) > Rank 6 extended 1 seeds out of 1 (100.00%) > ... > Number of contigs: 2 > Total length of contigs: 5383 > Number of contigs >= 500 nt: 2 > Total length of contigs >= 500 nt: 5383 > Number of scaffolds: 1 > Total length of scaffolds: 5652 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5652 > > Why don't the other ranks find seeds? > > It's still successful at getting a sequence, but it's split into 2 > contigs. This doesn't much surprise me, because I eliminated the joining > segment, but kept end pairs that spanned either side of the joining > segment. It's a little strange that the sequence of Ns that has been > filled (about 274 Ns) in is not the same as what I would expect from an > assembly of the circular genome (i.e. 2 Ns), but when a lot of repeated > Ns appear in an assembled sequence, it's expected that the distance is > unknown. > > -- David > I think you are right, basically, to find the start of a seed, Ray looks for a vertex that don't have one strong parent and that don't have one strong child. Because your genome is so tiny and because there are no repeats, Ray fails to find the start of any seed because of the above. I have to think how to solve this carefully. Sébastien |
From: Sébastien B. <seb...@ul...> - 2011-07-15 12:40:15
|
> ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 15 juillet 2011 06:47 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: Confused about coding -- completed seeds without distributions > > On 15/07/11 10:12, David Eccles (gringer) wrote: >> I'll fetch the input data that failed once I get back home, and also >> see what happens with real phiX data off our sequencer. > > So, for real phiX data, I prepared the sequence by masking out bases > with quality scores less than 20, and then filtering out the 'N' bases, > using the attached script (with 'proportion=1'), which produces an > interleaved FASTA file. Despite all that filtering, I still have 4GB of > >Q20 phiX data that came off the sequencer. > You don't need to filter anything with Ray. > Based on the presence of the sequence 'ATCCAACCTGCAGAGTTTTATCGCTT' in > these data, it it looks like a circularised phiX genome was used. I put > the top 100k lines from this interleaved file into Ray, and it seems to > work fine: > > $ mpirun -np 10 ../../code/Ray -i > ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/head100k_interleaved_noN_phiX_region1101.fasta > ... > Rank 0 is creating seeds [1/1072] > Rank 9 is creating seeds [1/1102] > Rank 7 is creating seeds [1/1094] > Rank 3 is creating seeds [1/1182] > Rank 5 is creating seeds [1/1102] > Rank 4 is creating seeds [1/1168] > Rank 2 is creating seeds [1/1090] > Rank 8 is creating seeds [1/1206] > Rank 1 is creating seeds [1/1124] > Rank 6 is creating seeds [1/1164] > Rank 0 has 0 seeds > Rank 0 is creating seeds [1072/1072] (completed) > Rank 0: peak number of workers: 1032, maximum: 30000 > Rank 0: VirtualCommunicator: 172441 pushed messages generated 918 > virtual messages (0.532356%) > Rank 2 has 0 seeds > Rank 2 is creating seeds [1090/1090] (completed) > Rank 2: peak number of workers: 1040, maximum: 30000 > Rank 2: VirtualCommunicator: 172482 pushed messages generated 876 > virtual messages (0.507879%) > Rank 7 has 0 seeds > Rank 7 is creating seeds [1094/1094] (completed) > Rank 7: peak number of workers: 1056, maximum: 30000 > Rank 7: VirtualCommunicator: 172598 pushed messages generated 860 > virtual messages (0.498268%) > Rank 8 has 0 seeds > Rank 8 is creating seeds [1206/1206] (completed) > Rank 8: peak number of workers: 1128, maximum: 30000 > Rank 8: VirtualCommunicator: 173161 pushed messages generated 863 > virtual messages (0.498380%) > Rank 3 has 0 seeds > Rank 3 is creating seeds [1182/1182] (completed) > Rank 3: peak number of workers: 1136, maximum: 30000 > Rank 3: VirtualCommunicator: 173111 pushed messages generated 867 > virtual messages (0.500835%) > Rank 5 has 0 seeds > Rank 5 is creating seeds [1102/1102] (completed) > Rank 5: peak number of workers: 1054, maximum: 30000 > Rank 5: VirtualCommunicator: 172651 pushed messages generated 936 > virtual messages (0.542134%) > Rank 6 has 0 seeds > Rank 6 is creating seeds [1164/1164] (completed) > Rank 6: peak number of workers: 1114, maximum: 30000 > Rank 6: VirtualCommunicator: 173032 pushed messages generated 862 > virtual messages (0.498174%) > Rank 4 has 0 seeds > Rank 4 is creating seeds [1168/1168] (completed) > Rank 4: peak number of workers: 1116, maximum: 30000 > Rank 4: VirtualCommunicator: 173056 pushed messages generated 852 > virtual messages (0.492326%) > Rank 1 has 0 seeds > Rank 1 is creating seeds [1124/1124] (completed) > Rank 1: peak number of workers: 1088, maximum: 30000 > Rank 1: VirtualCommunicator: 183555 pushed messages generated 11665 > virtual messages (6.355043%) > Rank 9 discovered a seed with 5363 vertices > Rank 9 has 1 seeds > Rank 9 is creating seeds [1102/1102] (completed) > Rank 9: peak number of workers: 1052, maximum: 30000 > Rank 9: VirtualCommunicator: 183307 pushed messages generated 11631 > virtual messages (6.345093%) > ... > Rank 1 is extending seeds [0/0] (completed) > Rank 1 extended 0 seeds out of 0 (-nan%) > Rank 6 is extending seeds [0/0] (completed) > Rank 6 extended 0 seeds out of 0 (-nan%) > Rank 4 is extending seeds [0/0] (completed) > Rank 4 extended 0 seeds out of 0 (-nan%) > Rank 5 is extending seeds [0/0] (completed) > Rank 5 extended 0 seeds out of 0 (-nan%) > Rank 3 is extending seeds [0/0] (completed) > Rank 3 extended 0 seeds out of 0 (-nan%) > Rank 0 is extending seeds [0/0] (completed) > Rank 0 extended 0 seeds out of 0 (-nan%) > Rank 8 is extending seeds [0/0] (completed) > Rank 8 extended 0 seeds out of 0 (-nan%) > Rank 2 is extending seeds [0/0] (completed) > Rank 2 extended 0 seeds out of 0 (-nan%) > Rank 7 is extending seeds [0/0] (completed) > Rank 7 extended 0 seeds out of 0 (-nan%) > Rank 9 is extending seeds [1/1] > Rank 9 starts on a seed, length is 5363 [0/1] > Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1 > Rank 9 reached 5363 vertices from seed 1 (changing direction) > Rank 9 reached 0 vertices (CTGGTTATATTGACCATGCCG) from seed 1 > Rank 9 reached 5365 vertices (CCTCTCCCGACTGTTAACACT) from seed 1 > Rank 9 (extension done) > Rank 9 is extending seeds [1/1] (completed) > Rank 9 extended 1 seeds out of 1 (100.00%) > ... > Number of contigs: 1 > Total length of contigs: 5385 > Number of contigs >= 500 nt: 1 > Total length of contigs >= 500 nt: 5385 > Number of scaffolds: 1 > Total length of scaffolds: 5385 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5385 > > I'll have a look at what comes out if I break the circularity of this > DNA by removing sequence that includes the start and end pieces together. > > -- David So your hypothesis is that when you use all the reads, you get a k-mer graph containing a loop. But if that would be the case, then you would get a parallel infinite loop because SeedWorker.cpp does not check for circular seeds. Sébastien |
From: David E. (gringer) <dav...@mp...> - 2011-07-15 12:33:57
|
On 15/07/11 13:27, David Eccles (gringer) wrote: > Number of contigs: 2 > Total length of contigs: 5383 > Number of contigs >= 500 nt: 2 > Total length of contigs >= 500 nt: 5383 > Number of scaffolds: 1 > Total length of scaffolds: 5652 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5652 If I tell Ray to treat that interleaved sequence as single reads, it still spits out two contigs, despite what I consider to be reasonable evidence that the sequence that joins the contigs exists in reasonable numbers in the input data: $ mpirun -np 10 ../../code/Ray -s ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta Number of contigs: 2 Total length of contigs: 5383 Number of contigs >= 500 nt: 2 Total length of contigs >= 500 nt: 5383 Number of scaffolds: 2 Total length of scaffolds: 5383 Number of scaffolds >= 500 nt: 2 Total length of scaffolds >= 500: 5383 $ head -n 3 RayOutput.Contigs.fasta >contig-6 4084 nucleotides CTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCC CTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGA $ tail -n 3 RayOutput.Contigs.fasta TTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAG GATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACT $ grep 'GTTAACACT[ACTG]CTGGTTATATTGACC' ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta | wc -l 120 $ grep 'GTTAACACT[A]CTGGTTATATTGACC' ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta | wc -l 75 $ grep 'GTTAACACT[C]CTGGTTATATTGACC' ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta | wc -l 0 $ grep 'GTTAACACT[T]CTGGTTATATTGACC' ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta | wc -l 0 $ grep 'GTTAACACT[G]CTGGTTATATTGACC' ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta | wc -l 45 Ah, of course... it's a bubble. So if I remove that... $ perl -pe 's/GTTAACACT[ACTG]CTGGTTATATTGACC/GTTAACACTACTGGTTATATTGACC/' linearised_head100k_interleaved_noN_phiX_region1101.fasta > debubbled_linearised_head100k_interleaved_noN_phiX_region1101.fasta $ mpirun -np 10 ../../code/Ray -s ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/debubbled_linearised_head100k_interleaved_noN_phiX_region1101.fasta ... Rank 1 is extending seeds [0/0] (completed) Rank 1 extended 0 seeds out of 0 (-nan%) Rank 5 is extending seeds [0/0] (completed) Rank 5 extended 0 seeds out of 0 (-nan%) Rank 8 is extending seeds [0/0] (completed) Rank 8 extended 0 seeds out of 0 (-nan%) Rank 2 is extending seeds [0/0] (completed) Rank 2 extended 0 seeds out of 0 (-nan%) Rank 0 is extending seeds [0/0] (completed) Rank 0 extended 0 seeds out of 0 (-nan%) Rank 9 is extending seeds [1/1] Rank 9 starts on a seed, length is 1277 [0/1] Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1 Rank 4 is extending seeds [0/0] (completed) Rank 4 extended 0 seeds out of 0 (-nan%) Rank 3 is extending seeds [0/0] (completed) Rank 3 extended 0 seeds out of 0 (-nan%) Rank 7 is extending seeds [0/0] (completed) Rank 7 extended 0 seeds out of 0 (-nan%) Rank 6 is extending seeds [1/1] Rank 6 starts on a seed, length is 4063 [0/1] Rank 6 reached 0 vertices (GCAGGTTGGATACGCCAATCA) from seed 1 Rank 9 reached 1277 vertices from seed 1 (changing direction) Rank 9 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1 Rank 6 reached 4063 vertices from seed 1 (changing direction) Rank 6 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1 Rank 9 reached 5364 vertices (TGATTGGCGTATCCAACCTGC) from seed 1 Rank 9 (extension done) Rank 9 is extending seeds [1/1] (completed) Rank 9 extended 1 seeds out of 1 (100.00%) Rank 6 reached 5364 vertices (TGATTGGCGTATCCAACCTGC) from seed 1 Rank 6 (extension done) Rank 6 is extending seeds [1/1] (completed) Rank 6 extended 1 seeds out of 1 (100.00%) ... Number of contigs: 1 Total length of contigs: 5384 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5384 Number of scaffolds: 1 Total length of scaffolds: 5384 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5384 And this time, the contig appears to match the genome (http://www.ncbi.nlm.nih.gov/nuccore/HM753704.1 -- slightly different from the other phiX sequence. There are a few base errors, which means that a simple grep doesn't work properly: $ diffseq RayOutput.Scaffolds.fasta phiX174_isolate.fasta -wordsize 10 -outfile report.txt -aoutfeat afeat.txt -boutfeat bfeat.txt $ cat report.txt ... #======================================= # # Sequence: scaffold-0 from: 1 to: 5384 # HitCount: 11 # # Compare: HM753704.1 from: 1 to: 5386 # # scaffold-0 overlap starts at 1 # HM753704.1 overlap starts at 2 # # #======================================= ... #--------------------------------------- # # Overlap_end: 5384 in scaffold-0 # Overlap_end: 5385 in HM753704.1 # # SNP_count: 10 # Transitions: 8 # Transversions: 2 # #--------------------------------------- ... -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-15 11:28:17
|
On 15/07/11 12:47, David Eccles (gringer) wrote: > I'll have a look at what comes out if I break the circularity of this > DNA by removing sequence that includes the start and end pieces together. Done, sort of. Sorry, I think I forgot to link in the circular reads sequence in my previous email, so here are both of them: http://user.interface.org.nz/~gringer/hacking/head100k_interleaved_noN_phiX_region1101.fasta http://user.interface.org.nz/~gringer/hacking/linearised_head100k_interleaved_noN_phiX_region1101.fasta Curiously, Ray finds more seeds in another rank (rank 6) by removing this connecting sequence: $ mpirun -np 10 ../../code/Ray -i ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/linearised_head100k_interleaved_noN_phiX_region1101.fasta Rank 0 is extending seeds [0/0] (completed) Rank 5 is extending seeds [0/0] (completed) Rank 5 extended 0 seeds out of 0 (-nan%) Rank 4 is extending seeds [0/0] (completed) Rank 4 extended 0 seeds out of 0 (-nan%) Rank 7 is extending seeds [0/0] (completed) Rank 7 extended 0 seeds out of 0 (-nan%) Rank 2 is extending seeds [0/0] (completed) Rank 2 extended 0 seeds out of 0 (-nan%) Rank 3 is extending seeds [0/0] (completed) Rank 3 extended 0 seeds out of 0 (-nan%) Rank 1 is extending seeds [0/0] (completed) Rank 1 extended 0 seeds out of 0 (-nan%) Rank 8 is extending seeds [0/0] (completed) Rank 8 extended 0 seeds out of 0 (-nan%) Rank 6 is extending seeds [1/1] Rank 6 starts on a seed, length is 4063 [0/1] Rank 6 reached 0 vertices (GCAGGTTGGATACGCCAATCA) from seed 1 Rank 9 is extending seeds [1/1] Rank 9 starts on a seed, length is 1277 [0/1] Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1 Rank 0 extended 0 seeds out of 0 (-nan%) Rank 9 reached 1277 vertices from seed 1 (changing direction) Rank 9 reached 0 vertices (AGTTTTATCGCTTCCATGACG) from seed 1 Rank 9 reached 1279 vertices (CCTCTCCCGACTGTTAACACT) from seed 1 Rank 9 (extension done) Rank 9 is extending seeds [1/1] (completed) Rank 9 extended 1 seeds out of 1 (100.00%) Rank 6 reached 4063 vertices from seed 1 (changing direction) Rank 6 reached 0 vertices (CTGGTTATATTGACCATGCCG) from seed 1 Rank 6 reached 4064 vertices (TGATTGGCGTATCCAACCTGC) from seed 1 Rank 6 (extension done) Rank 6 is extending seeds [1/1] (completed) Rank 6 extended 1 seeds out of 1 (100.00%) ... Number of contigs: 2 Total length of contigs: 5383 Number of contigs >= 500 nt: 2 Total length of contigs >= 500 nt: 5383 Number of scaffolds: 1 Total length of scaffolds: 5652 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5652 Why don't the other ranks find seeds? It's still successful at getting a sequence, but it's split into 2 contigs. This doesn't much surprise me, because I eliminated the joining segment, but kept end pairs that spanned either side of the joining segment. It's a little strange that the sequence of Ns that has been filled (about 274 Ns) in is not the same as what I would expect from an assembly of the circular genome (i.e. 2 Ns), but when a lot of repeated Ns appear in an assembled sequence, it's expected that the distance is unknown. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-15 10:48:42
|
On 15/07/11 10:12, David Eccles (gringer) wrote: > I'll fetch the input data that failed once I get back home, and also > see what happens with real phiX data off our sequencer. So, for real phiX data, I prepared the sequence by masking out bases with quality scores less than 20, and then filtering out the 'N' bases, using the attached script (with 'proportion=1'), which produces an interleaved FASTA file. Despite all that filtering, I still have 4GB of >Q20 phiX data that came off the sequencer. Based on the presence of the sequence 'ATCCAACCTGCAGAGTTTTATCGCTT' in these data, it it looks like a circularised phiX genome was used. I put the top 100k lines from this interleaved file into Ray, and it seems to work fine: $ mpirun -np 10 ../../code/Ray -i ~/illuminadata/110517_H134_0062_A816HKABXX/Data/Intensities/BaseCalls/fastq/ignore/head100k_interleaved_noN_phiX_region1101.fasta ... Rank 0 is creating seeds [1/1072] Rank 9 is creating seeds [1/1102] Rank 7 is creating seeds [1/1094] Rank 3 is creating seeds [1/1182] Rank 5 is creating seeds [1/1102] Rank 4 is creating seeds [1/1168] Rank 2 is creating seeds [1/1090] Rank 8 is creating seeds [1/1206] Rank 1 is creating seeds [1/1124] Rank 6 is creating seeds [1/1164] Rank 0 has 0 seeds Rank 0 is creating seeds [1072/1072] (completed) Rank 0: peak number of workers: 1032, maximum: 30000 Rank 0: VirtualCommunicator: 172441 pushed messages generated 918 virtual messages (0.532356%) Rank 2 has 0 seeds Rank 2 is creating seeds [1090/1090] (completed) Rank 2: peak number of workers: 1040, maximum: 30000 Rank 2: VirtualCommunicator: 172482 pushed messages generated 876 virtual messages (0.507879%) Rank 7 has 0 seeds Rank 7 is creating seeds [1094/1094] (completed) Rank 7: peak number of workers: 1056, maximum: 30000 Rank 7: VirtualCommunicator: 172598 pushed messages generated 860 virtual messages (0.498268%) Rank 8 has 0 seeds Rank 8 is creating seeds [1206/1206] (completed) Rank 8: peak number of workers: 1128, maximum: 30000 Rank 8: VirtualCommunicator: 173161 pushed messages generated 863 virtual messages (0.498380%) Rank 3 has 0 seeds Rank 3 is creating seeds [1182/1182] (completed) Rank 3: peak number of workers: 1136, maximum: 30000 Rank 3: VirtualCommunicator: 173111 pushed messages generated 867 virtual messages (0.500835%) Rank 5 has 0 seeds Rank 5 is creating seeds [1102/1102] (completed) Rank 5: peak number of workers: 1054, maximum: 30000 Rank 5: VirtualCommunicator: 172651 pushed messages generated 936 virtual messages (0.542134%) Rank 6 has 0 seeds Rank 6 is creating seeds [1164/1164] (completed) Rank 6: peak number of workers: 1114, maximum: 30000 Rank 6: VirtualCommunicator: 173032 pushed messages generated 862 virtual messages (0.498174%) Rank 4 has 0 seeds Rank 4 is creating seeds [1168/1168] (completed) Rank 4: peak number of workers: 1116, maximum: 30000 Rank 4: VirtualCommunicator: 173056 pushed messages generated 852 virtual messages (0.492326%) Rank 1 has 0 seeds Rank 1 is creating seeds [1124/1124] (completed) Rank 1: peak number of workers: 1088, maximum: 30000 Rank 1: VirtualCommunicator: 183555 pushed messages generated 11665 virtual messages (6.355043%) Rank 9 discovered a seed with 5363 vertices Rank 9 has 1 seeds Rank 9 is creating seeds [1102/1102] (completed) Rank 9: peak number of workers: 1052, maximum: 30000 Rank 9: VirtualCommunicator: 183307 pushed messages generated 11631 virtual messages (6.345093%) ... Rank 1 is extending seeds [0/0] (completed) Rank 1 extended 0 seeds out of 0 (-nan%) Rank 6 is extending seeds [0/0] (completed) Rank 6 extended 0 seeds out of 0 (-nan%) Rank 4 is extending seeds [0/0] (completed) Rank 4 extended 0 seeds out of 0 (-nan%) Rank 5 is extending seeds [0/0] (completed) Rank 5 extended 0 seeds out of 0 (-nan%) Rank 3 is extending seeds [0/0] (completed) Rank 3 extended 0 seeds out of 0 (-nan%) Rank 0 is extending seeds [0/0] (completed) Rank 0 extended 0 seeds out of 0 (-nan%) Rank 8 is extending seeds [0/0] (completed) Rank 8 extended 0 seeds out of 0 (-nan%) Rank 2 is extending seeds [0/0] (completed) Rank 2 extended 0 seeds out of 0 (-nan%) Rank 7 is extending seeds [0/0] (completed) Rank 7 extended 0 seeds out of 0 (-nan%) Rank 9 is extending seeds [1/1] Rank 9 starts on a seed, length is 5363 [0/1] Rank 9 reached 0 vertices (GTGTTAACAGTCGGGAGAGGA) from seed 1 Rank 9 reached 5363 vertices from seed 1 (changing direction) Rank 9 reached 0 vertices (CTGGTTATATTGACCATGCCG) from seed 1 Rank 9 reached 5365 vertices (CCTCTCCCGACTGTTAACACT) from seed 1 Rank 9 (extension done) Rank 9 is extending seeds [1/1] (completed) Rank 9 extended 1 seeds out of 1 (100.00%) ... Number of contigs: 1 Total length of contigs: 5385 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5385 Number of scaffolds: 1 Total length of scaffolds: 5385 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5385 I'll have a look at what comes out if I break the circularity of this DNA by removing sequence that includes the start and end pieces together. -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-15 08:43:58
|
On 15/07/11 10:12, David Eccles (gringer) wrote: >> Works fine on my branch master. > And also on my local copy of your master on my MPI desktop, even when > running with just 1 processor, carrying out the same steps that failed a > day ago on my laptop: > > $ readSimulator/VirtualNextGenSequencer > ~/data-for-system-tests/phix/phix.fasta 0 200 10 500000 50 > phix_500k_1.fasta phix_500k_2.fasta > > $ ./code/Ray -p phix_500k_1.fasta phix_500k_2.fasta > > [i.e. only one processor] > > Number of contigs: 1 > Total length of contigs: 5384 > Number of contigs >= 500 nt: 1 > Total length of contigs >= 500 nt: 5384 > Number of scaffolds: 1 > Total length of scaffolds: 5384 > Number of scaffolds >= 500 nt: 1 > Total length of scaffolds >= 500: 5384 > >> Your distribution has 2 peaks. > > Yes, I did notice that. Not quite sure how that happened given that I'm > working off the same code. I'll fetch the input data that failed once I > get back home, and also see what happens with real phiX data off our > sequencer. ... and I just re-ran with a 5k run, got one successful run, and then tried again and got an unsuccessful run: $ readSimulator/VirtualNextGenSequencer ~/data-for-system-tests/phix/phix.fasta 0 200 10 5000 50 phix_5k_1.fasta phix_5k_2.fasta $ mpirun -np ./code/Ray -p phix_5k_1.fasta phix_5k_2.fasta [so same as your steps, except a smaller genome size] These inputs seem to have a single peak: http://user.interface.org.nz/~gringer/hacking/phix_5k_1.fasta http://user.interface.org.nz/~gringer/hacking/phix_5k_2.fasta http://user.interface.org.nz/~gringer/hacking/coverage_5k_bad.pdf -- David |
From: David E. (gringer) <dav...@mp...> - 2011-07-15 08:12:58
|
On 14/07/11 16:21, Sébastien Boisvert wrote: > Works fine on my branch master. And also on my local copy of your master on my MPI desktop, even when running with just 1 processor, carrying out the same steps that failed a day ago on my laptop: $ readSimulator/VirtualNextGenSequencer ~/data-for-system-tests/phix/phix.fasta 0 200 10 500000 50 phix_500k_1.fasta phix_500k_2.fasta $ ./code/Ray -p phix_500k_1.fasta phix_500k_2.fasta [i.e. only one processor] Number of contigs: 1 Total length of contigs: 5384 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5384 Number of scaffolds: 1 Total length of scaffolds: 5384 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5384 > Your distribution has 2 peaks. Yes, I did notice that. Not quite sure how that happened given that I'm working off the same code. I'll fetch the input data that failed once I get back home, and also see what happens with real phiX data off our sequencer. Thanks for trying this out, -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-14 14:21:13
|
Works fine on my branch master. Commands: ~/git-clones/ray/readSimulator/VirtualNextGenSequencer ~/data-for-system-tests/phix/phix.fasta 0 200 10 500000 50 phix_500k_1.fasta phix_500k_2.fasta mpirun -np 3 ~/Ray -p phix_500k_1.fasta phix_500k_2.fasta |tee 1 Result: Number of contigs: 1 Total length of contigs: 5384 Number of contigs >= 500 nt: 1 Total length of contigs >= 500 nt: 5384 Number of scaffolds: 1 Total length of scaffolds: 5384 Number of scaffolds >= 500 nt: 1 Total length of scaffolds >= 500: 5384 Your distribution has 2 peaks. http://imgur.com/WoksH Should be like: http://imgur.com/lm84R Distribution: http://pastebin.com/qmJtWqj9 k-mer length: 21 Lowest coverage observed: 98 MinimumCoverage: 98 PeakCoverage: 5779 RepeatCoverage: 11460 Number of k-mers with at least MinimumCoverage: 10732 k-mers Estimated genome length: 5366 nucleotides Percentage of vertices with coverage 98: 0.0186359 % DistributionFile: RayOutput.CoverageDistribution.txt Did you provide the good input files ? Sébastien > ________________________________________ > De : Eccles, David [dav...@mp...] > Date d'envoi : 14 juillet 2011 02:48 > À : Sébastien Boisvert; den...@li... > Objet : RE: RE : Confused about coding -- completed seeds without distributions > > From: Sébastien Boisvert [mailto:seb...@ul...] >> VirtualNextGenSequencer dumps pairs of reads so you should provide both. >> Can you rerun with -p tests/phix/phix_500k_1.fasta > tests/phix/phix_500k_2.fasta >> Otherwise I think your graph won't be connected enough because of the > random number generator used to simulate reads. > > Okay, done. Coverage: > > http://pastebin.com/p6v7m1jA > > And output: > > http://pastebin.com/rEjqveCz > > Still failing to assemble. > > I must say that these results surprised me, because the minimum coverage from > Coverage Distribution has bumped up to 96 now. > >> Also, what is the read length ? > > Reads were generated using this command: > ../readSimulator/VirtualNextGenSequencer tests/phix/phix_genome.fasta 0 200 > 10 500000 50 phix_500k_1.fasta phix_500k_2.fasta > > In other words, no sequence errors, 200bp outer distance (SD = 10bp), 50bp > read length. I chose those outer distance / read length values (but SD was a > guess) because they are the same as the parameters used on a human genome > sequence that was done at MPI a month or so ago (which included a phiX lane). > I guess I could choose the reads from that as input, but then I'd be testing > a different thing, because those reads would likely have errors. > >> I added a system test for phiX and it works just fine on my branch master. >> see system-tests/tests/phix > > You're simulating errors / SNPs (substitution rate = 0.005), which is a > different test from the one I'm doing. > > -- David > |
From: Eccles, D. <dav...@mp...> - 2011-07-14 06:49:38
|
From: Sébastien Boisvert [mailto:seb...@ul...] > VirtualNextGenSequencer dumps pairs of reads so you should provide both. > Can you rerun with -p tests/phix/phix_500k_1.fasta tests/phix/phix_500k_2.fasta > Otherwise I think your graph won't be connected enough because of the random number generator used to simulate reads. Okay, done. Coverage: http://pastebin.com/p6v7m1jA And output: http://pastebin.com/rEjqveCz Still failing to assemble. I must say that these results surprised me, because the minimum coverage from Coverage Distribution has bumped up to 96 now. > Also, what is the read length ? Reads were generated using this command: ../readSimulator/VirtualNextGenSequencer tests/phix/phix_genome.fasta 0 200 10 500000 50 phix_500k_1.fasta phix_500k_2.fasta In other words, no sequence errors, 200bp outer distance (SD = 10bp), 50bp read length. I chose those outer distance / read length values (but SD was a guess) because they are the same as the parameters used on a human genome sequence that was done at MPI a month or so ago (which included a phiX lane). I guess I could choose the reads from that as input, but then I'd be testing a different thing, because those reads would likely have errors. > I added a system test for phiX and it works just fine on my branch master. > see system-tests/tests/phix You're simulating errors / SNPs (substitution rate = 0.005), which is a different test from the one I'm doing. -- David |
From: Sébastien B. <seb...@ul...> - 2011-07-14 03:07:08
|
Hello, Can you rerun with -p tests/phix/phix_500k_1.fasta tests/phix/phix_500k_2.fasta Otherwise I think your graph won't be connected enough because of the random number generator used to simulate reads. Also, what is the read length ? I added a system test for phiX and it works just fine on my branch master. see system-tests/tests/phix seb@fault:~/git-clones/ray/system-tests$ ./run-test-smp.sh phix|tee 1 I ran the assembly on my Samsung netbook with an Intel Atom processor. [1,0]<stdout>:Number of contigs: 1 [1,0]<stdout>:Total length of contigs: 5384 [1,0]<stdout>:Number of contigs >= 500 nt: 1 [1,0]<stdout>:Total length of contigs >= 500 nt: 5384 [1,0]<stdout>:Number of scaffolds: 1 [1,0]<stdout>:Total length of scaffolds: 5384 [1,0]<stdout>:Number of scaffolds >= 500 nt: 1 [1,0]<stdout>:Total length of scaffolds >= 500: 5384 [1,0]<stdout>: [1,0]<stdout>:Rank 0 wrote phix.Contigs.fasta [1,0]<stdout>:Rank 0 wrote phix.Scaffolds.fasta [1,0]<stdout>:Check for phix.* Sébastien > ________________________________________ > De : Eccles, David [dav...@mp...] > Date d'envoi : 13 juillet 2011 18:40 > À : Sébastien Boisvert; den...@li... > Objet : RE: Confused about coding -- 500k phiX simulation still fails > > From: Sébastien Boisvert [mailto:seb...@ul...] >> I think you generated too few reads. >> That is why it fails. >> You need to get the peak coverage at least above 20. > > Okay, so I used the VirtualNextGenSequencer to create 500,000 reads, which > has a substantially higher coverage: > > http://pastebin.com/18SwshYE > > It actually registers as having a peak now, which is nice, because it can > then proceed with the assembly. The statistics are still a little odd, > finding a peak at 1415 (neighbourhood counts 58,52,70,74*,56,72,68) when 2862 > has a higher count (70,54,72,82*,56,50,52), as does 2873 > (56,58,60,76*,64,76,64), but I'll put that down to smoothing -- it's close > enough. > > So now that I've muscled my way past the coverage, the assembly should be a > breeze: > > http://pastebin.com/xJJ9yycT > > ... or maybe not. It [still] looks like it's failing on seed creation. > > Would it make sense to include a coverage calculation in the seeding tests > done by the workers (SeedWorker.cpp)? What happens if a vertex (with >5 > coverage) has a parent that has 1 edge in, one edge out, but the coverage of > that parent is below 5? > > -- David > |
From: Eccles, D. <dav...@mp...> - 2011-07-13 22:44:14
|
From: Sébastien Boisvert [mailto:seb...@ul...] > I think you generated too few reads. > That is why it fails. > You need to get the peak coverage at least above 20. Okay, so I used the VirtualNextGenSequencer to create 500,000 reads, which has a substantially higher coverage: http://pastebin.com/18SwshYE It actually registers as having a peak now, which is nice, because it can then proceed with the assembly. The statistics are still a little odd, finding a peak at 1415 (neighbourhood counts 58,52,70,74*,56,72,68) when 2862 has a higher count (70,54,72,82*,56,50,52), as does 2873 (56,58,60,76*,64,76,64), but I'll put that down to smoothing -- it's close enough. So now that I've muscled my way past the coverage, the assembly should be a breeze: http://pastebin.com/xJJ9yycT ... or maybe not. It [still] looks like it's failing on seed creation. Would it make sense to include a coverage calculation in the seeding tests done by the workers (SeedWorker.cpp)? What happens if a vertex (with >5 coverage) has a parent that has 1 edge in, one edge out, but the coverage of that parent is below 5? -- David |
From: Eccles, D. <dav...@mp...> - 2011-07-13 21:51:15
|
> You need to get the peak coverage at least above 20. > How did you generate the reads ? Using Ray's VirtualNextGenSequencer, without errors, 1000 reads. This is actually the first end of a paired-end simulation with outer distance 200bp, read length 50bp. So changing that is easy enough to do, it just means the Ray run might take tens of seconds to finish, rather than seconds (i.e. still well within the limits of what would be appropriate for a quick test to make sure the code works before committing to git). |
From: Sébastien B. <seb...@ul...> - 2011-07-13 20:29:58
|
I think you generated too few reads. That is why it fails. 1 302 2 422 3 982 4 1550 5 1804 ********** peakCoverage 6 1596 7 1220 8 1004 9 666 10 344 11 202 12 126 13 84 14 20 You need to get the peak coverage at least above 20. How did you generate the reads ? Estimated genome length: 5161 nucleotides is correct > Okay, so I've downloaded sebhtml/ray 1.6.1-rc3, and have got as far as the > first smoothing bug I found. It doesn't seem to work properly with any Ray > that I have tested. > > Coverage: http://pastebin.com/egRf1dzj > > Note that the analysis doesn't match the coverage, and doesn't agree with > itself (min/max coverage is 1, but Percentage of vertices with coverage 1: > 2.92579 %). > > Input (exact reads generated from the phiX genome): > http://pastebin.com/yae0xNKC > > Output: http://pastebin.com/tjrj5W2q > > -- David Eccles (gringer) Sébastien |
From: Eccles, D. <dav...@mp...> - 2011-07-13 20:12:24
|
Okay, so I've downloaded sebhtml/ray 1.6.1-rc3, and have got as far as the first smoothing bug I found. It doesn't seem to work properly with any Ray that I have tested. Coverage: http://pastebin.com/egRf1dzj Note that the analysis doesn't match the coverage, and doesn't agree with itself (min/max coverage is 1, but Percentage of vertices with coverage 1: 2.92579 %). Input (exact reads generated from the phiX genome): http://pastebin.com/yae0xNKC Output: http://pastebin.com/tjrj5W2q -- David Eccles (gringer) |
From: Sébastien B. <seb...@ul...> - 2011-07-13 16:28:53
|
> ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 11 juillet 2011 07:21 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: RE : Kmer formats and colour-space / bit logic > > On 07/07/11 00:08, Sébastien Boisvert wrote: >>>> Color-space is not necessary I think, >>>> m_parameters->getColorSpaceMode does that already. >>> But then you can't do nifty tricks like matching colour-space to >>> base-space, which *can* be done by using a different k-mer format. >> What do you mean exactly here ? > > I was thinking of a situation where you might have k-mers stored as both > colour-space and base-space. Upon reflection, I realised that there's > really no point in this, and everything should just be stored as > colour-space. If you want a strict comparison (i.e. the same as matching > in base-space), then you enforce a first-base for every k-mer. > I also believe color-space assembly should be done in color-space. >> I don't see the point of doing checksums for k-mers because the only data that >> are communicated transit with the message-passing interface. And I believe the underlying >> bit transfer layers (TCP, Infiniband, or another one) already verify data integrity. > > Either a checksum or a 'this sequence is invalid' bit would be useful, I > think. This would allow functions that return k-mers to indicate that a > mis-translation has occurred (e.g. adding edges to something with an > unknown first base). My main reason for using a checksum was for > ferreting out areas in the code that assumed a base-space format. It is > also useful for finding code errors caused by writing outside the > expected range, pointer problems, etc.. I think I've dealt with most of > those now, so perhaps the processor overhead is not necessary. > Worthless if the code base is not buggy then. Checksum are mostly useful when data corruption occurs in a way that is out of control. >>> So in positions 60-63 when using 1 64-bit number, positions 125-128 >>> when using 2, etc.? That means the location of the flags is less easy >>> to determine. I suppose you could put them always in positions 60-63 >>> (i.e. at the end of the first array entry), but that's pretty much the >>> same as positions 0-3. >> The location is easy to locate -- it starting bit is basically 2*kmerLength, >> assuming kmerLength+2<=MAXMERLENGTH. > > This assumption is dangerous, or not appropriate, because the code > allows for a k-mer length different from MAXKMERLENGTH, and for > different k-mer lengths for different k-mers (e.g. kMerAtPosition in > common_functions.cpp doesn't check to see if w matches a static width > variable). > It is perfectly safe. You just have to enforce the simple rule kmerLength+2 <= MAXKMERLENGTH so you can store your tracking information. Anyway, you have to store this information somewhere, it is either in the array of uint64_t of a Kmer or as another attribute of a Kmer. >> I know that doing it this way would not break the code, I think you would just need to change the hashing functions >> to reset (set to 0) all the fields starting at 2*kmerLength in a Kmer. > > Yes, that should work. There needs to be some thought about how to treat > colour-space sequences with unknown first bases, though. Should they > hash to the same position (possibly getting changed when/if the first > base is known)? If they hash to different positions, what happens when > you would be able to find out with high reliability what the first base > of a k-mer should be? > Does your institution have a SOLiD 5500xl ? My opinion on the matter is: Life Technologies should write a base caller software that transform these color-space files into fastq files. All major vendors (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and probably Complete Genomics) do that already. fastq and sam/bam are the de facto standards. End users don't care about internal file formats of one sequencing pipeline (Illumina's export, 454's SFF, PacBio's movies or Ion Torrent's SFF). I just don't see the point of pushing color-space in user land, this should stay in sequencer pipeline land. Who care that it is not a DNA polymerase. That is my opinion of course. > Thanks for your help, > > David > Best. Sébastien |
From: Sébastien B. <seb...@ul...> - 2011-07-13 16:17:43
|
Can you report this to denovoassembler-devel ? https://lists.sourceforge.net/lists/listinfo/denovoassembler-devel (should be up in a few) When you post to denovoassembler-devel, can you paste CoverageDistribution.txt on pastebin and link it in your email. By the way: message maximum length on the mailing list is 65536 bytes, lengths above are rejected. Also: on which data are you testing ? This may help you: (1) For developing new code in Ray, I use assertions. So when you see ifdef ASSERT, the codes inside of these bits are not for production but just used for development. For instance, if you read structures/MyHashTable.h or memory/DefragmentationGroup.cpp, you will see a lot of them. (2) Unit tests are in unit-tests bash main.sh runs the unit tests. (3) System tests are in system-tests data-for-system-tests are not in the git repository I can upload these bits to a system of yours if you want. then you must create a symbolic link as follows: /rap/nne-790-ab/data-for-system-tests -> actual place where is data-for-system-tests To run system tests on a SMP computer: cd system-tests ./main-smp.sh To run system tests with Sun Grid Engine: cd system-tests ./main-sge.sh Sébastien > ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 13 juillet 2011 07:27 > À : den...@li... > Objet : Re: [Denovoassembler-users] Confused about coding -- completed seeds without distributions > > On 12/07/11 16:49, David Eccles (gringer) wrote: >> However, the code I have doesn't seem to have any success in finding seeds. > > I added a bit of debugging output, and came up with this: > > Rank 0 is selecting optimal read markers [1/1000] > inserting sequence (0): T0312332212212032313211021010112303320130220032131 > inserting sequence (1): T0230102032123100220032121302331231100332020012031 > inserting sequence (2): T2102202120200310123200323013323110122001100012320 > inserting sequence (3): C1312102002001322010320202013210121223010123122033 > inserting sequence (4): A2121223331021020132132300201223202232300220310201 > ... > Rank 0 is creating seeds [1/10020] > Vertex 1-1 test (4): testing C21012121330003101031 > Vertex 1-1 test (5): testing C13010130003312121012 > Vertex 1-1 test (6): testing A33102002200001000130 > Vertex 1-1 test (7): testing T03100010000220020133 > Vertex 1-1 test (8): testing G23300202002023312231 > Vertex 1-1 test (9): testing A13221332020020200332 > ... > Vertex 1-1 test (8168): testing T31233221221203231321 > ... > Continuing (8168): vertex has 1 in, 1 out: T31233221221203231321 > Vertex 1-1 test (8168): testing T03123322122120323132 > ... > Finished (8168): parent has 1 in, 1 out: T03123322122120323132 > worker 8168 done: found a seed with 20 nucleotides > > [workerIds are in brackets] > > So this seed (T31233221221203231321) is rejected, because there's a > parent vertex (T03123322122120323132) that is closer to the start of the > sequence. Except that parent vertex is never processed, it's nowhere in > my output. Note that the first workerId is 4, not 0. I have a suspicion > that the wonderful "ultimate first seed" is something that should be > managed by worker 0, who decided to take a holiday. > > This probably might not be a problem for imperfect reads, because you'd > expect breaks in the chain somewhere (so at least *some* sequence would > map properly, even if the first workers weren't processing). I think the > problem here is that all vertices are linked to this first one. > > I still need to hunt more, but I think I'm getting close to the problem. > > --- David > > > ------------------------------------------------------------------------------ > AppSumo Presents a FREE Video for the SourceForge Community by Eric > Ries, the creator of the Lean Startup Methodology on "Lean Startup > Secrets Revealed." This video shows you how to validate your ideas, > optimize your ideas and identify your business strategy. > http://p.sf.net/sfu/appsumosfdev2dev > _______________________________________________ > Denovoassembler-users mailing list > Den...@li... > https://lists.sourceforge.net/lists/listinfo/denovoassembler-users > |