From: Brian W. <th...@gm...> - 2015-03-06 22:37:27
|
Restatement: you want to assemble three BACs where the ends of them share something artificial that shouldn't be assembled across BACs. Correct? Add the kmers in the vector that shouldn't be assembled to the nmers.fasta in 0-mercounts. The overlapper will not seed overlaps with these kmers, but will extend overlaps into them. For two reads [vector][seq] and [vector][seq], the overlap will be seeded from a kmer in [seq], and the overlap will cover both reads entirely. Reads [vector] and [vector][seq] will share only kmers in nmers.fasta which will be ignored. To get the kmers, build a fasta of all the vector sequences from all the reads, and run meryl as the assembler does (IMPORTANT: with the -C flag). Append these to the nmers.fasta, or use only these kmers (with option ovlFrequentMers) and seed off of all overlaps in the BAC sequences (if your pool is small). An alternative -- but a pita to do -- would be to filter the overlaps to remove any vector-vector overlaps you don't want to assemble together. To do this, the ovlStore need to be dumped, then filtered, then rebuilt. We can't edit an overlap store to mark overlaps as 'don't use'. The filtering can probably be done based only on read id, so easy to do from the dumps. b On Fri, Mar 6, 2015 at 4:49 PM, mathog <ma...@ca...> wrote: > We have some data that consists of reads (Sanger) from pooled BACs. > Let's say for the sake of illustration that there are 3 BACs in each > pool and let's look at the 5' end of the insert. There will be 3 > classes of reads that look like: > > [vector][seq1] > [vector][seq2] > [vector][seq3] > > where vector is the BAC vector, not the sequencing vector, and where of > course the amount of sequence one each side of the junction will vary > from read to read. > > It is important to keep track of these end sequences. Is that possible > with this assembler? > > One option is to note in a file somewhere that these reads are ends, and > cut off the vector ahead of time. A problem with that is that there > isn't a huge amount of data in hand and some of the remaining pieces > will be small, so they will be dropped from the assembly. That is, it > may cause an "edge effect" which would most likely cause many bases to > be lost from each end, even if the rest of the assembly works. One > would also need to tell the assembler somehow that these are ends, so it > doesn't mistakenly assemble things on the other side if the sequence at > the junction happens to be repetitive. (Is there a way to mark an input > sequence like that?) Finally, one would need to be able to map whatever > name the assembler uses internally for the reads back to the ones in the > saved file. > > The other option is to leave the vector in, but that will result in a > forked structure when the vector sequences line up during overlap, and > the assembler will cut off the vector at the base of the fork anyway. > Which goes right back to the first case. Unless there is some way to > give the assembler the BAC vector and then have it "do the right thing" > by not cutting the forked structure at the junction, but instead > splitting it into the 3 classes. Is there a way to tell wgs to do that? > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wgs-assembler-users mailing list > wgs...@li... > https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users > |