From: mathog <ma...@ca...> - 2015-03-06 21:49:43
|
We have some data that consists of reads (Sanger) from pooled BACs. Let's say for the sake of illustration that there are 3 BACs in each pool and let's look at the 5' end of the insert. There will be 3 classes of reads that look like: [vector][seq1] [vector][seq2] [vector][seq3] where vector is the BAC vector, not the sequencing vector, and where of course the amount of sequence one each side of the junction will vary from read to read. It is important to keep track of these end sequences. Is that possible with this assembler? One option is to note in a file somewhere that these reads are ends, and cut off the vector ahead of time. A problem with that is that there isn't a huge amount of data in hand and some of the remaining pieces will be small, so they will be dropped from the assembly. That is, it may cause an "edge effect" which would most likely cause many bases to be lost from each end, even if the rest of the assembly works. One would also need to tell the assembler somehow that these are ends, so it doesn't mistakenly assemble things on the other side if the sequence at the junction happens to be repetitive. (Is there a way to mark an input sequence like that?) Finally, one would need to be able to map whatever name the assembler uses internally for the reads back to the ones in the saved file. The other option is to leave the vector in, but that will result in a forked structure when the vector sequences line up during overlap, and the assembler will cut off the vector at the base of the fork anyway. Which goes right back to the first case. Unless there is some way to give the assembler the BAC vector and then have it "do the right thing" by not cutting the forked structure at the junction, but instead splitting it into the 3 classes. Is there a way to tell wgs to do that? Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |