[wgs-assembler-users] retaining vector sequence as an end marker?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

We have some data that consists of reads (Sanger) from pooled BACs.  
Let's say for the sake of illustration that there are 3 BACs in each 
pool and let's look at the 5' end of the insert.  There will be 3 
classes of reads that look like:

   [vector][seq1]
   [vector][seq2]
   [vector][seq3]

where vector is the BAC vector, not the sequencing vector, and where of 
course the amount of sequence one each side of the junction will vary 
from read to read.

It is important to keep track of these end sequences.  Is that possible 
with this assembler?

One option is to note in a file somewhere that these reads are ends, and 
cut off the vector ahead of time.  A problem with that is that there 
isn't a huge amount of data in hand and some of the remaining pieces 
will be small, so they will be dropped from the assembly.  That is, it 
may cause an "edge effect" which would most likely cause many bases to 
be lost from each end, even if the rest of the assembly works.  One 
would also need to tell the assembler somehow that these are ends, so it 
doesn't mistakenly assemble things on the other side if the sequence at 
the junction happens to be repetitive.  (Is there a way to mark an input 
sequence like that?)  Finally, one would need to be able to map whatever 
name the assembler uses internally for the reads back to the ones in the 
saved file.

The other option is to leave the vector in, but that will result in a 
forked structure when the vector sequences line up during overlap, and 
the assembler will cut off the vector at the base of the fork anyway.  
Which goes right back to the first case.  Unless there is some way to 
give the assembler the BAC vector and then have it "do the right thing" 
by not cutting the forked structure at the junction, but instead 
splitting it into the 3 classes.  Is there a way to tell wgs to do that?

Thanks,

David Mathog
ma...@ca...
Manager, Sequence Analysis Facility, Biology Division, Caltech