Re: [Denovoassembler-users] colour space reads
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
|
From: Eccles, D. <dav...@mp...> - 2011-06-27 21:50:37
|
Von: Sébastien Boisvert [mailto:Seb...@ul...] >> You mentioned yourself that there's no need to store the reverse complement >> when in colour-space. To get the reverse complement in base space, you >> reverse complement the first base, convert to base space, then reverse the >> sequence. > Complement the first base and reverse the color -- this is the recipe to > "reverse-complement" a color-space read. > I think I am starting to get it. Be careful with this. Order of processing matters a lot with colour space. You need to reverse the resulting *base-space* sequence, rather than reversing the colour space sequence then working it out in base space. (e.g. the reverse complement of A3200233 is reverse(T3200233), not T3320023) >> If there is a good chance of a match between two reads, and one read has an >> unknown first base, then you can infer that base from the other read. > Yes, but keep in mind that Ray never computes pairwise similarity. Sure. In the scenario I described, both sequences would have exactly the same colour-space representation (excluding first base) -- no pairwise differences necessary. The only difference is that one can be converted unambiguously to a base-space sequence (known first base), and the other has up to 4 base-space representations (unknown first base). > Like in Velvet, Ray uses 2 bits per symbol. And also a flag for whether or not the kmer is in colour-space (or all kmers in colour space), I presume. For each kmer (assuming you want to be able to output in base-space), Ray will also need to record a first base, preferably in a separate structure, but it could just be the first 2-bit symbol in the sequence. > a path can obviously start in the middle of a read -- thus in that case > the first base would remain unknown. (right?) >From each read, you can generate putative first bases for any subsequence of an uninterrupted <first base>[0123]+ sequence. This requires converting the sequence to base space, and inserting the converted base at the appropriate position. I'll try to demonstrate this starting with a colour-space sequence: A2112322311010133121320003202203201302321 This has starting base A, complementary transitions have colour 3, non-complementary are 1,2 depending on how far away they are in the alphabet [just FYI, that's how I remember it]: AGTGATCTACAACCATACTGCTTTTAGGAGGCTTGCCTAGT [or something like that -- hopefully I converted it correctly] If I start with the colour-space sequence, I can work out the 'starting base' at any position by converting to base-space. For example, before the string of 3 0s, you can insert a T: <A>211232231101013312132<T>0003202203201302321 I'll try working through a scenario. Let's say I want the sequence split up into groups of 10-mers: 2112322311 0101331213 2000320220 3201302321 I know the first base for the first group: <A>2112322311 0101331213 2000320220 3201302321 I can convert that first group to base space, and the last base of that converted group is the first base for the next group: (<A>2112322311 / AGTGATCTAC) <C>0101331213 2000320220 3201302321 and so on: <A>2112322311 <C>0101331213 <C>2000320220 <G>3201302321 If there's a misread somewhere, any sequences past the misread will have ambiguous colour-space -> base-space translations: <A>2112322311 <C>01013X1213 <N>2000320220 <N>3201302321 The problem is that for a sufficiently large dataset (or error-containing dataset), you'll get disagreements about the starting base for a given sequence. If Ray were to record the counts for each observed starting base, it might be possible to reduce this error (e.g. pick the most frequently occurring starting base), bearing in mind that the starting base for sequence closer to the start of a read will be more reliable than the calculated starting bases at the end of a read. Hope this helps, David Eccles (gringer) |