[Denovoassembler-devel] RE : RE : Kmer formats and colour-space / bit logic
Ray -- Parallel genome assemblies for parallel DNA sequencing
Brought to you by:
sebhtml
From: Sébastien B. <seb...@ul...> - 2011-07-13 16:28:53
|
> ________________________________________ > De : David Eccles (gringer) [dav...@mp...] > Date d'envoi : 11 juillet 2011 07:21 > À : Sébastien Boisvert > Cc : den...@li... > Objet : Re: RE : Kmer formats and colour-space / bit logic > > On 07/07/11 00:08, Sébastien Boisvert wrote: >>>> Color-space is not necessary I think, >>>> m_parameters->getColorSpaceMode does that already. >>> But then you can't do nifty tricks like matching colour-space to >>> base-space, which *can* be done by using a different k-mer format. >> What do you mean exactly here ? > > I was thinking of a situation where you might have k-mers stored as both > colour-space and base-space. Upon reflection, I realised that there's > really no point in this, and everything should just be stored as > colour-space. If you want a strict comparison (i.e. the same as matching > in base-space), then you enforce a first-base for every k-mer. > I also believe color-space assembly should be done in color-space. >> I don't see the point of doing checksums for k-mers because the only data that >> are communicated transit with the message-passing interface. And I believe the underlying >> bit transfer layers (TCP, Infiniband, or another one) already verify data integrity. > > Either a checksum or a 'this sequence is invalid' bit would be useful, I > think. This would allow functions that return k-mers to indicate that a > mis-translation has occurred (e.g. adding edges to something with an > unknown first base). My main reason for using a checksum was for > ferreting out areas in the code that assumed a base-space format. It is > also useful for finding code errors caused by writing outside the > expected range, pointer problems, etc.. I think I've dealt with most of > those now, so perhaps the processor overhead is not necessary. > Worthless if the code base is not buggy then. Checksum are mostly useful when data corruption occurs in a way that is out of control. >>> So in positions 60-63 when using 1 64-bit number, positions 125-128 >>> when using 2, etc.? That means the location of the flags is less easy >>> to determine. I suppose you could put them always in positions 60-63 >>> (i.e. at the end of the first array entry), but that's pretty much the >>> same as positions 0-3. >> The location is easy to locate -- it starting bit is basically 2*kmerLength, >> assuming kmerLength+2<=MAXMERLENGTH. > > This assumption is dangerous, or not appropriate, because the code > allows for a k-mer length different from MAXKMERLENGTH, and for > different k-mer lengths for different k-mers (e.g. kMerAtPosition in > common_functions.cpp doesn't check to see if w matches a static width > variable). > It is perfectly safe. You just have to enforce the simple rule kmerLength+2 <= MAXKMERLENGTH so you can store your tracking information. Anyway, you have to store this information somewhere, it is either in the array of uint64_t of a Kmer or as another attribute of a Kmer. >> I know that doing it this way would not break the code, I think you would just need to change the hashing functions >> to reset (set to 0) all the fields starting at 2*kmerLength in a Kmer. > > Yes, that should work. There needs to be some thought about how to treat > colour-space sequences with unknown first bases, though. Should they > hash to the same position (possibly getting changed when/if the first > base is known)? If they hash to different positions, what happens when > you would be able to find out with high reliability what the first base > of a k-mer should be? > Does your institution have a SOLiD 5500xl ? My opinion on the matter is: Life Technologies should write a base caller software that transform these color-space files into fastq files. All major vendors (Illumina, 454, Pacific Biosciences, Ion Torrent, Helicos, and probably Complete Genomics) do that already. fastq and sam/bam are the de facto standards. End users don't care about internal file formats of one sequencing pipeline (Illumina's export, 454's SFF, PacBio's movies or Ion Torrent's SFF). I just don't see the point of pushing color-space in user land, this should stay in sequencer pipeline land. Who care that it is not a DNA polymerase. That is my opinion of course. > Thanks for your help, > > David > Best. Sébastien |