#121 Retain Illumina/Solexa read UID

Gatekeeper
closed
Brian Walenz
Feature (48)
5
2011-01-03
2010-05-21
Jason Miller
No

Celera Assembler 6.1 does not preserve the read UID that is in the fastq file of Illumina reads. That was an understandable design choice, as the read UID can be very long. However, the current configuration makes it nearly impossible to compare fastq and asm files. Suppose we'd like to compare the clear range in a trimmed read (AFG message in the ASM file) to the original read in the fastq file. This is hard or impossible because CA assigns an arbitrary UID to each read. Furthermore, it assigns the same UID to both reads of a pair. (Asside: for unpaired Illumina reads, are there DST messages to link each read to its library?)

Here are some suggestions. Generate a file of UID to IID during gatekeeper and copy that to the 9-terminator directory. Alternately, preserve the Illumina read ID in the gkpStore, even though it is long. Alternately, extract the variable portion of the Illumina read ID and use that; don't store the portion of the read name that encodes the run ID.

Discussion

  • Jason Miller
    Jason Miller
    2010-05-24

    (Translated from Bri) UIDs in the read pair should be different by one digit. If not, that's a bug in gatekeeper and terminator. Like the idea of gatekeeper building a UID <-> IID mapping -- then modifying terminator to replace IIDs with UIDs from this mapping. This keeps UIDs out of the pipeline. However, UIDs are convenient when debugging consensus since they indicate read type (platform, mate); but a table of those would serve the same purpose.

     
  • Brian Walenz
    Brian Walenz
    2011-01-03

    • assigned_to: nobody --> brianwalenz
    • status: open --> closed
     
  • Brian Walenz
    Brian Walenz
    2011-01-03

    Gatekeeper now reports the mapping from CA UID to Illumina read name. With runCA, this is in (assemblydirectory)/(gkpStore).illuminaUIDmap. The file is created automagically, no runCA or gatekeeper command line option is needed.