From: Walenz, B. <bw...@jc...> - 2012-04-24 20:22:16
|
Hi- I was fearing the day someone would ask about this. We had a choice of either doing lots of engineering to optimize directly saving names of fastq reads, or an inelegant - and only partially completed - solution of stripping the names when the reads are loaded into the gatekeeper store, and adding them back as a post process. The names and mapping are saved in the *.gkpStore.fastqUIDmap. The format is: UID IID Name (for unparied reads) UID IID Name UID IID Name (for paired reads) IIDs are used internal to the assembler. Most logs refer to reads (unitigs, contigs and scaffolds) using these. There is an implicit 'type' with each IID. "1" is a valid IID for four objects: a fragment, a unitig, a contig and a scaffold. UIDs appear in the outputs - posmap and asm. These are guaranteed to be unique within the assembly. For reads loaded as .frg, the UID is the read name. The iidtouid file gives a mapping from IID to UID, for every object in the assembly, not just reads. Sorry for the pain. We're a bit short on engineering time at the moment, and as this wasn't an issue critical to getting a good assembly, we only made it 'not break' for an assembly with > 1 billion reads. b On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote: > > Hi, > > I need to get a read-mapping with the actual read-names for an assembly > that was created based on FASTQ input sequences. I noticed the iidtouid > file in the 9-terminator directory, but it has numbers for fragments > rather than read names. > > Looking at the reads from the 9-terminator/.frg file I matched up some by > sequence, and it looks like the FRG numbers are alternating reads from > each of the paired ends. > > e.g., > > FRG 1 110000000001 - first entry from read 1 > No FRG 2 > FRG 3 110000000003 - 2nd entry from read 1 > FRG 4 120000000003 - 2nd entry from read 2 > FRG 5 110000000005 - 3rd entry from read 1 > FRG 6 120000000005 - 3rd entry from read 2 > FRG 100000 120000099999 - Entry 50,000 from read 2 > > I'm guessing that I can figure out the read name to iid translation by > counting into the fastq files by FRG # / 2 > > Has anyone else done this? Did I correctly interpret what the FRG numbers > mean? Are there any gotchas at input file boundaries? > > Thanks, > Arjun |