From: Arjun P. <ap...@ma...> - 2012-04-25 18:31:50
|
Hi, Thanks Brian for the detailed explanation. The gkpStore.fastqUIDmap file is easy enough to parse. From what you said it seems like the generated UID in the output may be something you guys fix at some point, right? I wrote a little perl script to convert the UIDs to readnames in the posmap files. I didn't do it for the .asm file because posmaps are all I need for now. I posted it at http://arjunprasad.net/scripts/fixReadnamesInPosmap in case it's helpful for someone else. It took about 1.5 Gigs of RAM for 7 million reads with fairly long names. It just occurred to me that fixReadnamesInPosmap doesn't handle the case where you have an assembly with some FASTQ files and some .frg files for input. That's easy to fix if it's useful to anyone. Arjun On Tue, 24 Apr 2012, Walenz, Brian wrote: > Hi- > > I was fearing the day someone would ask about this. We had a choice of > either doing lots of engineering to optimize directly saving names of fastq > reads, or an inelegant - and only partially completed - solution of > stripping the names when the reads are loaded into the gatekeeper store, and > adding them back as a post process. > > The names and mapping are saved in the *.gkpStore.fastqUIDmap. The format > is: > > UID IID Name (for unparied reads) > UID IID Name UID IID Name (for paired reads) > > IIDs are used internal to the assembler. Most logs refer to reads (unitigs, > contigs and scaffolds) using these. There is an implicit 'type' with each > IID. "1" is a valid IID for four objects: a fragment, a unitig, a contig > and a scaffold. > > UIDs appear in the outputs - posmap and asm. These are guaranteed to be > unique within the assembly. For reads loaded as .frg, the UID is the read > name. > > The iidtouid file gives a mapping from IID to UID, for every object in the > assembly, not just reads. > > Sorry for the pain. We're a bit short on engineering time at the moment, > and as this wasn't an issue critical to getting a good assembly, we only > made it 'not break' for an assembly with > 1 billion reads. > > b > > > > > On 4/24/12 1:52 PM, "Arjun Prasad" <ap...@ma...> wrote: > >> >> Hi, >> >> I need to get a read-mapping with the actual read-names for an assembly >> that was created based on FASTQ input sequences. I noticed the iidtouid >> file in the 9-terminator directory, but it has numbers for fragments >> rather than read names. >> >> Looking at the reads from the 9-terminator/.frg file I matched up some by >> sequence, and it looks like the FRG numbers are alternating reads from >> each of the paired ends. >> >> e.g., >> >> FRG 1 110000000001 - first entry from read 1 >> No FRG 2 >> FRG 3 110000000003 - 2nd entry from read 1 >> FRG 4 120000000003 - 2nd entry from read 2 >> FRG 5 110000000005 - 3rd entry from read 1 >> FRG 6 120000000005 - 3rd entry from read 2 >> FRG 100000 120000099999 - Entry 50,000 from read 2 >> >> I'm guessing that I can figure out the read name to iid translation by >> counting into the fastq files by FRG # / 2 >> >> Has anyone else done this? Did I correctly interpret what the FRG numbers >> mean? Are there any gotchas at input file boundaries? >> >> Thanks, >> Arjun > |