From: Alec W. <al...@br...> - 2010-12-09 19:43:01
|
hi ilya, regarding both this proposal and IntegerTagFilter, is there a reason why the code needs to go into picard rather than being in your application? i ask because in general we are reluctant to add to the code base unless either what needs to be done can't be done in client code, or the new functionality would be broadly useful and difficult for individual developers to implement. we are reluctant because it adds to our support burden. i'm not saying no, just asking if you could explain a little more why it ought to go in the library. thanks, alec On 12/9/10 12:14 PM, Goldin, Ilya wrote: > Alec, > > Yes, that is what I'm trying to do. I'd like to implement it and submit it for inclusion in Picard. Would you consider such a patch? > > It seems that one would want to do something like the following: > > Define an object that defines a group of SAMRecords, call it SAMQuery. > > Define QueryIterator implementing CloseableIterator<SAMQuery>. > > Then QueryIterator is somewhat modeled on SamRecordIntervalIterator. There's a private SAMFileReader and a private PeekableIterator<SAMRecord>. QueryIterator accumulates SAMRecords from the SAMFileReader in a private SAMQuery, and keeps peeking at the PeekableIterator to see when the read name changes. > > Best, > Ilya > > ________________________________________ > From: Alec Wysoker [al...@br...] > Sent: Tuesday, December 07, 2010 3:51 PM > To: Goldin, Ilya > Cc: sam...@li... > Subject: Re: [Samtools-devel] iterator over batches of samrecords > > Hi Ilya, > > I want to make sure I'm understanding your question: the idea is that > you will have multiple SAMRecords for the same query, i.e. with the same > value for SAMRecord.getReadName, and the same pair flag value (i.e. > either multiple SAMRecords with same read name that are unpaired, or > multiple SAMRecords that are the same end of a pair), right? Note that > we haven't done a lot of work in Picard to deal with multiple SAMRecords > for the same query. It should work, however, but there isn't anything > to do what you ask. It doesn't sound too hard, however, to accumulate > SAMRecords so long as the read name is the same, and when a different > read name arrives, do whatever processing is needed for the SAMRecords > that have been accumulated. > > -Alec > > On 12/7/10 2:53 PM, Goldin, Ilya wrote: > >> In Picard, I want to implement an iterator over batches of SAMRecords rather than processing just one SAMRecord at a time. Is there something similar in Picard already? If not, is there a recommended way of doing it? >> >> In particular, I want to filter out all queries that align to too many locations on the reference. I was thinking that I would >> >> 1. sort SAM file by query name (e.g., with samtools) >> 2. iterate over queries (each of which could be a batch of SAMRecords) >> 3. if the batch size was greater than some threshold, skip this query >> 4. write out a SAM file of the remaining queries >> >> Advice welcome. >> >> >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel > |