From: Alec W. <al...@br...> - 2009-09-25 17:18:15
|
When Picard sorts a file, it uses a stable sort as follows: 1. order by sequence index, with reads without coordinate coming last; 2. order by alignment start; 3. order by strand, with positive strand sorting earlier than negative strand; 4. order by read name. Since we have lots of files that are sorted this way, requiring some other order would be problematic for us. Note that our algorithms that expect sorted inputs do not expect the inputs to be stably sorted. The only require sorting according to tests 1 & 2 above. -Alec John Marshall wrote: > Probably I am being naive because it is a long time since I worried > about these things too formally, but anyway... > > On 25 Sep 2009, at 10:24, Heng Li wrote: > >> Here is a question for other developers in this mailing list. >> Suppose we >> have m sorted lists of integers with each list having n elements. The >> most straightforward algorithm to merge the m lists is to use a heap >> data structure. The time complexity is O(n*m*log(m)). However, heap >> does >> not maintain stability. Is there a merge algorithm to achieve the same >> time complexity while stable? I know it is easy to implement a stable >> merge in O(n*m^2). >> > > Since we don't have a single input list, it's not entirely obvious > what we might mean by "stability" here. And in fact this model is a > slight oversimplification of our situation. > > The m sorted lists have been produced by chopping the huge original > input file into contiguous blocks (and then stably sorting each one). > So to preserve stability in terms of the original file, we need the > merge step, when looking at several equivalent records at the heads of > some subset of the m sorted lists, to pick the one that has come from > the earlier block in the original file, i.e., the one from the > leftmost list in that subset. (So with the lists' provenance added to > the model, we now have a meaning for "stability".) > > Thus by adding the list number (1..m) to the heap data structure's > sort criterion, the heap merge can be stabilised while maintaining its > existing time complexity. > > Alternatively: > > Questions of sort stability only arise when the sort criterion says > some non-identical records are equivalent; or, more judgementally, > when the sort criterion fails to distinguish between records that we > might like it to distinguish. We can define the stability problem out > of existence and continue to use an unstable sort by tightening up the > sort criterion. > > At the moment, bam_sort.c's heap_lt() compares records by chromosome/ > position/strand. Would it not suffice to simply add the read name to > the mix? If we sort by chromosome/position/readname/strand or > something similar, unmapped reads will appear alongside their mates, > and as the comparison will almost always short-circuit before the > strcmp() there's no real speed lost. > > John > > > |