Currently percentile (including median) operators completely sort the
data. This is not necessary for these operations. Instead, the data
need only be partitioned:
Partitioning: reorder the array so that all elements with values less
than the pivot come before the pivot, while all elements with values
greater than the pivot come after it (equal values can go either
way).
(From the wikipedia quicksort article.)
Partitioning (the basis for quicksort, which performs it recursively) is much faster, and can be done incrementally. This can speed up generating percentiles, generating partitions optimized for the percentiles requested (e.g. using the same approach to dividing the data as used by the quicksort or binary search algorithms).
I should note that this is a bit more complicated than simply using partitions, as the pivot value needs to be passed to the partition routine. However, it should be possible to come up with an iterative approach with a worst case performance equivalent to quicksort's.
I need improve my web search skills. Quickselect is what I was thinking of.
Ticket moved from /p/pdl/bugs/428/
Patches and code contributions are always welcome. Drop-in but more performant functions are particularly easy to verify correctness of...