Menu

Big Q Sort

Csaba Skrabák

A variation of Quicksort algorithm for in-place sorting of data that does not completely fit into RAM.

It is an external sort on a file that is organized into blocks, each of which hold equal number of records. The
blocks can be accessed randomly using their indices.

Transforming Internal Comparison Sorts Into External Ones

A step of a comparison algorithm takes two records to compare and conditionally swaps them. The corresponding
external algorithm on the other hand, will take two blocks of records to fill one continuous buffer in memory.
Instead of swapping, it runs a full internal sorting algorithm on the buffer and then blocks are written back
to the file before the corresponding block pointer is moved.

Quicksort

For an example implementation, we take the widely known Quicksort as the internal sorting algorithm to externalize. To recall the steps and clarify the terms used in this document, here is a short description of the original sorting algorithm.

  1. Start with partitioning on the whole input array.
  2. Partitioning starts with choosing one record of the array to partition, which is called pivot.
  3. Set pointer called i to the low end of the array, j to the high end of it. The empty partition below position i is called the left partition, the part over j is the right partition.
  4. Loop starts here. Raise i and lower j while i points to records preceding pivot, and those pointed by j succeed it, according to the wanted ordering. If both i and j is unable to move closer keeping this pivot condition, an inversion is found, so swap both records at i and j.
  5. Repeat the loop from step 4 until i and j become equal.
  6. Note that although the records outside the pivot are not yet necessarily ordered correctly, still the final place of the pivot is found where the pointers collide: all preceding records are in the left partition while all successors in the right partition. And the position of any record is the number of records preceding it. Put the pivot record there by swapping the position of both pointers i and j with pivot's original place. Partitioning ends here.
  7. Recursively do partitioning from step 2 over the left partition unless it is shorter than 2 records.
  8. Recursively do partitioning from step 2 over the right partition unless it is shorter than 2 records.

Externalizing Quicksort

Applying the transformation on Quicksort, the two pointers of the partitioning become block indices but pivot stays one record.
An idea specific for the Quicksort is that partitioning can be started without selecting the final pivot.
The pivot can be changed while partitioning, it just needs to stay in an interval to meet pivot conditions (i.e. all keys in the left partition <= pivot <= all keys in the right partition.)

This Implementation

Partitioning starts with reading the lowest and the highest block into the buffer.

Pivot selection saves block reads by selecting a key that is anyway in the memory. Although this may not be an
optimal pivot, the implementation constantly optimizes pivot while partitioning. It is non-trivial to construct
an input file for the worst case when the algorithm would always choose extreme pivots and run for N square time.


MongoDB Logo MongoDB