I am maintaining a UNIX-based program that sorts large files using the native UNIX sort utility (i.e. /usr/bin/sort). It actually writes unsorted data records to an instance of the sort utility that has been opened via a pipe (and the sorted output is written to another file). At first, this seemed quite strange to me but from what I now understand, /usr/bin/sort will sort on disk if needed and that is attractive given the file sizes involved.
The /usr/bin/sort solution works in its current configuration but I am finding it limited now mainly because I don't have very flexible control over the key fields that get used in the sort (there is also a Windows port coming up and so portability of the OS-specific sort utility is a concern). I'm interested in perhaps replacing the current sort method with the stxxl library as this would be portable and allow me to easily adjust the sort criteria using simple C++ features, e.g. operator< or predicate.
I have attempted to compile the stxxl library using g++ on Solaris but the build failed... I suspect I will need to have the g++ version updated (currently 3.3.2). Before I go to the trouble of requesting that update from our UNIX administrators, I'm wondering if you have any performance information about how the UNIX /usr/bin/sort utility compares to stxxl. After all, when /usr/bin/sort sorts on disk, it must use a similar algorithm to stxxl. Have you ever compared your library to /usr/bin/sort? Can I expect a performance gain or drop?
The is a difference between UNIX sort and stxxl sort: stxxl sort can handle only records with a fixed size (e.g. PODs, no pointers, etc.). If this is not a problem for you, then I would recommend you to use Stxxl as it should be more efficient in this case (no input text parsing needed, better algorithms used).
I recommend you to use a pipelined version of stxxl sorting, it is similar to the pipe interface of UNIX sort:
Some stxxl users have reported that stxxl sort is much faster than UNIX sort (tested on data sets of about 200 GBytes).
Log in to post a comment.