SourceForge has been redesigned. Learn more.
Close

Performance -- stxxl vs /usr/bin/sort

2008-10-09
2013-04-25
  • Nobody/Anonymous

    Hi,

    I am maintaining a UNIX-based program that sorts large files using the native UNIX sort utility (i.e. /usr/bin/sort).  It actually writes unsorted data records to an instance of the sort utility that has been opened via a pipe (and the sorted output is written to another file).  At first, this seemed quite strange to me but from what I now understand, /usr/bin/sort will sort on disk if needed and that is attractive given the file sizes involved.

    The /usr/bin/sort solution works in its current configuration but I am finding it limited now mainly because I don't have very flexible control over the key fields that get used in the sort (there is also a Windows port coming up and so portability of the OS-specific sort utility is a concern).  I'm interested in perhaps replacing the current sort method with the stxxl library as this would be portable and allow me to easily adjust the sort criteria using simple C++ features, e.g. operator< or predicate.

    I have attempted to compile the stxxl library using g++ on Solaris but the build failed... I suspect I will need to have the g++ version updated (currently 3.3.2).  Before I go to the trouble of requesting that update from our UNIX administrators, I'm wondering if you have any performance information about how the UNIX /usr/bin/sort utility compares to stxxl.  After all, when /usr/bin/sort sorts on disk, it must use a similar algorithm to stxxl.  Have you ever compared your library to /usr/bin/sort?  Can I expect a performance gain or drop?

    Regards,
    Jason

     
    • Nobody/Anonymous

      Hi Jason,

      The is a difference between UNIX sort and stxxl sort: stxxl sort can handle only records with a fixed size (e.g. PODs, no pointers, etc.). If this is not a problem for you, then I would recommend you to use Stxxl as it should be more efficient in this case (no input text parsing needed, better algorithms used).

      I recommend you to use a pipelined version of stxxl sorting, it is similar to the pipe interface of UNIX sort:
      http://algo2.iti.uni-karlsruhe.de/dementiev/stxxl/tags/1.2.1/stream_2test__push__sort_8cpp-example.html

      Some stxxl users have reported that stxxl sort is much faster than UNIX sort (tested on data sets of about 200 GBytes).

      Best regards,
      Roman

       

Log in to post a comment.