Vector sorting improvement.

Brought to you by: cristeab, ediap, tonyottosson

Vector sorting improvement.

Forum: Open Discussion

Creator: Mark Dobossy

Created: 2007-06-19

Updated: 2012-09-15

Mark Dobossy - 2007-06-19

I have recently been working on a project that is using large vectors that must be sorted (along with a sorted index). These vectors are 200,000+ elements, and the sorting is painfully slow on my system (Mac Pro 3ghz, using itpp compiled with the intel compiler + MKL libs). Sorting a 200,000 element vector takes upwards of a minute.

I suspect this is because of the amount of recursion required when using the Quicksort algorithm. For very large vectors, it occurred to me that a heap sort algorithm might be more efficient. I wrote up a quick and sloppy function that takes as input a vec, turns it into a double[], sorts/indexes the double[] using a heap sort code I had lying around (thus the conversion to a double[]), turns the double[] back into the vec, and returns the index (which is an ivec). With this highly inefficient code, I am able to sort/index the same 200,000 element vector in around 0.02s.

So, my questions are: 1) is the it++ sort algorithm really that inefficient on large vectors, or am I doing something wrong? 2) if it turns out it++'s sort is inefficient, I am going to code up a better (and it++ built in) function to do this. However, I am not running svn. Can I submit a patch on the 3.10.11 code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark Dobossy - 2007-07-05
  
  Ok.. Dumb question time.
  
  What make files do I need to modify to include my new sort.cpp file (which contains the sort class implementation)? I have modified itpp/base/sources.mk to include $(top_srcdir)/itpp/base/sort.cpp . I did a make distclean, re-ran automake.sh, configure, make , etc..
  
  When I try to compile my test program, I now get a bunch of undefined symbols (the symbols are the functions/class members located in sort.cpp). Is there another make related file I need to edit to make sure my new symbols are included when building/linking the library?
  
  -Mark
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark Dobossy - 2007-07-05
    
    nevermind.. got it working.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark Dobossy - 2007-06-19
  
  Just one more note-
  
  I just went back and looked at the vector I was sorting, and it is somewhat sparse (~half of the values are zero), which makes it very bad for sorting via a QS algorithm. If I simply create a 200,000 element vector, the sort time is fast (0.03 seconds) using the default sorting algorithm. If however, I create a 50% sparse vector (can be done via: vecToSort=URNG(200000); vecToSort=elem_mult(floor(vecToSort+0.5),vecToSort); where URNG is a uniform random number generator object), the sort time goes through the roof. One way around this is to use something like the Introsort algorithm which starts with a quicksort, but if the recursion gets too deep, switches to a heap sort.
  
  So, my question #1 was answered- no I wasn't doing something wrong, but my particular vector was not suited well to a quicksort.
  
  And just for good measure, I ran some code which increases the sparseness of the vector, from 0% zeros to 80% zeros, and sorts using the default QS as well as my sloppy heap sort. Here are the results (time in seconds):
  
  % QS Heap
  0 0.0564 0.0351
  10 1.8074 0.0344
  20 6.9793 0.0308
  30 15.789 0.0280
  40 27.765 0.0257
  50 43.532 0.0223
  60 62.582 0.0195
  70 85.573 0.0166
  80 112.046 0.0135
  
  As you can see, the heap sort can work much faster. For smaller vectors, that are not sparse, the QS becomes the winner, but whenever there is an amount of sparseness, the heap sort is more efficient.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adam Piątyszek - 2007-06-19
    
    Hi Mark,
    
    Thanks for your interesting proposal and tests performed! For sure we are open for new features and improvements of the IT++ classes and function. As for your question about patches and IT++ versions, 3.10.x releases are not intended to include any new functionality. Only bug fixes are committed to this branch. So definitely, you should consider switching to at least development releases (3.99.x) or better to SVN trunk sources, and then prepare the patch. BTW, we are planning to release 4.0.0-rc? (release candidates) within the next 2-3 months which will replace the current stable branch (3.10.x).
    
    As for your tests, could you please say something more about the performance of QS and heap-sort for shorter vectors? Besides, to make the comparison more fair, could you reimplement your heap-sort using a templated interface (using Vec<T>)?
    
    For the development version, we might consider redesigning the sort() and sort_index() interfaces, so it would be possible to employ different sorting algorithm, e.g.:
    
    void sort(Vec<T> &data, const std::string &method = "QS");
    int sort_index(Vec<T> &data, const std::string &method = "QS");
    
    or even better using an enum:
    
    enum SORTING_METHOD { QS = 0, QuickSort = 0, HS = 1, HeapSort = 1 }
    void sort(Vec<T> &data, SORTING_METHOD method = QS);
    int sort_index(Vec<T> &data, SORTING_METHOD method = QS);
    
    What do you think?
    
    BR,
    /ediap
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark Dobossy - 2007-06-20
      
      Adam-
      
      I was able to get the head of SVN all set up and running, and created a templated heap sort function (I have not yet created a heap sort index function). You were right about using the templated interface to make the comparison fair- the time to perform the sort doubled with the templated interface.
      
      With the new templated function, I did some testing, and found some very interesting results.
      
      1) Given a vector of any size, where all elements are different (i.e.- a vector created with the uniform RNG), QS is almost always faster. From 2-20 elements the heap sort is faster, but then the QS is quicker. I went out to 2 million elements, and the QS time was about half of the HS.
      
      2) That being said, if there is any kind of duplication in your vector (i.e.- a vector created with the uniform integer RNG), the HS quickly becomes the more efficient algorithm. For smaller vectors (less than 500 elements) if ~20% or more elements are duplicates (i.e.- if you had a 100 element vector, with random integers from 1 to 5) the HS is more efficient. When you go to a 5000 element vector, this drops to 7%, and at a 100000 element vector, this drops to less than 2%. In my case, I had a somewhat sparse vector (almost 50% of the elements were zero, while the rest were uniformly distributed). Because nearly 50% of the elements had duplicates, the heap sort was an enormous improvement.
      
      3) A word of warning (looking over my old old old CS notes)- the heap sort is not a stable sort. By stable, I mean, it does not preserve the relative order of indexes with the same value. In many cases this does not matter, but it can cause problems. There is another sort called the merge sort, which should have the same advantages as the heap sort, but is stable (however, it has a larger memory requirement- you cant get anything for free ;) ).
      
      I have some old code from that CS course which implements a merge sort in C (dont worry, I got an "A" on that lab ;) ). I'm going to take a look at that code, and make a merge sort for it++ as well.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Adam Piątyszek - 2007-06-20
        
        It's great news that you managed to switch to SVN version smoothly ;)
        
        A bit off-topic now:
        If you are familiar with Git (the content tracking system devised by Linus), you might import the IT++ SVN trunk sources (or even the whole repository) into a local Git repository (by e.g. using the following command: "git-svn clone https://itpp.svn.sourceforge.net/svnroot/itpp/itpp/trunk itpp-trunk"). Then you could work locally committing your changes, creating branches, preparing patches, sending them by email, etc. You could also send the patches directly to me, since I have been using Git for maintaining IT++ remote SVN repository for some time as well.
        BTW, I am quite impressed what this tool can do, as compared to SVN ;)
        
        Besides, if you are interested in becoming an IT++ developer, we are open for your contributions, and after some training period might give you an RW access to the repository.
        
        /ediap
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark Dobossy - 2007-06-20
        
        Thanks for the tip on git- I'll see if I can get it up and running on my development box (it is OS X, so it might not be totally straightforward.)
        
        For now, if I have a patch to the sort.h file, could I just either email it to you, or post it to the tracker?
        
        -Mark
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Adam Piątyszek - 2007-06-20
        
        It would be better to attach your patch to the FR tracker, since it will be recorded within the project as your contribution :)
        
        As for git, there shouldn't be any problems with using it under Mac OS X.
        
        /ediap
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark Dobossy - 2007-06-19
      
      Adam-
      
      My comment about performance with shorter vectors was just a generalization of what I know about QS vs. Heap. I can definitely run some tests, and let you know how the performance changes based on vector size.
      
      As for using the templated interface- definitely. That is actually what I was planning on doing next, though it may take me a day or two to get to it.
      
      As for specifying a sorting method, I think it is a great idea, as certain problems lend themselves well to different types of sorting. In fact, it may be beneficial to create a sorting class, that has many different types of sorting algorithms (e.g.: heap, quick, merge, bubble, insertion), as each has its own advantages and disadvantages. Furthermore, if all of the basic sorting routines are implemented, hybrid sorts, which tend to be the most efficient, could be used. For example, the introsort, which begins as a quicksort, but if the problem becomes too recursive, changes to a heap sort, and finally to an insertion sort.
      
      Taking sorting a step further (and since I need it for the program I am currently working on ;) ), I was thinking of adding matrix sort features as well, where you would pass a matrix, and specify a column to sort on.
      
      I'll start with templating my current heap sort, and run a few tests on it, to show speed on various size vectors.
      
      -Mark
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Adam Piątyszek - 2007-06-20
        
        Hi,
        
        I agree that it is a great idea to create a new sorting class, which could be used for both Vec<> and Mat<> (and maybe Array<> when possible). If you will find some time to experiment with such a design, we would be glad to see the patches. But of course there is no hurry. You can start with something basic, e.g. a templated Sort class with two algorithms (QS and Heap) for vectors first.
        
        As soon as you have something to test, please open a new Feature Request ticket and attach your patches there.
        
        As for the performance results, please keep the notes. It will be a valuable knowledge to include in the class documentation.
        
        /ediap
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark Dobossy - 2007-06-30
  
  Adam-
  
  Just wanted to give you a quick update on the sorting endeavor. I currently have an in-place heap sort implemented for both the normal sort, and the sort that just returns a sorted index vector. This in-place heap sort routine is almost as efficient as the quick sort in almost all cases (it is 5-10% slower than the implemented QS in the quickest QS cases).
  
  To further optimize things, I am almost finished with an introsort routine. An introsort routine will do the quick-sort to a recursion depth of 16, then switch to the heap sort. This gives the best of both worlds- problems that are suited well to QS stay in QS, while problems that are better suited to heap sorting, quickly change to it. The introsort would be the ideal "default" sorting algorithm, and is what the C++ STL uses for its "sort" routine. A true introsort will then switch to an insertion sort once the sub-arrays get small enough, but this could be added later. I should have introsort (sans insertion sort for small sub-arrays) working this afternoon.
  
  Both QS and Heap Sort are "non-stable" sorting algorithms. For some problems this may cause an issue (non-stable doesn't mean you crash, or cant sort, it just means that the relative order of indices for elements with the same value, may not retain their order), thus I am implementing a "stable_sort" function, which uses the merge sort algorithm. So, if it is necessary for the user to implement a stable sort, they can use the stable_sort routine rather than sort.
  
  Thus, for vector sorting, the user will have several options:
  1) sort:
  1a) using introsort (default)
  1b) using QS
  1c) using in place heap sort
  
  2) stable_sort, which uses merge sort
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adam Piątyszek - 2007-07-01
    
    Thanks for great news!
    
    As for the default sorting algorithm, I am pretty sure that we can switch to the introsort one. We just need to discuss this with a few active developers.
    
    Besides, I wonder if it is the best solution to introduce a separate stable_sort method for the merge sort algorithm... In my opinion it would be enough to put it under the sort() interface as the fourth alternative, and just mention the important difference (stability) in the documentation.
    
    BTW, have you used class inheritance to implement a base class Sort and its descendent classes for each algorithm? Or you just implemented the sorting algorithms as a set of methods in one class Sort?
    I wonder which approach is better for this case.
    
    BR,
    /ediap
    
    PS. When your contribution is ready, I am going to add your full name to the list of contributors (AUTHORS, doc/local/authors.doc and Authors www page). Please let me know wheather you accept this fact or not. :)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark Dobossy - 2007-07-02
  
  Adam-
  
  Warning: long message ahead. :)
  
  Summary: I have found that 1) Introsort is probably the best general sorting routine to use, as it will sort vectors well suited to QS, as fast as the QS routine (sometimes faster), and vectors not well suited to QS as fast as the heap sort routine. 2) If we use the _data pointer, sorting is ~2x faster.
  
  The long version:
  
  I tried putting together an introsort (along with a heap and quicksort) that use the pointer to the vector's data ( _data() ). Amazingly this cut the QS sort time in half, and the introsort to 1/4 of its original time.
  
  Would you be opposed to having sort functions that use the pointer? We could make the functions private, and wrap them with a public function which would check the initial high and low inputs, and now allow them to be outside the range of the vector. It would take some work, but I think we could make the functions safe, and greatly impact the sort efficiency.
  
  Attached is the output from a code I wrote which would create a random data set of a given size, apply the Introsort(_data), Introsort(vec), inplaceHeap(_data), inplaceHeap(vec), QS(_data) and QS(vec) to these data sets (the data is randomized between each sort). It cycles through vectors of size 10-500000, and sorts 300 times per vector, and outputs the average time for each sort. As you can see, since this is a true random sample, QS is usually fastest, followed by introsort, and finally the heapsort (for small vectors, introsort is fastest, because it uses an insertion sort for them). But notice that the _data sort is always about double the speed of the vec sort. I think the results show that it would be beneficial to use the _data pointer for sorting.
  
  Vector Size: 10
  
  Introsort (_data): 2.89281e-07
  Introsort (Vec): 4.69685e-07
  inplaceHeap (_data): 3.00407e-07
  inplaceHeap (Vec): 6.41346e-07
  QS (_data): 3.88622e-07
  QS (Vec): 5.17368e-07
  
  Vector Size: 50
  
  Introsort (_data): 1.70549e-06
  Introsort (Vec): 3.75748e-06
  inplaceHeap (_data): 2.22921e-06
  inplaceHeap (Vec): 4.64598e-06
  QS (_data): 2.10206e-06
  QS (Vec): 4.45286e-06
  
  Vector Size: 100
  
  Introsort (_data): 4.06027e-06
  Introsort (Vec): 8.78334e-06
  inplaceHeap (_data): 5.44548e-06
  inplaceHeap (Vec): 1.09895e-05
  QS (_data): 4.5975e-06
  QS (Vec): 8.53856e-06
  
  Vector Size: 500
  
  Introsort (_data): 3.0237e-05
  Introsort (Vec): 6.50072e-05
  inplaceHeap (_data): 3.60473e-05
  inplaceHeap (Vec): 7.0587e-05
  QS (_data): 2.96219e-05
  QS (Vec): 5.90078e-05
  
  Vector Size: 1000
  
  Introsort (_data): 6.4315e-05
  Introsort (Vec): 0.000142167
  inplaceHeap (_data): 8.01659e-05
  inplaceHeap (Vec): 0.000158134
  QS (_data): 6.50994e-05
  QS (Vec): 0.000132353
  
  Vector Size: 5000
  
  Introsort (_data): 0.000393809
  Introsort (Vec): 0.000876443
  inplaceHeap (_data): 0.000492495
  inplaceHeap (Vec): 0.000963068
  QS (_data): 0.0003915
  QS (Vec): 0.000810726
  
  Vector Size: 10000
  
  Introsort (_data): 0.00085743
  Introsort (Vec): 0.00190342
  inplaceHeap (_data): 0.00108499
  inplaceHeap (Vec): 0.00209439
  QS (_data): 0.000842934
  QS (Vec): 0.00175925
  
  Vector Size: 50000
  
  Introsort (_data): 0.00499921
  Introsort (Vec): 0.0111738
  inplaceHeap (_data): 0.00665189
  inplaceHeap (Vec): 0.0123509
  QS (_data): 0.00487621
  QS (Vec): 0.0103599
  
  Vector Size: 100000
  
  Introsort (_data): 0.010662
  Introsort (Vec): 0.0238135
  inplaceHeap (_data): 0.0143524
  inplaceHeap (Vec): 0.0263256
  QS (_data): 0.0104147
  QS (Vec): 0.0222037
  
  Vector Size: 500000
  
  Introsort (_data): 0.0616069
  Introsort (Vec): 0.136997
  inplaceHeap (_data): 0.0951638
  inplaceHeap (Vec): 0.162112
  QS (_data): 0.0590362
  QS (Vec): 0.127107
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adam Piątyszek - 2007-07-02
    
    Hi Mark,
    
    What a great job you have been doing with these sorting algorithms! Thanks! I wish you will be able to work on IT++ internals in future! We are open for recruiting new developers who are willing to help!
    
    There is no problem with using _data() pointer in performance critical functions. I think that is why it was introduced before. It is just not recommended for end users, which are not always experienced enough to perform all the error checking by hand. So you can stick to the implementation with _data pointers.
    
    And please make the introsort the default sorting algorithm in your patch.
    
    BR,
    /ediap
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark Dobossy - 2007-07-03
  
  Adam-
  
  I have now uploaded a new version of the sort.h file to the tracker. I submitted the entire file rather than a patch, as it has been almost completely revamped. If you would rather I submit this as a patch, please let me know how you would like me to make the patch.
  
  Sorting is going much quicker now (anywhere from 2-10x faster). I have set the Introsort as the default sorting algorithm, but you can easily use the enum to change the method used.
  
  This patch contains a Quick Sort, Heap Sort, Insertion Sort and Intro Sort algorithm for both a normal sort and an index sort. Later this week, I plan to add a merge sort (to offer a "Fast" stable sorting routine).
  
  For each sorting method, I have two functions, a "public" (documented) function, which takes as inputs a min, max, max_depth in the case of the introsort, and the data vector. This function ensures that the min and max are within the bounds of the vector, then calls the "private" function with the pointer to the vector's data. I did this in separate functions to prevent the checks being performed on each call of the sort during a recursion. For example, when a program calls QS(int low, int high, Vec<T> &data), the QS function checks low and high to be in the bounds of data, then passes low, high and data._data() to _QS(int low, int high, T data[]). Upon recursion within _QS, it calls itself. Since we know any recursed call will be within the bounds, we dont have to repeat the checks which are performed in the "public" QS function.
  
  I have two requests when you have a moment- 1) look through my doxygen documentation. I tried to do it correct, but may not have. 2) Hammer a bit on the sorting functions. I did a great deal of testing myself, and could not break anything, but to make sure it would be great if you could try to find any problems.
  
  Please let me know if you have any questions/comments/suggestions!
  
  -Mark
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adam Piątyszek - 2007-07-04
    
    > I have two requests when you have a moment-
    > 1) look through my doxygen documentation. I tried to do it correct, but may not have.
    
    The prefixed functions are not documented, but Doxygen parses them and they are listed in the sort.h file documentation. This could be nicely fixed by wrapping the functions in a private section of a Sort class. Then document the Sort class and each public method of it as a new module "Sorting methods" available in the list of modules.
    
    > 2) Hammer a bit on the sorting functions. I did a great deal of testing myself, and could not
    > break anything, but to make sure it would be great if you could try to find any problems.
    
    Unfortunately my time is a bit constraint. So, I will count here on your tests and possible test program. Besides, once the code will be included in public release, we might expect bug reports for any serious problem with sorting functions ;-)
    
    /ediap
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark Dobossy - 2007-07-04
      
      Good news- I got git all up and running on my OS X machine (I had a problem with mac ports- didnt realize I had to specify a +svn on the install to get git-svn, but all is good now). I'll apply your patch against my local version, and work on creating a sort class as you mentioned in your post on the tracker.
      
      > Unfortunately my time is a bit constraint. So, I will count
      > here on your tests and possible test program. Besides, once
      > the code will be included in public release, we might expect
      > bug reports for any serious problem with sorting functions ;-)
      
      No problem at all. I'll definitely clean up/expand the program I was using for testing, and take a look at the current it++ test programs, and submit a testing program.
      
      -Mark
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.