I recently flew over your work related to LCP. Unfortunately I did not understand the details on the fly properly. Apparently it is about optimized sorting strings with external memory. Can I conclude that a future version of STXXL will support sorting of variable-length strings and in general non-pod types? Or is this feature out of the scope of STXXL?
Well, it's not directly about sorting string. But one of the subroutines implemented in eSAIS does do sorting of short, variable length object in EM. The implementation in eSAIS is very ad-hoc and not going to be integrated in STXXL.
However, the general idea of sorting and processing variable length objects is not out of scope of STXXL: the stream sorting methods will work well for sorting short serialization of objects (e.g. strings). Long objects are a more difficult case, but probably also not an issue in practical applications. What does not fit well is the stxxl::sort interface, or more generally the stxxl::vector.
Currently I dont know about anyone working on serialization of variable length objects. If you want to do some work, we can talk more.
I see no way how stxxl::vector<> could be used in order to perform an efficient sort of serialized objects without additional paging. In my view those objects can only reside in vectors through sone kind of smart pointers. The actual sorting (merging) should be done by a specialized stream-like implementation similar to the already presented run_creator <> and runs_merger<> for pod's. Additional support for vectors could be implemented by a transition from a stream based representation of serialized objects to a representation with some kind of smart pointers, but at the momement I think that this is another story.
Yes, that's how I see things, as well.
The stream sorting could be adapted to objects easily using a serialization mechanism. How to actually design this mechanism is open.
The eSAIS design for this was ad-hoc and specialized, but also shows some of the issues at hand.
In a project some time ago I've already experimented with a customized STXXL 1.3.1 to support serialized objects. For this purpose I've added a special version of typed_block<>, which had support for allocating arbitrary sized objects (within the specialized typed_block<>, no overflow permitted). It turned out, that the existing implementation for fixed sized pods could be reused to some extent for serialized objects as well. The main difference is, that the existing implementations assume a fix capacity of typed_block<> elements computed during compiletime, whereas the implementation for serialized objects computes the (block-) capacities during runtime. As I can recall, with a slight redesign of the run_merger<> and losertree<> using iterators, those implementation could be reused for both pods and serialized objects.
Your ideas sound interesting.
I'm currently postponing everything until 1.4.0 is released.
After that I'm going to have a look at the "parallel pipeline" sorting branch, and try to merge it. That branch with "async" nodes will lead to merge errors with your code.
We can talk about your ideas then, okay?
Okay. Do you have a timescale?
Andreas should finish the git repo any day. 1.4.0 is done as is.
Then I'll try to merge some of the branches, probably in December.
Log in to post a comment.