Data corruption with stxxl::sort

tcb
2012-10-16
2013-04-25
  • tcb
    tcb
    2012-10-16

    hi,

    I have an stxxl vector backed by a syscall_file of a custom object and after sorting it with stxxl::sort some of the entries are corrupted- all the fields are set to 0.

    Before going to the trouble of making a test case, is this a problem anyone has encountered before? I have checked using boost/type_traits that my struct is a POD type. Is it likely something to do with the parameters I am using? or are there known bugs with the sorting code? I am running on osx 10.8, compiled with gcc-4.7

    My vector type is defined like this:

      static const unsigned int memory_to_use = 512 * 1024 * 1024;
      static const unsigned int block_size = static_cast<unsigned int>(sizeof(MyStruct) * 1024 * 32);
      typedef stxxl::vector<MyStruct, 1, stxxl::lru_pager<1>, block_size> vector_type;

    thanks,

     
  • I have an stxxl vector backed by a syscall_file of a custom object and after sorting it with stxxl::sort some of the entries are corrupted- all the fields are set to 0.

    Do you have elements in the vector that are equal to MyStruct.min_value() according to your comparator?
    In that case, your struct and the comparison functor do not conform to the "Comparison concept" as described in http://algo2.iti.kit.edu/stxxl/tags/1.3.1/group__stlalgo.html

    Otherwise I need a working code example (and neccessary parameters and disk configuration) to reproduce the problem.,
    Do you use the latest stxxl from SVN trunk?

    Andreas

     
  • Hi,
    Calling flush immediately before sort  solved similar issue for me.

     
  • tcb
    tcb
    2012-10-16

    Thanks for the replies.

    I am using the svn trunk version (STXXL v1.4.0 (prerelease)). I have made sure that the min_value() and max_value() do not occur in the data- this is not the problem.

    I didn't know it was necessary to call flush- I'll try that. But I also find the problem with existing files, where presumably calling flush() is not necessary.

    It seems the problem is related to the use of syscall files. I am creating a vector like this:

      stxxl::syscall_file f(filename, stxxl::file::CREAT | stxxl::file::RDWR | stxxl::file::DIRECT);
      vector_type data(&f);
      .. fill data
      stxxl::sort(data.begin(), data.end(), cmp(), memory_to_use); 

    and my .stxxl is something like this:

      disk=/mnt/usb/stxxl/files,0,syscall

    So I fill the vector and then call sort- then some fraction of the data will be zero'ed out.

    However, I have noticed that without the file backing, the sort completes just perfectly:

      vector_type data;
      .. fill data
      stxxl::sort(data.begin(), data.end(), cmp(), memory_to_use);

    So I guess I can work around the problem then- but having a persistent store is really useful for me.

    I've adapted the simple sort test from algo/test_sort.cpp, and I've tried on osx 10.8.2 with g++ 4.7.2 (MacPorts gcc47 4.7.2_2)

    #include <iostream>
    #include <random>
    #include <stxxl/io>
    #include <stxxl/sort>
    #include <stxxl/vector>

    using namespace std;

    // very simple type with two integer fields, sorting on field1 then field2
    struct my_type
    {
      typedef int64_t key_type;
     
      key_type field1;
      key_type field2;

      my_type() { }
      my_type(key_type key1, key_type key2) : field1(key1), field2(key2) { }
      friend std::ostream& operator<<(std::ostream &o, const my_type &obj) {
        return o << "";
      }

      bool operator<(const my_type& other) const {
        if(this->field1 < other.field1) {
          return true;
        }
        else if(this->field1 > other.field1) {
          return false;
        }
        if(this->field2 < other.field2) {
          return true;
        }
        else if(this->field2 > other.field2) {
          return false;
        }
        return false;
      } 
    };

    my_type my_type_min = {std::numeric_limits<my_type::key_type>::min(), std::numeric_limits<my_type::key_type>::min()};
    my_type my_type_max = {std::numeric_limits<my_type::key_type>::max(), std::numeric_limits<my_type::key_type>::max()};

    struct Cmp {
      bool operator () (const my_type & a, const my_type & b) const {
        return a < b;
      }
      static my_type min_value() {
        return my_type_min;
      }

      static my_type max_value() {
        return my_type_max;
      }
    };

    int main(int argc, char ** argv)
    {
      std::mt19937 gen;
      std::uniform_int_distribution<> dis(1, 1000);
      const unsigned int block_size = sizeof(my_type) * 4096;
      unsigned memory_to_use = 50 * 1024 * 1024;
      typedef stxxl::vector<my_type, 1, stxxl::lru_pager<8>, block_size> vector_type;

    #ifdef _DEBUG_
      stxxl::syscall_file f("syscall_file.dat", stxxl::file::DIRECT | stxxl::file::RDWR | stxxl::file::CREAT);
      vector_type v(&f);
    #else
      vector_type v;
    #endif

      // adjust this to be big enough to avoid in memory sorting
      int numberRecords = 100000000;
      std::generate_n(std::back_inserter(v), numberRecords,
      () { return my_type(dis(gen), dis(gen)); });
     
      stxxl::sort(v.begin(), v.end(), Cmp(), memory_to_use);

      STXXL_MSG((stxxl::is_sorted(v.begin(), v.end()) ? "OK" : "WRONG"));

      ofstream of("test.dat");
      std::for_each(v.begin(), v.end(),
    (const my_type& val) { of << val << endl; });

      return 0;
    }

     
  • Hi,

    thanks for the example.  Unfortunately I cannot reproduce the problem (on Linux, amd64). I thought I had reproduced it when I saw that test.dat started with a lot of  pairs … until I realized that syscall_file.dat was way too big - there is a v.clear() missing and therefore more data gets appended with each run. And you can easily create a partially written final block in that file by interrupting your program with Ctrl-C.

    Manually flushing the vector should not help as stxxl::sort() will issue a flush() itself …

    Can you try to analyze which part of the input "gets lost"? I would expect it's the last (partial) block.

    int64_t i = 0;
    v.clear();
    std::generate_n(std::back_inserter(v), numberRecords,
        [&dis, &gen, &i]() { return my_type(dis(gen), ++i); });
    
    cut -d ' ' -f 3 test.dat | sort | uniq -c | sort -n -k 1 -k 2 | tail
    

    might output the interesting part

    Does syscall_file.dat get written to the same device that contains the scratch space configured in .stxxl?

    Have you tried removing the stxxl::file::DIRECT flag?

    Have you tried an internal disk? (To rule out some bug in the USB stack.)

    Have you tried a different platform? (Linux)

    BTW, stxxl::generate() might be more efficient (I/O-wise due to overlapping), although AFAIK nobody ever tried it with lambdas.

    Andreas

    PS: I don't have access to any Mac, so I can't try "your platform".

     
  • tcb
    tcb
    2012-10-17

    Thanks for the reply.

    I removed the stxxl::file::DIRECT flag and made sure there were no stray files lying around and the sorting appears to be correct now. I changed a couple of other minor things in the example before posting it which seemed insignificant (just cleanups) and now I can't reproduce my own problem! I'll try sorting my full dataset and if I run into further problems I'll let you know.

    I was seeing the original issue on Linux too- but I had the same file setup there- so that could be the same issue.

    I've seen the stxxl::generate which looks good- must try it with a lambda expression….

    I noticed on the bug tracker an issue related to sentinels for sorting. This would allow you to remove the requirement for min_value(), max_value(). It did cause a bit of an issue for me at first, and while it was simple enough to work around it would certainly make things easier to remove this requirement.

    many thanks!

     
  • The sentinels are historically grown … we just need to find someone with enthusiasm and time to get rid of them :-)

    if stxxl::file::DIRECT really makes the difference, the O_DIRECT implementation in libc and/or kernel may be buggy … or the filesystem amy not support O_DIRECT.