Menu

Status of on-disk hash tables

2011-02-09
2013-04-25
<< < 1 2 (Page 2 of 2)
  • Daniel Godás

    Daniel Godás - 2011-02-23

    Here is the patch.

    Index: containers/test_external_shared_ptr.cpp
    ===================================================================
    --- containers/test_external_shared_ptr.cpp (revision 2979)
    +++ containers/test_external_shared_ptr.cpp (working copy)
    @@ -89,13 +89,6 @@
             vector_type::const_iterator c_it = v.begin();
             STXXL_UNUSED(c_it);
    
    -        unsigned int big_size = 1024 * 1024 * 2 * 16 * 16;
    -        typedef stxxl::vector<double> vec_big;
    -        vec_big my_vec(big_size);
    -
    -        vec_big::iterator big_it = my_vec.begin();
    -        big_it += 6;
    -
             test_const_iterator(v);
    
             stxxl::random_number32 rnd;
    @@ -125,6 +118,7 @@
                 aep->key = rnd();
                 element e(aep);
    
    +            v[i].unwrap();
                 v[i] = e;
    
                 assert(v[i].get()->key == aep->key);
    @@ -146,6 +140,9 @@
             // check again
             STXXL_MSG("clear");
    
    +        for (vector_type::iterator it = v.begin(); it != v.end(); ++it)
    +       it->unwrap();
    +
             v.clear();
    
             stxxl::ran32State = 0xdeadbeef + 10;
    @@ -179,6 +176,12 @@
             vector_type v_copy1;
             v_copy1 = v;
             assert(v == v_copy1);
    +
    +        while (v.size() != 0) {
    +       element e = v.back();
    +       v.pop_back();
    +            e.unwrap();
    +        }
         }
         catch (const std::exception & ex)
         {
    
     
  • Johannes Singler

    Applied.

    Johannes

     
  • Andreas Beckmann

    While the patch removes the memory leak, it clearly demonstrates that the concept is flawed if you need a "cleanup loop".

    And the most surprising thing is that you need to manually unwrap() before overwriting:

                v[i].unwrap();
                v[i] = e;
    

    This doesn't look at all like the idea that stands behind a shared_ptr, i.e. you don't have to worry about these things.

    The only real solution I would see is to implement a variant (would specialization work?) of the stxxl containers that honours constructors/destructors/assignment operators of the elements. Penalty would be linear number of I/Os (N/B) for creating/resizing/clearing/destroying such a container. And it would probably break all the other algorithms that access the underlying blocks directly, like sorting, streamify, … so the blocks used must be aware of this, too.
    I'm not sure this road would lead to something practical.

    Andreas

     
  • Daniel Godás

    Daniel Godás - 2011-02-23

    > While the patch removes the memory leak, it clearly demonstrates that the concept is flawed if you need a "cleanup loop".

    I agree this looks ugly but I did not want to touch the containers for the first version of the wrapper. However I don't think the concept is flawed. When we are happy with the basics we can integrate the wrapper with the containers so that insert-like functions on shared pointers automatically wrap them and get-like functions automatically unwrap them. With this approach we can also handle overwriting without the explicit unwrap. The drawback is we will need to overload some functions for each type of shared pointer we want to use but being realistic I don't think most people would use something else than boost::shared_ptr and std::shared_ptr.

    To sum up, we can move all the unwraps into the containers so that users never need to call them explicitly.

    > The only real solution I would see is to implement a variant (would specialization work?) of the stxxl containers that honours constructors/destructors/assignment operators of the elements.

    After the shared pointer has been wrapped the wrapper works exactly as any object without pointers to internal memory. We only need to take care of releasing it when it is removed from the container.

    > Penalty would be linear number of I/Os (N/B) for creating/resizing/clearing/destroying such a container.

    The penalty with my approach would be an O(1) block of code for insertions, accesses and removals.

    > And it would probably break all the other algorithms that access the underlying blocks directly, like sorting, streamify, … so the blocks used must be aware of this, too.

    I don't think the other algorithms should be affected. They would work on the wrapper, not directly on the pointer.

     
  • Daniel Godás

    Daniel Godás - 2011-03-06

    Hi there! I modified the wrapper a bit and derived the vector and map classes to remove all the explicit calls to unwrap. I wanted to share this patch before moving on but my plan was to write some templates to wrap the containers and keep the pointer wrapper inside them. As you can see I initialise all the elements in the vector on creation. This could be avoided by adding 1 bit of metadata to each object stored on disk but I didn't want to go that way until after discussing this patch.

    Looking forward for your comments,
    Dan

    Index: include/stxxl/bits/utils/external_shared_ptr.h
    ===================================================================
    --- include/stxxl/bits/utils/external_shared_ptr.h  (revision 3014)
    +++ include/stxxl/bits/utils/external_shared_ptr.h  (working copy)
    @@ -57,8 +57,13 @@
          */
         char data[sizeof(P)];
    
    +    /*
    +     * Is there anything in the wrapper?
    +     */
    +    bool stored;
    +
     public:
    -    external_shared_ptr()
    +    external_shared_ptr() : stored(false)
         {
             /*
              * This constructor needs to be defined so that the [] operator
    @@ -67,7 +72,7 @@
              */
         }
    
    -    external_shared_ptr(P ptr)
    +    external_shared_ptr(P ptr) : stored(true)
         {
             /*
              * Copy the pointer to internal storage and increment
    @@ -76,32 +81,58 @@
             new(data) P(ptr);
         }
    
    +    void init()
    +    {
    +        stored = false;
    +    }
    +
         void unwrap()
         {
    +        assert(stored == true);
    +
             /*
    -         * Call the destructor to decrement the refcount. If this is
    -         * called more than once the results are undefined.
    +         * Call the destructor to decrement the refcount.
              */
             P * p = reinterpret_cast<P *>((void *)data);
             p->~P();
    +
    +        stored = false;
         }
    
         P get() const
         {
    -        /*
    -         * If this is called after unwrap() the behaviour is undefined
    -         */
    +        assert(stored == true);
    +
             P * p = reinterpret_cast<P *>((void *)data);
             return *p;
         }
    
         bool operator == (const external_shared_ptr & x) const
         {
    +        assert(stored == true);
    +
             P * p1 = reinterpret_cast<P *>((void *)data);
             P * p2 = reinterpret_cast<P *>((void *)x.data);
    
             return *p1 == *p2;
         }
    +
    +    const external_shared_ptr & operator = (const external_shared_ptr & x)
    +    {
    +        /*
    +         * Release the current object and replace it with the one in x.
    +         */
    +
    +        if (stored == true)
    +            unwrap();
    +
    +        if (x.stored) {
    +            new(data) P(x.get());
    +            stored = true;
    +   }
    +
    +        return *this;
    +    }
     };
    
     __STXXL_END_NAMESPACE
    Index: containers/test_external_shared_ptr.cpp
    ===================================================================
    --- containers/test_external_shared_ptr.cpp (revision 3014)
    +++ containers/test_external_shared_ptr.cpp (working copy)
    @@ -76,13 +76,64 @@
         *i;
     }
    
    +// use non-randomized striping to avoid side effects on random generator
    +typedef stxxl::VECTOR_GENERATOR<element, 2, 2, (2 * 1024 * 1024), stxxl::striping>::result _vector_type;
    
    +class vector_type : public _vector_type
    +{
    +public:
    +    vector_type(size_type n = 0) : _vector_type(n)
    +    {
    +        iterator it;
    +
    +        for (it = begin(); it != end(); ++it)
    +            it->init();
    +
    +        flush();
    +    }
    +
    +    void clear()
    +    {
    +        iterator it;
    +
    +        for (it = begin(); it != end(); ++it)
    +            it->unwrap();
    +
    +        _vector_type::clear();
    +    }
    +
    +    vector_type & operator = (const vector_type & obj)
    +    {
    +        if (&obj != this) {
    +            resize(obj.size());
    +
    +            for (size_type i = 0; i < obj.size(); i++)
    +                (*this)[i] = obj[i];
    +        }
    +
    +        return *this;
    +    }
    +
    +    void resize(size_type n)
    +    {
    +        size_type old = size();
    +
    +        _vector_type::resize(n);
    +
    +        for (size_type i = old; i < n; ++i)
    +            (*this)[i].init();
    +    }
    +
    +    ~vector_type()
    +    {
    +        clear();
    +    }
    +};
    +
     void test_vector()
     {
         try
         {
    -        // use non-randomized striping to avoid side effects on random generator
    -        typedef stxxl::VECTOR_GENERATOR<element, 2, 2, (2 * 1024 * 1024), stxxl::striping>::result vector_type;
             vector_type v(64 * 1024 * 1024 / sizeof(element));
    
             // test assignment const_iterator = iterator
    @@ -118,7 +169,6 @@
                 aep->key = rnd();
                 element e(aep);
    
    -            v[i].unwrap();
                 v[i] = e;
    
                 assert(v[i].get()->key == aep->key);
    @@ -140,9 +190,6 @@
             // check again
             STXXL_MSG("clear");
    
    -        for (vector_type::iterator it = v.begin(); it != v.end(); ++it)
    -            it->unwrap();
    -
             v.clear();
    
             stxxl::ran32State = 0xdeadbeef + 10;
    @@ -176,12 +223,6 @@
             vector_type v_copy1;
             v_copy1 = v;
             assert(v == v_copy1);
    -
    -        while (v.size() != 0) {
    -            element e = v.back();
    -            v.pop_back();
    -            e.unwrap();
    -        }
         }
         catch (const std::exception & ex)
         {
    @@ -221,8 +262,92 @@
    
     #define CACHE_ELEMENTS (BLOCK_SIZE * CACHE_SIZE / (sizeof(key_type) + sizeof(data_type)))
    
    -typedef stxxl::map<key_type, data_type, cmp, BLOCK_SIZE, BLOCK_SIZE> map_type;
    +typedef stxxl::map<key_type, data_type, cmp, BLOCK_SIZE, BLOCK_SIZE> _map_type;
    
    +class map_type : public _map_type
    +{
    +public:
    +    map_type(stxxl::unsigned_type node_cache_size_in_bytes,
    +        stxxl::unsigned_type leaf_cache_size_in_bytes
    +        ) : _map_type(node_cache_size_in_bytes, leaf_cache_size_in_bytes)
    +    { }
    +
    +    map_type(const key_compare & c_,
    +        stxxl::unsigned_type node_cache_size_in_bytes,
    +        stxxl::unsigned_type leaf_cache_size_in_bytes
    +        ) : _map_type(c_, node_cache_size_in_bytes, leaf_cache_size_in_bytes)
    +    { }
    +
    +    template <class InputIterator>
    +    map_type(InputIterator b,
    +        InputIterator e,
    +        stxxl::unsigned_type node_cache_size_in_bytes,
    +        stxxl::unsigned_type leaf_cache_size_in_bytes,
    +        bool range_sorted = false,
    +        double node_fill_factor = 0.75,
    +        double leaf_fill_factor = 0.6
    +        ) : _map_type(b, e, node_cache_size_in_bytes, leaf_cache_size_in_bytes,
    +                 range_sorted, node_fill_factor, leaf_fill_factor)
    +    { }
    +
    +    template <class InputIterator>
    +    map_type(InputIterator b,
    +        InputIterator e,
    +        const key_compare & c_,
    +        stxxl::unsigned_type node_cache_size_in_bytes,
    +        stxxl::unsigned_type leaf_cache_size_in_bytes,
    +        bool range_sorted = false,
    +        double node_fill_factor = 0.75,
    +        double leaf_fill_factor = 0.6
    +        ) : _map_type(b, e, c_, node_cache_size_in_bytes, leaf_cache_size_in_bytes,
    +                 range_sorted, node_fill_factor, leaf_fill_factor)
    +    { }
    +
    +    void clear()
    +    {
    +        iterator it;
    +
    +        for (it = begin(); it != end(); ++it)
    +            it->second.unwrap();
    +
    +        _map_type::clear();
    +    }
    +
    +    map_type & operator = (const map_type & obj)
    +    {
    +        if (&obj != this) {
    +            clear();
    +
    +            const_iterator it;
    +
    +            for (it = obj.begin(); it != obj.end(); ++it) {
    +                value_type v(it->first, it->second);
    +                insert(v);
    +            }
    +        }
    +
    +        return *this;
    +    }
    +
    +    data_type & operator [] (const key_type & k)
    +    {
    +        iterator it = find(k);
    +   
    +   if (it == end()) {
    +       data_type new_data;
    +       value_type v(k, new_data);
    +            it = insert(v).first;
    +   }
    +
    +        return it->second;
    +    }
    +
    +    ~map_type()
    +    {
    +        clear();
    +    }
    +};
    +
     void test_map()
     {
         stxxl::stats_data stats_begin(*stxxl::stats::get_instance());
    @@ -282,13 +407,6 @@
             STXXL_MSG("reads/logel " << reads << " readsperq " << readsperq);
             STXXL_MSG(stats_elapsed);
    
    -        while (Map.size() != 0) {
    -            map_type::iterator it = Map.begin();
    -            data_type data = (*it).second;
    -            Map.erase(it);
    -            data.unwrap();
    -        }
    -
             delete DMap;
         }
     }
    
     
  • Johannes Singler

    As you can see I initialise all the elements in the vector on creation. This could be avoided by adding 1 bit of metadata to each object stored on disk but I didn't want to go that way until after discussing this patch.

    Well, haven't you added a boolean called "stored" already?  This one will go to disk as well.  Couldn't you set the shared_ptr to NULL instead of !stored?  Initializing will in any case cause a lot of I/O, which usually is undesirable…  we would need something like lazy allocation.

    Apart from that, please drop " == true", and use more descriptive identifiers for

    element
    vector_type
    _vector_type
    map_type
    _map_type

    Copying _vector_type is also not overlapped.  However, we have to admit that it is not much better for a standard vector…

    Johannes

     
  • Daniel Godás

    Daniel Godás - 2011-03-07

    The reason for initilizing the fields is that when they are not initialized and I get a reference to one of them the constructor does not get called. That is why the field I added won't do the trick. On creation of an empty wrapper that field is set to 0 but with the stxxl vector the constructor never gets called leaving field undefined. The problem is in this bit of code (vector.h):

        void read_page(int_type page_no, int_type cache_slot) const
        {
            if (_page_status[page_no] == uninitialized)
                return;
    

    If we added the equivalent to the "stored" field at this level then we could call the constructor just before the return for uninitialized objects. I didn't want to be that intrusive without asking for better ideas.

    Apart from that, please drop " == true", and use more descriptive identifiers for

    Will do.

    Copying _vector_type is also not overlapped. However, we have to admit that it is not much better for a standard vector…

    I'd rather not worry about this until everything else is solved.

    Dan

     
  • Johannes Singler

    I understand.

    If we added the equivalent to the "stored" field at this level then we could call the constructor just before the return for uninitialized objects. I didn't want to be that intrusive without asking for better ideas.
    

    Yes, that would be quite intrusive.  If introducing such a flag, it should be at most per block, not per element.  Then, we could also keep it in internal memory.
    But I shy away from altering the STXXL vector standard behavior.
    Andreas, what do you think?

    Johannes

     
<< < 1 2 (Page 2 of 2)

Log in to post a comment.