Re: [Bacula-devel] De-duplication friendly Volume Format change

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

2011/1/5 Kern Sibbald <ke...@si...>

> > I hope that you mean by keeping hashes on Director you mean actually
> > keeping them on both?
>
> We are considering every possibility -- each solution has its own
> advantages
> and disadvantages, so it is very hard to say that one way of doing this is
> the correct or right way.
>
> For example, it is faster to dedupicate if the hashes are stored on the
> client
> machine than if they are stored on a server such as the Director, but not
> every client machine has enough disk space to store them.  Most estimates
> indicate that about 30% more disk space is required to keep hash codes.  In
> addition, your deduplication ratio will very significantly drop (be very
> poor) if you are only deduping a client machine and do not use a
> deduplication "pool" of hashes from multiple machines.
>
> Unless you run tests, which may vary from machine to machine, it is very
> difficult to know what algorithm is best.  One major factor is that the
> machine might be connected to a server by a very slow 100Mb Internet
> connection or a fast 10Gb LAN.
>
> We will probably start with something very simple and add to it over time.
>

A question is: How we'll store a data block hashes?
SQL database seems to be very easy to implement, but it has a lot of
disadvantages.
Another option is use one of opensource key/value database like Redis (
http://redis.io/). It has a very good performance.
The last one is to implement our own solution. Requires a lot of work and
tests (all about time).

What do you think about it?

-- 
Radosław Korzeniewski
rad...@ko...