Re: [Dibs-discussion] Congrats & some problems

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Christian Stork <cs...@ic...> writes:

> First of all I want to say that I really like your approach to backup
> and, from what I can tell, your design and implementation seems to be of
> very high qualtiy.  

Thank you for the kind words.

> So, let me start chronologically.  The documentation is quite good, but
> there were a few things that weren't really clear on first reading and
> some others which still aren't clear to me. :-)

You're very diplomatic.  I agree that the documentation needs work. :)

> It wasn't immediatly clear to me how I have to set the --talk/listen
> options to add_peer.  I think it might be a good idea to simply say that
> --listen is essentially always the same, ie the communication mode
> (active/passive/mail) of my local box, whereas --talk always refers to the
> mode of my peer.  (Maybe, the option could even be renamed to
> --local/remote-mode or something.) 

You are correct that --listen is the reverse of talk.  I agree with
your suggestion that this should be more clearly documented or
renamed.  I'll log a feature request for this.

> Actually, it is not clear to me why the distinction between active
> and passive is needed at all.  Can't DIBS figure this out on its own
> and operate on a best effort basis?

Basically, if A is behind a firewall and peering with B, then you
don't want B constantly trying (and failing) to connect to A.  The
reason is that in trying to connect to A, B has to wait for a time and
this creates a bottleneck.  

> The next thing, which is not clear to me from the docs even until today,
> is how DIBS decides to split up the files into pieces and to which peers
> they'll be send.  First of all there is the "redundantPieces" user
> option of wich the documentation says: 
>
>     This variables specifies how many redundant pieces of a file will be
>     created. If a file is chopped into k pieces (see kbPerFile), this
>     many extra pieces will be added using a Reed-Solomon code. For
>     example, if a file is chopped into 5 pieces and redundantPieces is
>     2, then 7 pieces will be sent such that the original file can be
>     recovered from any 5 of those 7 pieces.
>
> Is a file only chopped into pieces if it is above the kbPerFile limit?
> (I'm sure the answer is no, but it's not clear from the documentation.)
> Also, the default value of redundantPieces is not specified?   

I believe that a file is only split into pieces based on kbPerFile.
If not there is a bug in either the code or documentation.

> Now, what happened to me was that I wanted to slowly grow my network of
> backup peers.  First I hooked up with my friend Peter Froehlich and we
> decided to mutually provide 1GB.  When I designated 40MB of data to be
> backed up Peter received 120MB of data.  That makes me assume that the
> default for redundantPieces is 2, and that DIBS put three copies of my
> data on Peters box (1 original + 2 redundant copies).  

Yes, the default for redundantPieces is 2.  As an aside, you can see
the default for various options by looking at the dibs_options.py file
that comes with your DIBS distribution.  Of course, ideally the
default values should be listed in the documentation.

> Of course, that does not make much sense, especially if it takes
> Peter's box busy for more than half a day to verify all signatures
> of my data (more about this (hypothesis) later).

By this I'm guessing you mean that it doesn't make sense to put 3
copies of the same thing on Peter's box.  I would agree with you but
claim that the problem is not with DIBS but with the fact that you
only have one trading partner.  If you only want to have a small
number of trading partners, you could reduce redundantPieces
appropriately. 

> This experience also makes me believe that DIBS tries to distribute a
> file over all available peers.  So, if I cooperate with 10 peers then my
> files are split in 10 including the 2 redundant copies.  Is that
> right?

DIBS first splits a file so that no piece is greater than the
kbPerFile limit.  Then it adds redundant pieces based on the
redundantPieces variable.  Then it tries to give each piece to a
separate peer if possible.  The reasoning behind this is as follows.

1.  Very large files are annoying for a number of reasons (take longer
    to transmit, take longer to verify, can make things run out of
    memory, etc.).  The solution is to have the kbPerFile variable
    cause large files to be split up.  This is independent of how 
    many peers you have and how much redundancy you want.

2.  The whole point of DIBS is robustness to failures, i.e., you want
    to be able to recover your data even if X peers fail.  The
    solution is to generate X redundant pieces which DIBS attempts to
    store with different peers.  This is independent of the number of
    peers you have and issue 1 above.

There are many other ways to think about splitting files and
generating redundant pieces and I don't claim that the current one is
best.  Rather it seemed like a reasonably simple starting point.
Proposals for alternate ways of handling this are welcome on the
discussion list or as requests for enhancement.

> Now what happens if several of the peers provide more space than
> others?  Let's say 5 of my peers provide 2GB instead of 1GB.  Are some
> of my files split in 5 pieces then?  How does DIBS make the decision
> which files to chose for the 5-way split? 

The splitting decisions are described above.  Your question
illustrates some of the difficulties with more sophisticated splitting
methods.

> [Side track issue:  Given the above scenario maybe the best option is to
> split files into 7.5 pieces on average.  This would achieve the most
> homogenous distribution, it seems.  And just to complicate things
> further, what if some of the shared amounts of space change now?]
>
> What's DIBS backup distribution strategy and is there an easy way to
> describe it so that no surprises happen as above?

Described above.

> As hinted at before there is one other major issue which we (or actually
> Peter) experienced:  Once the remote peer's data has been received DIBS
> seems to consume lots of computing resources.  My guess is that this is
> due to the gpg subprocesses used to verify the authenticity (and
> integrity) of the just received data.  Is that correct?  

The last time I checked the major bottlenecks (in no particular order)
were: GPG, network transmission, accessing the database, generating
redundant pieces.  You can investigate this yourself by using the
python profiler (http://docs.python.org/lib/profile.html).

> And is there possibly a way to speed this up?  

Almost certainly.  Right now I'm still trying to get essential
features in DIBS.  After the 1.0 release I plan to focus more on
issues such as the UI, documentation, bug fixes, speed, etc.

> Even not verifying at all seems to be a better option! 

Yes, that is something worth considering.  The issue with that however
is how would you prevent untrusted people from storing stuff with you?
Or perhaps you would only check signatures on recovery and unstore
requests but not on store requests.  That way if someone stores stuff
with you when they shouldn't then they can never get it back.  If you
have a suggestion for how you think unverified or partially unverified
backup should work please submit an RFE.

> Despite all the problems mentioned above, let me reiterate that I really
> enjoy DIBS and it's this kind of software that makes open source
> great!

Thank you!

> It's just that UI or, in this case especially, perfomance issues like
> the last one can make or break the adoption of great ideas like DIBS.

Yes, I agree.  Unfortunately these are often a fundamental part of
open source projects for some very basic reasons.  First, open source
software such as DIBS is often developed as a hobby or in the free
time of the developer(s).  As a result, development takes a long
time and so there are two choices:  don't release anything until
everything looks "professional" or release gradually improved versions
of the software.

As a software developer, I find releasing less than perfect code
somewhat disturbing.  Ideally, I would like to have the time to make
things really solid before letting it out.  Realistically, though, if
I tried to develop DIBS in isolation I would miss out on a lot of the
feedback and encouragement which is essential to producing good
software. 

To summarize, let me say that I understand your frustration with poor
documentation and features that don't work as you want them to.  I get
frustrated with it too.  I hope to fix these issues as soon as
possible, but life is busy.

-Emin

> PS:  I crossposted this to the backup mailing list so that all my backup
> peers are in the loop too.

OK, I cc'd your list in the reply.