From: Roy K. <rk...@ps...> - 2005-07-07 11:42:22
|
Comments inline below. Roy Keene Planning Systems Inc. On Thu, 7 Jul 2005, Craig Barratt wrote: > Roy Keene writes: > >> I have the BackupPC Daemon stable enough to temporarily be >> deployed (it has an semi-functional auto-update, so I can update it.. the >> first update will likely be to the auto-update). > > Impressive start! > >> I don't know enough Perl to implement the BackupPC side of things, so I >> wrote a client that outputs USTAR formatted streams. > > That could be a great way to get started - just pipe that > into BackupPC_tarExtract... I just changed the TarCmd variable for each host using it. The USTAR format is extremely limited, so this is only a short term solution. I might look into using GNU tar extensions later. > > I'm sorry I've been too busy lately to really think carefully > about this. There are a lot of non-trivial issues here that > could either allow us to do something great, or could really > trip us up. > > Here are some pros/cons about rsync: > > - Pro: pipelining design allows good performance even with > high-latency links. Your protocol could allow pipelining > by using a similar approach to rsync: get the complete file > list, then the receiver forks into two processes. The goal > is to avoid one or more round-trips (request/responses) per > file. > > - Con: Rsync requires the entire file list to be sent and stored > at each end of the connection. That takes about 100 bytes per > file and makes it difficult to scale rsync to very large numbers > of files. > > It would be great to design a protocol that avoided the need > to send and store the entire file list. Maybe the sending > side can fork into two threads: one sends the file list and > the other receives files checksums and sends file changes. This is what I was going for. The way it's written, it's cheap for a client to stall the list and send data on a seperate connection. For maximum portability, it's written as a single-threaded, single-proccess application. > > Here's what rsync does currently. First the file list is > sent from the Client/Sender to the Server/Receiver > > Server/Receiver <--- file list --- Client/Sender > > Then the Receiver forks. One thread sends block checksums > to the client/sender. The client/sender sends matching > blocks and raw data to the receiver: > > Server/Receiver ---> block checksum -->\ > >---> Client/Sender > Server/Receiver <--- file deltas <--/ > > Perhaps each side could fork at each end and do a continuous > pipeline something like this: > > /--- file list <-------- Client Sender > Server/Receiver -< > \---> block checksums -->\ > >---> Client/Sender > Server/Receiver <--- file deltas <--/ > > This way there is a continuous pipeline of file listing, > block checksums and file deltas flowing, with no need to > store the entire file list at all. > > There are complications here since the second Server/Receiver > needs to know the file path. Also, if the short checksums that > rsync uses on the first pass result in a failure, it is tricky > making sure the file is repeated correctly. > > Also, it would also be helpful at the BackupPC end to know when > a directory was complete (ie: all the files in a directory were > done). > > - Con: combined with cygwin, rsync cannot access locked files > on WinXX, and also is slower because of the cygwin overhead. > It sounds like your daemon will be faster. Can your daemon > work with something like VSS to allow access to locked files? The code is designed to allow this in the future pretty easily. The definition for "backuppc_pathmangle()" and "backuppc_pathunmangle()" will need to be changed to allow it to re-write the entire buffer rather than just change drive letters into paths, but that is trivial. Opening files and directories is completely wrapped in a layer that could implement VSS. I looked at VSS and it was very confusing so I stayed away from it for now. > > - Con: rsync doesn't provide consistent support for extended > file attributes. Your protocol will support these, which > is a significant improvement. > Yes, they'll implemented as "BPC_ATTRID_NTACL". I've not actually written the code to support them, (or "BPC_ATTRID_PACL" for POSIX ACLs) yet. Some other interesting attributes sent: MD5 hash (if requested) SHA1 hash (if requested) > - Pro: rsync uses some run-length encoding of the file paths, > and flags for encoding unchanged unix attributes to reduce > network traffic when sending the file list. However, I think > the benefit is relatively small, so sending the full file path > for every file is probably fine. Just FYI. Adding compression through LZO or Zlib would be easy enough. > > - Comment: we need to clearly understand file name character > encoding. Should this be reported by the daemon, or, better > yet, should the wire format always be utf8? The wire format should always be UTF-8. I hadn't thought of this and I don't have much experience with locales, so I'm not sure how it's being handled now. Either way, this is the easiest solution. > > For the rdiff file part of the protocol (get and reply) it would > be extremely helpful if this exactly matched the rsync algorithm, > rather than rdiff. That would significantly reduce the coding > effort on the BackupPC side. The "RDIFF" stuff was more of an experiment and has no code behind it so we can change it at will. > > Craig > I'm going to talk with my boss and supervisor this morning about releasing the code. Thanks for all your good insight. |