Hi,

a few observations of my own, no to be taken seriously:

1) Block level deduplication

There are already a lot of filesystem/filesystem layers in fuse (such as ZFS, lessfs, ...) which do this. This is often more efficient then rolling an own solution and is well abstracted.
In my opinion it does not make sense to do block level deduplication in the application layer, except if you do it on the client side to safe bandwidth.

2) Database

I would suggest not abusing the file system as database and using something like SQLite. This gives you features like transactions, atomic operations, etc. and also improves speed.

3) v4

Is v4 published somewhere? What you are doing seems to be more like a fork, if there are huge changes in v4 and you are working on v3.

Regards

Am 07.08.2012 10:36, schrieb Wessel Dankers:
Hi Les,

On 2012‒08‒06 13:05:53-0500, Les Mikesell wrote:
On Mon, Aug 6, 2012 at 9:46 AM, Wessel Dankers
<wsl-backuppc-devel@fruit.je> wrote:
The ideas overlap to a limited extent with the ideas[0] that Craig posted
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.
Personally I think the hardlink scheme works pretty well up to about
the scale that I'd want on a single machine and you get a badly needed
atomic operation with links more or less for free.
Adding a chunk can be done atomically using a simple rename(). Removing a
chunk can be done atomically using unlink(). The only danger lies in
removing a chunk that is still being used (because there's a backup still
in progress whose chunks aren't being counted yet by the gc procedure). The
simplest way to prevent that is to grant exclusive access to the gc
process. Note that hardlinks do not prevent this race either. It's a
problem we need to solve anyway.

Craig lists a couple of reasons to abandon the hardlink scheme in
http://sourceforge.net/mailarchive/message.php?msg_id=27140176

If you are going to do things differently, wouldn't it make sense to use
one of the naturally distributed scalable databases (bigcouch, riak,
etc.) for storage from the start since anything you do is going to
involve re-inventing the atomic operation of updating a link or replacing
it and the big win would be making this permit concurrent writes from
multiple servers?
Using something like ceph/rados for storing the chunks could be interesting
at some point but I'd like to keep things simple for now.

cheers,



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/