[S3tools-general] Feature idea: sqlite caches

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

As I've mentioned before, I'm using s3cmd for Fedora Infrastructure to sync
~1TB and >1M files out to S3 in each region as mirrors for Fedora instances
running in EC2.  Most of the feature enhancements I've written thus far
have been in support of this use case.

I am still having a couple significant problems, that I expect will require
some "better thinking" to resolve.

1) Hitting MemoryError when trying to sync this many files, on 32-bit
python.  We keep dicts of a lot of data about the local and remote object
lists. At >900k objects, we run out of address space.  Yes, running a
64-bit OS and python, with 12-16GB RAM would resolve this, but it seems
like overkill to me.

2) It takes >24 hours just to do an "incremental sync" of the directories
that have changed locally, mostly because we're doing S3 directory listings
on the whole blessed tree first.  As I've got a good sense of what files
may have changed within subtrees, I'd like to be able to cache the remote
directory listings and use them between runs, updating the cache when we
upload or delete content.  That alone could save 20+ hours.

I think both problems could be solved by shifting away from using in-memory
dicts for everything, to using an on-disk (or in-memory if persistence
isn't needed for some use cases) a

[S3tools-general] Feature idea: sqlite caches

Command line tool for managing Amazon S3 and CloudFront services

[S3tools-general] Feature idea: sqlite caches