As I've mentioned before, I'm using s3cmd for Fedora Infrastructure to sync
~1TB and >1M files out to S3 in each region as mirrors for Fedora instances
running in EC2. Most of the feature enhancements I've written thus far
have been in support of this use case.
I am still having a couple significant problems, that I expect will require
some "better thinking" to resolve.
1) Hitting MemoryError when trying to sync this many files, on 32-bit
python. We keep dicts of a lot of data about the local and remote object
lists. At >900k objects, we run out of address space. Yes, running a
64-bit OS and python, with 12-16GB RAM would resolve this, but it seems
like overkill to me.
2) It takes >24 hours just to do an "incremental sync" of the directories
that have changed locally, mostly because we're doing S3 directory listings
on the whole blessed tree first. As I've got a good sense of what files
may have changed within subtrees, I'd like to be able to cache the remote
directory listings and use them between runs, updating the cache when we
upload or delete content. That alone could save 20+ hours. Obviously this
wouldn't work if you have multiple sources updating the S3 trees
independently, but for the single writer case, would be fine.
I think both problems could be solved by shifting away from using in-memory
dicts for everything, to using an on-disk (or in-memory if persistence
isn't needed for some use cases) sqlite database. I haven't thought a lot
about the schema yet, but would start by modelling it on the dicts in use
today storing the local and remote file trees. For my use case, this could
potentially reduce the time to crawl the remote trees by tens of hours.
Has anyone looked into doing this before? Is there a philosophical
objection to doing so?