As I've mentioned before, I'm using s3cmd for Fedora Infrastructure to sync ~1TB and >1M files out to S3 in each region as mirrors for Fedora instances running in EC2. Most of the feature enhancements I've written thus far have been in support of this use case.
I am still having a couple significant problems, that I expect will require some "better thinking" to resolve.
1) Hitting MemoryError when trying to sync this many files, on 32-bit python. We keep dicts of a lot of data about the local and remote object lists. At >900k objects, we run out of address space. Yes, running a 64-bit OS and python, with 12-16GB RAM would resolve this, but it seems like overkill to me.
2) It takes >24 hours just to do an "incremental sync" of the directories that have changed locally, mostly because we're doing S3 directory listings on the whole blessed tree first. As I've got a good sense of what files may have changed within subtrees, I'd like to be able to cache the remote directory listings and use them between runs, updating the cache when we upload or delete content. That alone could save 20+ hours.