My testing shows that 's3cmd --encrypt sync mydir s3://mybucket does not encrypt the sync'd files. Is this a bug, or was s3cmd purposely implemented this way? If so, why?
My understanding is that you are correct.
Encrypt is reported as NOT an available function within sync because sync uses md5sum signature comparison which is evidently available at S3 upload sites for comparison. Hence, until someone (may be you?) writes a patch to place an encryption filter in front of sync's md5sum computation, you would have to encrypt local data then upload for md5sum signatures to match.
I haven't looked at the code to confirm where the S3 located md5sum data is derived, however from the developer's comments elsewhere I am lead to believe that the md5sum signature is not upload by s3cmd. (If it where, that would not ensure data integrity.)
That said, keep in mind I am just a sys admin using this cool tool!
I've looked at the code, and it seems quite possible to generate the md5sum for the encrypted object which could then be compared with what was stored on S3. The down-side of the most obvious patch is that it would introduce considerable CPU overhead when sync'ing encrypted object because during the "verifying checksums..." stage, s3cmd would have to encrypt each file before computing the md5sum that would be compared to what is stored in ETag for the object in S3
A better approach would be to store the plain-text md5sum in S3 metadata:
1) md5 - The md5sum of the the object available as ETag (regardless of whether it is encrypted or plain text)
2) md5plaintext - The md5sum of the plain text object stored as S3 metadata
When encryption is turned off, the etag "md5" and S3 metadata "md5plaintext" would be the same value. However when encryption is turned on, the Etag "md5" would not be the same as "md5plaintext".
Using this approach 's3cmd sync' could be just as fast at determining the files that need to by sync'd when using --encrypt as when not using --encrypt. This is because s3cmd can checksum plain text local files and compare the checksum to "md5plaintext" to determine whether or not the local file has changed since the last sync.
it's true that encryption is not (yet) honored for sync. Sync takes the remote objects' MD5 sums from a bucket listing, which gives you object names AND their MD5 sums (i.e. it takes only one S3 request to get all MD5sums of all stored objects).
There are basically two approaches to this problem. First is to store the plaintext MD5 sum in an object header. The checksum itself has to be encrypted, otherwise you'd leak sensitive information about your plaintext data, but that's just a minor issue. The major problem is that you'd have to query headers of all remote objects when syncing which could be time consuming.
The other approach is to keep a "metaobject" in each bucket that will keep track of all plaintext checksums and possibly other information (timestamps, filemode, local owner/group, etc). Sync will then retrieve this "metaobject", compare local files against information found in there, upload or delete changed files and finally upload updated "metaobject". That's obviously much faster then querying headers of each object, but due to inability to lock the bucket may lead to inconsistencies (for instance if s3cmd is interrupted prior to updating the metaobject or if two s3cmd instances update the bucket at the same time).
Combination of both is probably the best way to go. I'll just have to work on some decent "transaction" mechanism to protect against inconsistencies (or at least be able to detect when things went wrong).
BTW Your suggestion to encrypt all local files on every sync and compare the checksums of encrypted data wouldn't work as GPG uses random encryption keys and even if you try to encrypt one file two times with the same passphrase you'll get two different encrypted files. Therefore comparing encrypted files checksums is not possible.
Hope that explains it.
Very good explanation. Thank you!
... Only I'm not sure what you mean by: "The checksum itself has to be encrypted, otherwise you'd leak sensitive information about your plaintext data ..." What, exactly would I be leaking?