Hi Shachar,
Thanks for the feedback. I agree that the tests are not ideal (I am a bit pressed for time and I couldnt use the ideal files for this test). However In my opinion the results put my mind more at rest regarding the levels of rsyncability.
Now to my three questions:
1) You said:
>> Lowering the gzip overhead means hurting compression ratio
>> (about 5% loss in relation to running zlib without rsyncable at the
>> current levels).
This could be interesting for me. If space is not really an issue, then it might be worth compromising compression ratio in return for less rsync overhead. How could this be done?
 
2) When re-encrypting a file, how does rsyncrypto compare the original
plaintext copy to the compressed ciphertext in order to find matching
blocks?

3) How shall we proceed in the "speeding-up performance" project?

Regards
Julian



 
On 08/11/06, Shachar Shemesh <rsyncrypto@shemesh.biz > wrote:
Julian Pace Ross wrote:
> Hi Shachar,
>
> I am sending my results regarding a few tests to the list.
> I did not try with larger files, in order to speed up the process, but
> I hope you can get a good idea from the following.
Yes, but small files have a higher overhead (percentage wise).
> *test.doc - 20 Mb Microsoft Word Document with images and text
> test.txt - 16Mb Plain ascii text file with garbled text.*
Apparently, not garbled enough, as the file compressed 1:1000.
> *CASE 1:
> No zip or encryption:*
>
> *a) Small change at beginning of files*
> test.doc
>     23842816 100%   14.84Mbits/s    0:00:12 (xfer#1, to-check=1/3)
> test.txt
>     16777232 100%   33.19Mbits/s    0:00:03 (xfer#2, to-check=0/3)
>
> Number of files: 3
> Number of files transferred: 2
> Total file size: 40620048 bytes
> Total transferred file size: 40620048 bytes
> _Literal data: 287057 bytes
> Matched data: 40332991 bytes
> _
>
But you don't have separate statistics for the doc and the text file.
The doc file has (apparently) much more than a simple change for each
change you make, thus skewing the results.
>
> *CASE 2:
> gzip --rsyncable only, followed by rsync:
> textfile compresses to 80K (!)
> *
>
I'm sorry. Only 1:250 compression. Your text wasn't nearly random enough.

80K is a tiny size. As such, the overheads rsyncrypto introduce
per-change (about 16K) are huge in comparison. A 16K change out of a
16MB file seems nothing, while a 16K change out of a 80K file seems a
lot, but in both cases, only 16K need to be synced.
>
> *CASE 3:
> full rsyncrypto before rsync.
> Total encryption time: 1 minute for 1.5Mb (almost all the time is
> taken up by the .doc)*
>
That's because encrypting 80K of data, even with rsyncrypto's current
abysmal throughput, is nothing.
> *CASE 4:
> full rsyncrypto before rsync WITH NULL GZIP:*
> *Total time for encryption: 25 minutes for around 36 Mb.*
That's because rsyncrypto's current performance is abysmal... :-)
>
> *2) From Cases 1 and 4, it seems that rsyncrypto adds a margin of
> literal data to transmit, but the overall rsyncability is maintained.*
>
Better say that rsyncrypto has a minimal data that will change across
encryptions for any change. This is deliberate, and, if anything, I'm
thinking of making it bigger, as the minimal amount of data changed
affects the encryption's strength.

If you want, you can play around with it by playing with the
"--roll-win=num, --roll-min=num, --roll-sensitivity=num" parameters, but
I really would have to recommend against it. The overheads per change
are fairly bounded (they will rarely go above 20K), which is, for all
normal circumstances, meaningless. Normal, of course, does not include
the case where the entire file is 80K, but in such a case the cost of
sending the entire file over is also meaningless.
>
> * *
>
> *3) From test cases 2 and 3, it seems that a high proportion of
> literal data is added to the gzip overhead by rsyncrypto.*
>
Let's just say that both gzip --rsyncable and rsyncrypto add a certain
overhead per change. Since the two work independently, you get both
overheads. This is not considered a bug by me, for the same reasons
listed above.

Notice that you can lower both overheads almost arbitrarily, but doing
so has costs. Lowering the gzip overhead means hurting compression ratio
(about 5% loss in relation to running zlib without rsyncable at the
current levels), while lowering the rsyncrypto overhead makes the change
stand out of the encrypted file more, hurting the security.
>
> * *
>
> *4) It is to be seen whether the ratio of test case 3 would approach
> the more ideal ratio of test case 4 when larger files are
> gzipped/encrypted. I did not have time to conduct such a test. This is
> the cruz of it all I guess. Maybe Shachar or someone else can confirm
> what will happen in the meantime? *
>
Yes, making a one byte change in a 16MB file will incur the same ~16KB
change penalty it does today, making the penalty percentage much smaller.

Shachar