xz rsync-friendly

2009-05-13
2013-05-30
  • Hi

    There are any chance to add to xz an option to make it rsync friendly like gzip ?

    gzip -h
    ......
    ......
    -1 --fast        compress faster
    -9 --best        compress better
    --rsyncable   Make rsync-friendly archive

    Thanks in advance

    roberto

     
    • Lasse Collin
      Lasse Collin
      2009-05-13

      It should be possible. The  problem probably is that the LZMA2 compression code is likely to keep changing so that the algorithm can be made better (better speed or better compression ratio). Such changes don't affect the file format, but will make a new version of the encoder produce different output, and thus breaking rsyncability with files created with different version of xz.

      One solution would be to freeze some version of the encoder code, and make improvements in another copy of the same code. That way when rsyncability is wanted, the old frozen code would be used for encoding. I don't like this solution though.

      It is quite likely that there will be rsyncability in xz some day. But it probably won't be there in this year.

       
    • Ok, thanks for your reply

      roberto

       
  • Ole Tange
    Ole Tange
    2009-12-27

    I use -resyncable often. However, I control both ends of the transmission, so it would not be a problem to me if different versions of xz had different implementations of -rsyncable as I would simply make sure to have the same version.

    I would think most of the places where -rsyncable is being used today is by people themselves and not between two different people.

    I actually have a hard time coming up with scenarios where having different implementations of -rsyncable willl work worse than xz works now.

    From this I would say: If you can make a version with -rsyncable then do it. Even if you have to change it later.

     
  • Lasse Collin
    Lasse Collin
    2009-12-31

    OK, that's good to know. Maybe I will look into it once multithreaded compression has been implemented, which will be one of the most important things once 5.0.0 is out.

     
  • Robert de Bath
    Robert de Bath
    2012-09-26

    I've just come across this comment about rsyncable … changes in the algorithm between versions have absolutely nothing to do with the rsyncable flag. For gzip the decoder doesn't even know that rsyncable was used.

    One could add a, somewhat inefficient, rsyncable wrapper on the current xz implementation as follows …

    1) Use a content sensitive algorithm (like gzip does) to chop the input stream into blocks.
    2) Compress, using xz, each block completely independently.
    3) Concatenate the xz files (of each block) to make a single file.

    The result is an rsyncable xz file.

    It doesn't matter that the remote is using a different version because the xz file is still valid for all versions.
    If an executable upgrade makes for a very different compressed file that is an irritation … but only very minor it's not as if the program will be upgraded every day.

    Obviously it'd be best if the decoder doesn't know that 'rsyncable' has been used,  but if a 'stream reset' token needs to be added even that is only a one time upgrade.

    So what's the problem?

     
  • Lasse Collin
    Lasse Collin
    2012-09-27

    You are right that the decoder doesn't need to know about rsyncability. There's also no need to change the file format to make rsyncability possible. LZMA2 supports reset markers already.

    The problem is that I haven't been able to work on xz as much as I would have hoped. For example, there is still no stable release with a threaded encoder.

     
  • Robert de Bath
    Robert de Bath
    2012-10-03

    Ah, the old 'got a life' problem :-)

    It's okay, the thing that prompted me to write was the (wrong) comment in the man page.  I'm not even sure it would give enough of a boost over gzip -rsyncable.

    These three files illustrate the issue.

    total 415028
    -rw-r-r- 1 robert robert 142979072 Sep 29 09:22 linux.sfs
    -rw-r-r- 1 robert robert 171764281 Sep 29 09:22 linux.tgz
    -rw-r-r- 1 robert robert 109805895 Sep 29 09:22 linux.tlz

    The TLZ is a tar.lzma1-9, the tgz is a rsyncable gzip-6, these are the controls.
    The sfs is a squashfs filesystem with a 128k block size using xz as the compressor. In effect it's an rsyncable xz, though that's more of a side effect to it being 'seekable'.

    As you can see there's a substantial increase in size between the tlz and the sfs. Probably an effect of the very small block size/dictionary size. For my application it's not enough to make up for a disadvantage so I'm normally using gzip as the sfs compressor and the file size comes out as a tiny bit smaller than the tgz file.

     
  • Lasse Collin
    Lasse Collin
    2012-10-03

    No, it's very much the opposite of "getting a life".

    I fixed the comment on the man page. I think it wasn't completely wrong though, because changing the encoder means that rsyncability is temporarily broken. Sometimes that is just a minor annoyance but sometimes it can mean that people will need to stick to an old version of xz. But in any case, having some form of rsyncability support is always better than none.

    Splitting the data into individual blocks minimizes the amount of data that rsync needs to transfer if very small changes are made to the file. Like you observed, it's not good for compression ratio though.

    The compression ratio will be less affected if the dictionary is not forgotten at every rsyncability point. That is, flush the LZMA2 encoder and reset the encoder state except the match finder state (which contains the dictionary). The downside is that a tiny change in the uncompressed data can affect even several megabytes of compressed data depending on the dictionary size and compression ratio.

    Even when the dictionary isn't reset, the frequency of the state resets may affect the compression ratio quite a bit. Maybe it needs to be adjustable and the default be based on the dictionary size. This is just guessing, it needs to be tested.

     
  • Robert de Bath
    Robert de Bath
    2012-10-03

    Ah, that makes the manpage much clearer.

    But as for not forgetting the dictionary at the reset mark, I don't think that's gonna work.
    While initially the bytes after an partial reset might start the same as last time as soon as you get a reference before the delta (say the change was deleting one space) the tokens for that reference from the LZ match finder will be different which I expect will change all the following output bits (assuming this "range encoding" has similar properties to huffman or arithmetic encoding).  You're then lost until the next partial reset and whatever compression you lose from doing the partial reset isn't going to be recoverable by rsync until you go past the LZ dictionary distance. At least if you do a full reset you know it isn't going to break again within 'a few' bytes.

    Changing the expected length of an rsync block does sound like a good idea though, probably by adjusting the number of significant bits on the running hash. And matching it to the dictionary size as you suggest sounds good too.

    How about some -0r .. -6r .. -9r presets that divide the dictionary by four (unless extreme is set too) and turn on the reset markers to match (on average) the dictionary size. (-lzma2 dict=999 would still work of course)

    That should nice and simple and give us enough rope to do some testing … or get ourselves in trouble …

     
  • Lasse Collin
    Lasse Collin
    2012-10-04

    You have understood it correctly, but I'm not so ready to say that it won't be useful. One will need to compromise between the compression ratio and the level of rsyncability. If the dictionary size is 8 MiB and the file being compressed is several gigabytes, it's not necessarily too bad if changing a few bytes in the uncompressed file requires transferring like ten megabytes with rsync. But sometimes one might prefer to keep the transfer sizes smaller and have lower compression ratio, which is when forgetting the dictionary at reset points is better.

    I don't remember if gzip's -rsyncable resets the dictionary at the rsyncability points. Not resetting it should be fine because the dictionary size in Deflate is only 32 KiB, so changing one uncompressed byte won't affect a large amount of data.

    Thanks for the ideas. I don't have any comments about them right now.

    I don't promise any schedule for any implementation. If you want to help (even if just discuss without coding), it could be nice to chat on IRC (#tukaani on Freenode) some day.