Menu

#127 Support pigz alongwith gzip

none
closed
None
abandoned
5
2018-12-16
2009-12-04
funtoos
No

pigz works about 2-3 times faster than gzip on my quad-core machines. Can you please add an option to use any program (with restrictions on arguments like stdin/stdout and level) for compression? If not any user specified program, can you please add pigz as an option if dar finds pigz is present? Or without a new option, prefer pigz if its present?

Discussion

  • funtoos

    funtoos - 2009-12-19

    Any comments on this? This is holding me back on Dar compared to tar, which is what I use along with pigz. Dar takes upto 2-3 times longer for the same backup than tar. I have a fast machine with fast storage (SSD) but dar can't use any of those extra CPU cores. Tar is able to compress in parallel using pigz.

    PS: while I was writing this, I had an idea: why not link gzip to pigz and see how dar behaves.

     
  • funtoos

    funtoos - 2009-12-19

    well that didn't help! Seems like you are not calling the gzip binary....:(

     
  • Denis Corbin

    Denis Corbin - 2009-12-23

    Dar does not call gzip, because, first, gzip can only compress/decompress files in a filesystem, while dar's need is to compress and decompress in memory an abritrary length stream of bytes. This way, dar does not rely on a temporary files, nor it needs to load a complete file in memory to compress it (which would fail for large files, due to the lacks of virtual memory).

    Second, having dar calling gzip for each file to compress would lead to very poor performances...

    Instead, dar uses zlib which is the underlying library used by gzip and pigz. As such, dar is linked with this library (and optionally with some others, libzip2, openssl, ...) to form the dar binary. Using compression routing is then just as simple as calling a function.

    It would be nice if pigz maintenainer could provide a library of the algorithm used by pigz having the same API as zlib.

    However, the compression method used by pigz seems to be less performant (in termes of compression ratio) than the usual gzip method, as the compression engine is reset every 128 KB and pigz has to add some bytes of garbage between each chunk of compressed data. Last, the use of all CPU of a system is possible only for compression, not for decompression.

     
  • funtoos

    funtoos - 2009-12-31

    I don't know the details man. All I know is that for my root filesystem, 'dar' creates 200MB bigger (on a size of 5.2GB, 5.2GB vs 5.4GB) full dar file compared to tar, and takes 6 minutes to create that back up, whereas tar with pigz takes 2 minutes. I am using -z5 for compression with dar and default level of pigz (which is 5 also, I think) with tar.

    I don't mind slow downs during restore because hopefully I will never need to. Restore of a specific file or files is much much much faster in DAR compared to TAR anyway. That's saved my ass on many big backups many a times. And I love DAR for it!

    But full backups cost downtime man! I can use the speedup. See if you can help me! This won't stop me from using DAR...:-)

     
  • Denis Corbin

    Denis Corbin - 2010-01-03

    What makes dar faster compared to tar when it comes to restore a file, is that dar does not compress the whole archve at once, but does compress it file by file so it has not to uncompress the whole archive to just restore a file in the middle of it. However, this has a drawback: the compression ratio is not as good as what can be done using tar, espetially when compressing a lot of small files. There's already the -m option that let you not try to compress small file given a threshold size (which defaults to 100 bytes). Probably you also know that there are -Z and -Y options to define what file to compress and which file to even not try to compress (.gz .bz2 files some .mp3 files etc do not become smaller compressed), this saves CPU cycles.

    However, as you probably already make use of theses options to speed up compression time, I let your feature request open, expecting/looking for a way to interact with bigz by the mean of a pigz library.

     
  • funtoos

    funtoos - 2010-01-03

    If we are doing file by file compression, it would be easier to provide a way to invoke the external program like tar. It will be a trivial fix. Otherwise, we will have to get hold of pigz maintainer.

    Also, if we are doing file by file, isn't it easier to compress ${NUMCPU}+1 files together?

    Yeah, I have put in filters for every compressed extension that I know of...:-)

     
  • funtoos

    funtoos - 2010-01-04

    I was looking at the function parallel_compress in pigz.c. Although, it uses globals, I think this can be made to be used in the compressor. Let me see if I can
    get a working patch for you. You wouldn't mind including the whole of pigz.c in another cpp module if needed, would you?

     
  • funtoos

    funtoos - 2010-01-04

    An ignore my comments about external program and multiple files...I hadn't looked at your or pigz's code at that time...:-)

     
  • Denis Corbin

    Denis Corbin - 2010-01-04

    Well,

    first problem in having pigz code copied into libdar is that it does not follow the programming principe of "one source code one maintainer". in particular, if pigz receive bug fixes in the future (most probable), theses will not be propagated to libdar's copy of that same code. The second point is that libdar is a thread safe library, having global symbols will break this major feature (used by external GUI). libdar is used by dar, kdar, darGUI, sarab and several other external programs. Hopefully, none of them has a copy of libdar within its source code, this lets me bring buf fixes and new feature to libdar with minimal change for theses external programs.

    The correct approach is to have pigz team provide a thread-safe library with a zlib-like interface (an API), to have libdar using it to un/compress a stream of bytes. This is what dar needs to be able to put several files in an archive, without relying on any temporary file (another feature that let dar work on partially readonly systems, or on system with low or no available free space at all).

    Regards,
    Denis.

     
  • funtoos

    funtoos - 2010-01-04

    I agree with you on that being the cleaner solution. But there is no way on the website to get hold of a dev mailing list.

    I will see if I can get a response from madler@alumni.caltech.edu

    The globals should be fine as long as only one thread accesses them. I think he used them instead of passing arguments around to local functions. Not a good programming style because its prone to errors and hard to maintain because every time you add something, you need to remember not to touch any of these variables outside of that single thread. The code is not very large number of LOCs, so I thought may be we could just roll it in libdar itself and maintain it. But I think you are right. It should really be a separate library maintained by Mark. That way his work gets more credit and possibly spreads to other dar like programs as well.

    Let me see what response I get from him.

     
  • Mark Adler

    Mark Adler - 2010-01-04

    I would consider developing a parallel compression library. Having an application helps, since it is much easier to develop a useful interface if there's a real user to bounce it off of.

    The inefficiency of breaking the deflate stream is negligible. The compression engine is not reset -- at each boundary the compressor still has the previous 32K of history to use, just like it would without the boundary. Adding four or five bytes into the compressed data every 128K bytes of uncompressed data is also negligible.

    There is a development list for email noted on the zlib.net page. Anyone can join it at http://zlib.net/mailman/listinfo/zlib-devel_madler.net .

     
  • funtoos

    funtoos - 2010-01-04

    Mark, thanks for responding here. I appreciate it. So, how do we go about starting this off? We will need some exchange of API definition between dar and pigz projects.

    Do we take this discussion over at zlib-devel_madler or keep it here? I think we definitely need to agree on the interface, so Denis can specify what he needs. I think we keep it here until we do that.

     
  • Mark Adler

    Mark Adler - 2010-01-04

    It would be best to discuss on the zlib-devel list. Denis can suggest something, and we take it from there.

     
  • funtoos

    funtoos - 2010-01-05

    I subscribed to the list but it doesn't let me view archives with my login and password. What's up with that?

    I thought, since we all had an account here, it would be much simpler.

    Denis, are u on that list?

    Mark, I did not realize you were the zlib guy...:-)

     
  • Mark Adler

    Mark Adler - 2010-01-05

    I hadn't approved you yet. Now you're approved.

    Having the discussion here wouldn't be convenient for me, and since I'd be writing the library, I get to dictate the terms. :-)

    Also the other people on that list could contribute very constructively to such a discussion.

     
  • funtoos

    funtoos - 2010-01-05

    Hehe...yeah, you definitely do.

    We need to get Denis over there as well. Because he will be the consumer, so he will need to start with what he needs in the APIs.

     
  • Denis Corbin

    Denis Corbin - 2010-09-27

    Hello,

    Normally I get noticed when someone add comments to a tracker, but I did not. This is a question in dar-discussion mailing-list that let me get here again and notice only now that ther has been very interesting exchanges here, some month ago.

    Well, I will go subcribe to zlib-devel list.

    Sorry for the delay,

    Regards,
    Denis.

     
  • Torsten Bronger

    Torsten Bronger - 2013-06-18

    Has there been any progress on this? I cannot even find follow-ups to your (Denis) post on the pigz mailing list. This doesn't look promising.

     
  • Denis Corbin

    Denis Corbin - 2015-10-18

    I think there is another path to achieve parallelism in compression with dar. Simply put: compressing by block, so the additional memory requirment stays acceptable, thus just another implementation but not additional library requirement... to be considered and tested to see if the speed gain is valuable.

     
  • Denis Corbin

    Denis Corbin - 2015-10-18
    • Priority: 3 --> 5
    • km stone :): --> none
     
  • Denis Corbin

    Denis Corbin - 2018-12-16
    • status: open --> closed
    • Progression: requested --> abandoned
     

Log in to post a comment.