Menu

Disctionary and Word Size

2006-07-04
2012-12-08
  • Nobody/Anonymous

    Hello there. I am an early user of 7-Zip and I don't know what do the "Dictionary Size" and "Word Size"  fields mean. Could someone please assist?

     
    • Nobody/Anonymous

      Hi !

      The main algorithm used in 7-Zip is LZMA, which is derived from the Lempel-Ziv sheme.

      Algorithms following this sheme build a dictionnary of previously seen matches to replace next identical matches by a reference (deflate, LZMA) or an index (LZW).

      The bigger the dictionnary is, the more you can expect compression, because more matches will be considered. Think that it will require more memory too.
      The word size enables the matching of patterns until the specified length. This may result in better compression ratio, but is slightly slower for compression (afaik, it has no impact on decompression speed).

      Try google with Lempel-Ziv keywords to get more info

      --

      I hope I was clear, English is not my mother's tongue

      --

      Arkanosis

       
    • Nobody/Anonymous

      You say word size is the size of the pattern, but what units are they measured in, non are in the program?

      BB

       
    • Nobody/Anonymous

      I suppose that the word size is in bytes.

      In a text file for example, the algorithm will match <word_size_or_less> characters patterns.

      Finding identical 255 characters patterns is very unprobable on most files, so compression has few chances to be improved, but since it has no impact on decompression speed, I always set the word size to 255.
      Afaik, memory usage is not affected by the word size, so you should not really bother with it.

      I think LZMA use a special case to compress long runs of an only symbol, but I'm not sure of this. It may correspond to the number of fast bits (not available with the GUI of 7-Zip).

      --

      Arkanosis

       
    • Guillermo Gabrielli

      >I think LZMA use a special case to compress long runs of an only symbol, but I'm not sure of this. It may correspond to the number of fast bits (not available with the GUI of 7-Zip).

      Word Size is the number of "Fast Bytes"(fb) for LZMA/Deflate and the "Model Order"(o) for PPMd.

       
    • Nobody/Anonymous

      word size has a big meaning for decomression speed. for lempel- ziv- based algorithms longer matches means less codewords and less copyings, so decompression speed is higher. for ppm- based algorithms higher orders means more statistics to check, so decompression speed is lower. look at http://uclc.info/decompression_speed.htm , especially on rar. options m1, m2 and m3 uses lampel- ziv. options m4 and m5 uses lz or ppm, in this case it's ppm. and you see that decompressions speed for particular modes ranks in this order: m3 > m2 > m1 > m5 > m4 (that means: long pattern lz > medium pattern lz > short pattern lz > low order ppm > high order ppm).

      (in addition) usually better compression of some file means better decompression speed than other files (with same algorithm). try compressing and then decompressing two huge files: file a is a bunch of random bytes, file b is filled with zeroes. you'll se that file b decompresses much faster (at least with lz- or ppm- based algorithms).

      too big memory limit also decreases decompression speed. a window size that fits in processor cache gives best decompression speed.

      and in lz- based algorithms (especially in lzma) more bytes encoded as literals (non- matching bytes) means slower decompression speed. the difference can be sometimes enormous like some file decompresses twenty times faster than other. numerous filters can help compressor by processing file in order to make more repeating patters (or make them appearing closer) and, as a result, improve compression ratio and decompression speed.

       
    • Nobody/Anonymous

      >Word Size is the number of "Fast Bytes"(fb) for LZMA/Deflate and the "Model Order"(o) for PPMd

      Thx, now I now what is fb :p
      So for LZMA the unit is "bytes", but for PPMd it must be "bits" (I rarely use PPMd, but I believe the word size can reach 32... the model order must be the word size divided by 8... To be verified)

      >for lempel- ziv- based algorithms longer matches means less codewords and less copyings, so decompression speed is higher

      --

      Sure, I didn't think of that :)
      So, in fact the time actualy spent by the algorithm is exactly the same (I mean without taking I/O in consideration, am I right ?), the time spent writing the decompressed files is exactly the same, but we save some time reading into a smaller archive.

      But it's another reason to use a word size of 255 :)

      --

      Arkanosis

       
    • Nobody/Anonymous

      word size specified in add to archive dialog means _maximum_ pattern length or model order. model order for every ppm scheme (i know) is measured in bytes. so if ppmd can't find any context of length max then it tries to find apprpriate context of length max-1, and so on till order -1 (special order with flat distribution - bytes are transferred uncompressed). and if lzma can't find repeating pattern of length max then it tries to find pattern of length max-1. btw: in real world this is done in reverse ie. ppmd starts from order- 0 and lzma starts from patterns of length min (i don't know how big is min, i guess it's 2 or 3 bytes).

       
    • Nobody/Anonymous

      so i can safely increase fast bytes without riscing to reduce compression ratio ?

       
    • Nobody/Anonymous

      On most cases yes, but there is no way to predict the effect of using longer matches on an arbitrary file: you must try it and see.

      Ddot

       
    • xephael

      xephael - 2006-07-09

      I've concluded it takes more memory to use a larger dictionary for compression.  Do you need the same memory requirements for decompression?

      I would presume you wouldn't.

       
    • Nobody/Anonymous

      No, if you use LZMA which is asymetric.
      Memory is used mainly for the match finder that isn't needed for decompression.
      LZ77 algorithms (like LZMA) only require memory for the "sliding window" history, far less than for the whole match finder.
      However, the size of the window increases with the size of the dictionnary, but remains acceptable (66 Mo for decompression against 900 for compression, for example).

      If you use PPMdH, you'll need the same amount of memory for both compression and decompression, because arithmetic coders are symetric.
      I think you shouldn't use PPMdH, as LZMA is generally a better algorithm (I mean, by efficiency). It's far faster for compression, very far for decompression, and even if sometimes PPMdH compresses better (generally on text files), time should dissuade to use it.

      --

      @nobody >
      I checked the maximum dictionary size for PPMdH, which is 32.
      I cannot imagine PPMdH use an order-32 context model. PAQ itself don't use such a model, because it has no reason to be (to stay in memory, big contexts have to be scaled by a hash, which means less precision and less compression).
      Moreover, PPMdH is one of the fastest ariths, so I believe it must be an order-8,4,3,2,1,0 or something like that.
      To be verified, I really don't have time to read that badly written code of PPMdH (with all the respect I have for Mr Shkarin).

      So, I maintain that the dictionnary size is in bytes for LZMA and in bits for PPMdH.
      Discussion is open :)

      --
      Arkanosis

       
    • Nobody/Anonymous

      ppmdh uses information inheritance. i don't know exactly what it is, but it acts like blending high orders with low orders (but it's faster). word size for ppmdh *is in bytes*. and there's no hashing (why?).

      recently i've created coder with fixed 2- byte order. it's here http://peerfactor.fr/d.jsp?f=TarsaLZP.1151738422.zip . as you can see it performs almost exaclty as ppmdh with order 2. so ppmdh word size is in bytes. ppmonstr uses up to 128- byte order. ppmz uses infinite order. it can be said that bwt based compressors all use infinite order modelling. and no one uses hashing because it will corrupt files.

      have you used ppmdj from shkarin's website? it's command line compressor that shows you memory usage. and memory usage grows when compressing. it means that new contexts are added dynamically, without hashtables or something.

      usual ppm's doesn't store entire contexts in each memory cell. instead of this, it stores only used characters in particular context (it's memory saving - why store full 256 characters alphabet if only 5 characters were used?) and links to one byte shorter context as a parent context (this preserves continuity). this way allows you to access unimaginably high orders without using huge amounts of memory, because you don't store informations about contexts that wasn't yet used. my simple exponential order- 2 coder uses fixed 32 mb of memory (65535 contexts* 256 byte alphabet* 2 byte (word) of frequency count). and ppmdj on small text (below 1 mb) files with order- 2 usually uses around 0.1 mb of memory. you see the difference.

      another approach to ppm is to use limited order bwt. it fakes ppm behaviour, but it's much simpler. how to do that? if we use, eg. 150- byte order then we do a blocksorting where we compare only first 150 bytes of each row. and then we use techniques from usual bwt compressor (like bzip or uhbc) to compress transformed data.

      i suppose you only know exponential ppm's. this way a n- bits order ppm with two bytes (word size) per symbol for stofing frequencies uses 2^n*256*2 bytes of memory. so:
      1 bit order uses 2^1*256*2= 1 kilobyte
      8 bit uses 128 kilobytes
      16 bit uses 32 megabytes
      20 bit uses 512 megabytes
      30 bit uses 512 gigabytes
      32 bit uses 2 terabytes
      so for sure exponential ppm is not used in ppmdj.

      if you still think ppmdj word size is in bits then you surely should contact mr shkarin :)

      ppmd is usually faster for highly reduntant data and lzma is faster for harder compressible data (in compresion of course).

       
    • Nobody/Anonymous

      Well, it seems you're right. PPMonstr use an order-128.

      This must be possible with dynamic contexts, now I need to study that :)
      I actually thought of exponential ppm's, since the only arith source I've tried to understand is PAQ's, which stores contexts statically. This means a huge memory usage and forces the use of a hash of the contexts to stay in memory (only contexts are hashed, there is no corruption but only a loss of precision).
      Thank you for opening my eyes :) I'll finally try to read PPMdJ source code...

      --

      Result of discussion :
      Word size is *always* in bytes

      --

      Sorry, i'm not able to understand your code :) I'm really not easy with ASM.

      --
      Arkanosis

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.