7-Zip / Discussion / Open Discussion: Disctionary and Word Size

Nobody/Anonymous - 2006-07-04

Hello there. I am an early user of 7-Zip and I don't know what do the "Dictionary Size" and "Word Size" fields mean. Could someone please assist?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-04
  
  Hi !
  
  The main algorithm used in 7-Zip is LZMA, which is derived from the Lempel-Ziv sheme.
  
  Algorithms following this sheme build a dictionnary of previously seen matches to replace next identical matches by a reference (deflate, LZMA) or an index (LZW).
  
  The bigger the dictionnary is, the more you can expect compression, because more matches will be considered. Think that it will require more memory too.
  The word size enables the matching of patterns until the specified length. This may result in better compression ratio, but is slightly slower for compression (afaik, it has no impact on decompression speed).
  
  Try google with Lempel-Ziv keywords to get more info
  
  --
  
  I hope I was clear, English is not my mother's tongue
  
  --
  
  Arkanosis
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-04
  
  You say word size is the size of the pattern, but what units are they measured in, non are in the program?
  
  BB
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-05
  
  I suppose that the word size is in bytes.
  
  In a text file for example, the algorithm will match <word_size_or_less> characters patterns.
  
  Finding identical 255 characters patterns is very unprobable on most files, so compression has few chances to be improved, but since it has no impact on decompression speed, I always set the word size to 255.
  Afaik, memory usage is not affected by the word size, so you should not really bother with it.
  
  I think LZMA use a special case to compress long runs of an only symbol, but I'm not sure of this. It may correspond to the number of fast bits (not available with the GUI of 7-Zip).
  
  --
  
  Arkanosis
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Guillermo Gabrielli - 2006-07-05
  
  >I think LZMA use a special case to compress long runs of an only symbol, but I'm not sure of this. It may correspond to the number of fast bits (not available with the GUI of 7-Zip).
  
  Word Size is the number of "Fast Bytes"(fb) for LZMA/Deflate and the "Model Order"(o) for PPMd.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-05
  
  word size has a big meaning for decomression speed. for lempel- ziv- based algorithms longer matches means less codewords and less copyings, so decompression speed is higher. for ppm- based algorithms higher orders means more statistics to check, so decompression speed is lower. look at http://uclc.info/decompression_speed.htm , especially on rar. options m1, m2 and m3 uses lampel- ziv. options m4 and m5 uses lz or ppm, in this case it's ppm. and you see that decompressions speed for particular modes ranks in this order: m3 > m2 > m1 > m5 > m4 (that means: long pattern lz > medium pattern lz > short pattern lz > low order ppm > high order ppm).
  
  (in addition) usually better compression of some file means better decompression speed than other files (with same algorithm). try compressing and then decompressing two huge files: file a is a bunch of random bytes, file b is filled with zeroes. you'll se that file b decompresses much faster (at least with lz- or ppm- based algorithms).
  
  too big memory limit also decreases decompression speed. a window size that fits in processor cache gives best decompression speed.
  
  and in lz- based algorithms (especially in lzma) more bytes encoded as literals (non- matching bytes) means slower decompression speed. the difference can be sometimes enormous like some file decompresses twenty times faster than other. numerous filters can help compressor by processing file in order to make more repeating patters (or make them appearing closer) and, as a result, improve compression ratio and decompression speed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-06
  
  >Word Size is the number of "Fast Bytes"(fb) for LZMA/Deflate and the "Model Order"(o) for PPMd
  
  Thx, now I now what is fb :p
  So for LZMA the unit is "bytes", but for PPMd it must be "bits" (I rarely use PPMd, but I believe the word size can reach 32... the model order must be the word size divided by 8... To be verified)
  
  >for lempel- ziv- based algorithms longer matches means less codewords and less copyings, so decompression speed is higher
  
  --
  
  Sure, I didn't think of that :)
  So, in fact the time actualy spent by the algorithm is exactly the same (I mean without taking I/O in consideration, am I right ?), the time spent writing the decompressed files is exactly the same, but we save some time reading into a smaller archive.
  
  But it's another reason to use a word size of 255 :)
  
  --
  
  Arkanosis
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-06
  
  word size specified in add to archive dialog means _maximum_ pattern length or model order. model order for every ppm scheme (i know) is measured in bytes. so if ppmd can't find any context of length max then it tries to find apprpriate context of length max-1, and so on till order -1 (special order with flat distribution - bytes are transferred uncompressed). and if lzma can't find repeating pattern of length max then it tries to find pattern of length max-1. btw: in real world this is done in reverse ie. ppmd starts from order- 0 and lzma starts from patterns of length min (i don't know how big is min, i guess it's 2 or 3 bytes).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-08
  
  so i can safely increase fast bytes without riscing to reduce compression ratio ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-08
  
  On most cases yes, but there is no way to predict the effect of using longer matches on an arbitrary file: you must try it and see.
  
  Ddot
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- xephael - 2006-07-09
  
  I've concluded it takes more memory to use a larger dictionary for compression. Do you need the same memory requirements for decompression?
  
  I would presume you wouldn't.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-10
  
  No, if you use LZMA which is asymetric.
  Memory is used mainly for the match finder that isn't needed for decompression.
  LZ77 algorithms (like LZMA) only require memory for the "sliding window" history, far less than for the whole match finder.
  However, the size of the window increases with the size of the dictionnary, but remains acceptable (66 Mo for decompression against 900 for compression, for example).
  
  If you use PPMdH, you'll need the same amount of memory for both compression and decompression, because arithmetic coders are symetric.
  I think you shouldn't use PPMdH, as LZMA is generally a better algorithm (I mean, by efficiency). It's far faster for compression, very far for decompression, and even if sometimes PPMdH compresses better (generally on text files), time should dissuade to use it.
  
  --
  
  @nobody >
  I checked the maximum dictionary size for PPMdH, which is 32.
  I cannot imagine PPMdH use an order-32 context model. PAQ itself don't use such a model, because it has no reason to be (to stay in memory, big contexts have to be scaled by a hash, which means less precision and less compression).
  Moreover, PPMdH is one of the fastest ariths, so I believe it must be an order-8,4,3,2,1,0 or something like that.
  To be verified, I really don't have time to read that badly written code of PPMdH (with all the respect I have for Mr Shkarin).
  
  So, I maintain that the dictionnary size is in bytes for LZMA and in bits for PPMdH.
  Discussion is open :)
  
  --
  Arkanosis
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-11
  
  ppmdh uses information inheritance. i don't know exactly what it is, but it acts like blending high orders with low orders (but it's faster). word size for ppmdh *is in bytes*. and there's no hashing (why?).
  
  recently i've created coder with fixed 2- byte order. it's here http://peerfactor.fr/d.jsp?f=TarsaLZP.1151738422.zip . as you can see it performs almost exaclty as ppmdh with order 2. so ppmdh word size is in bytes. ppmonstr uses up to 128- byte order. ppmz uses infinite order. it can be said that bwt based compressors all use infinite order modelling. and no one uses hashing because it will corrupt files.
  
  have you used ppmdj from shkarin's website? it's command line compressor that shows you memory usage. and memory usage grows when compressing. it means that new contexts are added dynamically, without hashtables or something.
  
  usual ppm's doesn't store entire contexts in each memory cell. instead of this, it stores only used characters in particular context (it's memory saving - why store full 256 characters alphabet if only 5 characters were used?) and links to one byte shorter context as a parent context (this preserves continuity). this way allows you to access unimaginably high orders without using huge amounts of memory, because you don't store informations about contexts that wasn't yet used. my simple exponential order- 2 coder uses fixed 32 mb of memory (65535 contexts* 256 byte alphabet* 2 byte (word) of frequency count). and ppmdj on small text (below 1 mb) files with order- 2 usually uses around 0.1 mb of memory. you see the difference.
  
  another approach to ppm is to use limited order bwt. it fakes ppm behaviour, but it's much simpler. how to do that? if we use, eg. 150- byte order then we do a blocksorting where we compare only first 150 bytes of each row. and then we use techniques from usual bwt compressor (like bzip or uhbc) to compress transformed data.
  
  i suppose you only know exponential ppm's. this way a n- bits order ppm with two bytes (word size) per symbol for stofing frequencies uses 2^n*256*2 bytes of memory. so:
  1 bit order uses 2^1*256*2= 1 kilobyte
  8 bit uses 128 kilobytes
  16 bit uses 32 megabytes
  20 bit uses 512 megabytes
  30 bit uses 512 gigabytes
  32 bit uses 2 terabytes
  so for sure exponential ppm is not used in ppmdj.
  
  if you still think ppmdj word size is in bits then you surely should contact mr shkarin :)
  
  ppmd is usually faster for highly reduntant data and lzma is faster for harder compressible data (in compresion of course).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2006-07-11
  
  Well, it seems you're right. PPMonstr use an order-128.
  
  This must be possible with dynamic contexts, now I need to study that :)
  I actually thought of exponential ppm's, since the only arith source I've tried to understand is PAQ's, which stores contexts statically. This means a huge memory usage and forces the use of a hash of the contexts to stay in memory (only contexts are hashed, there is no corruption but only a loss of precision).
  Thank you for opening my eyes :) I'll finally try to read PPMdJ source code...
  
  --
  
  Result of discussion :
  Word size is *always* in bytes
  
  --
  
  Sorry, i'm not able to understand your code :) I'm really not easy with ASM.
  
  --
  Arkanosis
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Disctionary and Word Size

A free file archiver for extremely high compression

Forums

Help

Disctionary and Word Size

Disctionary and Word Size

A free file archiver for extremely high compression

Forums

Help

Disctionary and Word Size document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Disctionary and Word Size