Re: [lc-devel] Compressed caching

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

John Moser wrote:
> Nitin Gupta wrote:
>> John Moser wrote:
>>> It may even be possible to find an algorithm that does strict analysis
>>> but not compression (like a hash) that spits out a value correlating to
>>> the redundancy and encodability of the stream; but that would take a lot
>>> of research to come up with.  A really fast, really light compressor
>>> would be easier.
>> your hash idea is a real nice topic for MS...but I've not yet planned
>> for that :)
>> (btw...I searched for something like this on citeseer but no luck)
>>
>> WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;)
>> ..and still WKdm can't give you an idea of how well others(LZO) will
>> compress it.
>>
> 
> In general there is a correlation.  Remember compression is a
> characteristic of the algorithm and the data.  It is computationally
> impossible to compress truly random data beyond a specific average
> ratio; however, small isolated blocks may encode well.
> 
> In general we say one algorithm performs better than another because of
> this.  7zip gets smaller output than RAR or ZIP, consistantly, on
> various input datas; this is not a coincidence, the LZMA algorithm is
> actually better at redundancy elimination.  The fact that an algorithm
> compresses a set of data more than the next shows that algorithm is
> better at eliminating redundancy; while the fact that a given algorithm
> compresses one piece of data better than the next shows that that data
> is more redundant.
> 
> Think of it in terms of a punett square with the crossed values giving
> the output data size of compressing a 32K block of data:
> 
>   Algorithms:			Data sets:
>   X = weak compression		A = minimal redundancy
>   Y = strong compression	B = much redundancy
> 
>    A    B
> X 30 | 25
>  ----+----
> Y 25 | 15
> 
> A weak algorithm with non-redundant data will produce a block only
> slightly smaller (if not bigger) than the input; while the same
> algorithm on more redundant data will produce a block that's a little
> smaller than that.  In contrast, a strong algorithm on non-redundant
> data will produce a block that's somewhat smaller, and on redundant data
> will produce a block that's much smaller.
> 
> You are correct in assuming that it's not a mathematical fact that any
> given algorithm can be used to predict how other algorithms will perform
> with the data; however, compression algorithms are basically analysis of
> data, and they do happen to as a side effect of their nature expose a
> property of the data.  We can use this to conjecture that in most cases
> a compression algorithm will effectively describe how well other
> algorithms will perform, just not in a 1:1 manner; it will produce an
> approximate general trend.

This seems very logical and we can confirm this at least for particular cases of 
  WKdm, WK4x4 and LZO.

If you have some time, it would be awesome if you take 'compress-test' kernel 
module:

http://linux-mm.org/CompressedCaching/Code?action=AttachFile&do=get&target=compress-test.tar.gz

and feed all three algos (included with this module), data from /lib (binary 
files), normal text files and other kind of data you can think of, and compare 
how effectively each compresses the data. This thing can be easily automated and 
  lots of data on this will allow us to see if we can really conjecture how well 
performance of WKdm (lightest) describes performance of other two.

If you are interested in doing this, I can send you all the details for this 
module. In fact, all you need to do is feed some data (in single pages) to 
/proc/compress_test/compress and then read this same file to see compressed 
size. You can dynamically switch algos by writing 0, 1, or 2 to 
/proc/compress_test/algo_idx.

---------------------------
[couldn't now think of a reply for rest of your mail]
---------------------------

Also, I have some doubts on page grouping you mentioned earlier:

AFAIK, Rodrigo's patch never compressed pages in groups - all pages were 
compressed individually. The page grouping you were mentioning about was the 
size of 'cells'. So, what is the meaning of grouping of pages w.r.t 'chunked' 
structure?

Though we can compression more than one page together and then store them in 
chunks but this was not you were mentioning about...right?

Cheers,
Nitin Gupta