Re: [lc-devel] Compressed caching

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nitin Gupta wrote:
> John Moser wrote:
>>> I don't think its good to kick out least compressible pages first. What
>>> ultimately matters is to have working set of running programs to stay in
>>> memory -- compressed or uncompressed. But then, its hard to guess
>>> working set of each running program...Its a big work in itself to come
>>> up with a good heuristic for which (and how many) pages to kick out from
>>> ccache when memory gets real low.
>>>
>>
>> Eh, that's what letting the existing swap logic figure it out is for.
>> Can't make it any LESS efficient than it already is right?  :)  Plus the
>> swap cache is looking out for us.
>>
> 
> Ah..there is no swap logic for which pages to discard when swap goes
> (almost) full (there's one: OOM killer :) ). After all we are swapping
> to just another swap device which we call virtual swap.
> 

I mean for which pages to emit from ccache to disk.

> So, it is our work to develop this logic to handle case of ccache
> reaching its size limit (or when we want to shrink it for other reasons).
> 
>> Some light LRU heuristics may be useful.  Possibly time based ("If this
>> page has been used in the last 40,000 jiffies, try to kill something
>> else")
>>
> 
> So, you are saying to maintain jiffies counter for every page in ccache?
> Or if you store jiffies val then how will you lookup page that has
> jiffies val in some particular(40,000) window? This 'LRU' scheme won't
> work.
> 

could just throw a jiffies count (number of jiffies since boot time
should be available) into the thing when it's requested from ccache; and
before sending it to disk, check to see page->jiffies <
JIFFIES_WE_THINK_MEAN_THIS_WAS_RECENTLY_WANTED

I don't really know, or feel like thinking it out too far.  This kind of
thing can be deferred until ccache is really actually working.

>>
>> It may even be possible to find an algorithm that does strict analysis
>> but not compression (like a hash) that spits out a value correlating to
>> the redundancy and encodability of the stream; but that would take a lot
>> of research to come up with.  A really fast, really light compressor
>> would be easier.
> 
> your hash idea is a real nice topic for MS...but I've not yet planned
> for that :)
> (btw...I searched for something like this on citeseer but no luck)
> 
> WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;)
> ..and still WKdm can't give you an idea of how well others(LZO) will
> compress it.
> 

In general there is a correlation.  Remember compression is a
characteristic of the algorithm and the data.  It is computationally
impossible to compress truly random data beyond a specific average
ratio; however, small isolated blocks may encode well.

In general we say one algorithm performs better than another because of
this.  7zip gets smaller output than RAR or ZIP, consistantly, on
various input datas; this is not a coincidence, the LZMA algorithm is
actually better at redundancy elimination.  The fact that an algorithm
compresses a set of data more than the next shows that algorithm is
better at eliminating redundancy; while the fact that a given algorithm
compresses one piece of data better than the next shows that that data
is more redundant.

Think of it in terms of a punett square with the crossed values giving
the output data size of compressing a 32K block of data:

  Algorithms:			Data sets:
  X = weak compression		A = minimal redundancy
  Y = strong compression	B = much redundancy

   A    B
X 30 | 25
 ----+----
Y 25 | 15

A weak algorithm with non-redundant data will produce a block only
slightly smaller (if not bigger) than the input; while the same
algorithm on more redundant data will produce a block that's a little
smaller than that.  In contrast, a strong algorithm on non-redundant
data will produce a block that's somewhat smaller, and on redundant data
will produce a block that's much smaller.

You are correct in assuming that it's not a mathematical fact that any
given algorithm can be used to predict how other algorithms will perform
with the data; however, compression algorithms are basically analysis of
data, and they do happen to as a side effect of their nature expose a
property of the data.  We can use this to conjecture that in most cases
a compression algorithm will effectively describe how well other
algorithms will perform, just not in a 1:1 manner; it will produce an
approximate general trend.

> 
>>
>> The algorithm CI reference would be computed when the algorithm is first
>> loaded, before it was put into use.  This is a one-time cost for each
>> algorithm, and can be ignored; it likely operates under a millisecond
>> but it wouldn't hurt anything if it ran for 100-200mS really.  This
>> value would become persistent for the life of the algorithm (i.e. until
>> someone rmmod's it), so it wouldn't have to be recomputed again.
>>
> 
> CI only makes sense when it tells how well _that_ algo compresses _this_
> page. If you just want a sane initial CI value for every algo, just
> hard-code a value you can derive from many previous papers that compare
> these algos (typically WKdm, WK4x4 and LZO).
> 

See above :P

>> Note that given these times there could be a negative impact on
>> performance; on faster systems, of course, the impact becomes quickly
>> negligible.  It is still significant for embedded systems.
>>
>> It is also notable that we can amortize the costs by conjecturing that
>> under normal circumstances, the system won't be swapping excessively,
>> and thus wouldn't be using compressed cache excessively; thus the CI
>> calculation can be largely ignored if it is sufficiently small.
> 
> ccache is not even invoked in 'normal' conditions. It drops in when
> system is choking for air...err..RAM :)
> 

Under normal conditions, the system bumps the top of RAM and swaps a
handful of pages out.  Under abnormal conditions, the system has a
working set 10% bigger than physical RAM and begins swapping data in and
out every few milliseconds because it can't keep EVERYTHING it needs NOW
in memory.

Boot with 'mem=32M' and start KDE to see what I'm talking about.

>>
>> We can further amortize the costs by conjecturing that when ccache
>> thrashing occurs (much paging in and out of ccache), pages being sent
>> back and forth between memory and ccache are usually not going to be
>> dirty.  In this case, we may be able to store the CI of the page and
>> only re-calculate it if the page is dirty (i.e. altered).  Pages that
>> were decompressed, read, and recompressed would be unchanged and have
>> the same CI, thus we could skip recalculating the CI.
>>
> 
> Yes, ccache (as I expect) will be loaded with clean page cache pages (if
> you choose to compress such pages) but then where will you store
> per-page CI values and maintain them even after a clean page is
> decompressed and freed from ccache?
> maybe a field somewhere...but this doesn't seems interesting..
> 

*shrug*  Don't worry about it for now anyway.  It's just an optimization
algorithm, worry about it in the future when ccache actually works first.

>>>> This would allow for the following useful elements:
>>>>
>>>>  - Groups may be given a CI == sum of the CI's of each page in the
>>>> group
>>>>  - Pages each have a CI that doesn't change based on algorithm used
>>>>  - Algorithms each have a CI to describe their efficiency
>>>>
>>>> Which can be used for:
>>>>
>>>>  - It is possible to quickly take a short incremental step to try to
>>>> get
>>>> the harder-to-compress pages into the same groups (i.e. find the first
>>>> group from the hardest-to-compress end that is 'wrong', find all groups
>>>> containing the next N hardest-to-compress pages, break those groups
>>>> apart, reorder them, the group is now 'proper')
>>>>   - This allows for idle time to be used to sort pages and groups in
>>>> such a way that our first candidates for writing to disk are pages that
>>>> are 'hard to compress' (taking more memory)
>>>>
>>> Hard to compress factor should be combined with factor of having working
>>> set of apps within memory (and maybe other too - I don't know).
>>
>> Yes.. perhaps time based?  ("This page was used some time within the
>> last 4000 jiffies, it's probably part of a working set").
> 
> I don't think this is the way to determine app working sets.
> This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression
> (http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf)
> 
> mentions of a adaptive resizing scheme that keeps working sets of app in
> memory by a user space process that 'monitors' this app. But the worst
> thing is:
> -- Where is that code? Where is description of method used? googled that
> for hours...YMMV
> -- They seem to assume that only a single heavyweight app is running.
> While systems like OLPC are generic ones (albeit slower ones).
> 

It's still the best way I can think of.  An app that's got 4 segments of
memory set up:

[main()][hash_md5()][DATA:db_index][user_add()]

main() is going to call some other stuff and probably never get returned
into until the program exits; user_add() will happen when someone is
administrating the program.

Whenever the database is accessed, some of DATA:db_index is read through
(among other chunks of the database); and hash_md5() is used to check
for database corruption.

In the normal course of the program, you'll probably find that, when the
program is active, user_add() is rarely used; main() is pretty much
NEVER used; hash_md5() is used every time the program does ANYTHING; and
DATA:db_index is read through every time someone needs to find something
in the database.  In these cases, we can say that the program's current
working set probably excludes main() and user_add() purely because they
haven't been used recently; and probably includes hash_md5() and
DATA:db_index because they've been used very recently.

If the program is asleep, and rarely has any activity, we can probably
page it out without anyone caring much in favor of some other program
that's currently hammering the CPU.

>>
>>>>  - It is possible to use a "Light" algorithm (such as WKdm) initially,
>>>> and move to a "Heavy" algorithm by decompressing the block and
>>>> recompressing it during idle CPU time
>>>>   - Maximize efficiency without adding unnecessary overhead while the
>>>> system is busy
>>>>
>>> -- compress a page with WKdm
>>> [CPU is idle]
>>> -- extract these pages compressed with WKdm from ccache
>>> [CPU is now busy]
>>> -- compress them with LZO
>>> -- put them again in ccache
>>>
>>> There can be lot of potential problems with this thing. We need to look
>>> at it carefully in detail.
>>>
>>
>> Yeah, like systems that are busy running a large number of programs
>> (i.e. building a kernel) where the CPU is working on and off in quick
>> spurts.  You can get 80-150mS idle periods, if you try to do 200mS of
>> work there you just slowed the system down.
>>
>>> I was just wondering how swap prefetching patch senses CPU as 'idle' -
>>> what are the things it considers?, how it handles problem of 'spike'
>>> loads?. Will look at all this later.
>>>
>>
>> When the CPU is idle, linux activates the "Idle Loop."  The Idle Loop
>> pretty much uses HLT to halt CPU use for a little bit when the system
>> has nothing to do.  Otherwise the CPU would be going nuts all the time ;)
>>
>> At any rate, there's a mechanism in there somewhere for it.
> 
> Andreas Mohr provided these links (to know when CPU is 'idle')
> -- http://lkml.org/lkml/2006/6/18/39
> -- http://lkml.org/lkml/2006/6/18/40
> -- http://threebit.net/mail-archive/linux-kernel/msg05303.html
> 

cool.  CPU 'idles' btw by entering a loop that does absolutely nothing;
however the kernel uses the HLT instruction to actually halt the CPU so
it actually processes nothing for a time.
> 
> 
> Cheers,
> Nitin Gupta
> 
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRJc5Wws1xW0HCTEFAQLD2g/+LdhieH0T6OPCveThzBfPtZOI/1ZuL9Hm
zqGLdMoZUk83OWOGVMkTmpn4pZEa/QzSAoZd97SkgFrBjfdLBUxZi0tp3i++2x58
bW33wuS/NH0OyCu9Up+lGBQ5LPfVdMk7/9CiuGdOFSvX2t1Xt5K7S1hd/QRpJE50
6jlZpcvnStkEHHwMSi8nGDw2ntUIQwnXAptjXUq/fFEbPJaWNGDB+p3I9qfwg2gY
PfMLgkdobhWcbtp6pcHdpyUcFkSYA4FFIggWpcDQw+TpWpMuGVbrPH9sMHjf7jim
XN+uUM7tqPtCg1GzDcWthoEF9u+DBxD39maOT/3PD4VMFMIX5bYjyKbohuuf0VIv
+5AvfFfNAjOwlpbVZsKbsCt2NXOi2Z+3+3JobRjds7PZ2XJPAKiA4SpOY99vCu5p
aZw7ERfp/elFGI8DRMeRSXGVVf0ZUnQCPDqNH+Nze8VBRecdK+jd8sp068dtXw5z
6YOwZARlEfVxj2Hx+4+iefEOWtzpztPBD/NhxpiHOpSWmsIEhj3XrX5YLtYWEDHl
UUgcNUZ7BFV9n3nnjW8mYU94VvJJLRbSP2YM08/HGPV5J7CPWJKBISd+k7umUV3h
kkqmTeHFnjNB8CoGWmEHatkUpVKTKGDttEK7MVAChAy4jRypHX/tIqdfohalZzxr
ImTpgaFc5vw=
=+mxu
-----END PGP SIGNATURE-----