Re: [lc-devel] Compressed caching

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

John Moser wrote:
>> I don't think its good to kick out least compressible pages first. What
>> ultimately matters is to have working set of running programs to stay in
>> memory -- compressed or uncompressed. But then, its hard to guess
>> working set of each running program...Its a big work in itself to come
>> up with a good heuristic for which (and how many) pages to kick out from
>> ccache when memory gets real low.
>>
> 
> Eh, that's what letting the existing swap logic figure it out is for.
> Can't make it any LESS efficient than it already is right?  :)  Plus the
> swap cache is looking out for us.
>

Ah..there is no swap logic for which pages to discard when swap goes (almost) 
full (there's one: OOM killer :) ). After all we are swapping to just another 
swap device which we call virtual swap.

So, it is our work to develop this logic to handle case of ccache reaching its 
size limit (or when we want to shrink it for other reasons).

> Some light LRU heuristics may be useful.  Possibly time based ("If this
> page has been used in the last 40,000 jiffies, try to kill something else")
>

So, you are saying to maintain jiffies counter for every page in ccache?
Or if you store jiffies val then how will you lookup page that has jiffies val 
in some particular(40,000) window? This 'LRU' scheme won't work.

> 
> It may even be possible to find an algorithm that does strict analysis
> but not compression (like a hash) that spits out a value correlating to
> the redundancy and encodability of the stream; but that would take a lot
> of research to come up with.  A really fast, really light compressor
> would be easier.

your hash idea is a real nice topic for MS...but I've not yet planned for that :)
(btw...I searched for something like this on citeseer but no luck)

WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;)
..and still WKdm can't give you an idea of how well others(LZO) will compress it.

> 
> The algorithm CI reference would be computed when the algorithm is first
> loaded, before it was put into use.  This is a one-time cost for each
> algorithm, and can be ignored; it likely operates under a millisecond
> but it wouldn't hurt anything if it ran for 100-200mS really.  This
> value would become persistent for the life of the algorithm (i.e. until
> someone rmmod's it), so it wouldn't have to be recomputed again.
> 

CI only makes sense when it tells how well _that_ algo compresses _this_ page. 
If you just want a sane initial CI value for every algo, just hard-code a value 
you can derive from many previous papers that compare these algos (typically 
WKdm, WK4x4 and LZO).

> Note that given these times there could be a negative impact on
> performance; on faster systems, of course, the impact becomes quickly
> negligible.  It is still significant for embedded systems.
> 
> It is also notable that we can amortize the costs by conjecturing that
> under normal circumstances, the system won't be swapping excessively,
> and thus wouldn't be using compressed cache excessively; thus the CI
> calculation can be largely ignored if it is sufficiently small.

ccache is not even invoked in 'normal' conditions. It drops in when system is 
choking for air...err..RAM :)

> 
> We can further amortize the costs by conjecturing that when ccache
> thrashing occurs (much paging in and out of ccache), pages being sent
> back and forth between memory and ccache are usually not going to be
> dirty.  In this case, we may be able to store the CI of the page and
> only re-calculate it if the page is dirty (i.e. altered).  Pages that
> were decompressed, read, and recompressed would be unchanged and have
> the same CI, thus we could skip recalculating the CI.
> 

Yes, ccache (as I expect) will be loaded with clean page cache pages (if you 
choose to compress such pages) but then where will you store per-page CI values 
and maintain them even after a clean page is decompressed and freed from ccache?
maybe a field somewhere...but this doesn't seems interesting..

>>> This would allow for the following useful elements:
>>>
>>>  - Groups may be given a CI == sum of the CI's of each page in the group
>>>  - Pages each have a CI that doesn't change based on algorithm used
>>>  - Algorithms each have a CI to describe their efficiency
>>>
>>> Which can be used for:
>>>
>>>  - It is possible to quickly take a short incremental step to try to get
>>> the harder-to-compress pages into the same groups (i.e. find the first
>>> group from the hardest-to-compress end that is 'wrong', find all groups
>>> containing the next N hardest-to-compress pages, break those groups
>>> apart, reorder them, the group is now 'proper')
>>>   - This allows for idle time to be used to sort pages and groups in
>>> such a way that our first candidates for writing to disk are pages that
>>> are 'hard to compress' (taking more memory)
>>>
>> Hard to compress factor should be combined with factor of having working
>> set of apps within memory (and maybe other too - I don't know).
> 
> Yes.. perhaps time based?  ("This page was used some time within the
> last 4000 jiffies, it's probably part of a working set").

I don't think this is the way to determine app working sets.
This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression
(http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf)
mentions of a adaptive resizing scheme that keeps working sets of app in memory 
by a user space process that 'monitors' this app. But the worst thing is:
-- Where is that code? Where is description of method used? googled that for 
hours...YMMV
-- They seem to assume that only a single heavyweight app is running. While 
systems like OLPC are generic ones (albeit slower ones).

> 
>>>  - It is possible to use a "Light" algorithm (such as WKdm) initially,
>>> and move to a "Heavy" algorithm by decompressing the block and
>>> recompressing it during idle CPU time
>>>   - Maximize efficiency without adding unnecessary overhead while the
>>> system is busy
>>>
>> -- compress a page with WKdm
>> [CPU is idle]
>> -- extract these pages compressed with WKdm from ccache
>> [CPU is now busy]
>> -- compress them with LZO
>> -- put them again in ccache
>>
>> There can be lot of potential problems with this thing. We need to look
>> at it carefully in detail.
>>
> 
> Yeah, like systems that are busy running a large number of programs
> (i.e. building a kernel) where the CPU is working on and off in quick
> spurts.  You can get 80-150mS idle periods, if you try to do 200mS of
> work there you just slowed the system down.
> 
>> I was just wondering how swap prefetching patch senses CPU as 'idle' -
>> what are the things it considers?, how it handles problem of 'spike'
>> loads?. Will look at all this later.
>>
> 
> When the CPU is idle, linux activates the "Idle Loop."  The Idle Loop
> pretty much uses HLT to halt CPU use for a little bit when the system
> has nothing to do.  Otherwise the CPU would be going nuts all the time ;)
> 
> At any rate, there's a mechanism in there somewhere for it.

Andreas Mohr provided these links (to know when CPU is 'idle')
-- http://lkml.org/lkml/2006/6/18/39
-- http://lkml.org/lkml/2006/6/18/40
-- http://threebit.net/mail-archive/linux-kernel/msg05303.html

Cheers,
Nitin Gupta