Thread: Re: [lc-devel] Compressed caching

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

John Moser wrote:
> Nitin Gupta wrote:
>> John Moser wrote:
>>> Nitin Gupta wrote:
>>>> On 6/17/06, John Moser <joh...@gm...> wrote:
>>>> * Storing pages:
>>>> -- First, it creates a 'virtual swap' which is a swap with its own
>>>> swap_map. This swap is given highest priority. So, when pages are to
>>>> be swapped out, existing code path naturally/transparently uses this
>>>> swap device (until it's full) for swapping out. (prelim patches for
>>>> this to be posted very soon on lc-devel).
>>> (See I don't understand the swap system that well :)
>>>
>>> I was thinking eventually I'd have to convince the kernel to *try* to
>>> swap even if there was no swap device..
>>>
>>> Once this swap device is full it needs to start pushing its pages to a
>>> REAL swap device, i.e. on disk.
>>>
>> This is exactly what this virtual swap is meant for :)
>> Pages assigned to this virtual swap will eventually go to ccache and
>> when it's full, pages are sent to real swap disk -- all this with
>> minimal code intrusion. This approach works even if there are no real
>> swap disks -- only virtual swap will be used in that case.
>>
> 
> Awesome.  And the swap cache and existing LRU logic is all used, right?
>  (These are important to get memory usage patterns right; we should
> theoretically be able to send pages physically to disk in any order we
> want without being less efficient than general Linux...)
> 

Yes!

>>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine
>>>> if page can really be added to ccache. It can consider things like no.
>>>> of pages already in ccache, CPU load, type of page (anon or fs-backed)
>>>> and some heuristics that tell if page is even worth mem space it will
>>>> take (no such heurustic yet worked out but it is really important
>>>> thing).
>>>>
>>> I was thinking putting that code in a module.  The page would be
>>> compressed with a fast reference algorithm to be used to get a
>>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE.  Squeezing down
>>> a single page is almost instant anyway.
>>>
>>> As for "type of page," certain FS-backed pages could be freed, others
>>> would have to be persistent.  I was thinking more on letting the calling
>>> code mark pages that could be freed as "transient" and just freeing them
>>> if the compressed memory area got too big, rather than swapping.
>>>
>> Again, this is exactly what's happening. Pages guessed to be swapped out
>> by should_add_to_ccache() are marked by PG_will_compress flag. But, this
>> does not guarantee that they will really be compressed. If finally,
>> compression fails (lack of mem, CPU load high, poor compressibility
>> etc.) this flag is cleared and everything else goes as usual. If it
>> really gets compressed, its marked with PG_compressed. Thus it goes with
>> your two step 'transient' approach...I think.
>>
> 
> Transient pages == come from the disk cache system.  If the disk cache
> decides it wants some page back and we don't have it, it can go read it
> back from disk (i.e. /lib/*, executables, stuff in /usr...)
> 
>>>> -- then pages filtered by should_add_to_ccache() go to pageout() as
>>>> usual where they are sent to add_to_ccache() where they are compressed
>>>> and added to compression structure. If add_to_ccache() fails for some
>>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks
>>>> (rest of pageout() code).
>>> I was thinking having the Swap Zone code supply the code to write to
>>> physical disk, and just having stuff pass through it:
>>>
>>>  RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File
>>>
>>> This was also mainly for when the compressed area got too big and needed
>>> shrinkage (by, of course, real swapping).  Of course, compression itself
>>> was going to be a module, so normal operation was:
>>>
>>>  RAM -> Swap Zone -> Swap File
>>>
>>> until such time as the compress module was loaded.  :)
>>>
>> As I understand you are really after an approach that can easily (and
>> with minimum code intrusion in kernel) fall back to real swap disks when
>> compression can't be done. This is being done well in current approach.
>>
> 
> Yeah.  I was looking at a different use of the same code later though
> but :)  So far your design looks technically about the same as mine.
> 

hmm...somewhat different:
-- I am not looking to have multi-stage compression i.e. first compress with 
quickest then later compress them with heavyweights instead when CPU is idle.
-- Not (yet) considering grouping of pages for compression.
-- And I've not yet started work on a heuristic for adaptive ccache size.

>>>> * Restoring pages:
>>>> Whenever swap-cache/page-cache is looked-up for a page
>>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag.
>>>> If its set we decompress() page and return it.
>>>>
>>> I skipped pages compressed in place because compressing individual pages
>>> is non-interesting.  I noticed 15-20% compression with 4K pages on
>>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I
>>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond.
>>>
>>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or 32KiB
>>> (the lower the better); I want to compress pages in groups of whatever
>>> that limit falls on for maximum performance.
>>>
>>> This makes the approach I had in mind a lot more complex.  Pages would
>>> be grouped using a grouping algorithm; idle time would be used to
>>> disrupt groups and reconstruct more optimal grouping; and in the end we
>>> would also attempt to swap out more difficult to compress pages first.
>>>
>>> (In fact this is why I want to find the smallest group for the
>>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to
>>> decompress, rearrange, and recompress up to 9 groups, 288KiB!  If it
>>> were i.e. 20, that'd be 6 x 5 = 120KiB)
>>>
>> This is a great point! I really didn't even consider grouping pages
>> together for compression -- because I was unaware of its effect on
>> compression ratios and due to additional complexity it involves. This
>> grouping of pages is definitely worth a genuine try. But first, to keep
>> it simple, I will implement for single page compression. With this thing
>> stable we can do some algo tweaking to have this grouping.
>>
> 
> Yeah, the page grouping and balancing code would be rather complex.
> Just the idea is complex.

And thats why I will look at it later or maybe just make it flexible enough for now.

> 
>>>> De/Compression codes are dumb. They just compress what they have in
>>>> input buffer and output a compressed buffer.
>>>>
>>>> Also, these algos are pluggable as separate kernel modules. They can
>>>> de/register themselves with ccahe (as is done by JFFS2). But, for now
>>>> I will just place them within kernel itself but overall approach makes
>>>> it easy to get de/compression algos as pluggable modules :)
>>>>
>>> :)  And the algorithm to use them?  I was considering using a reference
>>> block to compress to get a CI for the algorithm itself.
>>>
>> I was thinking of an alternate (simpler) approach. This is because we
>> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of
>> these WKdm and WK4x4 are designed to compress typical anonymous memory
>> pages (see
>> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz
>> for details of WKdm and WK4x4)   while LZO is much better at typical
>> filesystem (textual) data.  Also speeds (in general): WKdm > WK4x4 > LZO
>>
>> So, considering this, I think a simple algo selection scheme like
>> WKdm/WK4x4 if page is anonymous and LZO if fs-data.
>> Or, depending on kind of file, WKdm/WK4x4 can give comparable
>> compression with much better speeds. To handle cases, we can record
>> 'last compression ratio' for each mapping (recorded in corres.
>> address_space struct itself). This can be used to have separate algo
>> selection per open file.
> 
> Using a reference algorithm would be simpler if you wanted to do
> multi-layer compression.  Think about the following events:
> 
>  - Page 0x01 compressed (WKdm)
>  - Page 0x02 compressed (WKdm)
>  - Page 0x03 compressed (WKdm)
>  - CPU is idle
>   - Page 0x01 recompressed (WK4x4)
>   - Page 0x02 recompressed (WK4x4)
>  - Page 0x04 compressed (WKdm)
> 
> Current state:
> 
> 0x01	WK4x4
> 0x02	WK4x4
> 0x03	WKdm
> 0x04	WKdm
> 
>  - CPU is idle
>   - Rearrange pages based on compressibility
> 
> PROBLEM:  How do you index compressibility?
> 
>  - Memory pressure is mounting
>   - Get pages out of memory -- Least compressible?

I don't think its good to kick out least compressible pages first. What 
ultimately matters is to have working set of running programs to stay in memory 
-- compressed or uncompressed. But then, its hard to guess working set of each 
running program...Its a big work in itself to come up with a good heuristic for 
which (and how many) pages to kick out from ccache when memory gets real low.

> 
> PROBLEM:  Need to solve previous problem..
> 
> This is why I was considering multiple, pluggable algorithms and
> reference instrumentation.  The references would be:
> 
>  - Each new PAGE is compressed with a fast, extremely light algorithm
> that doesn't produce useful ratios but runs almost straight through the
> code (i.e. a streaming dictionary compressor rather than a windowed one,
> which immediately is passed to a streaming huffman encoder)
>   - PAGE_SIZE - compressed_page_size == Compression Index (CI)
>
>  - Each new algorithm is given a reference block to compress to test its
> compression efficiency
>   - block_size - compressed_block_size == CI for the algorithm
>   - We could have 'anonymous memory' and 'file data' blocks
>   - Could produce an anonmem_CI and filemem_CI
>

This is just too much work to be done before _every_ compression. It can affect 
performance badly.

> This would allow for the following useful elements:
> 
>  - Groups may be given a CI == sum of the CI's of each page in the group
>  - Pages each have a CI that doesn't change based on algorithm used
>  - Algorithms each have a CI to describe their efficiency
> 
> Which can be used for:
> 
>  - It is possible to quickly take a short incremental step to try to get
> the harder-to-compress pages into the same groups (i.e. find the first
> group from the hardest-to-compress end that is 'wrong', find all groups
> containing the next N hardest-to-compress pages, break those groups
> apart, reorder them, the group is now 'proper')
>   - This allows for idle time to be used to sort pages and groups in
> such a way that our first candidates for writing to disk are pages that
> are 'hard to compress' (taking more memory)
> 

Hard to compress factor should be combined with factor of having working set of 
apps within memory (and maybe other too - I don't know).

>  - It is possible to use a "Light" algorithm (such as WKdm) initially,
> and move to a "Heavy" algorithm by decompressing the block and
> recompressing it during idle CPU time
>   - Maximize efficiency without adding unnecessary overhead while the
> system is busy
> 

-- compress a page with WKdm
[CPU is idle]
-- extract these pages compressed with WKdm from ccache
[CPU is now busy]
-- compress them with LZO
-- put them again in ccache

There can be lot of potential problems with this thing. We need to look at it 
carefully in detail.

I was just wondering how swap prefetching patch senses CPU as 'idle' - what are 
the things it considers?, how it handles problem of 'spike' loads?. Will look at 
all this later.

> Possibly other things too.
> 
> Anyway the point is I was more thinking on how to get it so when you
> initially compress pages, you use very light algorithms; and when you
> have idle CPU time, you burn it to order and condense the compressed
> cache.  This avoids CPU usage during initial entry into the cache.
> 
>>>> Many of the setting are also tunable via sysctl interface. (e.g.
>>>> /proc/sys/vm/max_anon_cc_size).
>>>>
>>> I was considering being conservative with the maximum cache size in
>>> general; but more specifically, allowing the cache to grow until it
>>> notices a lot decompression is happening per time space (compression we
>>> do NOT care about; decompression means the system wants something back).
>>>  If we suddenly see 300 decompressions a second, we're not being very
>>> efficient; time to start shrinking the compressed cache.  (Under real
>>> memory pressure, this shrinkage will involve swapping; otherwise, just
>>> expanding requested pages back into memory.)
>>>
>>> Increasing the size of compressed cache during idle time can be done
>>> conservatively, but I currently can't think of a way to figure out how
>>> big to make it when not under load.  Increasing passively (i.e. swapping
>>> LRU data into compressed cache) would get some memory into (cheap)
>>> compressed cache before the system came under load (decompression is
>>> cheaper than compression).
>> There are _many_ things to consider when developing heuristic for
>> adaptive ccache resizing scheme. I have touched this area for now. I
>> first aim to have a static sized ccache working first.
>>
>> This said, it's always good to keep thinking over this...it's
>> interesting and is going to be useful :)
>>
>>>> Apart from some kernel hooks (as in lookup functions,
>>>> shrink_page_list(), pageout()), all ccache code is pretty much in
>>>> separate .[ch] files or as separte kernel modules. So, it's also very
>>>> less intrusive.
>>> Very nice.
>>>
>>>>> Another advantage is that no matter what I do, I can't swap pages to
>>>>> disk in any worse order than the kernel normally would, since I'm being
>>>>> handed pages based on existing LRU logic.  To this end I've decided to
>>>>> take the "Maximize Compression" approach and try to swap out pages that
>>>>> are "hard to compress" first.
>>>>>
>>>>> Actual compression will be done by a module; although actual
>>>>> compression
>>>>> algorithms will be loaded as slave modules to this module.  :)
>>>>>
>>>>> I've also made note that due to the nature of this design, large
>>>>> amounts
>>>>> of flexibility are presented.  We can talk to the swap device in any
>>>>> way
>>>>> we want; the Swap Zone would supply a mechanism for modules to access
>>>>> the swap device to swap in and out anything, but would always ask the
>>>>> modules themselves to get pages for it.  In theory, we can compress
>>>>> whatever we like and swap it out;
>>>> Approach I described above is flexible enough to handle all kinds of
>>>> pages: anon pages, clean and dirty page cache pages.
>>>>
>>> Compressed pages on swap?  Other uses of the code (i.e. implementing
>>> something besides compression)?
>>>
>> maybe I didn't understand you here.  I meant flexibility in terms of
>> what kind of pages we can keep in compression structure, a compression
>> structure that should give good performance for almost all operations
>> (lookup, expanding, shrinking, deleting a page, minimize fragmentation
>> etc.)
>>
> 
> Mmm..  I was looking at toying with the infra later to 1) Compress data
> that's in swap (and keep in memory a ccache swp_entry_t to real
> swp_entry_t mapping to find it later); and 2) Implement a toy
> distributed, encrypted storage backing (i.e. pages can be encrypted and
> swapped over network to 2-3 servers for redundant storage off-device,
> think wireless PDAs in a corporate network...)
> 

Interesting.

Many ideas keep flying in my mind but when I look at my code...lol..not even 
static cccache is working yet...LOL

>>>>> of course, we need to relate swap file
>>>>> indexes to real page indexes, and deal with suspend-to-disk et al.
>>>> With approach described, we automatically use exisiting mech. of
>>>> kernel for our use. This virtual swap is just another swap disk to
>>>> other kernel parts.
>>> What happens when you suspend to disk?  Does the virtual swap get
>>> force-flushed to real swap?
>> It has to -- after all RAM can retain stuff in suspend mode.
>>
>>> Can you compress pages in swap and have the
>>> kernel not wonder wtf you're smoking when it tries to come back up?
>>>
>> Yes, it should be easy to swap out pages as compressed form itself. This
>> was done in Castro's patch too -- this is not to save swap disk space
>> (rest of space was padded) but to reduce unnecessary decompression until
>> they are _really_ required.
>>
> 
> If you work in multi-page blocks and get better than 1:((N-1)/N)
> compression anywhere, you'll squeeze out a page.  In groups of 5 pages
> it's entirely feasible to get better than 20% compression, squeezing out
> a whole page of swap usage.
> 
> Of course, this in no way interferes with the original goal of reducing
> unnecessary decompression; it's a happy side-effect.  :)
>

This grouping of pages will also reduce metadata required for ccache. Each group 
will have a single chunk_head and coressponding chunks i.e. no. of cunk_heads 
will be reduced.

>>> (this would all just require extra infrastructure, mainly APIs to say
>>> 'Don't compress these pages' and 'stop storing shit in memory until I
>>> say so', nothing majorly intrusive)
>>>
>>>> -----------------------
>>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching
>>>> Here, I have detailed the approach I'm taking (at least for page-cache
>>>> pages). It's slightly out-dated now --- It doesn't detail approach for
>>>> anon pages in detail.....I will update it soon.
>>>>    Feel free to edit it.
>>>>
>>>> Since I'm doing this project as part of Google SoC, I have many
>>>> 'deadlines' to consider. So, now bringing major changes to approach
>>>> will overwhelm me with work. It will be really nice if you can provide
>>>> feedback on my approach :)
>>> Your approach sounds at least to be minimally intrusive, which is good
>>> because it lets you keep the code separate and do interesting things.
>>> You also use the swapping mechanism, so you should be able to take
>>> advantage of page cache implicitly (a goal of my own design).
>>>
>>> You need to consider page grouping; Experiment with different sized
>>> groups, especially around 16K, 20K, 24K, 28K, and 32K.  Try to make the
>>> code flexible enough to handle variable sized page groups too; it would
>>> be interesting to have multiple group sizes at run, but possibly more
>>> complex than needed.  I found page grouping to present gains of 10-15%
>>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and
>>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's
>>> nothing interesting about any group bigger than 32K, and maybe you can
>>> get away with something smaller and still maximize gains.
>>>
>> I will definitely lookup into how flexible my current approach is w.r.t
>> grouping pages and will try to get some flexibility here if it shows up
>> rigid. This grouping of pages seems really important.
>>
> 
> Run some simple tests.  Feed the modules a group of test data; have them
> compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and
> then output the average compression ratio for each group size.  The
> output would look similar to:
> 
> 
> 4K	16%
> 8K	23%
> 12K	30%
> 16K	35%
> 20K	39%
> 24K	42%  <-- This would be a good place to stop...
> 28K	43%  <-- Not worth the gains
> 32K	43%
> 36K	44%
> 
> (Numbers faked)
> 
> And you'd basically be looking for where the numbers stopped going up by
> useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful
> for the added overhead).  If you keep going and going you'll find a
> point where no matter how much bigger the block gets you can't gain any
> more compression; it's likely good to stop a bit before that, knocking
> off 1% or 2%.
> 

Your analysis of grouping pages interests me most :)  I will definitely look 
into having this grouping supported by my design.....again, a bell rings -- 
nitin! get that static ccache with simplest of things you can do first!! ;)

>>> One part of my design was going to be to let the kernel's existing swap
>>> code make ALL the LRU decisions and base my decisions entirely on
>>> compressibility of the pages.  Pages would be regrouped during idle time
>>> and groups would be kept in order of compressibility.  The idea here was
>>> to be able to throw hard-to-compress pages to swap first during memory
>>> pressure.  This would help maximize the in-memory compression ratio.
>>>
>>> As for multiple algorithms, I would say to always compress with your
>>> lightest, fastest algorithm FIRST (this would be WKdm).  During idle
>>> time you can go back and decompress/recompress pages with something
>>> heavier (WK4x4).  There is no shame in abusing the CPU when it's idle ;)
>>>
>> For desktop systems, this makes sense. But what about OLPC systems? You
>> really can't abuse idle CPU, you got to conserve power (I'm not sure of
>> its effect on power requirements but it should be like more we run more
>> we eat ;) )
>>
> 
> True :)
> 

Cheers,
Nitin Gupta

Thread: Re: [lc-devel] Compressed caching

linuxcompressed-devel