Re: [lc-devel] Compressed caching
Status: Beta
Brought to you by:
nitin_sf
From: John M. <joh...@gm...> - 2006-06-18 17:07:20
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > John Moser wrote: >> Nitin Gupta wrote: >>> John Moser wrote: >>>> Nitin Gupta wrote: >>>>> On 6/17/06, John Moser <joh...@gm...> wrote: >>>>> * Storing pages: >>>>> -- First, it creates a 'virtual swap' which is a swap with its own >>>>> swap_map. This swap is given highest priority. So, when pages are to >>>>> be swapped out, existing code path naturally/transparently uses this >>>>> swap device (until it's full) for swapping out. (prelim patches for >>>>> this to be posted very soon on lc-devel). >>>> (See I don't understand the swap system that well :) >>>> >>>> I was thinking eventually I'd have to convince the kernel to *try* to >>>> swap even if there was no swap device.. >>>> >>>> Once this swap device is full it needs to start pushing its pages to a >>>> REAL swap device, i.e. on disk. >>>> >>> This is exactly what this virtual swap is meant for :) >>> Pages assigned to this virtual swap will eventually go to ccache and >>> when it's full, pages are sent to real swap disk -- all this with >>> minimal code intrusion. This approach works even if there are no real >>> swap disks -- only virtual swap will be used in that case. >>> >> >> Awesome. And the swap cache and existing LRU logic is all used, right? >> (These are important to get memory usage patterns right; we should >> theoretically be able to send pages physically to disk in any order we >> want without being less efficient than general Linux...) >> > > Yes! > >>>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine >>>>> if page can really be added to ccache. It can consider things like no. >>>>> of pages already in ccache, CPU load, type of page (anon or fs-backed) >>>>> and some heuristics that tell if page is even worth mem space it will >>>>> take (no such heurustic yet worked out but it is really important >>>>> thing). >>>>> >>>> I was thinking putting that code in a module. The page would be >>>> compressed with a fast reference algorithm to be used to get a >>>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE. Squeezing >>>> down >>>> a single page is almost instant anyway. >>>> >>>> As for "type of page," certain FS-backed pages could be freed, others >>>> would have to be persistent. I was thinking more on letting the >>>> calling >>>> code mark pages that could be freed as "transient" and just freeing >>>> them >>>> if the compressed memory area got too big, rather than swapping. >>>> >>> Again, this is exactly what's happening. Pages guessed to be swapped out >>> by should_add_to_ccache() are marked by PG_will_compress flag. But, this >>> does not guarantee that they will really be compressed. If finally, >>> compression fails (lack of mem, CPU load high, poor compressibility >>> etc.) this flag is cleared and everything else goes as usual. If it >>> really gets compressed, its marked with PG_compressed. Thus it goes with >>> your two step 'transient' approach...I think. >>> >> >> Transient pages == come from the disk cache system. If the disk cache >> decides it wants some page back and we don't have it, it can go read it >> back from disk (i.e. /lib/*, executables, stuff in /usr...) >> >>>>> -- then pages filtered by should_add_to_ccache() go to pageout() as >>>>> usual where they are sent to add_to_ccache() where they are compressed >>>>> and added to compression structure. If add_to_ccache() fails for some >>>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks >>>>> (rest of pageout() code). >>>> I was thinking having the Swap Zone code supply the code to write to >>>> physical disk, and just having stuff pass through it: >>>> >>>> RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File >>>> >>>> This was also mainly for when the compressed area got too big and >>>> needed >>>> shrinkage (by, of course, real swapping). Of course, compression >>>> itself >>>> was going to be a module, so normal operation was: >>>> >>>> RAM -> Swap Zone -> Swap File >>>> >>>> until such time as the compress module was loaded. :) >>>> >>> As I understand you are really after an approach that can easily (and >>> with minimum code intrusion in kernel) fall back to real swap disks when >>> compression can't be done. This is being done well in current approach. >>> >> >> Yeah. I was looking at a different use of the same code later though >> but :) So far your design looks technically about the same as mine. >> > > hmm...somewhat different: > -- I am not looking to have multi-stage compression i.e. first compress > with quickest then later compress them with heavyweights instead when > CPU is idle. > -- Not (yet) considering grouping of pages for compression. > -- And I've not yet started work on a heuristic for adaptive ccache size. > >>>>> * Restoring pages: >>>>> Whenever swap-cache/page-cache is looked-up for a page >>>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag. >>>>> If its set we decompress() page and return it. >>>>> >>>> I skipped pages compressed in place because compressing individual >>>> pages >>>> is non-interesting. I noticed 15-20% compression with 4K pages on >>>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I >>>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond. >>>> >>>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or >>>> 32KiB >>>> (the lower the better); I want to compress pages in groups of whatever >>>> that limit falls on for maximum performance. >>>> >>>> This makes the approach I had in mind a lot more complex. Pages would >>>> be grouped using a grouping algorithm; idle time would be used to >>>> disrupt groups and reconstruct more optimal grouping; and in the end we >>>> would also attempt to swap out more difficult to compress pages first. >>>> >>>> (In fact this is why I want to find the smallest group for the >>>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to >>>> decompress, rearrange, and recompress up to 9 groups, 288KiB! If it >>>> were i.e. 20, that'd be 6 x 5 = 120KiB) >>>> >>> This is a great point! I really didn't even consider grouping pages >>> together for compression -- because I was unaware of its effect on >>> compression ratios and due to additional complexity it involves. This >>> grouping of pages is definitely worth a genuine try. But first, to keep >>> it simple, I will implement for single page compression. With this thing >>> stable we can do some algo tweaking to have this grouping. >>> >> >> Yeah, the page grouping and balancing code would be rather complex. >> Just the idea is complex. > > And thats why I will look at it later or maybe just make it flexible > enough for now. > >> >>>>> De/Compression codes are dumb. They just compress what they have in >>>>> input buffer and output a compressed buffer. >>>>> >>>>> Also, these algos are pluggable as separate kernel modules. They can >>>>> de/register themselves with ccahe (as is done by JFFS2). But, for now >>>>> I will just place them within kernel itself but overall approach makes >>>>> it easy to get de/compression algos as pluggable modules :) >>>>> >>>> :) And the algorithm to use them? I was considering using a reference >>>> block to compress to get a CI for the algorithm itself. >>>> >>> I was thinking of an alternate (simpler) approach. This is because we >>> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of >>> these WKdm and WK4x4 are designed to compress typical anonymous memory >>> pages (see >>> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz >>> for details of WKdm and WK4x4) while LZO is much better at typical >>> filesystem (textual) data. Also speeds (in general): WKdm > WK4x4 > LZO >>> >>> So, considering this, I think a simple algo selection scheme like >>> WKdm/WK4x4 if page is anonymous and LZO if fs-data. >>> Or, depending on kind of file, WKdm/WK4x4 can give comparable >>> compression with much better speeds. To handle cases, we can record >>> 'last compression ratio' for each mapping (recorded in corres. >>> address_space struct itself). This can be used to have separate algo >>> selection per open file. >> >> Using a reference algorithm would be simpler if you wanted to do >> multi-layer compression. Think about the following events: >> >> - Page 0x01 compressed (WKdm) >> - Page 0x02 compressed (WKdm) >> - Page 0x03 compressed (WKdm) >> - CPU is idle >> - Page 0x01 recompressed (WK4x4) >> - Page 0x02 recompressed (WK4x4) >> - Page 0x04 compressed (WKdm) >> >> Current state: >> >> 0x01 WK4x4 >> 0x02 WK4x4 >> 0x03 WKdm >> 0x04 WKdm >> >> - CPU is idle >> - Rearrange pages based on compressibility >> >> PROBLEM: How do you index compressibility? >> >> - Memory pressure is mounting >> - Get pages out of memory -- Least compressible? > > I don't think its good to kick out least compressible pages first. What > ultimately matters is to have working set of running programs to stay in > memory -- compressed or uncompressed. But then, its hard to guess > working set of each running program...Its a big work in itself to come > up with a good heuristic for which (and how many) pages to kick out from > ccache when memory gets real low. > Eh, that's what letting the existing swap logic figure it out is for. Can't make it any LESS efficient than it already is right? :) Plus the swap cache is looking out for us. Some light LRU heuristics may be useful. Possibly time based ("If this page has been used in the last 40,000 jiffies, try to kill something else") >> >> PROBLEM: Need to solve previous problem.. >> >> This is why I was considering multiple, pluggable algorithms and >> reference instrumentation. The references would be: >> >> - Each new PAGE is compressed with a fast, extremely light algorithm >> that doesn't produce useful ratios but runs almost straight through the >> code (i.e. a streaming dictionary compressor rather than a windowed one, >> which immediately is passed to a streaming huffman encoder) >> - PAGE_SIZE - compressed_page_size == Compression Index (CI) >> >> - Each new algorithm is given a reference block to compress to test its >> compression efficiency >> - block_size - compressed_block_size == CI for the algorithm >> - We could have 'anonymous memory' and 'file data' blocks >> - Could produce an anonmem_CI and filemem_CI >> > > This is just too much work to be done before _every_ compression. It can > affect performance badly. The page compression indexing would happen when a page was sent to the ccache, and occurs using a very light algorithm on a 4KiB page. This is not likely to affect much of anything, maybe cost 50,000 cycles (WKdm). Estimated times: CPU CLK SEC MSEC EXAMPLE 1MHz 1/20 50 8-bit Nintendo (6502) 200MHz 1/4000 0.25 iPaq 3600 PDA (206MHz) 400MHz 1/8000 0.125 iPaq 5400 PDA (Xscale) 800MHz 1/16000 0.0625 Old Dell 1.6GHz 1/32000 0.03125 Centrino Laptop 2.0GHz 1/40000 0.025 Desktop It may even be possible to find an algorithm that does strict analysis but not compression (like a hash) that spits out a value correlating to the redundancy and encodability of the stream; but that would take a lot of research to come up with. A really fast, really light compressor would be easier. The algorithm CI reference would be computed when the algorithm is first loaded, before it was put into use. This is a one-time cost for each algorithm, and can be ignored; it likely operates under a millisecond but it wouldn't hurt anything if it ran for 100-200mS really. This value would become persistent for the life of the algorithm (i.e. until someone rmmod's it), so it wouldn't have to be recomputed again. Note that given these times there could be a negative impact on performance; on faster systems, of course, the impact becomes quickly negligible. It is still significant for embedded systems. It is also notable that we can amortize the costs by conjecturing that under normal circumstances, the system won't be swapping excessively, and thus wouldn't be using compressed cache excessively; thus the CI calculation can be largely ignored if it is sufficiently small. We can further amortize the costs by conjecturing that when ccache thrashing occurs (much paging in and out of ccache), pages being sent back and forth between memory and ccache are usually not going to be dirty. In this case, we may be able to store the CI of the page and only re-calculate it if the page is dirty (i.e. altered). Pages that were decompressed, read, and recompressed would be unchanged and have the same CI, thus we could skip recalculating the CI. > >> This would allow for the following useful elements: >> >> - Groups may be given a CI == sum of the CI's of each page in the group >> - Pages each have a CI that doesn't change based on algorithm used >> - Algorithms each have a CI to describe their efficiency >> >> Which can be used for: >> >> - It is possible to quickly take a short incremental step to try to get >> the harder-to-compress pages into the same groups (i.e. find the first >> group from the hardest-to-compress end that is 'wrong', find all groups >> containing the next N hardest-to-compress pages, break those groups >> apart, reorder them, the group is now 'proper') >> - This allows for idle time to be used to sort pages and groups in >> such a way that our first candidates for writing to disk are pages that >> are 'hard to compress' (taking more memory) >> > > Hard to compress factor should be combined with factor of having working > set of apps within memory (and maybe other too - I don't know). Yes.. perhaps time based? ("This page was used some time within the last 4000 jiffies, it's probably part of a working set"). > >> - It is possible to use a "Light" algorithm (such as WKdm) initially, >> and move to a "Heavy" algorithm by decompressing the block and >> recompressing it during idle CPU time >> - Maximize efficiency without adding unnecessary overhead while the >> system is busy >> > > -- compress a page with WKdm > [CPU is idle] > -- extract these pages compressed with WKdm from ccache > [CPU is now busy] > -- compress them with LZO > -- put them again in ccache > > There can be lot of potential problems with this thing. We need to look > at it carefully in detail. > Yeah, like systems that are busy running a large number of programs (i.e. building a kernel) where the CPU is working on and off in quick spurts. You can get 80-150mS idle periods, if you try to do 200mS of work there you just slowed the system down. > I was just wondering how swap prefetching patch senses CPU as 'idle' - > what are the things it considers?, how it handles problem of 'spike' > loads?. Will look at all this later. > When the CPU is idle, linux activates the "Idle Loop." The Idle Loop pretty much uses HLT to halt CPU use for a little bit when the system has nothing to do. Otherwise the CPU would be going nuts all the time ;) At any rate, there's a mechanism in there somewhere for it. > >> Possibly other things too. >> >> Anyway the point is I was more thinking on how to get it so when you >> initially compress pages, you use very light algorithms; and when you >> have idle CPU time, you burn it to order and condense the compressed >> cache. This avoids CPU usage during initial entry into the cache. >> >>>>> Many of the setting are also tunable via sysctl interface. (e.g. >>>>> /proc/sys/vm/max_anon_cc_size). >>>>> >>>> I was considering being conservative with the maximum cache size in >>>> general; but more specifically, allowing the cache to grow until it >>>> notices a lot decompression is happening per time space (compression we >>>> do NOT care about; decompression means the system wants something >>>> back). >>>> If we suddenly see 300 decompressions a second, we're not being very >>>> efficient; time to start shrinking the compressed cache. (Under real >>>> memory pressure, this shrinkage will involve swapping; otherwise, just >>>> expanding requested pages back into memory.) >>>> >>>> Increasing the size of compressed cache during idle time can be done >>>> conservatively, but I currently can't think of a way to figure out how >>>> big to make it when not under load. Increasing passively (i.e. >>>> swapping >>>> LRU data into compressed cache) would get some memory into (cheap) >>>> compressed cache before the system came under load (decompression is >>>> cheaper than compression). >>> There are _many_ things to consider when developing heuristic for >>> adaptive ccache resizing scheme. I have touched this area for now. I >>> first aim to have a static sized ccache working first. >>> >>> This said, it's always good to keep thinking over this...it's >>> interesting and is going to be useful :) >>> >>>>> Apart from some kernel hooks (as in lookup functions, >>>>> shrink_page_list(), pageout()), all ccache code is pretty much in >>>>> separate .[ch] files or as separte kernel modules. So, it's also very >>>>> less intrusive. >>>> Very nice. >>>> >>>>>> Another advantage is that no matter what I do, I can't swap pages to >>>>>> disk in any worse order than the kernel normally would, since I'm >>>>>> being >>>>>> handed pages based on existing LRU logic. To this end I've >>>>>> decided to >>>>>> take the "Maximize Compression" approach and try to swap out pages >>>>>> that >>>>>> are "hard to compress" first. >>>>>> >>>>>> Actual compression will be done by a module; although actual >>>>>> compression >>>>>> algorithms will be loaded as slave modules to this module. :) >>>>>> >>>>>> I've also made note that due to the nature of this design, large >>>>>> amounts >>>>>> of flexibility are presented. We can talk to the swap device in any >>>>>> way >>>>>> we want; the Swap Zone would supply a mechanism for modules to access >>>>>> the swap device to swap in and out anything, but would always ask the >>>>>> modules themselves to get pages for it. In theory, we can compress >>>>>> whatever we like and swap it out; >>>>> Approach I described above is flexible enough to handle all kinds of >>>>> pages: anon pages, clean and dirty page cache pages. >>>>> >>>> Compressed pages on swap? Other uses of the code (i.e. implementing >>>> something besides compression)? >>>> >>> maybe I didn't understand you here. I meant flexibility in terms of >>> what kind of pages we can keep in compression structure, a compression >>> structure that should give good performance for almost all operations >>> (lookup, expanding, shrinking, deleting a page, minimize fragmentation >>> etc.) >>> >> >> Mmm.. I was looking at toying with the infra later to 1) Compress data >> that's in swap (and keep in memory a ccache swp_entry_t to real >> swp_entry_t mapping to find it later); and 2) Implement a toy >> distributed, encrypted storage backing (i.e. pages can be encrypted and >> swapped over network to 2-3 servers for redundant storage off-device, >> think wireless PDAs in a corporate network...) >> > > Interesting. > > Many ideas keep flying in my mind but when I look at my code...lol..not > even static cccache is working yet...LOL Yeah, just ignore that one. :) I'm always thinking really, really far ahead and in different areas. > >>>>>> of course, we need to relate swap file >>>>>> indexes to real page indexes, and deal with suspend-to-disk et al. >>>>> With approach described, we automatically use exisiting mech. of >>>>> kernel for our use. This virtual swap is just another swap disk to >>>>> other kernel parts. >>>> What happens when you suspend to disk? Does the virtual swap get >>>> force-flushed to real swap? >>> It has to -- after all RAM can retain stuff in suspend mode. >>> >>>> Can you compress pages in swap and have the >>>> kernel not wonder wtf you're smoking when it tries to come back up? >>>> >>> Yes, it should be easy to swap out pages as compressed form itself. This >>> was done in Castro's patch too -- this is not to save swap disk space >>> (rest of space was padded) but to reduce unnecessary decompression until >>> they are _really_ required. >>> >> >> If you work in multi-page blocks and get better than 1:((N-1)/N) >> compression anywhere, you'll squeeze out a page. In groups of 5 pages >> it's entirely feasible to get better than 20% compression, squeezing out >> a whole page of swap usage. >> >> Of course, this in no way interferes with the original goal of reducing >> unnecessary decompression; it's a happy side-effect. :) >> > > This grouping of pages will also reduce metadata required for ccache. > Each group will have a single chunk_head and coressponding chunks i.e. > no. of cunk_heads will be reduced. Interesting. That's another thing about getting a compression index for a page-- if it's too low, you can immediately decide the page is too hard to compress because the metadata to control it is going to be bigger than the space savings. > >>>> (this would all just require extra infrastructure, mainly APIs to say >>>> 'Don't compress these pages' and 'stop storing shit in memory until I >>>> say so', nothing majorly intrusive) >>>> >>>>> ----------------------- >>>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching >>>>> Here, I have detailed the approach I'm taking (at least for page-cache >>>>> pages). It's slightly out-dated now --- It doesn't detail approach for >>>>> anon pages in detail.....I will update it soon. >>>>> Feel free to edit it. >>>>> >>>>> Since I'm doing this project as part of Google SoC, I have many >>>>> 'deadlines' to consider. So, now bringing major changes to approach >>>>> will overwhelm me with work. It will be really nice if you can provide >>>>> feedback on my approach :) >>>> Your approach sounds at least to be minimally intrusive, which is good >>>> because it lets you keep the code separate and do interesting things. >>>> You also use the swapping mechanism, so you should be able to take >>>> advantage of page cache implicitly (a goal of my own design). >>>> >>>> You need to consider page grouping; Experiment with different sized >>>> groups, especially around 16K, 20K, 24K, 28K, and 32K. Try to make the >>>> code flexible enough to handle variable sized page groups too; it would >>>> be interesting to have multiple group sizes at run, but possibly more >>>> complex than needed. I found page grouping to present gains of 10-15% >>>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and >>>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's >>>> nothing interesting about any group bigger than 32K, and maybe you can >>>> get away with something smaller and still maximize gains. >>>> >>> I will definitely lookup into how flexible my current approach is w.r.t >>> grouping pages and will try to get some flexibility here if it shows up >>> rigid. This grouping of pages seems really important. >>> >> >> Run some simple tests. Feed the modules a group of test data; have them >> compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and >> then output the average compression ratio for each group size. The >> output would look similar to: >> >> >> 4K 16% >> 8K 23% >> 12K 30% >> 16K 35% >> 20K 39% >> 24K 42% <-- This would be a good place to stop... >> 28K 43% <-- Not worth the gains >> 32K 43% >> 36K 44% >> >> (Numbers faked) >> >> And you'd basically be looking for where the numbers stopped going up by >> useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful >> for the added overhead). If you keep going and going you'll find a >> point where no matter how much bigger the block gets you can't gain any >> more compression; it's likely good to stop a bit before that, knocking >> off 1% or 2%. >> > > Your analysis of grouping pages interests me most :) I will definitely > look into having this grouping supported by my design.....again, a bell > rings -- nitin! get that static ccache with simplest of things you can > do first!! ;) > Yeah, get it working first. Just don't box yourself in a corner ;) You can try writing a test for various page sizes once the cache is actually working. >>>> One part of my design was going to be to let the kernel's existing swap >>>> code make ALL the LRU decisions and base my decisions entirely on >>>> compressibility of the pages. Pages would be regrouped during idle >>>> time >>>> and groups would be kept in order of compressibility. The idea here >>>> was >>>> to be able to throw hard-to-compress pages to swap first during memory >>>> pressure. This would help maximize the in-memory compression ratio. >>>> >>>> As for multiple algorithms, I would say to always compress with your >>>> lightest, fastest algorithm FIRST (this would be WKdm). During idle >>>> time you can go back and decompress/recompress pages with something >>>> heavier (WK4x4). There is no shame in abusing the CPU when it's >>>> idle ;) >>>> >>> For desktop systems, this makes sense. But what about OLPC systems? You >>> really can't abuse idle CPU, you got to conserve power (I'm not sure of >>> its effect on power requirements but it should be like more we run more >>> we eat ;) ) >>> >> >> True :) >> > > > Cheers, > Nitin Gupta > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJWGugs1xW0HCTEFAQLpKBAAjbnj4W/mLvcudtAcH2DLeXpnL6/n1EHt gJ2c8bb9kdwRdDj6O6drCBPn4F2i2c0qiAI+6AIEtspFsTHDw3tHQi+9QxEChMf8 gMKLQ5fZI9SL/rXUbG6o6pF+pCvPtWjdfs+Qy1CDa9MIinh5DSFDQ1DGV1agBHtC Ow22apMS3p4xtNvnbRZ9l5d5GkS35nFJTnGDNVA2N6eKRk+UM2wlEqjcjFhMQyH7 XWQVTu4GiC9tDGDPUplGIfKIOHFfC7UqqLrciSQzgbr3gbIKfs1Qa/v5qNtWJLzr VeGKPCypK/74blA2+PjISH29aJ+vC6jJHQIFSy3fi9abt4esZ3JgxwyqNO+uIo1+ RKdvHMLKf71DNBf2QVV3BzBP7ZBYelRZFNCaSdfrGd5NFgIQR06JW5Nf14adpi+D hXK44YpkD1TVFuHoitVURODNBmwdrNF+Bqay/fY9hi31RSadylGT91U2OpFrSKyC 3cjfXB36HwDHNWtRxb2WzlETJvA4oMLWNvLbZAn6MqXZnBq6E4VwJNP+z3y22PNl dLOadtu+3qWbt0UyCHQp4sbpHgCgiG0wwDRMAJCNHhnXH7Tww93R84RkgtF8GkRC n1FtIH0mv3zWWSbFuei88KSqk8unH9xq9tKGKboHmJiDF3/AyZd/De8F+icTCIn9 0IfPKD+e6qQ= =v+xT -----END PGP SIGNATURE----- |