Thread: Re: [lc-devel] Compressed caching
Status: Beta
Brought to you by:
nitin_sf
From: Nitin G. <nit...@gm...> - 2006-06-18 15:28:58
|
John Moser wrote: > Nitin Gupta wrote: >> John Moser wrote: >>> Nitin Gupta wrote: >>>> On 6/17/06, John Moser <joh...@gm...> wrote: >>>> * Storing pages: >>>> -- First, it creates a 'virtual swap' which is a swap with its own >>>> swap_map. This swap is given highest priority. So, when pages are to >>>> be swapped out, existing code path naturally/transparently uses this >>>> swap device (until it's full) for swapping out. (prelim patches for >>>> this to be posted very soon on lc-devel). >>> (See I don't understand the swap system that well :) >>> >>> I was thinking eventually I'd have to convince the kernel to *try* to >>> swap even if there was no swap device.. >>> >>> Once this swap device is full it needs to start pushing its pages to a >>> REAL swap device, i.e. on disk. >>> >> This is exactly what this virtual swap is meant for :) >> Pages assigned to this virtual swap will eventually go to ccache and >> when it's full, pages are sent to real swap disk -- all this with >> minimal code intrusion. This approach works even if there are no real >> swap disks -- only virtual swap will be used in that case. >> > > Awesome. And the swap cache and existing LRU logic is all used, right? > (These are important to get memory usage patterns right; we should > theoretically be able to send pages physically to disk in any order we > want without being less efficient than general Linux...) > Yes! >>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine >>>> if page can really be added to ccache. It can consider things like no. >>>> of pages already in ccache, CPU load, type of page (anon or fs-backed) >>>> and some heuristics that tell if page is even worth mem space it will >>>> take (no such heurustic yet worked out but it is really important >>>> thing). >>>> >>> I was thinking putting that code in a module. The page would be >>> compressed with a fast reference algorithm to be used to get a >>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE. Squeezing down >>> a single page is almost instant anyway. >>> >>> As for "type of page," certain FS-backed pages could be freed, others >>> would have to be persistent. I was thinking more on letting the calling >>> code mark pages that could be freed as "transient" and just freeing them >>> if the compressed memory area got too big, rather than swapping. >>> >> Again, this is exactly what's happening. Pages guessed to be swapped out >> by should_add_to_ccache() are marked by PG_will_compress flag. But, this >> does not guarantee that they will really be compressed. If finally, >> compression fails (lack of mem, CPU load high, poor compressibility >> etc.) this flag is cleared and everything else goes as usual. If it >> really gets compressed, its marked with PG_compressed. Thus it goes with >> your two step 'transient' approach...I think. >> > > Transient pages == come from the disk cache system. If the disk cache > decides it wants some page back and we don't have it, it can go read it > back from disk (i.e. /lib/*, executables, stuff in /usr...) > >>>> -- then pages filtered by should_add_to_ccache() go to pageout() as >>>> usual where they are sent to add_to_ccache() where they are compressed >>>> and added to compression structure. If add_to_ccache() fails for some >>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks >>>> (rest of pageout() code). >>> I was thinking having the Swap Zone code supply the code to write to >>> physical disk, and just having stuff pass through it: >>> >>> RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File >>> >>> This was also mainly for when the compressed area got too big and needed >>> shrinkage (by, of course, real swapping). Of course, compression itself >>> was going to be a module, so normal operation was: >>> >>> RAM -> Swap Zone -> Swap File >>> >>> until such time as the compress module was loaded. :) >>> >> As I understand you are really after an approach that can easily (and >> with minimum code intrusion in kernel) fall back to real swap disks when >> compression can't be done. This is being done well in current approach. >> > > Yeah. I was looking at a different use of the same code later though > but :) So far your design looks technically about the same as mine. > hmm...somewhat different: -- I am not looking to have multi-stage compression i.e. first compress with quickest then later compress them with heavyweights instead when CPU is idle. -- Not (yet) considering grouping of pages for compression. -- And I've not yet started work on a heuristic for adaptive ccache size. >>>> * Restoring pages: >>>> Whenever swap-cache/page-cache is looked-up for a page >>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag. >>>> If its set we decompress() page and return it. >>>> >>> I skipped pages compressed in place because compressing individual pages >>> is non-interesting. I noticed 15-20% compression with 4K pages on >>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I >>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond. >>> >>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or 32KiB >>> (the lower the better); I want to compress pages in groups of whatever >>> that limit falls on for maximum performance. >>> >>> This makes the approach I had in mind a lot more complex. Pages would >>> be grouped using a grouping algorithm; idle time would be used to >>> disrupt groups and reconstruct more optimal grouping; and in the end we >>> would also attempt to swap out more difficult to compress pages first. >>> >>> (In fact this is why I want to find the smallest group for the >>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to >>> decompress, rearrange, and recompress up to 9 groups, 288KiB! If it >>> were i.e. 20, that'd be 6 x 5 = 120KiB) >>> >> This is a great point! I really didn't even consider grouping pages >> together for compression -- because I was unaware of its effect on >> compression ratios and due to additional complexity it involves. This >> grouping of pages is definitely worth a genuine try. But first, to keep >> it simple, I will implement for single page compression. With this thing >> stable we can do some algo tweaking to have this grouping. >> > > Yeah, the page grouping and balancing code would be rather complex. > Just the idea is complex. And thats why I will look at it later or maybe just make it flexible enough for now. > >>>> De/Compression codes are dumb. They just compress what they have in >>>> input buffer and output a compressed buffer. >>>> >>>> Also, these algos are pluggable as separate kernel modules. They can >>>> de/register themselves with ccahe (as is done by JFFS2). But, for now >>>> I will just place them within kernel itself but overall approach makes >>>> it easy to get de/compression algos as pluggable modules :) >>>> >>> :) And the algorithm to use them? I was considering using a reference >>> block to compress to get a CI for the algorithm itself. >>> >> I was thinking of an alternate (simpler) approach. This is because we >> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of >> these WKdm and WK4x4 are designed to compress typical anonymous memory >> pages (see >> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz >> for details of WKdm and WK4x4) while LZO is much better at typical >> filesystem (textual) data. Also speeds (in general): WKdm > WK4x4 > LZO >> >> So, considering this, I think a simple algo selection scheme like >> WKdm/WK4x4 if page is anonymous and LZO if fs-data. >> Or, depending on kind of file, WKdm/WK4x4 can give comparable >> compression with much better speeds. To handle cases, we can record >> 'last compression ratio' for each mapping (recorded in corres. >> address_space struct itself). This can be used to have separate algo >> selection per open file. > > Using a reference algorithm would be simpler if you wanted to do > multi-layer compression. Think about the following events: > > - Page 0x01 compressed (WKdm) > - Page 0x02 compressed (WKdm) > - Page 0x03 compressed (WKdm) > - CPU is idle > - Page 0x01 recompressed (WK4x4) > - Page 0x02 recompressed (WK4x4) > - Page 0x04 compressed (WKdm) > > Current state: > > 0x01 WK4x4 > 0x02 WK4x4 > 0x03 WKdm > 0x04 WKdm > > - CPU is idle > - Rearrange pages based on compressibility > > PROBLEM: How do you index compressibility? > > - Memory pressure is mounting > - Get pages out of memory -- Least compressible? I don't think its good to kick out least compressible pages first. What ultimately matters is to have working set of running programs to stay in memory -- compressed or uncompressed. But then, its hard to guess working set of each running program...Its a big work in itself to come up with a good heuristic for which (and how many) pages to kick out from ccache when memory gets real low. > > PROBLEM: Need to solve previous problem.. > > This is why I was considering multiple, pluggable algorithms and > reference instrumentation. The references would be: > > - Each new PAGE is compressed with a fast, extremely light algorithm > that doesn't produce useful ratios but runs almost straight through the > code (i.e. a streaming dictionary compressor rather than a windowed one, > which immediately is passed to a streaming huffman encoder) > - PAGE_SIZE - compressed_page_size == Compression Index (CI) > > - Each new algorithm is given a reference block to compress to test its > compression efficiency > - block_size - compressed_block_size == CI for the algorithm > - We could have 'anonymous memory' and 'file data' blocks > - Could produce an anonmem_CI and filemem_CI > This is just too much work to be done before _every_ compression. It can affect performance badly. > This would allow for the following useful elements: > > - Groups may be given a CI == sum of the CI's of each page in the group > - Pages each have a CI that doesn't change based on algorithm used > - Algorithms each have a CI to describe their efficiency > > Which can be used for: > > - It is possible to quickly take a short incremental step to try to get > the harder-to-compress pages into the same groups (i.e. find the first > group from the hardest-to-compress end that is 'wrong', find all groups > containing the next N hardest-to-compress pages, break those groups > apart, reorder them, the group is now 'proper') > - This allows for idle time to be used to sort pages and groups in > such a way that our first candidates for writing to disk are pages that > are 'hard to compress' (taking more memory) > Hard to compress factor should be combined with factor of having working set of apps within memory (and maybe other too - I don't know). > - It is possible to use a "Light" algorithm (such as WKdm) initially, > and move to a "Heavy" algorithm by decompressing the block and > recompressing it during idle CPU time > - Maximize efficiency without adding unnecessary overhead while the > system is busy > -- compress a page with WKdm [CPU is idle] -- extract these pages compressed with WKdm from ccache [CPU is now busy] -- compress them with LZO -- put them again in ccache There can be lot of potential problems with this thing. We need to look at it carefully in detail. I was just wondering how swap prefetching patch senses CPU as 'idle' - what are the things it considers?, how it handles problem of 'spike' loads?. Will look at all this later. > Possibly other things too. > > Anyway the point is I was more thinking on how to get it so when you > initially compress pages, you use very light algorithms; and when you > have idle CPU time, you burn it to order and condense the compressed > cache. This avoids CPU usage during initial entry into the cache. > >>>> Many of the setting are also tunable via sysctl interface. (e.g. >>>> /proc/sys/vm/max_anon_cc_size). >>>> >>> I was considering being conservative with the maximum cache size in >>> general; but more specifically, allowing the cache to grow until it >>> notices a lot decompression is happening per time space (compression we >>> do NOT care about; decompression means the system wants something back). >>> If we suddenly see 300 decompressions a second, we're not being very >>> efficient; time to start shrinking the compressed cache. (Under real >>> memory pressure, this shrinkage will involve swapping; otherwise, just >>> expanding requested pages back into memory.) >>> >>> Increasing the size of compressed cache during idle time can be done >>> conservatively, but I currently can't think of a way to figure out how >>> big to make it when not under load. Increasing passively (i.e. swapping >>> LRU data into compressed cache) would get some memory into (cheap) >>> compressed cache before the system came under load (decompression is >>> cheaper than compression). >> There are _many_ things to consider when developing heuristic for >> adaptive ccache resizing scheme. I have touched this area for now. I >> first aim to have a static sized ccache working first. >> >> This said, it's always good to keep thinking over this...it's >> interesting and is going to be useful :) >> >>>> Apart from some kernel hooks (as in lookup functions, >>>> shrink_page_list(), pageout()), all ccache code is pretty much in >>>> separate .[ch] files or as separte kernel modules. So, it's also very >>>> less intrusive. >>> Very nice. >>> >>>>> Another advantage is that no matter what I do, I can't swap pages to >>>>> disk in any worse order than the kernel normally would, since I'm being >>>>> handed pages based on existing LRU logic. To this end I've decided to >>>>> take the "Maximize Compression" approach and try to swap out pages that >>>>> are "hard to compress" first. >>>>> >>>>> Actual compression will be done by a module; although actual >>>>> compression >>>>> algorithms will be loaded as slave modules to this module. :) >>>>> >>>>> I've also made note that due to the nature of this design, large >>>>> amounts >>>>> of flexibility are presented. We can talk to the swap device in any >>>>> way >>>>> we want; the Swap Zone would supply a mechanism for modules to access >>>>> the swap device to swap in and out anything, but would always ask the >>>>> modules themselves to get pages for it. In theory, we can compress >>>>> whatever we like and swap it out; >>>> Approach I described above is flexible enough to handle all kinds of >>>> pages: anon pages, clean and dirty page cache pages. >>>> >>> Compressed pages on swap? Other uses of the code (i.e. implementing >>> something besides compression)? >>> >> maybe I didn't understand you here. I meant flexibility in terms of >> what kind of pages we can keep in compression structure, a compression >> structure that should give good performance for almost all operations >> (lookup, expanding, shrinking, deleting a page, minimize fragmentation >> etc.) >> > > Mmm.. I was looking at toying with the infra later to 1) Compress data > that's in swap (and keep in memory a ccache swp_entry_t to real > swp_entry_t mapping to find it later); and 2) Implement a toy > distributed, encrypted storage backing (i.e. pages can be encrypted and > swapped over network to 2-3 servers for redundant storage off-device, > think wireless PDAs in a corporate network...) > Interesting. Many ideas keep flying in my mind but when I look at my code...lol..not even static cccache is working yet...LOL >>>>> of course, we need to relate swap file >>>>> indexes to real page indexes, and deal with suspend-to-disk et al. >>>> With approach described, we automatically use exisiting mech. of >>>> kernel for our use. This virtual swap is just another swap disk to >>>> other kernel parts. >>> What happens when you suspend to disk? Does the virtual swap get >>> force-flushed to real swap? >> It has to -- after all RAM can retain stuff in suspend mode. >> >>> Can you compress pages in swap and have the >>> kernel not wonder wtf you're smoking when it tries to come back up? >>> >> Yes, it should be easy to swap out pages as compressed form itself. This >> was done in Castro's patch too -- this is not to save swap disk space >> (rest of space was padded) but to reduce unnecessary decompression until >> they are _really_ required. >> > > If you work in multi-page blocks and get better than 1:((N-1)/N) > compression anywhere, you'll squeeze out a page. In groups of 5 pages > it's entirely feasible to get better than 20% compression, squeezing out > a whole page of swap usage. > > Of course, this in no way interferes with the original goal of reducing > unnecessary decompression; it's a happy side-effect. :) > This grouping of pages will also reduce metadata required for ccache. Each group will have a single chunk_head and coressponding chunks i.e. no. of cunk_heads will be reduced. >>> (this would all just require extra infrastructure, mainly APIs to say >>> 'Don't compress these pages' and 'stop storing shit in memory until I >>> say so', nothing majorly intrusive) >>> >>>> ----------------------- >>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching >>>> Here, I have detailed the approach I'm taking (at least for page-cache >>>> pages). It's slightly out-dated now --- It doesn't detail approach for >>>> anon pages in detail.....I will update it soon. >>>> Feel free to edit it. >>>> >>>> Since I'm doing this project as part of Google SoC, I have many >>>> 'deadlines' to consider. So, now bringing major changes to approach >>>> will overwhelm me with work. It will be really nice if you can provide >>>> feedback on my approach :) >>> Your approach sounds at least to be minimally intrusive, which is good >>> because it lets you keep the code separate and do interesting things. >>> You also use the swapping mechanism, so you should be able to take >>> advantage of page cache implicitly (a goal of my own design). >>> >>> You need to consider page grouping; Experiment with different sized >>> groups, especially around 16K, 20K, 24K, 28K, and 32K. Try to make the >>> code flexible enough to handle variable sized page groups too; it would >>> be interesting to have multiple group sizes at run, but possibly more >>> complex than needed. I found page grouping to present gains of 10-15% >>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and >>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's >>> nothing interesting about any group bigger than 32K, and maybe you can >>> get away with something smaller and still maximize gains. >>> >> I will definitely lookup into how flexible my current approach is w.r.t >> grouping pages and will try to get some flexibility here if it shows up >> rigid. This grouping of pages seems really important. >> > > Run some simple tests. Feed the modules a group of test data; have them > compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and > then output the average compression ratio for each group size. The > output would look similar to: > > > 4K 16% > 8K 23% > 12K 30% > 16K 35% > 20K 39% > 24K 42% <-- This would be a good place to stop... > 28K 43% <-- Not worth the gains > 32K 43% > 36K 44% > > (Numbers faked) > > And you'd basically be looking for where the numbers stopped going up by > useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful > for the added overhead). If you keep going and going you'll find a > point where no matter how much bigger the block gets you can't gain any > more compression; it's likely good to stop a bit before that, knocking > off 1% or 2%. > Your analysis of grouping pages interests me most :) I will definitely look into having this grouping supported by my design.....again, a bell rings -- nitin! get that static ccache with simplest of things you can do first!! ;) >>> One part of my design was going to be to let the kernel's existing swap >>> code make ALL the LRU decisions and base my decisions entirely on >>> compressibility of the pages. Pages would be regrouped during idle time >>> and groups would be kept in order of compressibility. The idea here was >>> to be able to throw hard-to-compress pages to swap first during memory >>> pressure. This would help maximize the in-memory compression ratio. >>> >>> As for multiple algorithms, I would say to always compress with your >>> lightest, fastest algorithm FIRST (this would be WKdm). During idle >>> time you can go back and decompress/recompress pages with something >>> heavier (WK4x4). There is no shame in abusing the CPU when it's idle ;) >>> >> For desktop systems, this makes sense. But what about OLPC systems? You >> really can't abuse idle CPU, you got to conserve power (I'm not sure of >> its effect on power requirements but it should be like more we run more >> we eat ;) ) >> > > True :) > Cheers, Nitin Gupta |
From: John M. <joh...@gm...> - 2006-06-18 17:07:20
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > John Moser wrote: >> Nitin Gupta wrote: >>> John Moser wrote: >>>> Nitin Gupta wrote: >>>>> On 6/17/06, John Moser <joh...@gm...> wrote: >>>>> * Storing pages: >>>>> -- First, it creates a 'virtual swap' which is a swap with its own >>>>> swap_map. This swap is given highest priority. So, when pages are to >>>>> be swapped out, existing code path naturally/transparently uses this >>>>> swap device (until it's full) for swapping out. (prelim patches for >>>>> this to be posted very soon on lc-devel). >>>> (See I don't understand the swap system that well :) >>>> >>>> I was thinking eventually I'd have to convince the kernel to *try* to >>>> swap even if there was no swap device.. >>>> >>>> Once this swap device is full it needs to start pushing its pages to a >>>> REAL swap device, i.e. on disk. >>>> >>> This is exactly what this virtual swap is meant for :) >>> Pages assigned to this virtual swap will eventually go to ccache and >>> when it's full, pages are sent to real swap disk -- all this with >>> minimal code intrusion. This approach works even if there are no real >>> swap disks -- only virtual swap will be used in that case. >>> >> >> Awesome. And the swap cache and existing LRU logic is all used, right? >> (These are important to get memory usage patterns right; we should >> theoretically be able to send pages physically to disk in any order we >> want without being less efficient than general Linux...) >> > > Yes! > >>>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine >>>>> if page can really be added to ccache. It can consider things like no. >>>>> of pages already in ccache, CPU load, type of page (anon or fs-backed) >>>>> and some heuristics that tell if page is even worth mem space it will >>>>> take (no such heurustic yet worked out but it is really important >>>>> thing). >>>>> >>>> I was thinking putting that code in a module. The page would be >>>> compressed with a fast reference algorithm to be used to get a >>>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE. Squeezing >>>> down >>>> a single page is almost instant anyway. >>>> >>>> As for "type of page," certain FS-backed pages could be freed, others >>>> would have to be persistent. I was thinking more on letting the >>>> calling >>>> code mark pages that could be freed as "transient" and just freeing >>>> them >>>> if the compressed memory area got too big, rather than swapping. >>>> >>> Again, this is exactly what's happening. Pages guessed to be swapped out >>> by should_add_to_ccache() are marked by PG_will_compress flag. But, this >>> does not guarantee that they will really be compressed. If finally, >>> compression fails (lack of mem, CPU load high, poor compressibility >>> etc.) this flag is cleared and everything else goes as usual. If it >>> really gets compressed, its marked with PG_compressed. Thus it goes with >>> your two step 'transient' approach...I think. >>> >> >> Transient pages == come from the disk cache system. If the disk cache >> decides it wants some page back and we don't have it, it can go read it >> back from disk (i.e. /lib/*, executables, stuff in /usr...) >> >>>>> -- then pages filtered by should_add_to_ccache() go to pageout() as >>>>> usual where they are sent to add_to_ccache() where they are compressed >>>>> and added to compression structure. If add_to_ccache() fails for some >>>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks >>>>> (rest of pageout() code). >>>> I was thinking having the Swap Zone code supply the code to write to >>>> physical disk, and just having stuff pass through it: >>>> >>>> RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File >>>> >>>> This was also mainly for when the compressed area got too big and >>>> needed >>>> shrinkage (by, of course, real swapping). Of course, compression >>>> itself >>>> was going to be a module, so normal operation was: >>>> >>>> RAM -> Swap Zone -> Swap File >>>> >>>> until such time as the compress module was loaded. :) >>>> >>> As I understand you are really after an approach that can easily (and >>> with minimum code intrusion in kernel) fall back to real swap disks when >>> compression can't be done. This is being done well in current approach. >>> >> >> Yeah. I was looking at a different use of the same code later though >> but :) So far your design looks technically about the same as mine. >> > > hmm...somewhat different: > -- I am not looking to have multi-stage compression i.e. first compress > with quickest then later compress them with heavyweights instead when > CPU is idle. > -- Not (yet) considering grouping of pages for compression. > -- And I've not yet started work on a heuristic for adaptive ccache size. > >>>>> * Restoring pages: >>>>> Whenever swap-cache/page-cache is looked-up for a page >>>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag. >>>>> If its set we decompress() page and return it. >>>>> >>>> I skipped pages compressed in place because compressing individual >>>> pages >>>> is non-interesting. I noticed 15-20% compression with 4K pages on >>>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I >>>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond. >>>> >>>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or >>>> 32KiB >>>> (the lower the better); I want to compress pages in groups of whatever >>>> that limit falls on for maximum performance. >>>> >>>> This makes the approach I had in mind a lot more complex. Pages would >>>> be grouped using a grouping algorithm; idle time would be used to >>>> disrupt groups and reconstruct more optimal grouping; and in the end we >>>> would also attempt to swap out more difficult to compress pages first. >>>> >>>> (In fact this is why I want to find the smallest group for the >>>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to >>>> decompress, rearrange, and recompress up to 9 groups, 288KiB! If it >>>> were i.e. 20, that'd be 6 x 5 = 120KiB) >>>> >>> This is a great point! I really didn't even consider grouping pages >>> together for compression -- because I was unaware of its effect on >>> compression ratios and due to additional complexity it involves. This >>> grouping of pages is definitely worth a genuine try. But first, to keep >>> it simple, I will implement for single page compression. With this thing >>> stable we can do some algo tweaking to have this grouping. >>> >> >> Yeah, the page grouping and balancing code would be rather complex. >> Just the idea is complex. > > And thats why I will look at it later or maybe just make it flexible > enough for now. > >> >>>>> De/Compression codes are dumb. They just compress what they have in >>>>> input buffer and output a compressed buffer. >>>>> >>>>> Also, these algos are pluggable as separate kernel modules. They can >>>>> de/register themselves with ccahe (as is done by JFFS2). But, for now >>>>> I will just place them within kernel itself but overall approach makes >>>>> it easy to get de/compression algos as pluggable modules :) >>>>> >>>> :) And the algorithm to use them? I was considering using a reference >>>> block to compress to get a CI for the algorithm itself. >>>> >>> I was thinking of an alternate (simpler) approach. This is because we >>> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of >>> these WKdm and WK4x4 are designed to compress typical anonymous memory >>> pages (see >>> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz >>> for details of WKdm and WK4x4) while LZO is much better at typical >>> filesystem (textual) data. Also speeds (in general): WKdm > WK4x4 > LZO >>> >>> So, considering this, I think a simple algo selection scheme like >>> WKdm/WK4x4 if page is anonymous and LZO if fs-data. >>> Or, depending on kind of file, WKdm/WK4x4 can give comparable >>> compression with much better speeds. To handle cases, we can record >>> 'last compression ratio' for each mapping (recorded in corres. >>> address_space struct itself). This can be used to have separate algo >>> selection per open file. >> >> Using a reference algorithm would be simpler if you wanted to do >> multi-layer compression. Think about the following events: >> >> - Page 0x01 compressed (WKdm) >> - Page 0x02 compressed (WKdm) >> - Page 0x03 compressed (WKdm) >> - CPU is idle >> - Page 0x01 recompressed (WK4x4) >> - Page 0x02 recompressed (WK4x4) >> - Page 0x04 compressed (WKdm) >> >> Current state: >> >> 0x01 WK4x4 >> 0x02 WK4x4 >> 0x03 WKdm >> 0x04 WKdm >> >> - CPU is idle >> - Rearrange pages based on compressibility >> >> PROBLEM: How do you index compressibility? >> >> - Memory pressure is mounting >> - Get pages out of memory -- Least compressible? > > I don't think its good to kick out least compressible pages first. What > ultimately matters is to have working set of running programs to stay in > memory -- compressed or uncompressed. But then, its hard to guess > working set of each running program...Its a big work in itself to come > up with a good heuristic for which (and how many) pages to kick out from > ccache when memory gets real low. > Eh, that's what letting the existing swap logic figure it out is for. Can't make it any LESS efficient than it already is right? :) Plus the swap cache is looking out for us. Some light LRU heuristics may be useful. Possibly time based ("If this page has been used in the last 40,000 jiffies, try to kill something else") >> >> PROBLEM: Need to solve previous problem.. >> >> This is why I was considering multiple, pluggable algorithms and >> reference instrumentation. The references would be: >> >> - Each new PAGE is compressed with a fast, extremely light algorithm >> that doesn't produce useful ratios but runs almost straight through the >> code (i.e. a streaming dictionary compressor rather than a windowed one, >> which immediately is passed to a streaming huffman encoder) >> - PAGE_SIZE - compressed_page_size == Compression Index (CI) >> >> - Each new algorithm is given a reference block to compress to test its >> compression efficiency >> - block_size - compressed_block_size == CI for the algorithm >> - We could have 'anonymous memory' and 'file data' blocks >> - Could produce an anonmem_CI and filemem_CI >> > > This is just too much work to be done before _every_ compression. It can > affect performance badly. The page compression indexing would happen when a page was sent to the ccache, and occurs using a very light algorithm on a 4KiB page. This is not likely to affect much of anything, maybe cost 50,000 cycles (WKdm). Estimated times: CPU CLK SEC MSEC EXAMPLE 1MHz 1/20 50 8-bit Nintendo (6502) 200MHz 1/4000 0.25 iPaq 3600 PDA (206MHz) 400MHz 1/8000 0.125 iPaq 5400 PDA (Xscale) 800MHz 1/16000 0.0625 Old Dell 1.6GHz 1/32000 0.03125 Centrino Laptop 2.0GHz 1/40000 0.025 Desktop It may even be possible to find an algorithm that does strict analysis but not compression (like a hash) that spits out a value correlating to the redundancy and encodability of the stream; but that would take a lot of research to come up with. A really fast, really light compressor would be easier. The algorithm CI reference would be computed when the algorithm is first loaded, before it was put into use. This is a one-time cost for each algorithm, and can be ignored; it likely operates under a millisecond but it wouldn't hurt anything if it ran for 100-200mS really. This value would become persistent for the life of the algorithm (i.e. until someone rmmod's it), so it wouldn't have to be recomputed again. Note that given these times there could be a negative impact on performance; on faster systems, of course, the impact becomes quickly negligible. It is still significant for embedded systems. It is also notable that we can amortize the costs by conjecturing that under normal circumstances, the system won't be swapping excessively, and thus wouldn't be using compressed cache excessively; thus the CI calculation can be largely ignored if it is sufficiently small. We can further amortize the costs by conjecturing that when ccache thrashing occurs (much paging in and out of ccache), pages being sent back and forth between memory and ccache are usually not going to be dirty. In this case, we may be able to store the CI of the page and only re-calculate it if the page is dirty (i.e. altered). Pages that were decompressed, read, and recompressed would be unchanged and have the same CI, thus we could skip recalculating the CI. > >> This would allow for the following useful elements: >> >> - Groups may be given a CI == sum of the CI's of each page in the group >> - Pages each have a CI that doesn't change based on algorithm used >> - Algorithms each have a CI to describe their efficiency >> >> Which can be used for: >> >> - It is possible to quickly take a short incremental step to try to get >> the harder-to-compress pages into the same groups (i.e. find the first >> group from the hardest-to-compress end that is 'wrong', find all groups >> containing the next N hardest-to-compress pages, break those groups >> apart, reorder them, the group is now 'proper') >> - This allows for idle time to be used to sort pages and groups in >> such a way that our first candidates for writing to disk are pages that >> are 'hard to compress' (taking more memory) >> > > Hard to compress factor should be combined with factor of having working > set of apps within memory (and maybe other too - I don't know). Yes.. perhaps time based? ("This page was used some time within the last 4000 jiffies, it's probably part of a working set"). > >> - It is possible to use a "Light" algorithm (such as WKdm) initially, >> and move to a "Heavy" algorithm by decompressing the block and >> recompressing it during idle CPU time >> - Maximize efficiency without adding unnecessary overhead while the >> system is busy >> > > -- compress a page with WKdm > [CPU is idle] > -- extract these pages compressed with WKdm from ccache > [CPU is now busy] > -- compress them with LZO > -- put them again in ccache > > There can be lot of potential problems with this thing. We need to look > at it carefully in detail. > Yeah, like systems that are busy running a large number of programs (i.e. building a kernel) where the CPU is working on and off in quick spurts. You can get 80-150mS idle periods, if you try to do 200mS of work there you just slowed the system down. > I was just wondering how swap prefetching patch senses CPU as 'idle' - > what are the things it considers?, how it handles problem of 'spike' > loads?. Will look at all this later. > When the CPU is idle, linux activates the "Idle Loop." The Idle Loop pretty much uses HLT to halt CPU use for a little bit when the system has nothing to do. Otherwise the CPU would be going nuts all the time ;) At any rate, there's a mechanism in there somewhere for it. > >> Possibly other things too. >> >> Anyway the point is I was more thinking on how to get it so when you >> initially compress pages, you use very light algorithms; and when you >> have idle CPU time, you burn it to order and condense the compressed >> cache. This avoids CPU usage during initial entry into the cache. >> >>>>> Many of the setting are also tunable via sysctl interface. (e.g. >>>>> /proc/sys/vm/max_anon_cc_size). >>>>> >>>> I was considering being conservative with the maximum cache size in >>>> general; but more specifically, allowing the cache to grow until it >>>> notices a lot decompression is happening per time space (compression we >>>> do NOT care about; decompression means the system wants something >>>> back). >>>> If we suddenly see 300 decompressions a second, we're not being very >>>> efficient; time to start shrinking the compressed cache. (Under real >>>> memory pressure, this shrinkage will involve swapping; otherwise, just >>>> expanding requested pages back into memory.) >>>> >>>> Increasing the size of compressed cache during idle time can be done >>>> conservatively, but I currently can't think of a way to figure out how >>>> big to make it when not under load. Increasing passively (i.e. >>>> swapping >>>> LRU data into compressed cache) would get some memory into (cheap) >>>> compressed cache before the system came under load (decompression is >>>> cheaper than compression). >>> There are _many_ things to consider when developing heuristic for >>> adaptive ccache resizing scheme. I have touched this area for now. I >>> first aim to have a static sized ccache working first. >>> >>> This said, it's always good to keep thinking over this...it's >>> interesting and is going to be useful :) >>> >>>>> Apart from some kernel hooks (as in lookup functions, >>>>> shrink_page_list(), pageout()), all ccache code is pretty much in >>>>> separate .[ch] files or as separte kernel modules. So, it's also very >>>>> less intrusive. >>>> Very nice. >>>> >>>>>> Another advantage is that no matter what I do, I can't swap pages to >>>>>> disk in any worse order than the kernel normally would, since I'm >>>>>> being >>>>>> handed pages based on existing LRU logic. To this end I've >>>>>> decided to >>>>>> take the "Maximize Compression" approach and try to swap out pages >>>>>> that >>>>>> are "hard to compress" first. >>>>>> >>>>>> Actual compression will be done by a module; although actual >>>>>> compression >>>>>> algorithms will be loaded as slave modules to this module. :) >>>>>> >>>>>> I've also made note that due to the nature of this design, large >>>>>> amounts >>>>>> of flexibility are presented. We can talk to the swap device in any >>>>>> way >>>>>> we want; the Swap Zone would supply a mechanism for modules to access >>>>>> the swap device to swap in and out anything, but would always ask the >>>>>> modules themselves to get pages for it. In theory, we can compress >>>>>> whatever we like and swap it out; >>>>> Approach I described above is flexible enough to handle all kinds of >>>>> pages: anon pages, clean and dirty page cache pages. >>>>> >>>> Compressed pages on swap? Other uses of the code (i.e. implementing >>>> something besides compression)? >>>> >>> maybe I didn't understand you here. I meant flexibility in terms of >>> what kind of pages we can keep in compression structure, a compression >>> structure that should give good performance for almost all operations >>> (lookup, expanding, shrinking, deleting a page, minimize fragmentation >>> etc.) >>> >> >> Mmm.. I was looking at toying with the infra later to 1) Compress data >> that's in swap (and keep in memory a ccache swp_entry_t to real >> swp_entry_t mapping to find it later); and 2) Implement a toy >> distributed, encrypted storage backing (i.e. pages can be encrypted and >> swapped over network to 2-3 servers for redundant storage off-device, >> think wireless PDAs in a corporate network...) >> > > Interesting. > > Many ideas keep flying in my mind but when I look at my code...lol..not > even static cccache is working yet...LOL Yeah, just ignore that one. :) I'm always thinking really, really far ahead and in different areas. > >>>>>> of course, we need to relate swap file >>>>>> indexes to real page indexes, and deal with suspend-to-disk et al. >>>>> With approach described, we automatically use exisiting mech. of >>>>> kernel for our use. This virtual swap is just another swap disk to >>>>> other kernel parts. >>>> What happens when you suspend to disk? Does the virtual swap get >>>> force-flushed to real swap? >>> It has to -- after all RAM can retain stuff in suspend mode. >>> >>>> Can you compress pages in swap and have the >>>> kernel not wonder wtf you're smoking when it tries to come back up? >>>> >>> Yes, it should be easy to swap out pages as compressed form itself. This >>> was done in Castro's patch too -- this is not to save swap disk space >>> (rest of space was padded) but to reduce unnecessary decompression until >>> they are _really_ required. >>> >> >> If you work in multi-page blocks and get better than 1:((N-1)/N) >> compression anywhere, you'll squeeze out a page. In groups of 5 pages >> it's entirely feasible to get better than 20% compression, squeezing out >> a whole page of swap usage. >> >> Of course, this in no way interferes with the original goal of reducing >> unnecessary decompression; it's a happy side-effect. :) >> > > This grouping of pages will also reduce metadata required for ccache. > Each group will have a single chunk_head and coressponding chunks i.e. > no. of cunk_heads will be reduced. Interesting. That's another thing about getting a compression index for a page-- if it's too low, you can immediately decide the page is too hard to compress because the metadata to control it is going to be bigger than the space savings. > >>>> (this would all just require extra infrastructure, mainly APIs to say >>>> 'Don't compress these pages' and 'stop storing shit in memory until I >>>> say so', nothing majorly intrusive) >>>> >>>>> ----------------------- >>>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching >>>>> Here, I have detailed the approach I'm taking (at least for page-cache >>>>> pages). It's slightly out-dated now --- It doesn't detail approach for >>>>> anon pages in detail.....I will update it soon. >>>>> Feel free to edit it. >>>>> >>>>> Since I'm doing this project as part of Google SoC, I have many >>>>> 'deadlines' to consider. So, now bringing major changes to approach >>>>> will overwhelm me with work. It will be really nice if you can provide >>>>> feedback on my approach :) >>>> Your approach sounds at least to be minimally intrusive, which is good >>>> because it lets you keep the code separate and do interesting things. >>>> You also use the swapping mechanism, so you should be able to take >>>> advantage of page cache implicitly (a goal of my own design). >>>> >>>> You need to consider page grouping; Experiment with different sized >>>> groups, especially around 16K, 20K, 24K, 28K, and 32K. Try to make the >>>> code flexible enough to handle variable sized page groups too; it would >>>> be interesting to have multiple group sizes at run, but possibly more >>>> complex than needed. I found page grouping to present gains of 10-15% >>>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and >>>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's >>>> nothing interesting about any group bigger than 32K, and maybe you can >>>> get away with something smaller and still maximize gains. >>>> >>> I will definitely lookup into how flexible my current approach is w.r.t >>> grouping pages and will try to get some flexibility here if it shows up >>> rigid. This grouping of pages seems really important. >>> >> >> Run some simple tests. Feed the modules a group of test data; have them >> compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and >> then output the average compression ratio for each group size. The >> output would look similar to: >> >> >> 4K 16% >> 8K 23% >> 12K 30% >> 16K 35% >> 20K 39% >> 24K 42% <-- This would be a good place to stop... >> 28K 43% <-- Not worth the gains >> 32K 43% >> 36K 44% >> >> (Numbers faked) >> >> And you'd basically be looking for where the numbers stopped going up by >> useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful >> for the added overhead). If you keep going and going you'll find a >> point where no matter how much bigger the block gets you can't gain any >> more compression; it's likely good to stop a bit before that, knocking >> off 1% or 2%. >> > > Your analysis of grouping pages interests me most :) I will definitely > look into having this grouping supported by my design.....again, a bell > rings -- nitin! get that static ccache with simplest of things you can > do first!! ;) > Yeah, get it working first. Just don't box yourself in a corner ;) You can try writing a test for various page sizes once the cache is actually working. >>>> One part of my design was going to be to let the kernel's existing swap >>>> code make ALL the LRU decisions and base my decisions entirely on >>>> compressibility of the pages. Pages would be regrouped during idle >>>> time >>>> and groups would be kept in order of compressibility. The idea here >>>> was >>>> to be able to throw hard-to-compress pages to swap first during memory >>>> pressure. This would help maximize the in-memory compression ratio. >>>> >>>> As for multiple algorithms, I would say to always compress with your >>>> lightest, fastest algorithm FIRST (this would be WKdm). During idle >>>> time you can go back and decompress/recompress pages with something >>>> heavier (WK4x4). There is no shame in abusing the CPU when it's >>>> idle ;) >>>> >>> For desktop systems, this makes sense. But what about OLPC systems? You >>> really can't abuse idle CPU, you got to conserve power (I'm not sure of >>> its effect on power requirements but it should be like more we run more >>> we eat ;) ) >>> >> >> True :) >> > > > Cheers, > Nitin Gupta > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJWGugs1xW0HCTEFAQLpKBAAjbnj4W/mLvcudtAcH2DLeXpnL6/n1EHt gJ2c8bb9kdwRdDj6O6drCBPn4F2i2c0qiAI+6AIEtspFsTHDw3tHQi+9QxEChMf8 gMKLQ5fZI9SL/rXUbG6o6pF+pCvPtWjdfs+Qy1CDa9MIinh5DSFDQ1DGV1agBHtC Ow22apMS3p4xtNvnbRZ9l5d5GkS35nFJTnGDNVA2N6eKRk+UM2wlEqjcjFhMQyH7 XWQVTu4GiC9tDGDPUplGIfKIOHFfC7UqqLrciSQzgbr3gbIKfs1Qa/v5qNtWJLzr VeGKPCypK/74blA2+PjISH29aJ+vC6jJHQIFSy3fi9abt4esZ3JgxwyqNO+uIo1+ RKdvHMLKf71DNBf2QVV3BzBP7ZBYelRZFNCaSdfrGd5NFgIQR06JW5Nf14adpi+D hXK44YpkD1TVFuHoitVURODNBmwdrNF+Bqay/fY9hi31RSadylGT91U2OpFrSKyC 3cjfXB36HwDHNWtRxb2WzlETJvA4oMLWNvLbZAn6MqXZnBq6E4VwJNP+z3y22PNl dLOadtu+3qWbt0UyCHQp4sbpHgCgiG0wwDRMAJCNHhnXH7Tww93R84RkgtF8GkRC n1FtIH0mv3zWWSbFuei88KSqk8unH9xq9tKGKboHmJiDF3/AyZd/De8F+icTCIn9 0IfPKD+e6qQ= =v+xT -----END PGP SIGNATURE----- |
From: Andreas M. <an...@us...> - 2006-06-19 08:45:28
|
Hi, On Sun, Jun 18, 2006 at 08:58:54PM +0530, Nitin Gupta wrote: > I was just wondering how swap prefetching patch senses CPU as 'idle' - what are > the things it considers?, how it handles problem of 'spike' loads?. Will look at > all this later. I think that would be this one: http://lkml.org/lkml/2006/6/18/39 used by http://lkml.org/lkml/2006/6/18/40 Andreas Mohr |
From: Nitin G. <nit...@gm...> - 2006-06-19 09:21:41
|
Hi, > > On Sun, Jun 18, 2006 at 08:58:54PM +0530, Nitin Gupta wrote: >> I was just wondering how swap prefetching patch senses CPU as 'idle' - what are >> the things it considers?, how it handles problem of 'spike' loads?. Will look at >> all this later. > > I think that would be this one: > http://lkml.org/lkml/2006/6/18/39 > used by > http://lkml.org/lkml/2006/6/18/40 > I have a strange problem. I can't access any page on lkml.org (since a long time)! (No firewalls) Checked google cached page for lkml.org and it seems the site is having some troubles...but that seems to be in a recent time but I'm having this problem since months.... -- Nitin > > > _______________________________________________ > linuxcompressed-devel mailing list > lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxcompressed-devel > |
From: Nitin G. <nit...@gm...> - 2006-06-19 20:34:52
|
John Moser wrote: >> I don't think its good to kick out least compressible pages first. What >> ultimately matters is to have working set of running programs to stay in >> memory -- compressed or uncompressed. But then, its hard to guess >> working set of each running program...Its a big work in itself to come >> up with a good heuristic for which (and how many) pages to kick out from >> ccache when memory gets real low. >> > > Eh, that's what letting the existing swap logic figure it out is for. > Can't make it any LESS efficient than it already is right? :) Plus the > swap cache is looking out for us. > Ah..there is no swap logic for which pages to discard when swap goes (almost) full (there's one: OOM killer :) ). After all we are swapping to just another swap device which we call virtual swap. So, it is our work to develop this logic to handle case of ccache reaching its size limit (or when we want to shrink it for other reasons). > Some light LRU heuristics may be useful. Possibly time based ("If this > page has been used in the last 40,000 jiffies, try to kill something else") > So, you are saying to maintain jiffies counter for every page in ccache? Or if you store jiffies val then how will you lookup page that has jiffies val in some particular(40,000) window? This 'LRU' scheme won't work. > > It may even be possible to find an algorithm that does strict analysis > but not compression (like a hash) that spits out a value correlating to > the redundancy and encodability of the stream; but that would take a lot > of research to come up with. A really fast, really light compressor > would be easier. your hash idea is a real nice topic for MS...but I've not yet planned for that :) (btw...I searched for something like this on citeseer but no luck) WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) ..and still WKdm can't give you an idea of how well others(LZO) will compress it. > > The algorithm CI reference would be computed when the algorithm is first > loaded, before it was put into use. This is a one-time cost for each > algorithm, and can be ignored; it likely operates under a millisecond > but it wouldn't hurt anything if it ran for 100-200mS really. This > value would become persistent for the life of the algorithm (i.e. until > someone rmmod's it), so it wouldn't have to be recomputed again. > CI only makes sense when it tells how well _that_ algo compresses _this_ page. If you just want a sane initial CI value for every algo, just hard-code a value you can derive from many previous papers that compare these algos (typically WKdm, WK4x4 and LZO). > Note that given these times there could be a negative impact on > performance; on faster systems, of course, the impact becomes quickly > negligible. It is still significant for embedded systems. > > It is also notable that we can amortize the costs by conjecturing that > under normal circumstances, the system won't be swapping excessively, > and thus wouldn't be using compressed cache excessively; thus the CI > calculation can be largely ignored if it is sufficiently small. ccache is not even invoked in 'normal' conditions. It drops in when system is choking for air...err..RAM :) > > We can further amortize the costs by conjecturing that when ccache > thrashing occurs (much paging in and out of ccache), pages being sent > back and forth between memory and ccache are usually not going to be > dirty. In this case, we may be able to store the CI of the page and > only re-calculate it if the page is dirty (i.e. altered). Pages that > were decompressed, read, and recompressed would be unchanged and have > the same CI, thus we could skip recalculating the CI. > Yes, ccache (as I expect) will be loaded with clean page cache pages (if you choose to compress such pages) but then where will you store per-page CI values and maintain them even after a clean page is decompressed and freed from ccache? maybe a field somewhere...but this doesn't seems interesting.. >>> This would allow for the following useful elements: >>> >>> - Groups may be given a CI == sum of the CI's of each page in the group >>> - Pages each have a CI that doesn't change based on algorithm used >>> - Algorithms each have a CI to describe their efficiency >>> >>> Which can be used for: >>> >>> - It is possible to quickly take a short incremental step to try to get >>> the harder-to-compress pages into the same groups (i.e. find the first >>> group from the hardest-to-compress end that is 'wrong', find all groups >>> containing the next N hardest-to-compress pages, break those groups >>> apart, reorder them, the group is now 'proper') >>> - This allows for idle time to be used to sort pages and groups in >>> such a way that our first candidates for writing to disk are pages that >>> are 'hard to compress' (taking more memory) >>> >> Hard to compress factor should be combined with factor of having working >> set of apps within memory (and maybe other too - I don't know). > > Yes.. perhaps time based? ("This page was used some time within the > last 4000 jiffies, it's probably part of a working set"). I don't think this is the way to determine app working sets. This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression (http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf) mentions of a adaptive resizing scheme that keeps working sets of app in memory by a user space process that 'monitors' this app. But the worst thing is: -- Where is that code? Where is description of method used? googled that for hours...YMMV -- They seem to assume that only a single heavyweight app is running. While systems like OLPC are generic ones (albeit slower ones). > >>> - It is possible to use a "Light" algorithm (such as WKdm) initially, >>> and move to a "Heavy" algorithm by decompressing the block and >>> recompressing it during idle CPU time >>> - Maximize efficiency without adding unnecessary overhead while the >>> system is busy >>> >> -- compress a page with WKdm >> [CPU is idle] >> -- extract these pages compressed with WKdm from ccache >> [CPU is now busy] >> -- compress them with LZO >> -- put them again in ccache >> >> There can be lot of potential problems with this thing. We need to look >> at it carefully in detail. >> > > Yeah, like systems that are busy running a large number of programs > (i.e. building a kernel) where the CPU is working on and off in quick > spurts. You can get 80-150mS idle periods, if you try to do 200mS of > work there you just slowed the system down. > >> I was just wondering how swap prefetching patch senses CPU as 'idle' - >> what are the things it considers?, how it handles problem of 'spike' >> loads?. Will look at all this later. >> > > When the CPU is idle, linux activates the "Idle Loop." The Idle Loop > pretty much uses HLT to halt CPU use for a little bit when the system > has nothing to do. Otherwise the CPU would be going nuts all the time ;) > > At any rate, there's a mechanism in there somewhere for it. Andreas Mohr provided these links (to know when CPU is 'idle') -- http://lkml.org/lkml/2006/6/18/39 -- http://lkml.org/lkml/2006/6/18/40 -- http://threebit.net/mail-archive/linux-kernel/msg05303.html Cheers, Nitin Gupta |
From: John M. <joh...@gm...> - 2006-06-20 00:01:47
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > John Moser wrote: >>> I don't think its good to kick out least compressible pages first. What >>> ultimately matters is to have working set of running programs to stay in >>> memory -- compressed or uncompressed. But then, its hard to guess >>> working set of each running program...Its a big work in itself to come >>> up with a good heuristic for which (and how many) pages to kick out from >>> ccache when memory gets real low. >>> >> >> Eh, that's what letting the existing swap logic figure it out is for. >> Can't make it any LESS efficient than it already is right? :) Plus the >> swap cache is looking out for us. >> > > Ah..there is no swap logic for which pages to discard when swap goes > (almost) full (there's one: OOM killer :) ). After all we are swapping > to just another swap device which we call virtual swap. > I mean for which pages to emit from ccache to disk. > So, it is our work to develop this logic to handle case of ccache > reaching its size limit (or when we want to shrink it for other reasons). > >> Some light LRU heuristics may be useful. Possibly time based ("If this >> page has been used in the last 40,000 jiffies, try to kill something >> else") >> > > So, you are saying to maintain jiffies counter for every page in ccache? > Or if you store jiffies val then how will you lookup page that has > jiffies val in some particular(40,000) window? This 'LRU' scheme won't > work. > could just throw a jiffies count (number of jiffies since boot time should be available) into the thing when it's requested from ccache; and before sending it to disk, check to see page->jiffies < JIFFIES_WE_THINK_MEAN_THIS_WAS_RECENTLY_WANTED I don't really know, or feel like thinking it out too far. This kind of thing can be deferred until ccache is really actually working. >> >> It may even be possible to find an algorithm that does strict analysis >> but not compression (like a hash) that spits out a value correlating to >> the redundancy and encodability of the stream; but that would take a lot >> of research to come up with. A really fast, really light compressor >> would be easier. > > your hash idea is a real nice topic for MS...but I've not yet planned > for that :) > (btw...I searched for something like this on citeseer but no luck) > > WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) > ..and still WKdm can't give you an idea of how well others(LZO) will > compress it. > In general there is a correlation. Remember compression is a characteristic of the algorithm and the data. It is computationally impossible to compress truly random data beyond a specific average ratio; however, small isolated blocks may encode well. In general we say one algorithm performs better than another because of this. 7zip gets smaller output than RAR or ZIP, consistantly, on various input datas; this is not a coincidence, the LZMA algorithm is actually better at redundancy elimination. The fact that an algorithm compresses a set of data more than the next shows that algorithm is better at eliminating redundancy; while the fact that a given algorithm compresses one piece of data better than the next shows that that data is more redundant. Think of it in terms of a punett square with the crossed values giving the output data size of compressing a 32K block of data: Algorithms: Data sets: X = weak compression A = minimal redundancy Y = strong compression B = much redundancy A B X 30 | 25 ----+---- Y 25 | 15 A weak algorithm with non-redundant data will produce a block only slightly smaller (if not bigger) than the input; while the same algorithm on more redundant data will produce a block that's a little smaller than that. In contrast, a strong algorithm on non-redundant data will produce a block that's somewhat smaller, and on redundant data will produce a block that's much smaller. You are correct in assuming that it's not a mathematical fact that any given algorithm can be used to predict how other algorithms will perform with the data; however, compression algorithms are basically analysis of data, and they do happen to as a side effect of their nature expose a property of the data. We can use this to conjecture that in most cases a compression algorithm will effectively describe how well other algorithms will perform, just not in a 1:1 manner; it will produce an approximate general trend. > >> >> The algorithm CI reference would be computed when the algorithm is first >> loaded, before it was put into use. This is a one-time cost for each >> algorithm, and can be ignored; it likely operates under a millisecond >> but it wouldn't hurt anything if it ran for 100-200mS really. This >> value would become persistent for the life of the algorithm (i.e. until >> someone rmmod's it), so it wouldn't have to be recomputed again. >> > > CI only makes sense when it tells how well _that_ algo compresses _this_ > page. If you just want a sane initial CI value for every algo, just > hard-code a value you can derive from many previous papers that compare > these algos (typically WKdm, WK4x4 and LZO). > See above :P >> Note that given these times there could be a negative impact on >> performance; on faster systems, of course, the impact becomes quickly >> negligible. It is still significant for embedded systems. >> >> It is also notable that we can amortize the costs by conjecturing that >> under normal circumstances, the system won't be swapping excessively, >> and thus wouldn't be using compressed cache excessively; thus the CI >> calculation can be largely ignored if it is sufficiently small. > > ccache is not even invoked in 'normal' conditions. It drops in when > system is choking for air...err..RAM :) > Under normal conditions, the system bumps the top of RAM and swaps a handful of pages out. Under abnormal conditions, the system has a working set 10% bigger than physical RAM and begins swapping data in and out every few milliseconds because it can't keep EVERYTHING it needs NOW in memory. Boot with 'mem=32M' and start KDE to see what I'm talking about. >> >> We can further amortize the costs by conjecturing that when ccache >> thrashing occurs (much paging in and out of ccache), pages being sent >> back and forth between memory and ccache are usually not going to be >> dirty. In this case, we may be able to store the CI of the page and >> only re-calculate it if the page is dirty (i.e. altered). Pages that >> were decompressed, read, and recompressed would be unchanged and have >> the same CI, thus we could skip recalculating the CI. >> > > Yes, ccache (as I expect) will be loaded with clean page cache pages (if > you choose to compress such pages) but then where will you store > per-page CI values and maintain them even after a clean page is > decompressed and freed from ccache? > maybe a field somewhere...but this doesn't seems interesting.. > *shrug* Don't worry about it for now anyway. It's just an optimization algorithm, worry about it in the future when ccache actually works first. >>>> This would allow for the following useful elements: >>>> >>>> - Groups may be given a CI == sum of the CI's of each page in the >>>> group >>>> - Pages each have a CI that doesn't change based on algorithm used >>>> - Algorithms each have a CI to describe their efficiency >>>> >>>> Which can be used for: >>>> >>>> - It is possible to quickly take a short incremental step to try to >>>> get >>>> the harder-to-compress pages into the same groups (i.e. find the first >>>> group from the hardest-to-compress end that is 'wrong', find all groups >>>> containing the next N hardest-to-compress pages, break those groups >>>> apart, reorder them, the group is now 'proper') >>>> - This allows for idle time to be used to sort pages and groups in >>>> such a way that our first candidates for writing to disk are pages that >>>> are 'hard to compress' (taking more memory) >>>> >>> Hard to compress factor should be combined with factor of having working >>> set of apps within memory (and maybe other too - I don't know). >> >> Yes.. perhaps time based? ("This page was used some time within the >> last 4000 jiffies, it's probably part of a working set"). > > I don't think this is the way to determine app working sets. > This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression > (http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf) > > mentions of a adaptive resizing scheme that keeps working sets of app in > memory by a user space process that 'monitors' this app. But the worst > thing is: > -- Where is that code? Where is description of method used? googled that > for hours...YMMV > -- They seem to assume that only a single heavyweight app is running. > While systems like OLPC are generic ones (albeit slower ones). > It's still the best way I can think of. An app that's got 4 segments of memory set up: [main()][hash_md5()][DATA:db_index][user_add()] main() is going to call some other stuff and probably never get returned into until the program exits; user_add() will happen when someone is administrating the program. Whenever the database is accessed, some of DATA:db_index is read through (among other chunks of the database); and hash_md5() is used to check for database corruption. In the normal course of the program, you'll probably find that, when the program is active, user_add() is rarely used; main() is pretty much NEVER used; hash_md5() is used every time the program does ANYTHING; and DATA:db_index is read through every time someone needs to find something in the database. In these cases, we can say that the program's current working set probably excludes main() and user_add() purely because they haven't been used recently; and probably includes hash_md5() and DATA:db_index because they've been used very recently. If the program is asleep, and rarely has any activity, we can probably page it out without anyone caring much in favor of some other program that's currently hammering the CPU. >> >>>> - It is possible to use a "Light" algorithm (such as WKdm) initially, >>>> and move to a "Heavy" algorithm by decompressing the block and >>>> recompressing it during idle CPU time >>>> - Maximize efficiency without adding unnecessary overhead while the >>>> system is busy >>>> >>> -- compress a page with WKdm >>> [CPU is idle] >>> -- extract these pages compressed with WKdm from ccache >>> [CPU is now busy] >>> -- compress them with LZO >>> -- put them again in ccache >>> >>> There can be lot of potential problems with this thing. We need to look >>> at it carefully in detail. >>> >> >> Yeah, like systems that are busy running a large number of programs >> (i.e. building a kernel) where the CPU is working on and off in quick >> spurts. You can get 80-150mS idle periods, if you try to do 200mS of >> work there you just slowed the system down. >> >>> I was just wondering how swap prefetching patch senses CPU as 'idle' - >>> what are the things it considers?, how it handles problem of 'spike' >>> loads?. Will look at all this later. >>> >> >> When the CPU is idle, linux activates the "Idle Loop." The Idle Loop >> pretty much uses HLT to halt CPU use for a little bit when the system >> has nothing to do. Otherwise the CPU would be going nuts all the time ;) >> >> At any rate, there's a mechanism in there somewhere for it. > > Andreas Mohr provided these links (to know when CPU is 'idle') > -- http://lkml.org/lkml/2006/6/18/39 > -- http://lkml.org/lkml/2006/6/18/40 > -- http://threebit.net/mail-archive/linux-kernel/msg05303.html > cool. CPU 'idles' btw by entering a loop that does absolutely nothing; however the kernel uses the HLT instruction to actually halt the CPU so it actually processes nothing for a time. > > > Cheers, > Nitin Gupta > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJc5Wws1xW0HCTEFAQLD2g/+LdhieH0T6OPCveThzBfPtZOI/1ZuL9Hm zqGLdMoZUk83OWOGVMkTmpn4pZEa/QzSAoZd97SkgFrBjfdLBUxZi0tp3i++2x58 bW33wuS/NH0OyCu9Up+lGBQ5LPfVdMk7/9CiuGdOFSvX2t1Xt5K7S1hd/QRpJE50 6jlZpcvnStkEHHwMSi8nGDw2ntUIQwnXAptjXUq/fFEbPJaWNGDB+p3I9qfwg2gY PfMLgkdobhWcbtp6pcHdpyUcFkSYA4FFIggWpcDQw+TpWpMuGVbrPH9sMHjf7jim XN+uUM7tqPtCg1GzDcWthoEF9u+DBxD39maOT/3PD4VMFMIX5bYjyKbohuuf0VIv +5AvfFfNAjOwlpbVZsKbsCt2NXOi2Z+3+3JobRjds7PZ2XJPAKiA4SpOY99vCu5p aZw7ERfp/elFGI8DRMeRSXGVVf0ZUnQCPDqNH+Nze8VBRecdK+jd8sp068dtXw5z 6YOwZARlEfVxj2Hx+4+iefEOWtzpztPBD/NhxpiHOpSWmsIEhj3XrX5YLtYWEDHl UUgcNUZ7BFV9n3nnjW8mYU94VvJJLRbSP2YM08/HGPV5J7CPWJKBISd+k7umUV3h kkqmTeHFnjNB8CoGWmEHatkUpVKTKGDttEK7MVAChAy4jRypHX/tIqdfohalZzxr ImTpgaFc5vw= =+mxu -----END PGP SIGNATURE----- |
From: Nitin G. <nit...@gm...> - 2006-06-20 16:16:47
|
John Moser wrote: > Nitin Gupta wrote: >> John Moser wrote: >>> It may even be possible to find an algorithm that does strict analysis >>> but not compression (like a hash) that spits out a value correlating to >>> the redundancy and encodability of the stream; but that would take a lot >>> of research to come up with. A really fast, really light compressor >>> would be easier. >> your hash idea is a real nice topic for MS...but I've not yet planned >> for that :) >> (btw...I searched for something like this on citeseer but no luck) >> >> WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) >> ..and still WKdm can't give you an idea of how well others(LZO) will >> compress it. >> > > In general there is a correlation. Remember compression is a > characteristic of the algorithm and the data. It is computationally > impossible to compress truly random data beyond a specific average > ratio; however, small isolated blocks may encode well. > > In general we say one algorithm performs better than another because of > this. 7zip gets smaller output than RAR or ZIP, consistantly, on > various input datas; this is not a coincidence, the LZMA algorithm is > actually better at redundancy elimination. The fact that an algorithm > compresses a set of data more than the next shows that algorithm is > better at eliminating redundancy; while the fact that a given algorithm > compresses one piece of data better than the next shows that that data > is more redundant. > > Think of it in terms of a punett square with the crossed values giving > the output data size of compressing a 32K block of data: > > Algorithms: Data sets: > X = weak compression A = minimal redundancy > Y = strong compression B = much redundancy > > A B > X 30 | 25 > ----+---- > Y 25 | 15 > > A weak algorithm with non-redundant data will produce a block only > slightly smaller (if not bigger) than the input; while the same > algorithm on more redundant data will produce a block that's a little > smaller than that. In contrast, a strong algorithm on non-redundant > data will produce a block that's somewhat smaller, and on redundant data > will produce a block that's much smaller. > > You are correct in assuming that it's not a mathematical fact that any > given algorithm can be used to predict how other algorithms will perform > with the data; however, compression algorithms are basically analysis of > data, and they do happen to as a side effect of their nature expose a > property of the data. We can use this to conjecture that in most cases > a compression algorithm will effectively describe how well other > algorithms will perform, just not in a 1:1 manner; it will produce an > approximate general trend. This seems very logical and we can confirm this at least for particular cases of WKdm, WK4x4 and LZO. If you have some time, it would be awesome if you take 'compress-test' kernel module: http://linux-mm.org/CompressedCaching/Code?action=AttachFile&do=get&target=compress-test.tar.gz and feed all three algos (included with this module), data from /lib (binary files), normal text files and other kind of data you can think of, and compare how effectively each compresses the data. This thing can be easily automated and lots of data on this will allow us to see if we can really conjecture how well performance of WKdm (lightest) describes performance of other two. If you are interested in doing this, I can send you all the details for this module. In fact, all you need to do is feed some data (in single pages) to /proc/compress_test/compress and then read this same file to see compressed size. You can dynamically switch algos by writing 0, 1, or 2 to /proc/compress_test/algo_idx. --------------------------- [couldn't now think of a reply for rest of your mail] --------------------------- Also, I have some doubts on page grouping you mentioned earlier: AFAIK, Rodrigo's patch never compressed pages in groups - all pages were compressed individually. The page grouping you were mentioning about was the size of 'cells'. So, what is the meaning of grouping of pages w.r.t 'chunked' structure? Though we can compression more than one page together and then store them in chunks but this was not you were mentioning about...right? Cheers, Nitin Gupta |
From: John M. <joh...@gm...> - 2006-06-20 18:42:13
|
Nitin Gupta wrote: > John Moser wrote: >> Nitin Gupta wrote: >>> John Moser wrote: >>>> It may even be possible to find an algorithm that does strict analysis >>>> but not compression (like a hash) that spits out a value correlating to >>>> the redundancy and encodability of the stream; but that would take a >>>> lot >>>> of research to come up with. A really fast, really light compressor >>>> would be easier. >>> your hash idea is a real nice topic for MS...but I've not yet planned >>> for that :) >>> (btw...I searched for something like this on citeseer but no luck) >>> >>> WKdm is lightest I know...something lighter is maybe dear ol' >>> memcpy() ;) >>> ..and still WKdm can't give you an idea of how well others(LZO) will >>> compress it. >>> >> >> In general there is a correlation. Remember compression is a >> characteristic of the algorithm and the data. It is computationally >> impossible to compress truly random data beyond a specific average >> ratio; however, small isolated blocks may encode well. >> >> In general we say one algorithm performs better than another because of >> this. 7zip gets smaller output than RAR or ZIP, consistantly, on >> various input datas; this is not a coincidence, the LZMA algorithm is >> actually better at redundancy elimination. The fact that an algorithm >> compresses a set of data more than the next shows that algorithm is >> better at eliminating redundancy; while the fact that a given algorithm >> compresses one piece of data better than the next shows that that data >> is more redundant. >> >> Think of it in terms of a punett square with the crossed values giving >> the output data size of compressing a 32K block of data: >> >> Algorithms: Data sets: >> X = weak compression A = minimal redundancy >> Y = strong compression B = much redundancy >> >> A B >> X 30 | 25 >> ----+---- >> Y 25 | 15 >> >> A weak algorithm with non-redundant data will produce a block only >> slightly smaller (if not bigger) than the input; while the same >> algorithm on more redundant data will produce a block that's a little >> smaller than that. In contrast, a strong algorithm on non-redundant >> data will produce a block that's somewhat smaller, and on redundant data >> will produce a block that's much smaller. >> >> You are correct in assuming that it's not a mathematical fact that any >> given algorithm can be used to predict how other algorithms will perform >> with the data; however, compression algorithms are basically analysis of >> data, and they do happen to as a side effect of their nature expose a >> property of the data. We can use this to conjecture that in most cases >> a compression algorithm will effectively describe how well other >> algorithms will perform, just not in a 1:1 manner; it will produce an >> approximate general trend. > > This seems very logical and we can confirm this at least for particular > cases of WKdm, WK4x4 and LZO. > > If you have some time, it would be awesome if you take 'compress-test' > kernel module: > > http://linux-mm.org/CompressedCaching/Code?action=AttachFile&do=get&target=compress-test.tar.gz > bluefox@iceslab:~/compress-test$ sudo ./testscript.sh ==== LIBRARY FILE TEST ==== Block 1 of /lib/libc.so.6 (4096 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000241 seconds, 17.0 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1000 (0 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000443 seconds, 9.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 952 (0 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000432 seconds, 9.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1593 (1 kB) Block 2 of /lib/libc.so.6 (8192 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 6.1e-05 seconds, 67.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1676 (1 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000197 seconds, 20.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1592 (1 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000177 seconds, 23.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2620 (2 kB) Block 3 of /lib/libc.so.6 (12288 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 9.6e-05 seconds, 42.7 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2512 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000162 seconds, 25.3 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2328 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000241 seconds, 17.0 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2877 (2 kB) Block 4 of /lib/libc.so.6 (16384 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000104 seconds, 39.4 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2692 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000158 seconds, 25.9 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2528 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000223 seconds, 18.4 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2854 (2 kB) Block 5 of /lib/libc.so.6 (20480 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0001 seconds, 41 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2764 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000257 seconds, 15.9 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2580 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00028 seconds, 14.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2854 (2 kB) ==== RANDOM DATA TEST ==== Block 1 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001142 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001151 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001132 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 2 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001079 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00115 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001254 seconds, 3.3 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 3 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001086 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001187 seconds, 3.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001262 seconds, 3.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 4 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001074 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00117 seconds, 3.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001276 seconds, 3.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 5 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000997 seconds, 4.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00101 seconds, 4.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001118 seconds, 3.7 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) (interesting results with random data) It looks like on libc, WKdm and WK4x4 do well but LZO fails to perform; while more random data squeezes down tighter with LZO. It's interesting that the random data sizes are consistent; I used /dev/urandom to calculate pi once, it's VERY good at producing random data, I am assuming thus that we are probably looking at the theoretical limits of compression on 4096 bytes of data (i.e. they get bigger). It does appear that there is a correlation here. For libc, WKdm and WK4x4 compress better than LZO; but in general, if a piece of data A compresses smaller than a piece of data B under one algorithm, then it also compresses smaller under the other algorithms. Side by side: /lib/libc.so.6 Block WKdm WK4x4 LZO 1 1000 952 1593 2 1676 1592 2620 3 2512 2328 2877 Notice as the output size of WKdm grows, so does the output size of WK4x4 and LZO. I have made note that for library data it seems in terms of output size LZO > WKdm > WK4x4; whereas for random data it seems to progress WKdm > WK4x4 > LZO. It is probably possible to write a small period analysis program to determine which to use; a simple monte-carlo could detect more random data, but counting the 0 bytes is also significant (WKdm and WK4x4 treat 0 as a special case, due to C memory alignment and unused bytes of memory pages tending to be filled with 0's). A quick test on the first 4096 bytes of each of /dev/urandom and /lib/libc.so.6 gives: bluefox@icebox:/tmp/x$ ./mcarlo hits = 1607, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.423745e+02 <--- Very random (high quality entropy) Nulls: 15 hits = 1595, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.793874e+01 Nulls: 13 hits = 1593, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.304198e+01 Nulls: 16 hits = 1592, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.103888e+01 Nulls: 17 hits = 1606, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 2.051743e+02 Nulls: 18 hits = 1570, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.330028e+01 Nulls: 19 hits = 1612, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.460953e+02 Nulls: 9 hits = 1578, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.678940e+01 Nulls: 13 hits = 1600, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 6.026764e+01 Nulls: 14 hits = 1621, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 4.094506e+01 Nulls: 19 hits = 2038, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.192071e+00 <--- Very not random (low quality) Nulls: 2341 hits = 2046, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.170274e+00 Nulls: 3448 <--- Lots of '\0' (WK-friendly!) hits = 2047, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.167605e+00 Nulls: 2646 hits = 2011, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.272035e+00 Nulls: 2194 hits = 1994, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.328130e+00 Nulls: 2022 hits = 1989, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.345582e+00 Nulls: 2032 hits = 1990, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.342055e+00 Nulls: 2019 hits = 1990, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.342055e+00 Nulls: 2029 hits = 1980, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.378180e+00 Nulls: 2025 hits = 1988, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.349127e+00 Nulls: 2008 > > and feed all three algos (included with this module), data from /lib > (binary files), normal text files and other kind of data you can think > of, and compare how effectively each compresses the data. This thing can > be easily automated and lots of data on this will allow us to see if we > can really conjecture how well performance of WKdm (lightest) describes > performance of other two. > > If you are interested in doing this, I can send you all the details for > this module. In fact, all you need to do is feed some data (in single > pages) to /proc/compress_test/compress and then read this same file to > see compressed size. You can dynamically switch algos by writing 0, 1, > or 2 to /proc/compress_test/algo_idx. > > --------------------------- > [couldn't now think of a reply for rest of your mail] > --------------------------- > > Also, I have some doubts on page grouping you mentioned earlier: > > AFAIK, Rodrigo's patch never compressed pages in groups - all pages were > compressed individually. The page grouping you were mentioning about was > the size of 'cells'. So, what is the meaning of grouping of pages w.r.t > 'chunked' structure? +Double Page Size +CONFIG_COMP_DOUBLE_PAGE + + Select this option to make compressed cache use double sized pages + (two pages contiguos) instead of single memory pages. That increases + the effect of the compressed cache use even when the compression + ratio isn't very good. On i386, compressed cache pages will have + 8KiB instead of the regular 4KiB. + + The size of the compressed cache page is listed in the output of + /proc/comp_cache_stat. + + If unsure, say Y here. Not sure. I can't tell at a glance how the logic works (I have difficulty figuring out the mm/ code), you may be right. > > Though we can compression more than one page together and then store > them in chunks but this was not you were mentioning about...right? > Yeah I was thinking crunch together multiple pages at once... > > Cheers, > Nitin Gupta > |
From: John M. <joh...@gm...> - 2006-06-21 00:00:13
Attachments:
signature.asc
mcarlo_optimal.c
|
John Moser wrote: > > A quick test on the first 4096 bytes of each of /dev/urandom and > /lib/libc.so.6 gives: > A bunch of numbers for libc that are only a few digits away. I've written a more optimized version that processes 2 longs at a time to save cycle count by about 8 times on 32-bit and 16 times on 64-bit. Output below: bluefox@icebox:/tmp/x$ ./mcarlo_optimal hits = 401, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.138932e+02 Nulls: 22 hits = 391, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.150680e+01 Nulls: 14 hits = 384, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 7.062513e+00 Nulls: 20 hits = 413, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.176888e+01 Nulls: 21 hits = 394, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.575606e+01 Nulls: 11 hits = 392, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.264340e+01 Nulls: 18 hits = 416, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 9.224467e+00 Nulls: 11 hits = 407, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.625027e+01 Nulls: 31 hits = 397, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.498117e+01 Nulls: 13 hits = 396, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.090185e+01 Nulls: 18 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2341 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 3448 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2646 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2194 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2022 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2032 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2019 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2029 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2025 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2008 Notice /dev/urandom still gets over 7 and usually over 10, 20, or even over 100 for a Quality Index; while libc6 still never touches 2. Instead of comparing 2 bytes, this code compares longs, which (in this example) groups 4 bytes together and analyzes them to determine a hit or miss. As such, the slight variations in libc6 smooth out and we basically hit all the time. In case you're wondering, what we're 'hitting' is a circle. This algorithm mathematically inscribes a circle in a square; cuts the square down to the top-right corner; and tests where random values fall. ________ | ..--.. |<-- square |/ \| | X | X = Inside Circle | | |\ /| |_--__--_| ____ |-.. |<-- Top-right quadrant of square | X \| |____| X = inside circle If they fall a distance of 1.0 or less from bottom-left of this quadrant, they're in the circle; if not, they're outside the circle. The number of hits inside the circle over the total shots gives the value of pi; we can approximate this by calculating hits/shots for the quadrant and multiplying by 4. Simple monte-carlo algorithm. The optimized version processes 8 or 16 bytes at a time using CPU-native words; and uses one pass of AND comparisons and bit shifts to test how many NULL bytes are processed. There are a lot of pointer dereferences; the compiler will cache these for us so they won't cost cycles. Note that in practice we can probably discard the entire monte-carlo calculation, since having a lot of NULs will probably be enough to accept the use of Wilson-Kaplan family algorithms over Lempil-Ziv. Using the monte-carlo based quality index is not good as a general compression index; compressed data has a high Q.Idx despite not being compressible. |
From: Nitin G. <nit...@gm...> - 2006-06-21 22:14:47
|
John Moser wrote: >> Just a quick note: We can't have floating point in kernel. So, I doubt >> how will we have monte-carlo algorithm ported to kernel? >> > > Nods. I don't know if it can be rewritten in fixed point; however, it > is notable that mcarlo is probably not as useful as counting null bytes. > Wilson-Kaplan algorithms treat 0 as a special case; think about > anonymous memory: > > ADDR DATA > 0 char a > \0 (pad) > \0 (pad) > 4 int i > int i > int i > int i > 8 short j > short j > \0 (pad) > \0 (pad) > 12 char b > \0 (pad) > \0 (pad) > \0 (pad) > 16 .... > > Data values will often be aligned to 4 bytes; stack frames tend to be > aligned to 16 bytes I guess (STACK_ROUND() rounds to 16 bytes). > Executable memory is even worse; loops and small functions and other > code blocks tend to be aligned to page boundaries, so you have things > that look like: > > > ...; > ...; > jmp @a; > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > @a: /* Aligned to 4096 byte page boundary */ > ...; > loop; > ...; > ret; /*return*/ > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > function foo() { /* Function aligned to a page boundary */ > ........ > > And so on. You get small clumps of zeros here and there that don't > really compress any better than small clumps of A's or 7's or such; but > WKdm and WK4x4 expect these so they have a special way to compress them > with lots more efficiency than LZO. > /* --------------- Frame This One ---------------------------------------- */ > At any rate, just running through with a NULL byte counter is probably > enough to decide whether to use WK or LZ. > /* ------- Simple, Fast and Reliable ------------------------------------- */ Great suggestion John Moser :-) And..I think, this NULL byte count can be really optimized to a level that time to perform this _each_ time when compressing pages should be far compensated by time and space we'll save by having a good hit at selecting right compression algorithm. Cheers, Nitin Gupta |
From: John M. <joh...@gm...> - 2006-06-22 00:04:15
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > > John Moser wrote: >>> Just a quick note: We can't have floating point in kernel. So, I doubt >>> how will we have monte-carlo algorithm ported to kernel? >>> >> >> Nods. I don't know if it can be rewritten in fixed point; however, it >> is notable that mcarlo is probably not as useful as counting null bytes. >> Wilson-Kaplan algorithms treat 0 as a special case; think about >> anonymous memory: [......] > /* --------------- Frame This One > ---------------------------------------- */ >> At any rate, just running through with a NULL byte counter is probably >> enough to decide whether to use WK or LZ. >> > /* ------- Simple, Fast and Reliable > ------------------------------------- */ > > Great suggestion John Moser :-) > > And..I think, this NULL byte count can be really optimized to a level > that time to perform this _each_ time when compressing pages should be > far compensated by time and space we'll save by having a good hit at > selecting right compression algorithm. > Optimizing, you can probably have it escape out when it hits a certain number of NULLs; but you still want to avoid running it every time. Mostly a page of memory is going to stay what it is; the stack, for example, will always have loads of padding in it except for the case of large on-stack buffers full of text and such (typically such buffers are short and negligible; large ones would mean more than say half a page?). At the very least, there's a lot of memory that won't change, and we can avoid re-working this when we see that stuff come back. This is especially true of mmap()'d segments of executables and libraries, although those aren't normally swapped anyway (they get invalidated and cleared from memory, re-read from disk later; in the future compressed caching could compress that stuff and just invalidate/free it when memory pressure is high rather than sending it to swap). There's other data that behaves this way, though; for example, most of the control information for GUIs like GTK won't change once the window is initially built and drawn.. > > > Cheers, > Nitin Gupta > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJnc5gs1xW0HCTEFAQJ5mRAAgfK7ZJFGudGjZpW9yqtXeCUGMCg8E2ss 8KLEl+bT+Jzja8+Qh+DYk1N6mMx4G7l0w/wR+M4JdnDEC2v/EzGMsbQ2IGWsEEl0 7eKpCUqvJUNiwaiNhsdoUy2BrC/6H3EXt4EEoqJtkFGAY1XTqT3kzHzopFlcFDPC XOQaLMvnDGIIVLarH+CY6SfWvBv6OcEYTLYFg43FWhZyOhmZboQGAy6CAp9XYELq 0MsmSeM1FA4cHQOAqd68HCY23ZubS1sy/IrnfFaxp3z5M/YQr797sMLceyPjM2Ao E/ZI3eVbtG+Qm+ZKqjpjGB+d/4MlD5brr/MlEpQqVXmGNkNTCMiwLW29r3/6or28 BIwkZ08kLOU/plqa5jum9nvb7IfQVK9ITUTRC39J9+xETCRAIV5mgR7PSCaIo7jm bbA2L6dizHZIfr7uwYIQB8dCT2+qKZipOCPHyeDtVfacJ1xPwch3hImjh0Jdg6KK lePMN52MLVlRp6fL7n3eEkMYXb9q2r3BzJ1jT+XBtF4xgfZXFxWpPEvUlhuFcZtD bxSS/cslgyHDajxMl4N9jffTd6wYKBMwsFKllNQO6YVSSS6T7KGpPRdwIAOj9Wq4 6/3HFmDQRsDjgsnwkFZx1toViSTTsqk518uJPgqFnT2x+6stC8hk8T0GvMCCAmQi kXZK9r5RaYQ= =l3kb -----END PGP SIGNATURE----- |