linuxcompressed-devel Mailing List for Linux Compressed Cache (Page 6)
Status: Beta
Brought to you by:
nitin_sf
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
(1) |
Dec
(11) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(22) |
Feb
(11) |
Mar
(31) |
Apr
(19) |
May
(17) |
Jun
(9) |
Jul
(13) |
Aug
(1) |
Sep
(10) |
Oct
(4) |
Nov
(10) |
Dec
(4) |
2003 |
Jan
|
Feb
(8) |
Mar
|
Apr
(5) |
May
(39) |
Jun
(10) |
Jul
(2) |
Aug
(1) |
Sep
(1) |
Oct
(27) |
Nov
(1) |
Dec
(2) |
2004 |
Jan
|
Feb
(3) |
Mar
(1) |
Apr
|
May
|
Jun
(3) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
|
2005 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(9) |
Dec
(2) |
2006 |
Jan
(7) |
Feb
(4) |
Mar
(12) |
Apr
(16) |
May
(11) |
Jun
(48) |
Jul
(19) |
Aug
(16) |
Sep
(13) |
Oct
|
Nov
(8) |
Dec
(1) |
2007 |
Jan
(4) |
Feb
|
Mar
|
Apr
(3) |
May
(26) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(7) |
Mar
(5) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: John M. <joh...@gm...> - 2006-06-21 00:00:13
|
John Moser wrote: > > A quick test on the first 4096 bytes of each of /dev/urandom and > /lib/libc.so.6 gives: > A bunch of numbers for libc that are only a few digits away. I've written a more optimized version that processes 2 longs at a time to save cycle count by about 8 times on 32-bit and 16 times on 64-bit. Output below: bluefox@icebox:/tmp/x$ ./mcarlo_optimal hits = 401, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.138932e+02 Nulls: 22 hits = 391, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.150680e+01 Nulls: 14 hits = 384, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 7.062513e+00 Nulls: 20 hits = 413, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.176888e+01 Nulls: 21 hits = 394, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.575606e+01 Nulls: 11 hits = 392, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 1.264340e+01 Nulls: 18 hits = 416, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 9.224467e+00 Nulls: 11 hits = 407, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.625027e+01 Nulls: 31 hits = 397, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.498117e+01 Nulls: 13 hits = 396, shots = 512 Final Calculations for /dev/urandom: Q.Idx: 2.090185e+01 Nulls: 18 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2341 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 3448 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2646 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2194 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2022 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2032 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2019 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2029 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2025 hits = 512, shots = 512 Final Calculations for /lib/libc.so.6: Q.Idx: 1.164948e+00 Nulls: 2008 Notice /dev/urandom still gets over 7 and usually over 10, 20, or even over 100 for a Quality Index; while libc6 still never touches 2. Instead of comparing 2 bytes, this code compares longs, which (in this example) groups 4 bytes together and analyzes them to determine a hit or miss. As such, the slight variations in libc6 smooth out and we basically hit all the time. In case you're wondering, what we're 'hitting' is a circle. This algorithm mathematically inscribes a circle in a square; cuts the square down to the top-right corner; and tests where random values fall. ________ | ..--.. |<-- square |/ \| | X | X = Inside Circle | | |\ /| |_--__--_| ____ |-.. |<-- Top-right quadrant of square | X \| |____| X = inside circle If they fall a distance of 1.0 or less from bottom-left of this quadrant, they're in the circle; if not, they're outside the circle. The number of hits inside the circle over the total shots gives the value of pi; we can approximate this by calculating hits/shots for the quadrant and multiplying by 4. Simple monte-carlo algorithm. The optimized version processes 8 or 16 bytes at a time using CPU-native words; and uses one pass of AND comparisons and bit shifts to test how many NULL bytes are processed. There are a lot of pointer dereferences; the compiler will cache these for us so they won't cost cycles. Note that in practice we can probably discard the entire monte-carlo calculation, since having a lot of NULs will probably be enough to accept the use of Wilson-Kaplan family algorithms over Lempil-Ziv. Using the monte-carlo based quality index is not good as a general compression index; compressed data has a high Q.Idx despite not being compressible. |
From: Mauricio L. <mau...@gm...> - 2006-06-20 23:23:08
|
Hi Gupta, Regarding the master chunk list (chunks) in the struct chunk, does it store the chunks of one page cache or several page caches? Will be there one master chunk list per page cache radix tree? I also wonder where the chunk list header will be stored. For instance the list header of task_struct is stored in the tasks field of init_task, so when an insertion happens in the list of task_struct, something like that is used: list_add_tail(&p->tasks, &init_task.tasks); where the first argument is the element to be inserted and the second argument corresponds the list header of task_struct. Thus where the list header of chunks will be stored? Any idea? BR, Mauricio Lin. On 6/19/06, Nitin Gupta <nit...@gm...> wrote: > Hi Mauricio, > > Mauricio Lin wrote: > > Hi all, > > > > After checking the chunk list in > > http://linux-mm.org/CompressedCaching, I have some questions as > > follows. > > > > Did anyone start to implement the chunk list operations? Is there any > > progress of it? > > > > No, I haven't yet started compression structure implementation. > > > Regarding the "struct chunk_head", the field "chunk_list" points to > > the related chunks, right? > > I mean if a page A in the page cache is replaced by the corresponding > > chunk head, the field chunk_list in such chunk_head points to the list > > of related chunks of page A, even if some chunks are located in other > > physical pages, right? > > > > Yes. A page is compressed and stored in chunks (which can be present in > different physical pages). These chunks are linked together using chunk->next. > > > Where ca I find more info about the purpose of lru field in the struct > > chunk_head? > > > > chunk_head->lru has basically same purpose as struct page->lru. > > When compressed cache become full, we have to free some pages from ccache to the > swap disk. The chunk_head->lru field is to have these chunk_heads in LRU > order. So, when freeing pages from ccache, take a chunk_head from head of LRU > list and free the chunk_head and corres. chunk(s). > > These freed chunks are then merged with other physically adjacent free chunks > (found using chunk->chunks->{prev, next}). A chunk is identified as free by flag > (say, CH_free) stored in 4 MSBs of chunk->size (A chunk has max size of > PAGE_SIZE so, 4 bits are free). > > Also, as mentioned on site: LRU Lists will be maintained separately for clean > page cache pages, dirty page cache pages and swap cache pages. This will allow > pages of these individual types to be separately tracked and when compressed > cache size reaches limit, specific types of pages and be freed in priority to > other (like put more pressure to free clean page cache pages than dirty page > cache pages and least on swap cache pages). > > > Cheers, > Nitin Gupta > |
From: John M. <joh...@gm...> - 2006-06-20 18:42:13
|
Nitin Gupta wrote: > John Moser wrote: >> Nitin Gupta wrote: >>> John Moser wrote: >>>> It may even be possible to find an algorithm that does strict analysis >>>> but not compression (like a hash) that spits out a value correlating to >>>> the redundancy and encodability of the stream; but that would take a >>>> lot >>>> of research to come up with. A really fast, really light compressor >>>> would be easier. >>> your hash idea is a real nice topic for MS...but I've not yet planned >>> for that :) >>> (btw...I searched for something like this on citeseer but no luck) >>> >>> WKdm is lightest I know...something lighter is maybe dear ol' >>> memcpy() ;) >>> ..and still WKdm can't give you an idea of how well others(LZO) will >>> compress it. >>> >> >> In general there is a correlation. Remember compression is a >> characteristic of the algorithm and the data. It is computationally >> impossible to compress truly random data beyond a specific average >> ratio; however, small isolated blocks may encode well. >> >> In general we say one algorithm performs better than another because of >> this. 7zip gets smaller output than RAR or ZIP, consistantly, on >> various input datas; this is not a coincidence, the LZMA algorithm is >> actually better at redundancy elimination. The fact that an algorithm >> compresses a set of data more than the next shows that algorithm is >> better at eliminating redundancy; while the fact that a given algorithm >> compresses one piece of data better than the next shows that that data >> is more redundant. >> >> Think of it in terms of a punett square with the crossed values giving >> the output data size of compressing a 32K block of data: >> >> Algorithms: Data sets: >> X = weak compression A = minimal redundancy >> Y = strong compression B = much redundancy >> >> A B >> X 30 | 25 >> ----+---- >> Y 25 | 15 >> >> A weak algorithm with non-redundant data will produce a block only >> slightly smaller (if not bigger) than the input; while the same >> algorithm on more redundant data will produce a block that's a little >> smaller than that. In contrast, a strong algorithm on non-redundant >> data will produce a block that's somewhat smaller, and on redundant data >> will produce a block that's much smaller. >> >> You are correct in assuming that it's not a mathematical fact that any >> given algorithm can be used to predict how other algorithms will perform >> with the data; however, compression algorithms are basically analysis of >> data, and they do happen to as a side effect of their nature expose a >> property of the data. We can use this to conjecture that in most cases >> a compression algorithm will effectively describe how well other >> algorithms will perform, just not in a 1:1 manner; it will produce an >> approximate general trend. > > This seems very logical and we can confirm this at least for particular > cases of WKdm, WK4x4 and LZO. > > If you have some time, it would be awesome if you take 'compress-test' > kernel module: > > http://linux-mm.org/CompressedCaching/Code?action=AttachFile&do=get&target=compress-test.tar.gz > bluefox@iceslab:~/compress-test$ sudo ./testscript.sh ==== LIBRARY FILE TEST ==== Block 1 of /lib/libc.so.6 (4096 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000241 seconds, 17.0 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1000 (0 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000443 seconds, 9.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 952 (0 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000432 seconds, 9.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1593 (1 kB) Block 2 of /lib/libc.so.6 (8192 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 6.1e-05 seconds, 67.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1676 (1 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000197 seconds, 20.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 1592 (1 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000177 seconds, 23.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2620 (2 kB) Block 3 of /lib/libc.so.6 (12288 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 9.6e-05 seconds, 42.7 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2512 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000162 seconds, 25.3 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2328 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000241 seconds, 17.0 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2877 (2 kB) Block 4 of /lib/libc.so.6 (16384 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000104 seconds, 39.4 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2692 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000158 seconds, 25.9 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2528 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000223 seconds, 18.4 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2854 (2 kB) Block 5 of /lib/libc.so.6 (20480 in) Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0001 seconds, 41 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2764 (2 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000257 seconds, 15.9 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2580 (2 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00028 seconds, 14.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 2854 (2 kB) ==== RANDOM DATA TEST ==== Block 1 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001142 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001151 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001132 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 2 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001079 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00115 seconds, 3.6 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001254 seconds, 3.3 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 3 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001086 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001187 seconds, 3.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001262 seconds, 3.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 4 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001074 seconds, 3.8 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00117 seconds, 3.5 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001276 seconds, 3.2 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) Block 5 of /dev/urandom Algorithm index: 0 = WKdm 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000997 seconds, 4.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4368 (4 kB) Algorithm index: 1 = WK4x4 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00101 seconds, 4.1 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4352 (4 kB) Algorithm index: 2 = LZO 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.001118 seconds, 3.7 MB/s Data size given: 4096 (4 kB) -- padded to 4kB Compressed size: 4116 (4 kB) (interesting results with random data) It looks like on libc, WKdm and WK4x4 do well but LZO fails to perform; while more random data squeezes down tighter with LZO. It's interesting that the random data sizes are consistent; I used /dev/urandom to calculate pi once, it's VERY good at producing random data, I am assuming thus that we are probably looking at the theoretical limits of compression on 4096 bytes of data (i.e. they get bigger). It does appear that there is a correlation here. For libc, WKdm and WK4x4 compress better than LZO; but in general, if a piece of data A compresses smaller than a piece of data B under one algorithm, then it also compresses smaller under the other algorithms. Side by side: /lib/libc.so.6 Block WKdm WK4x4 LZO 1 1000 952 1593 2 1676 1592 2620 3 2512 2328 2877 Notice as the output size of WKdm grows, so does the output size of WK4x4 and LZO. I have made note that for library data it seems in terms of output size LZO > WKdm > WK4x4; whereas for random data it seems to progress WKdm > WK4x4 > LZO. It is probably possible to write a small period analysis program to determine which to use; a simple monte-carlo could detect more random data, but counting the 0 bytes is also significant (WKdm and WK4x4 treat 0 as a special case, due to C memory alignment and unused bytes of memory pages tending to be filled with 0's). A quick test on the first 4096 bytes of each of /dev/urandom and /lib/libc.so.6 gives: bluefox@icebox:/tmp/x$ ./mcarlo hits = 1607, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.423745e+02 <--- Very random (high quality entropy) Nulls: 15 hits = 1595, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.793874e+01 Nulls: 13 hits = 1593, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.304198e+01 Nulls: 16 hits = 1592, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 3.103888e+01 Nulls: 17 hits = 1606, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 2.051743e+02 Nulls: 18 hits = 1570, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.330028e+01 Nulls: 19 hits = 1612, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.460953e+02 Nulls: 9 hits = 1578, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 1.678940e+01 Nulls: 13 hits = 1600, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 6.026764e+01 Nulls: 14 hits = 1621, shots = 2048 Final Calculations for /dev/urandom: Q.Idx: 4.094506e+01 Nulls: 19 hits = 2038, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.192071e+00 <--- Very not random (low quality) Nulls: 2341 hits = 2046, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.170274e+00 Nulls: 3448 <--- Lots of '\0' (WK-friendly!) hits = 2047, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.167605e+00 Nulls: 2646 hits = 2011, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.272035e+00 Nulls: 2194 hits = 1994, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.328130e+00 Nulls: 2022 hits = 1989, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.345582e+00 Nulls: 2032 hits = 1990, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.342055e+00 Nulls: 2019 hits = 1990, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.342055e+00 Nulls: 2029 hits = 1980, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.378180e+00 Nulls: 2025 hits = 1988, shots = 2048 Final Calculations for /lib/libc.so.6: Q.Idx: 1.349127e+00 Nulls: 2008 > > and feed all three algos (included with this module), data from /lib > (binary files), normal text files and other kind of data you can think > of, and compare how effectively each compresses the data. This thing can > be easily automated and lots of data on this will allow us to see if we > can really conjecture how well performance of WKdm (lightest) describes > performance of other two. > > If you are interested in doing this, I can send you all the details for > this module. In fact, all you need to do is feed some data (in single > pages) to /proc/compress_test/compress and then read this same file to > see compressed size. You can dynamically switch algos by writing 0, 1, > or 2 to /proc/compress_test/algo_idx. > > --------------------------- > [couldn't now think of a reply for rest of your mail] > --------------------------- > > Also, I have some doubts on page grouping you mentioned earlier: > > AFAIK, Rodrigo's patch never compressed pages in groups - all pages were > compressed individually. The page grouping you were mentioning about was > the size of 'cells'. So, what is the meaning of grouping of pages w.r.t > 'chunked' structure? +Double Page Size +CONFIG_COMP_DOUBLE_PAGE + + Select this option to make compressed cache use double sized pages + (two pages contiguos) instead of single memory pages. That increases + the effect of the compressed cache use even when the compression + ratio isn't very good. On i386, compressed cache pages will have + 8KiB instead of the regular 4KiB. + + The size of the compressed cache page is listed in the output of + /proc/comp_cache_stat. + + If unsure, say Y here. Not sure. I can't tell at a glance how the logic works (I have difficulty figuring out the mm/ code), you may be right. > > Though we can compression more than one page together and then store > them in chunks but this was not you were mentioning about...right? > Yeah I was thinking crunch together multiple pages at once... > > Cheers, > Nitin Gupta > |
From: Nitin G. <nit...@gm...> - 2006-06-20 16:16:47
|
John Moser wrote: > Nitin Gupta wrote: >> John Moser wrote: >>> It may even be possible to find an algorithm that does strict analysis >>> but not compression (like a hash) that spits out a value correlating to >>> the redundancy and encodability of the stream; but that would take a lot >>> of research to come up with. A really fast, really light compressor >>> would be easier. >> your hash idea is a real nice topic for MS...but I've not yet planned >> for that :) >> (btw...I searched for something like this on citeseer but no luck) >> >> WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) >> ..and still WKdm can't give you an idea of how well others(LZO) will >> compress it. >> > > In general there is a correlation. Remember compression is a > characteristic of the algorithm and the data. It is computationally > impossible to compress truly random data beyond a specific average > ratio; however, small isolated blocks may encode well. > > In general we say one algorithm performs better than another because of > this. 7zip gets smaller output than RAR or ZIP, consistantly, on > various input datas; this is not a coincidence, the LZMA algorithm is > actually better at redundancy elimination. The fact that an algorithm > compresses a set of data more than the next shows that algorithm is > better at eliminating redundancy; while the fact that a given algorithm > compresses one piece of data better than the next shows that that data > is more redundant. > > Think of it in terms of a punett square with the crossed values giving > the output data size of compressing a 32K block of data: > > Algorithms: Data sets: > X = weak compression A = minimal redundancy > Y = strong compression B = much redundancy > > A B > X 30 | 25 > ----+---- > Y 25 | 15 > > A weak algorithm with non-redundant data will produce a block only > slightly smaller (if not bigger) than the input; while the same > algorithm on more redundant data will produce a block that's a little > smaller than that. In contrast, a strong algorithm on non-redundant > data will produce a block that's somewhat smaller, and on redundant data > will produce a block that's much smaller. > > You are correct in assuming that it's not a mathematical fact that any > given algorithm can be used to predict how other algorithms will perform > with the data; however, compression algorithms are basically analysis of > data, and they do happen to as a side effect of their nature expose a > property of the data. We can use this to conjecture that in most cases > a compression algorithm will effectively describe how well other > algorithms will perform, just not in a 1:1 manner; it will produce an > approximate general trend. This seems very logical and we can confirm this at least for particular cases of WKdm, WK4x4 and LZO. If you have some time, it would be awesome if you take 'compress-test' kernel module: http://linux-mm.org/CompressedCaching/Code?action=AttachFile&do=get&target=compress-test.tar.gz and feed all three algos (included with this module), data from /lib (binary files), normal text files and other kind of data you can think of, and compare how effectively each compresses the data. This thing can be easily automated and lots of data on this will allow us to see if we can really conjecture how well performance of WKdm (lightest) describes performance of other two. If you are interested in doing this, I can send you all the details for this module. In fact, all you need to do is feed some data (in single pages) to /proc/compress_test/compress and then read this same file to see compressed size. You can dynamically switch algos by writing 0, 1, or 2 to /proc/compress_test/algo_idx. --------------------------- [couldn't now think of a reply for rest of your mail] --------------------------- Also, I have some doubts on page grouping you mentioned earlier: AFAIK, Rodrigo's patch never compressed pages in groups - all pages were compressed individually. The page grouping you were mentioning about was the size of 'cells'. So, what is the meaning of grouping of pages w.r.t 'chunked' structure? Though we can compression more than one page together and then store them in chunks but this was not you were mentioning about...right? Cheers, Nitin Gupta |
From: John M. <joh...@gm...> - 2006-06-20 00:01:47
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > John Moser wrote: >>> I don't think its good to kick out least compressible pages first. What >>> ultimately matters is to have working set of running programs to stay in >>> memory -- compressed or uncompressed. But then, its hard to guess >>> working set of each running program...Its a big work in itself to come >>> up with a good heuristic for which (and how many) pages to kick out from >>> ccache when memory gets real low. >>> >> >> Eh, that's what letting the existing swap logic figure it out is for. >> Can't make it any LESS efficient than it already is right? :) Plus the >> swap cache is looking out for us. >> > > Ah..there is no swap logic for which pages to discard when swap goes > (almost) full (there's one: OOM killer :) ). After all we are swapping > to just another swap device which we call virtual swap. > I mean for which pages to emit from ccache to disk. > So, it is our work to develop this logic to handle case of ccache > reaching its size limit (or when we want to shrink it for other reasons). > >> Some light LRU heuristics may be useful. Possibly time based ("If this >> page has been used in the last 40,000 jiffies, try to kill something >> else") >> > > So, you are saying to maintain jiffies counter for every page in ccache? > Or if you store jiffies val then how will you lookup page that has > jiffies val in some particular(40,000) window? This 'LRU' scheme won't > work. > could just throw a jiffies count (number of jiffies since boot time should be available) into the thing when it's requested from ccache; and before sending it to disk, check to see page->jiffies < JIFFIES_WE_THINK_MEAN_THIS_WAS_RECENTLY_WANTED I don't really know, or feel like thinking it out too far. This kind of thing can be deferred until ccache is really actually working. >> >> It may even be possible to find an algorithm that does strict analysis >> but not compression (like a hash) that spits out a value correlating to >> the redundancy and encodability of the stream; but that would take a lot >> of research to come up with. A really fast, really light compressor >> would be easier. > > your hash idea is a real nice topic for MS...but I've not yet planned > for that :) > (btw...I searched for something like this on citeseer but no luck) > > WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) > ..and still WKdm can't give you an idea of how well others(LZO) will > compress it. > In general there is a correlation. Remember compression is a characteristic of the algorithm and the data. It is computationally impossible to compress truly random data beyond a specific average ratio; however, small isolated blocks may encode well. In general we say one algorithm performs better than another because of this. 7zip gets smaller output than RAR or ZIP, consistantly, on various input datas; this is not a coincidence, the LZMA algorithm is actually better at redundancy elimination. The fact that an algorithm compresses a set of data more than the next shows that algorithm is better at eliminating redundancy; while the fact that a given algorithm compresses one piece of data better than the next shows that that data is more redundant. Think of it in terms of a punett square with the crossed values giving the output data size of compressing a 32K block of data: Algorithms: Data sets: X = weak compression A = minimal redundancy Y = strong compression B = much redundancy A B X 30 | 25 ----+---- Y 25 | 15 A weak algorithm with non-redundant data will produce a block only slightly smaller (if not bigger) than the input; while the same algorithm on more redundant data will produce a block that's a little smaller than that. In contrast, a strong algorithm on non-redundant data will produce a block that's somewhat smaller, and on redundant data will produce a block that's much smaller. You are correct in assuming that it's not a mathematical fact that any given algorithm can be used to predict how other algorithms will perform with the data; however, compression algorithms are basically analysis of data, and they do happen to as a side effect of their nature expose a property of the data. We can use this to conjecture that in most cases a compression algorithm will effectively describe how well other algorithms will perform, just not in a 1:1 manner; it will produce an approximate general trend. > >> >> The algorithm CI reference would be computed when the algorithm is first >> loaded, before it was put into use. This is a one-time cost for each >> algorithm, and can be ignored; it likely operates under a millisecond >> but it wouldn't hurt anything if it ran for 100-200mS really. This >> value would become persistent for the life of the algorithm (i.e. until >> someone rmmod's it), so it wouldn't have to be recomputed again. >> > > CI only makes sense when it tells how well _that_ algo compresses _this_ > page. If you just want a sane initial CI value for every algo, just > hard-code a value you can derive from many previous papers that compare > these algos (typically WKdm, WK4x4 and LZO). > See above :P >> Note that given these times there could be a negative impact on >> performance; on faster systems, of course, the impact becomes quickly >> negligible. It is still significant for embedded systems. >> >> It is also notable that we can amortize the costs by conjecturing that >> under normal circumstances, the system won't be swapping excessively, >> and thus wouldn't be using compressed cache excessively; thus the CI >> calculation can be largely ignored if it is sufficiently small. > > ccache is not even invoked in 'normal' conditions. It drops in when > system is choking for air...err..RAM :) > Under normal conditions, the system bumps the top of RAM and swaps a handful of pages out. Under abnormal conditions, the system has a working set 10% bigger than physical RAM and begins swapping data in and out every few milliseconds because it can't keep EVERYTHING it needs NOW in memory. Boot with 'mem=32M' and start KDE to see what I'm talking about. >> >> We can further amortize the costs by conjecturing that when ccache >> thrashing occurs (much paging in and out of ccache), pages being sent >> back and forth between memory and ccache are usually not going to be >> dirty. In this case, we may be able to store the CI of the page and >> only re-calculate it if the page is dirty (i.e. altered). Pages that >> were decompressed, read, and recompressed would be unchanged and have >> the same CI, thus we could skip recalculating the CI. >> > > Yes, ccache (as I expect) will be loaded with clean page cache pages (if > you choose to compress such pages) but then where will you store > per-page CI values and maintain them even after a clean page is > decompressed and freed from ccache? > maybe a field somewhere...but this doesn't seems interesting.. > *shrug* Don't worry about it for now anyway. It's just an optimization algorithm, worry about it in the future when ccache actually works first. >>>> This would allow for the following useful elements: >>>> >>>> - Groups may be given a CI == sum of the CI's of each page in the >>>> group >>>> - Pages each have a CI that doesn't change based on algorithm used >>>> - Algorithms each have a CI to describe their efficiency >>>> >>>> Which can be used for: >>>> >>>> - It is possible to quickly take a short incremental step to try to >>>> get >>>> the harder-to-compress pages into the same groups (i.e. find the first >>>> group from the hardest-to-compress end that is 'wrong', find all groups >>>> containing the next N hardest-to-compress pages, break those groups >>>> apart, reorder them, the group is now 'proper') >>>> - This allows for idle time to be used to sort pages and groups in >>>> such a way that our first candidates for writing to disk are pages that >>>> are 'hard to compress' (taking more memory) >>>> >>> Hard to compress factor should be combined with factor of having working >>> set of apps within memory (and maybe other too - I don't know). >> >> Yes.. perhaps time based? ("This page was used some time within the >> last 4000 jiffies, it's probably part of a working set"). > > I don't think this is the way to determine app working sets. > This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression > (http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf) > > mentions of a adaptive resizing scheme that keeps working sets of app in > memory by a user space process that 'monitors' this app. But the worst > thing is: > -- Where is that code? Where is description of method used? googled that > for hours...YMMV > -- They seem to assume that only a single heavyweight app is running. > While systems like OLPC are generic ones (albeit slower ones). > It's still the best way I can think of. An app that's got 4 segments of memory set up: [main()][hash_md5()][DATA:db_index][user_add()] main() is going to call some other stuff and probably never get returned into until the program exits; user_add() will happen when someone is administrating the program. Whenever the database is accessed, some of DATA:db_index is read through (among other chunks of the database); and hash_md5() is used to check for database corruption. In the normal course of the program, you'll probably find that, when the program is active, user_add() is rarely used; main() is pretty much NEVER used; hash_md5() is used every time the program does ANYTHING; and DATA:db_index is read through every time someone needs to find something in the database. In these cases, we can say that the program's current working set probably excludes main() and user_add() purely because they haven't been used recently; and probably includes hash_md5() and DATA:db_index because they've been used very recently. If the program is asleep, and rarely has any activity, we can probably page it out without anyone caring much in favor of some other program that's currently hammering the CPU. >> >>>> - It is possible to use a "Light" algorithm (such as WKdm) initially, >>>> and move to a "Heavy" algorithm by decompressing the block and >>>> recompressing it during idle CPU time >>>> - Maximize efficiency without adding unnecessary overhead while the >>>> system is busy >>>> >>> -- compress a page with WKdm >>> [CPU is idle] >>> -- extract these pages compressed with WKdm from ccache >>> [CPU is now busy] >>> -- compress them with LZO >>> -- put them again in ccache >>> >>> There can be lot of potential problems with this thing. We need to look >>> at it carefully in detail. >>> >> >> Yeah, like systems that are busy running a large number of programs >> (i.e. building a kernel) where the CPU is working on and off in quick >> spurts. You can get 80-150mS idle periods, if you try to do 200mS of >> work there you just slowed the system down. >> >>> I was just wondering how swap prefetching patch senses CPU as 'idle' - >>> what are the things it considers?, how it handles problem of 'spike' >>> loads?. Will look at all this later. >>> >> >> When the CPU is idle, linux activates the "Idle Loop." The Idle Loop >> pretty much uses HLT to halt CPU use for a little bit when the system >> has nothing to do. Otherwise the CPU would be going nuts all the time ;) >> >> At any rate, there's a mechanism in there somewhere for it. > > Andreas Mohr provided these links (to know when CPU is 'idle') > -- http://lkml.org/lkml/2006/6/18/39 > -- http://lkml.org/lkml/2006/6/18/40 > -- http://threebit.net/mail-archive/linux-kernel/msg05303.html > cool. CPU 'idles' btw by entering a loop that does absolutely nothing; however the kernel uses the HLT instruction to actually halt the CPU so it actually processes nothing for a time. > > > Cheers, > Nitin Gupta > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJc5Wws1xW0HCTEFAQLD2g/+LdhieH0T6OPCveThzBfPtZOI/1ZuL9Hm zqGLdMoZUk83OWOGVMkTmpn4pZEa/QzSAoZd97SkgFrBjfdLBUxZi0tp3i++2x58 bW33wuS/NH0OyCu9Up+lGBQ5LPfVdMk7/9CiuGdOFSvX2t1Xt5K7S1hd/QRpJE50 6jlZpcvnStkEHHwMSi8nGDw2ntUIQwnXAptjXUq/fFEbPJaWNGDB+p3I9qfwg2gY PfMLgkdobhWcbtp6pcHdpyUcFkSYA4FFIggWpcDQw+TpWpMuGVbrPH9sMHjf7jim XN+uUM7tqPtCg1GzDcWthoEF9u+DBxD39maOT/3PD4VMFMIX5bYjyKbohuuf0VIv +5AvfFfNAjOwlpbVZsKbsCt2NXOi2Z+3+3JobRjds7PZ2XJPAKiA4SpOY99vCu5p aZw7ERfp/elFGI8DRMeRSXGVVf0ZUnQCPDqNH+Nze8VBRecdK+jd8sp068dtXw5z 6YOwZARlEfVxj2Hx+4+iefEOWtzpztPBD/NhxpiHOpSWmsIEhj3XrX5YLtYWEDHl UUgcNUZ7BFV9n3nnjW8mYU94VvJJLRbSP2YM08/HGPV5J7CPWJKBISd+k7umUV3h kkqmTeHFnjNB8CoGWmEHatkUpVKTKGDttEK7MVAChAy4jRypHX/tIqdfohalZzxr ImTpgaFc5vw= =+mxu -----END PGP SIGNATURE----- |
From: Nitin G. <nit...@gm...> - 2006-06-19 22:55:00
|
Hi Mauricio, Mauricio Lin wrote: > Hi all, > > After checking the chunk list in > http://linux-mm.org/CompressedCaching, I have some questions as > follows. > > Did anyone start to implement the chunk list operations? Is there any > progress of it? > No, I haven't yet started compression structure implementation. > Regarding the "struct chunk_head", the field "chunk_list" points to > the related chunks, right? > I mean if a page A in the page cache is replaced by the corresponding > chunk head, the field chunk_list in such chunk_head points to the list > of related chunks of page A, even if some chunks are located in other > physical pages, right? > Yes. A page is compressed and stored in chunks (which can be present in different physical pages). These chunks are linked together using chunk->next. > Where ca I find more info about the purpose of lru field in the struct > chunk_head? > chunk_head->lru has basically same purpose as struct page->lru. When compressed cache become full, we have to free some pages from ccache to the swap disk. The chunk_head->lru field is to have these chunk_heads in LRU order. So, when freeing pages from ccache, take a chunk_head from head of LRU list and free the chunk_head and corres. chunk(s). These freed chunks are then merged with other physically adjacent free chunks (found using chunk->chunks->{prev, next}). A chunk is identified as free by flag (say, CH_free) stored in 4 MSBs of chunk->size (A chunk has max size of PAGE_SIZE so, 4 bits are free). Also, as mentioned on site: LRU Lists will be maintained separately for clean page cache pages, dirty page cache pages and swap cache pages. This will allow pages of these individual types to be separately tracked and when compressed cache size reaches limit, specific types of pages and be freed in priority to other (like put more pressure to free clean page cache pages than dirty page cache pages and least on swap cache pages). Cheers, Nitin Gupta |
From: Mauricio L. <mau...@gm...> - 2006-06-19 22:25:59
|
Hi all, After checking the chunk list in http://linux-mm.org/CompressedCaching, I have some questions as follows. Did anyone start to implement the chunk list operations? Is there any progress of it? Regarding the "struct chunk_head", the field "chunk_list" points to the related chunks, right? I mean if a page A in the page cache is replaced by the corresponding chunk head, the field chunk_list in such chunk_head points to the list of related chunks of page A, even if some chunks are located in other physical pages, right? Where ca I find more info about the purpose of lru field in the struct chunk_head? BR, Mauricio Lin. |
From: Nitin G. <nit...@gm...> - 2006-06-19 20:34:52
|
John Moser wrote: >> I don't think its good to kick out least compressible pages first. What >> ultimately matters is to have working set of running programs to stay in >> memory -- compressed or uncompressed. But then, its hard to guess >> working set of each running program...Its a big work in itself to come >> up with a good heuristic for which (and how many) pages to kick out from >> ccache when memory gets real low. >> > > Eh, that's what letting the existing swap logic figure it out is for. > Can't make it any LESS efficient than it already is right? :) Plus the > swap cache is looking out for us. > Ah..there is no swap logic for which pages to discard when swap goes (almost) full (there's one: OOM killer :) ). After all we are swapping to just another swap device which we call virtual swap. So, it is our work to develop this logic to handle case of ccache reaching its size limit (or when we want to shrink it for other reasons). > Some light LRU heuristics may be useful. Possibly time based ("If this > page has been used in the last 40,000 jiffies, try to kill something else") > So, you are saying to maintain jiffies counter for every page in ccache? Or if you store jiffies val then how will you lookup page that has jiffies val in some particular(40,000) window? This 'LRU' scheme won't work. > > It may even be possible to find an algorithm that does strict analysis > but not compression (like a hash) that spits out a value correlating to > the redundancy and encodability of the stream; but that would take a lot > of research to come up with. A really fast, really light compressor > would be easier. your hash idea is a real nice topic for MS...but I've not yet planned for that :) (btw...I searched for something like this on citeseer but no luck) WKdm is lightest I know...something lighter is maybe dear ol' memcpy() ;) ..and still WKdm can't give you an idea of how well others(LZO) will compress it. > > The algorithm CI reference would be computed when the algorithm is first > loaded, before it was put into use. This is a one-time cost for each > algorithm, and can be ignored; it likely operates under a millisecond > but it wouldn't hurt anything if it ran for 100-200mS really. This > value would become persistent for the life of the algorithm (i.e. until > someone rmmod's it), so it wouldn't have to be recomputed again. > CI only makes sense when it tells how well _that_ algo compresses _this_ page. If you just want a sane initial CI value for every algo, just hard-code a value you can derive from many previous papers that compare these algos (typically WKdm, WK4x4 and LZO). > Note that given these times there could be a negative impact on > performance; on faster systems, of course, the impact becomes quickly > negligible. It is still significant for embedded systems. > > It is also notable that we can amortize the costs by conjecturing that > under normal circumstances, the system won't be swapping excessively, > and thus wouldn't be using compressed cache excessively; thus the CI > calculation can be largely ignored if it is sufficiently small. ccache is not even invoked in 'normal' conditions. It drops in when system is choking for air...err..RAM :) > > We can further amortize the costs by conjecturing that when ccache > thrashing occurs (much paging in and out of ccache), pages being sent > back and forth between memory and ccache are usually not going to be > dirty. In this case, we may be able to store the CI of the page and > only re-calculate it if the page is dirty (i.e. altered). Pages that > were decompressed, read, and recompressed would be unchanged and have > the same CI, thus we could skip recalculating the CI. > Yes, ccache (as I expect) will be loaded with clean page cache pages (if you choose to compress such pages) but then where will you store per-page CI values and maintain them even after a clean page is decompressed and freed from ccache? maybe a field somewhere...but this doesn't seems interesting.. >>> This would allow for the following useful elements: >>> >>> - Groups may be given a CI == sum of the CI's of each page in the group >>> - Pages each have a CI that doesn't change based on algorithm used >>> - Algorithms each have a CI to describe their efficiency >>> >>> Which can be used for: >>> >>> - It is possible to quickly take a short incremental step to try to get >>> the harder-to-compress pages into the same groups (i.e. find the first >>> group from the hardest-to-compress end that is 'wrong', find all groups >>> containing the next N hardest-to-compress pages, break those groups >>> apart, reorder them, the group is now 'proper') >>> - This allows for idle time to be used to sort pages and groups in >>> such a way that our first candidates for writing to disk are pages that >>> are 'hard to compress' (taking more memory) >>> >> Hard to compress factor should be combined with factor of having working >> set of apps within memory (and maybe other too - I don't know). > > Yes.. perhaps time based? ("This page was used some time within the > last 4000 jiffies, it's probably part of a working set"). I don't think this is the way to determine app working sets. This paper: I. Chihaia and T. Gross. Adaptive Main Memory Compression (http://www.lst.inf.ethz.ch/research/publications/publications/USENIX_2005/USENIX_2005.pdf) mentions of a adaptive resizing scheme that keeps working sets of app in memory by a user space process that 'monitors' this app. But the worst thing is: -- Where is that code? Where is description of method used? googled that for hours...YMMV -- They seem to assume that only a single heavyweight app is running. While systems like OLPC are generic ones (albeit slower ones). > >>> - It is possible to use a "Light" algorithm (such as WKdm) initially, >>> and move to a "Heavy" algorithm by decompressing the block and >>> recompressing it during idle CPU time >>> - Maximize efficiency without adding unnecessary overhead while the >>> system is busy >>> >> -- compress a page with WKdm >> [CPU is idle] >> -- extract these pages compressed with WKdm from ccache >> [CPU is now busy] >> -- compress them with LZO >> -- put them again in ccache >> >> There can be lot of potential problems with this thing. We need to look >> at it carefully in detail. >> > > Yeah, like systems that are busy running a large number of programs > (i.e. building a kernel) where the CPU is working on and off in quick > spurts. You can get 80-150mS idle periods, if you try to do 200mS of > work there you just slowed the system down. > >> I was just wondering how swap prefetching patch senses CPU as 'idle' - >> what are the things it considers?, how it handles problem of 'spike' >> loads?. Will look at all this later. >> > > When the CPU is idle, linux activates the "Idle Loop." The Idle Loop > pretty much uses HLT to halt CPU use for a little bit when the system > has nothing to do. Otherwise the CPU would be going nuts all the time ;) > > At any rate, there's a mechanism in there somewhere for it. Andreas Mohr provided these links (to know when CPU is 'idle') -- http://lkml.org/lkml/2006/6/18/39 -- http://lkml.org/lkml/2006/6/18/40 -- http://threebit.net/mail-archive/linux-kernel/msg05303.html Cheers, Nitin Gupta |
From: Nitin G. <nit...@gm...> - 2006-06-19 09:21:41
|
Hi, > > On Sun, Jun 18, 2006 at 08:58:54PM +0530, Nitin Gupta wrote: >> I was just wondering how swap prefetching patch senses CPU as 'idle' - what are >> the things it considers?, how it handles problem of 'spike' loads?. Will look at >> all this later. > > I think that would be this one: > http://lkml.org/lkml/2006/6/18/39 > used by > http://lkml.org/lkml/2006/6/18/40 > I have a strange problem. I can't access any page on lkml.org (since a long time)! (No firewalls) Checked google cached page for lkml.org and it seems the site is having some troubles...but that seems to be in a recent time but I'm having this problem since months.... -- Nitin > > > _______________________________________________ > linuxcompressed-devel mailing list > lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxcompressed-devel > |
From: Andreas M. <an...@us...> - 2006-06-19 08:45:28
|
Hi, On Sun, Jun 18, 2006 at 08:58:54PM +0530, Nitin Gupta wrote: > I was just wondering how swap prefetching patch senses CPU as 'idle' - what are > the things it considers?, how it handles problem of 'spike' loads?. Will look at > all this later. I think that would be this one: http://lkml.org/lkml/2006/6/18/39 used by http://lkml.org/lkml/2006/6/18/40 Andreas Mohr |
From: John M. <joh...@gm...> - 2006-06-18 17:07:20
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Nitin Gupta wrote: > John Moser wrote: >> Nitin Gupta wrote: >>> John Moser wrote: >>>> Nitin Gupta wrote: >>>>> On 6/17/06, John Moser <joh...@gm...> wrote: >>>>> * Storing pages: >>>>> -- First, it creates a 'virtual swap' which is a swap with its own >>>>> swap_map. This swap is given highest priority. So, when pages are to >>>>> be swapped out, existing code path naturally/transparently uses this >>>>> swap device (until it's full) for swapping out. (prelim patches for >>>>> this to be posted very soon on lc-devel). >>>> (See I don't understand the swap system that well :) >>>> >>>> I was thinking eventually I'd have to convince the kernel to *try* to >>>> swap even if there was no swap device.. >>>> >>>> Once this swap device is full it needs to start pushing its pages to a >>>> REAL swap device, i.e. on disk. >>>> >>> This is exactly what this virtual swap is meant for :) >>> Pages assigned to this virtual swap will eventually go to ccache and >>> when it's full, pages are sent to real swap disk -- all this with >>> minimal code intrusion. This approach works even if there are no real >>> swap disks -- only virtual swap will be used in that case. >>> >> >> Awesome. And the swap cache and existing LRU logic is all used, right? >> (These are important to get memory usage patterns right; we should >> theoretically be able to send pages physically to disk in any order we >> want without being less efficient than general Linux...) >> > > Yes! > >>>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine >>>>> if page can really be added to ccache. It can consider things like no. >>>>> of pages already in ccache, CPU load, type of page (anon or fs-backed) >>>>> and some heuristics that tell if page is even worth mem space it will >>>>> take (no such heurustic yet worked out but it is really important >>>>> thing). >>>>> >>>> I was thinking putting that code in a module. The page would be >>>> compressed with a fast reference algorithm to be used to get a >>>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE. Squeezing >>>> down >>>> a single page is almost instant anyway. >>>> >>>> As for "type of page," certain FS-backed pages could be freed, others >>>> would have to be persistent. I was thinking more on letting the >>>> calling >>>> code mark pages that could be freed as "transient" and just freeing >>>> them >>>> if the compressed memory area got too big, rather than swapping. >>>> >>> Again, this is exactly what's happening. Pages guessed to be swapped out >>> by should_add_to_ccache() are marked by PG_will_compress flag. But, this >>> does not guarantee that they will really be compressed. If finally, >>> compression fails (lack of mem, CPU load high, poor compressibility >>> etc.) this flag is cleared and everything else goes as usual. If it >>> really gets compressed, its marked with PG_compressed. Thus it goes with >>> your two step 'transient' approach...I think. >>> >> >> Transient pages == come from the disk cache system. If the disk cache >> decides it wants some page back and we don't have it, it can go read it >> back from disk (i.e. /lib/*, executables, stuff in /usr...) >> >>>>> -- then pages filtered by should_add_to_ccache() go to pageout() as >>>>> usual where they are sent to add_to_ccache() where they are compressed >>>>> and added to compression structure. If add_to_ccache() fails for some >>>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks >>>>> (rest of pageout() code). >>>> I was thinking having the Swap Zone code supply the code to write to >>>> physical disk, and just having stuff pass through it: >>>> >>>> RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File >>>> >>>> This was also mainly for when the compressed area got too big and >>>> needed >>>> shrinkage (by, of course, real swapping). Of course, compression >>>> itself >>>> was going to be a module, so normal operation was: >>>> >>>> RAM -> Swap Zone -> Swap File >>>> >>>> until such time as the compress module was loaded. :) >>>> >>> As I understand you are really after an approach that can easily (and >>> with minimum code intrusion in kernel) fall back to real swap disks when >>> compression can't be done. This is being done well in current approach. >>> >> >> Yeah. I was looking at a different use of the same code later though >> but :) So far your design looks technically about the same as mine. >> > > hmm...somewhat different: > -- I am not looking to have multi-stage compression i.e. first compress > with quickest then later compress them with heavyweights instead when > CPU is idle. > -- Not (yet) considering grouping of pages for compression. > -- And I've not yet started work on a heuristic for adaptive ccache size. > >>>>> * Restoring pages: >>>>> Whenever swap-cache/page-cache is looked-up for a page >>>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag. >>>>> If its set we decompress() page and return it. >>>>> >>>> I skipped pages compressed in place because compressing individual >>>> pages >>>> is non-interesting. I noticed 15-20% compression with 4K pages on >>>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I >>>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond. >>>> >>>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or >>>> 32KiB >>>> (the lower the better); I want to compress pages in groups of whatever >>>> that limit falls on for maximum performance. >>>> >>>> This makes the approach I had in mind a lot more complex. Pages would >>>> be grouped using a grouping algorithm; idle time would be used to >>>> disrupt groups and reconstruct more optimal grouping; and in the end we >>>> would also attempt to swap out more difficult to compress pages first. >>>> >>>> (In fact this is why I want to find the smallest group for the >>>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to >>>> decompress, rearrange, and recompress up to 9 groups, 288KiB! If it >>>> were i.e. 20, that'd be 6 x 5 = 120KiB) >>>> >>> This is a great point! I really didn't even consider grouping pages >>> together for compression -- because I was unaware of its effect on >>> compression ratios and due to additional complexity it involves. This >>> grouping of pages is definitely worth a genuine try. But first, to keep >>> it simple, I will implement for single page compression. With this thing >>> stable we can do some algo tweaking to have this grouping. >>> >> >> Yeah, the page grouping and balancing code would be rather complex. >> Just the idea is complex. > > And thats why I will look at it later or maybe just make it flexible > enough for now. > >> >>>>> De/Compression codes are dumb. They just compress what they have in >>>>> input buffer and output a compressed buffer. >>>>> >>>>> Also, these algos are pluggable as separate kernel modules. They can >>>>> de/register themselves with ccahe (as is done by JFFS2). But, for now >>>>> I will just place them within kernel itself but overall approach makes >>>>> it easy to get de/compression algos as pluggable modules :) >>>>> >>>> :) And the algorithm to use them? I was considering using a reference >>>> block to compress to get a CI for the algorithm itself. >>>> >>> I was thinking of an alternate (simpler) approach. This is because we >>> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of >>> these WKdm and WK4x4 are designed to compress typical anonymous memory >>> pages (see >>> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz >>> for details of WKdm and WK4x4) while LZO is much better at typical >>> filesystem (textual) data. Also speeds (in general): WKdm > WK4x4 > LZO >>> >>> So, considering this, I think a simple algo selection scheme like >>> WKdm/WK4x4 if page is anonymous and LZO if fs-data. >>> Or, depending on kind of file, WKdm/WK4x4 can give comparable >>> compression with much better speeds. To handle cases, we can record >>> 'last compression ratio' for each mapping (recorded in corres. >>> address_space struct itself). This can be used to have separate algo >>> selection per open file. >> >> Using a reference algorithm would be simpler if you wanted to do >> multi-layer compression. Think about the following events: >> >> - Page 0x01 compressed (WKdm) >> - Page 0x02 compressed (WKdm) >> - Page 0x03 compressed (WKdm) >> - CPU is idle >> - Page 0x01 recompressed (WK4x4) >> - Page 0x02 recompressed (WK4x4) >> - Page 0x04 compressed (WKdm) >> >> Current state: >> >> 0x01 WK4x4 >> 0x02 WK4x4 >> 0x03 WKdm >> 0x04 WKdm >> >> - CPU is idle >> - Rearrange pages based on compressibility >> >> PROBLEM: How do you index compressibility? >> >> - Memory pressure is mounting >> - Get pages out of memory -- Least compressible? > > I don't think its good to kick out least compressible pages first. What > ultimately matters is to have working set of running programs to stay in > memory -- compressed or uncompressed. But then, its hard to guess > working set of each running program...Its a big work in itself to come > up with a good heuristic for which (and how many) pages to kick out from > ccache when memory gets real low. > Eh, that's what letting the existing swap logic figure it out is for. Can't make it any LESS efficient than it already is right? :) Plus the swap cache is looking out for us. Some light LRU heuristics may be useful. Possibly time based ("If this page has been used in the last 40,000 jiffies, try to kill something else") >> >> PROBLEM: Need to solve previous problem.. >> >> This is why I was considering multiple, pluggable algorithms and >> reference instrumentation. The references would be: >> >> - Each new PAGE is compressed with a fast, extremely light algorithm >> that doesn't produce useful ratios but runs almost straight through the >> code (i.e. a streaming dictionary compressor rather than a windowed one, >> which immediately is passed to a streaming huffman encoder) >> - PAGE_SIZE - compressed_page_size == Compression Index (CI) >> >> - Each new algorithm is given a reference block to compress to test its >> compression efficiency >> - block_size - compressed_block_size == CI for the algorithm >> - We could have 'anonymous memory' and 'file data' blocks >> - Could produce an anonmem_CI and filemem_CI >> > > This is just too much work to be done before _every_ compression. It can > affect performance badly. The page compression indexing would happen when a page was sent to the ccache, and occurs using a very light algorithm on a 4KiB page. This is not likely to affect much of anything, maybe cost 50,000 cycles (WKdm). Estimated times: CPU CLK SEC MSEC EXAMPLE 1MHz 1/20 50 8-bit Nintendo (6502) 200MHz 1/4000 0.25 iPaq 3600 PDA (206MHz) 400MHz 1/8000 0.125 iPaq 5400 PDA (Xscale) 800MHz 1/16000 0.0625 Old Dell 1.6GHz 1/32000 0.03125 Centrino Laptop 2.0GHz 1/40000 0.025 Desktop It may even be possible to find an algorithm that does strict analysis but not compression (like a hash) that spits out a value correlating to the redundancy and encodability of the stream; but that would take a lot of research to come up with. A really fast, really light compressor would be easier. The algorithm CI reference would be computed when the algorithm is first loaded, before it was put into use. This is a one-time cost for each algorithm, and can be ignored; it likely operates under a millisecond but it wouldn't hurt anything if it ran for 100-200mS really. This value would become persistent for the life of the algorithm (i.e. until someone rmmod's it), so it wouldn't have to be recomputed again. Note that given these times there could be a negative impact on performance; on faster systems, of course, the impact becomes quickly negligible. It is still significant for embedded systems. It is also notable that we can amortize the costs by conjecturing that under normal circumstances, the system won't be swapping excessively, and thus wouldn't be using compressed cache excessively; thus the CI calculation can be largely ignored if it is sufficiently small. We can further amortize the costs by conjecturing that when ccache thrashing occurs (much paging in and out of ccache), pages being sent back and forth between memory and ccache are usually not going to be dirty. In this case, we may be able to store the CI of the page and only re-calculate it if the page is dirty (i.e. altered). Pages that were decompressed, read, and recompressed would be unchanged and have the same CI, thus we could skip recalculating the CI. > >> This would allow for the following useful elements: >> >> - Groups may be given a CI == sum of the CI's of each page in the group >> - Pages each have a CI that doesn't change based on algorithm used >> - Algorithms each have a CI to describe their efficiency >> >> Which can be used for: >> >> - It is possible to quickly take a short incremental step to try to get >> the harder-to-compress pages into the same groups (i.e. find the first >> group from the hardest-to-compress end that is 'wrong', find all groups >> containing the next N hardest-to-compress pages, break those groups >> apart, reorder them, the group is now 'proper') >> - This allows for idle time to be used to sort pages and groups in >> such a way that our first candidates for writing to disk are pages that >> are 'hard to compress' (taking more memory) >> > > Hard to compress factor should be combined with factor of having working > set of apps within memory (and maybe other too - I don't know). Yes.. perhaps time based? ("This page was used some time within the last 4000 jiffies, it's probably part of a working set"). > >> - It is possible to use a "Light" algorithm (such as WKdm) initially, >> and move to a "Heavy" algorithm by decompressing the block and >> recompressing it during idle CPU time >> - Maximize efficiency without adding unnecessary overhead while the >> system is busy >> > > -- compress a page with WKdm > [CPU is idle] > -- extract these pages compressed with WKdm from ccache > [CPU is now busy] > -- compress them with LZO > -- put them again in ccache > > There can be lot of potential problems with this thing. We need to look > at it carefully in detail. > Yeah, like systems that are busy running a large number of programs (i.e. building a kernel) where the CPU is working on and off in quick spurts. You can get 80-150mS idle periods, if you try to do 200mS of work there you just slowed the system down. > I was just wondering how swap prefetching patch senses CPU as 'idle' - > what are the things it considers?, how it handles problem of 'spike' > loads?. Will look at all this later. > When the CPU is idle, linux activates the "Idle Loop." The Idle Loop pretty much uses HLT to halt CPU use for a little bit when the system has nothing to do. Otherwise the CPU would be going nuts all the time ;) At any rate, there's a mechanism in there somewhere for it. > >> Possibly other things too. >> >> Anyway the point is I was more thinking on how to get it so when you >> initially compress pages, you use very light algorithms; and when you >> have idle CPU time, you burn it to order and condense the compressed >> cache. This avoids CPU usage during initial entry into the cache. >> >>>>> Many of the setting are also tunable via sysctl interface. (e.g. >>>>> /proc/sys/vm/max_anon_cc_size). >>>>> >>>> I was considering being conservative with the maximum cache size in >>>> general; but more specifically, allowing the cache to grow until it >>>> notices a lot decompression is happening per time space (compression we >>>> do NOT care about; decompression means the system wants something >>>> back). >>>> If we suddenly see 300 decompressions a second, we're not being very >>>> efficient; time to start shrinking the compressed cache. (Under real >>>> memory pressure, this shrinkage will involve swapping; otherwise, just >>>> expanding requested pages back into memory.) >>>> >>>> Increasing the size of compressed cache during idle time can be done >>>> conservatively, but I currently can't think of a way to figure out how >>>> big to make it when not under load. Increasing passively (i.e. >>>> swapping >>>> LRU data into compressed cache) would get some memory into (cheap) >>>> compressed cache before the system came under load (decompression is >>>> cheaper than compression). >>> There are _many_ things to consider when developing heuristic for >>> adaptive ccache resizing scheme. I have touched this area for now. I >>> first aim to have a static sized ccache working first. >>> >>> This said, it's always good to keep thinking over this...it's >>> interesting and is going to be useful :) >>> >>>>> Apart from some kernel hooks (as in lookup functions, >>>>> shrink_page_list(), pageout()), all ccache code is pretty much in >>>>> separate .[ch] files or as separte kernel modules. So, it's also very >>>>> less intrusive. >>>> Very nice. >>>> >>>>>> Another advantage is that no matter what I do, I can't swap pages to >>>>>> disk in any worse order than the kernel normally would, since I'm >>>>>> being >>>>>> handed pages based on existing LRU logic. To this end I've >>>>>> decided to >>>>>> take the "Maximize Compression" approach and try to swap out pages >>>>>> that >>>>>> are "hard to compress" first. >>>>>> >>>>>> Actual compression will be done by a module; although actual >>>>>> compression >>>>>> algorithms will be loaded as slave modules to this module. :) >>>>>> >>>>>> I've also made note that due to the nature of this design, large >>>>>> amounts >>>>>> of flexibility are presented. We can talk to the swap device in any >>>>>> way >>>>>> we want; the Swap Zone would supply a mechanism for modules to access >>>>>> the swap device to swap in and out anything, but would always ask the >>>>>> modules themselves to get pages for it. In theory, we can compress >>>>>> whatever we like and swap it out; >>>>> Approach I described above is flexible enough to handle all kinds of >>>>> pages: anon pages, clean and dirty page cache pages. >>>>> >>>> Compressed pages on swap? Other uses of the code (i.e. implementing >>>> something besides compression)? >>>> >>> maybe I didn't understand you here. I meant flexibility in terms of >>> what kind of pages we can keep in compression structure, a compression >>> structure that should give good performance for almost all operations >>> (lookup, expanding, shrinking, deleting a page, minimize fragmentation >>> etc.) >>> >> >> Mmm.. I was looking at toying with the infra later to 1) Compress data >> that's in swap (and keep in memory a ccache swp_entry_t to real >> swp_entry_t mapping to find it later); and 2) Implement a toy >> distributed, encrypted storage backing (i.e. pages can be encrypted and >> swapped over network to 2-3 servers for redundant storage off-device, >> think wireless PDAs in a corporate network...) >> > > Interesting. > > Many ideas keep flying in my mind but when I look at my code...lol..not > even static cccache is working yet...LOL Yeah, just ignore that one. :) I'm always thinking really, really far ahead and in different areas. > >>>>>> of course, we need to relate swap file >>>>>> indexes to real page indexes, and deal with suspend-to-disk et al. >>>>> With approach described, we automatically use exisiting mech. of >>>>> kernel for our use. This virtual swap is just another swap disk to >>>>> other kernel parts. >>>> What happens when you suspend to disk? Does the virtual swap get >>>> force-flushed to real swap? >>> It has to -- after all RAM can retain stuff in suspend mode. >>> >>>> Can you compress pages in swap and have the >>>> kernel not wonder wtf you're smoking when it tries to come back up? >>>> >>> Yes, it should be easy to swap out pages as compressed form itself. This >>> was done in Castro's patch too -- this is not to save swap disk space >>> (rest of space was padded) but to reduce unnecessary decompression until >>> they are _really_ required. >>> >> >> If you work in multi-page blocks and get better than 1:((N-1)/N) >> compression anywhere, you'll squeeze out a page. In groups of 5 pages >> it's entirely feasible to get better than 20% compression, squeezing out >> a whole page of swap usage. >> >> Of course, this in no way interferes with the original goal of reducing >> unnecessary decompression; it's a happy side-effect. :) >> > > This grouping of pages will also reduce metadata required for ccache. > Each group will have a single chunk_head and coressponding chunks i.e. > no. of cunk_heads will be reduced. Interesting. That's another thing about getting a compression index for a page-- if it's too low, you can immediately decide the page is too hard to compress because the metadata to control it is going to be bigger than the space savings. > >>>> (this would all just require extra infrastructure, mainly APIs to say >>>> 'Don't compress these pages' and 'stop storing shit in memory until I >>>> say so', nothing majorly intrusive) >>>> >>>>> ----------------------- >>>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching >>>>> Here, I have detailed the approach I'm taking (at least for page-cache >>>>> pages). It's slightly out-dated now --- It doesn't detail approach for >>>>> anon pages in detail.....I will update it soon. >>>>> Feel free to edit it. >>>>> >>>>> Since I'm doing this project as part of Google SoC, I have many >>>>> 'deadlines' to consider. So, now bringing major changes to approach >>>>> will overwhelm me with work. It will be really nice if you can provide >>>>> feedback on my approach :) >>>> Your approach sounds at least to be minimally intrusive, which is good >>>> because it lets you keep the code separate and do interesting things. >>>> You also use the swapping mechanism, so you should be able to take >>>> advantage of page cache implicitly (a goal of my own design). >>>> >>>> You need to consider page grouping; Experiment with different sized >>>> groups, especially around 16K, 20K, 24K, 28K, and 32K. Try to make the >>>> code flexible enough to handle variable sized page groups too; it would >>>> be interesting to have multiple group sizes at run, but possibly more >>>> complex than needed. I found page grouping to present gains of 10-15% >>>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and >>>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's >>>> nothing interesting about any group bigger than 32K, and maybe you can >>>> get away with something smaller and still maximize gains. >>>> >>> I will definitely lookup into how flexible my current approach is w.r.t >>> grouping pages and will try to get some flexibility here if it shows up >>> rigid. This grouping of pages seems really important. >>> >> >> Run some simple tests. Feed the modules a group of test data; have them >> compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and >> then output the average compression ratio for each group size. The >> output would look similar to: >> >> >> 4K 16% >> 8K 23% >> 12K 30% >> 16K 35% >> 20K 39% >> 24K 42% <-- This would be a good place to stop... >> 28K 43% <-- Not worth the gains >> 32K 43% >> 36K 44% >> >> (Numbers faked) >> >> And you'd basically be looking for where the numbers stopped going up by >> useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful >> for the added overhead). If you keep going and going you'll find a >> point where no matter how much bigger the block gets you can't gain any >> more compression; it's likely good to stop a bit before that, knocking >> off 1% or 2%. >> > > Your analysis of grouping pages interests me most :) I will definitely > look into having this grouping supported by my design.....again, a bell > rings -- nitin! get that static ccache with simplest of things you can > do first!! ;) > Yeah, get it working first. Just don't box yourself in a corner ;) You can try writing a test for various page sizes once the cache is actually working. >>>> One part of my design was going to be to let the kernel's existing swap >>>> code make ALL the LRU decisions and base my decisions entirely on >>>> compressibility of the pages. Pages would be regrouped during idle >>>> time >>>> and groups would be kept in order of compressibility. The idea here >>>> was >>>> to be able to throw hard-to-compress pages to swap first during memory >>>> pressure. This would help maximize the in-memory compression ratio. >>>> >>>> As for multiple algorithms, I would say to always compress with your >>>> lightest, fastest algorithm FIRST (this would be WKdm). During idle >>>> time you can go back and decompress/recompress pages with something >>>> heavier (WK4x4). There is no shame in abusing the CPU when it's >>>> idle ;) >>>> >>> For desktop systems, this makes sense. But what about OLPC systems? You >>> really can't abuse idle CPU, you got to conserve power (I'm not sure of >>> its effect on power requirements but it should be like more we run more >>> we eat ;) ) >>> >> >> True :) >> > > > Cheers, > Nitin Gupta > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQIVAwUBRJWGugs1xW0HCTEFAQLpKBAAjbnj4W/mLvcudtAcH2DLeXpnL6/n1EHt gJ2c8bb9kdwRdDj6O6drCBPn4F2i2c0qiAI+6AIEtspFsTHDw3tHQi+9QxEChMf8 gMKLQ5fZI9SL/rXUbG6o6pF+pCvPtWjdfs+Qy1CDa9MIinh5DSFDQ1DGV1agBHtC Ow22apMS3p4xtNvnbRZ9l5d5GkS35nFJTnGDNVA2N6eKRk+UM2wlEqjcjFhMQyH7 XWQVTu4GiC9tDGDPUplGIfKIOHFfC7UqqLrciSQzgbr3gbIKfs1Qa/v5qNtWJLzr VeGKPCypK/74blA2+PjISH29aJ+vC6jJHQIFSy3fi9abt4esZ3JgxwyqNO+uIo1+ RKdvHMLKf71DNBf2QVV3BzBP7ZBYelRZFNCaSdfrGd5NFgIQR06JW5Nf14adpi+D hXK44YpkD1TVFuHoitVURODNBmwdrNF+Bqay/fY9hi31RSadylGT91U2OpFrSKyC 3cjfXB36HwDHNWtRxb2WzlETJvA4oMLWNvLbZAn6MqXZnBq6E4VwJNP+z3y22PNl dLOadtu+3qWbt0UyCHQp4sbpHgCgiG0wwDRMAJCNHhnXH7Tww93R84RkgtF8GkRC n1FtIH0mv3zWWSbFuei88KSqk8unH9xq9tKGKboHmJiDF3/AyZd/De8F+icTCIn9 0IfPKD+e6qQ= =v+xT -----END PGP SIGNATURE----- |
From: Nitin G. <nit...@gm...> - 2006-06-18 15:28:58
|
John Moser wrote: > Nitin Gupta wrote: >> John Moser wrote: >>> Nitin Gupta wrote: >>>> On 6/17/06, John Moser <joh...@gm...> wrote: >>>> * Storing pages: >>>> -- First, it creates a 'virtual swap' which is a swap with its own >>>> swap_map. This swap is given highest priority. So, when pages are to >>>> be swapped out, existing code path naturally/transparently uses this >>>> swap device (until it's full) for swapping out. (prelim patches for >>>> this to be posted very soon on lc-devel). >>> (See I don't understand the swap system that well :) >>> >>> I was thinking eventually I'd have to convince the kernel to *try* to >>> swap even if there was no swap device.. >>> >>> Once this swap device is full it needs to start pushing its pages to a >>> REAL swap device, i.e. on disk. >>> >> This is exactly what this virtual swap is meant for :) >> Pages assigned to this virtual swap will eventually go to ccache and >> when it's full, pages are sent to real swap disk -- all this with >> minimal code intrusion. This approach works even if there are no real >> swap disks -- only virtual swap will be used in that case. >> > > Awesome. And the swap cache and existing LRU logic is all used, right? > (These are important to get memory usage patterns right; we should > theoretically be able to send pages physically to disk in any order we > want without being less efficient than general Linux...) > Yes! >>>> -- in shrink_page_list() it calls should_add_to_ccache() to determine >>>> if page can really be added to ccache. It can consider things like no. >>>> of pages already in ccache, CPU load, type of page (anon or fs-backed) >>>> and some heuristics that tell if page is even worth mem space it will >>>> take (no such heurustic yet worked out but it is really important >>>> thing). >>>> >>> I was thinking putting that code in a module. The page would be >>> compressed with a fast reference algorithm to be used to get a >>> Compression Index CI = PAGE_SIZE - COMPRESSED_PAGE_SIZE. Squeezing down >>> a single page is almost instant anyway. >>> >>> As for "type of page," certain FS-backed pages could be freed, others >>> would have to be persistent. I was thinking more on letting the calling >>> code mark pages that could be freed as "transient" and just freeing them >>> if the compressed memory area got too big, rather than swapping. >>> >> Again, this is exactly what's happening. Pages guessed to be swapped out >> by should_add_to_ccache() are marked by PG_will_compress flag. But, this >> does not guarantee that they will really be compressed. If finally, >> compression fails (lack of mem, CPU load high, poor compressibility >> etc.) this flag is cleared and everything else goes as usual. If it >> really gets compressed, its marked with PG_compressed. Thus it goes with >> your two step 'transient' approach...I think. >> > > Transient pages == come from the disk cache system. If the disk cache > decides it wants some page back and we don't have it, it can go read it > back from disk (i.e. /lib/*, executables, stuff in /usr...) > >>>> -- then pages filtered by should_add_to_ccache() go to pageout() as >>>> usual where they are sent to add_to_ccache() where they are compressed >>>> and added to compression structure. If add_to_ccache() fails for some >>>> reason (e.g. insuff. RAM), the page falls back to normal swap disks >>>> (rest of pageout() code). >>> I was thinking having the Swap Zone code supply the code to write to >>> physical disk, and just having stuff pass through it: >>> >>> RAM -> Swap Zone -> Compress -> NOPE DOESN'T SHRINK -> Swap File >>> >>> This was also mainly for when the compressed area got too big and needed >>> shrinkage (by, of course, real swapping). Of course, compression itself >>> was going to be a module, so normal operation was: >>> >>> RAM -> Swap Zone -> Swap File >>> >>> until such time as the compress module was loaded. :) >>> >> As I understand you are really after an approach that can easily (and >> with minimum code intrusion in kernel) fall back to real swap disks when >> compression can't be done. This is being done well in current approach. >> > > Yeah. I was looking at a different use of the same code later though > but :) So far your design looks technically about the same as mine. > hmm...somewhat different: -- I am not looking to have multi-stage compression i.e. first compress with quickest then later compress them with heavyweights instead when CPU is idle. -- Not (yet) considering grouping of pages for compression. -- And I've not yet started work on a heuristic for adaptive ccache size. >>>> * Restoring pages: >>>> Whenever swap-cache/page-cache is looked-up for a page >>>> (find_get_page(), find_get_pages() etc.), we check PG_compressed flag. >>>> If its set we decompress() page and return it. >>>> >>> I skipped pages compressed in place because compressing individual pages >>> is non-interesting. I noticed 15-20% compression with 4K pages on >>> Castro's patch; pushing this up to 8KiB, 16KiB, 32KiB, and 64KiB, I >>> noticed gains up to 50-60% when we hit 32KiB and nothing more beyond. >>> >>> I'm pretty sure that there's an asymptotic limit at 20, 24, 28, or 32KiB >>> (the lower the better); I want to compress pages in groups of whatever >>> that limit falls on for maximum performance. >>> >>> This makes the approach I had in mind a lot more complex. Pages would >>> be grouped using a grouping algorithm; idle time would be used to >>> disrupt groups and reconstruct more optimal grouping; and in the end we >>> would also attempt to swap out more difficult to compress pages first. >>> >>> (In fact this is why I want to find the smallest group for the >>> asymptotic limit-- 32KiB, 8 4KiB pages, my regrouping algorithm has to >>> decompress, rearrange, and recompress up to 9 groups, 288KiB! If it >>> were i.e. 20, that'd be 6 x 5 = 120KiB) >>> >> This is a great point! I really didn't even consider grouping pages >> together for compression -- because I was unaware of its effect on >> compression ratios and due to additional complexity it involves. This >> grouping of pages is definitely worth a genuine try. But first, to keep >> it simple, I will implement for single page compression. With this thing >> stable we can do some algo tweaking to have this grouping. >> > > Yeah, the page grouping and balancing code would be rather complex. > Just the idea is complex. And thats why I will look at it later or maybe just make it flexible enough for now. > >>>> De/Compression codes are dumb. They just compress what they have in >>>> input buffer and output a compressed buffer. >>>> >>>> Also, these algos are pluggable as separate kernel modules. They can >>>> de/register themselves with ccahe (as is done by JFFS2). But, for now >>>> I will just place them within kernel itself but overall approach makes >>>> it easy to get de/compression algos as pluggable modules :) >>>> >>> :) And the algorithm to use them? I was considering using a reference >>> block to compress to get a CI for the algorithm itself. >>> >> I was thinking of an alternate (simpler) approach. This is because we >> have only 3 principal de/compression algos -- WKdm, WK4x4 and LZO. Of >> these WKdm and WK4x4 are designed to compress typical anonymous memory >> pages (see >> http://www.cs.amherst.edu/~sfkaplan/papers/sfkaplan-dissertation.ps.gz >> for details of WKdm and WK4x4) while LZO is much better at typical >> filesystem (textual) data. Also speeds (in general): WKdm > WK4x4 > LZO >> >> So, considering this, I think a simple algo selection scheme like >> WKdm/WK4x4 if page is anonymous and LZO if fs-data. >> Or, depending on kind of file, WKdm/WK4x4 can give comparable >> compression with much better speeds. To handle cases, we can record >> 'last compression ratio' for each mapping (recorded in corres. >> address_space struct itself). This can be used to have separate algo >> selection per open file. > > Using a reference algorithm would be simpler if you wanted to do > multi-layer compression. Think about the following events: > > - Page 0x01 compressed (WKdm) > - Page 0x02 compressed (WKdm) > - Page 0x03 compressed (WKdm) > - CPU is idle > - Page 0x01 recompressed (WK4x4) > - Page 0x02 recompressed (WK4x4) > - Page 0x04 compressed (WKdm) > > Current state: > > 0x01 WK4x4 > 0x02 WK4x4 > 0x03 WKdm > 0x04 WKdm > > - CPU is idle > - Rearrange pages based on compressibility > > PROBLEM: How do you index compressibility? > > - Memory pressure is mounting > - Get pages out of memory -- Least compressible? I don't think its good to kick out least compressible pages first. What ultimately matters is to have working set of running programs to stay in memory -- compressed or uncompressed. But then, its hard to guess working set of each running program...Its a big work in itself to come up with a good heuristic for which (and how many) pages to kick out from ccache when memory gets real low. > > PROBLEM: Need to solve previous problem.. > > This is why I was considering multiple, pluggable algorithms and > reference instrumentation. The references would be: > > - Each new PAGE is compressed with a fast, extremely light algorithm > that doesn't produce useful ratios but runs almost straight through the > code (i.e. a streaming dictionary compressor rather than a windowed one, > which immediately is passed to a streaming huffman encoder) > - PAGE_SIZE - compressed_page_size == Compression Index (CI) > > - Each new algorithm is given a reference block to compress to test its > compression efficiency > - block_size - compressed_block_size == CI for the algorithm > - We could have 'anonymous memory' and 'file data' blocks > - Could produce an anonmem_CI and filemem_CI > This is just too much work to be done before _every_ compression. It can affect performance badly. > This would allow for the following useful elements: > > - Groups may be given a CI == sum of the CI's of each page in the group > - Pages each have a CI that doesn't change based on algorithm used > - Algorithms each have a CI to describe their efficiency > > Which can be used for: > > - It is possible to quickly take a short incremental step to try to get > the harder-to-compress pages into the same groups (i.e. find the first > group from the hardest-to-compress end that is 'wrong', find all groups > containing the next N hardest-to-compress pages, break those groups > apart, reorder them, the group is now 'proper') > - This allows for idle time to be used to sort pages and groups in > such a way that our first candidates for writing to disk are pages that > are 'hard to compress' (taking more memory) > Hard to compress factor should be combined with factor of having working set of apps within memory (and maybe other too - I don't know). > - It is possible to use a "Light" algorithm (such as WKdm) initially, > and move to a "Heavy" algorithm by decompressing the block and > recompressing it during idle CPU time > - Maximize efficiency without adding unnecessary overhead while the > system is busy > -- compress a page with WKdm [CPU is idle] -- extract these pages compressed with WKdm from ccache [CPU is now busy] -- compress them with LZO -- put them again in ccache There can be lot of potential problems with this thing. We need to look at it carefully in detail. I was just wondering how swap prefetching patch senses CPU as 'idle' - what are the things it considers?, how it handles problem of 'spike' loads?. Will look at all this later. > Possibly other things too. > > Anyway the point is I was more thinking on how to get it so when you > initially compress pages, you use very light algorithms; and when you > have idle CPU time, you burn it to order and condense the compressed > cache. This avoids CPU usage during initial entry into the cache. > >>>> Many of the setting are also tunable via sysctl interface. (e.g. >>>> /proc/sys/vm/max_anon_cc_size). >>>> >>> I was considering being conservative with the maximum cache size in >>> general; but more specifically, allowing the cache to grow until it >>> notices a lot decompression is happening per time space (compression we >>> do NOT care about; decompression means the system wants something back). >>> If we suddenly see 300 decompressions a second, we're not being very >>> efficient; time to start shrinking the compressed cache. (Under real >>> memory pressure, this shrinkage will involve swapping; otherwise, just >>> expanding requested pages back into memory.) >>> >>> Increasing the size of compressed cache during idle time can be done >>> conservatively, but I currently can't think of a way to figure out how >>> big to make it when not under load. Increasing passively (i.e. swapping >>> LRU data into compressed cache) would get some memory into (cheap) >>> compressed cache before the system came under load (decompression is >>> cheaper than compression). >> There are _many_ things to consider when developing heuristic for >> adaptive ccache resizing scheme. I have touched this area for now. I >> first aim to have a static sized ccache working first. >> >> This said, it's always good to keep thinking over this...it's >> interesting and is going to be useful :) >> >>>> Apart from some kernel hooks (as in lookup functions, >>>> shrink_page_list(), pageout()), all ccache code is pretty much in >>>> separate .[ch] files or as separte kernel modules. So, it's also very >>>> less intrusive. >>> Very nice. >>> >>>>> Another advantage is that no matter what I do, I can't swap pages to >>>>> disk in any worse order than the kernel normally would, since I'm being >>>>> handed pages based on existing LRU logic. To this end I've decided to >>>>> take the "Maximize Compression" approach and try to swap out pages that >>>>> are "hard to compress" first. >>>>> >>>>> Actual compression will be done by a module; although actual >>>>> compression >>>>> algorithms will be loaded as slave modules to this module. :) >>>>> >>>>> I've also made note that due to the nature of this design, large >>>>> amounts >>>>> of flexibility are presented. We can talk to the swap device in any >>>>> way >>>>> we want; the Swap Zone would supply a mechanism for modules to access >>>>> the swap device to swap in and out anything, but would always ask the >>>>> modules themselves to get pages for it. In theory, we can compress >>>>> whatever we like and swap it out; >>>> Approach I described above is flexible enough to handle all kinds of >>>> pages: anon pages, clean and dirty page cache pages. >>>> >>> Compressed pages on swap? Other uses of the code (i.e. implementing >>> something besides compression)? >>> >> maybe I didn't understand you here. I meant flexibility in terms of >> what kind of pages we can keep in compression structure, a compression >> structure that should give good performance for almost all operations >> (lookup, expanding, shrinking, deleting a page, minimize fragmentation >> etc.) >> > > Mmm.. I was looking at toying with the infra later to 1) Compress data > that's in swap (and keep in memory a ccache swp_entry_t to real > swp_entry_t mapping to find it later); and 2) Implement a toy > distributed, encrypted storage backing (i.e. pages can be encrypted and > swapped over network to 2-3 servers for redundant storage off-device, > think wireless PDAs in a corporate network...) > Interesting. Many ideas keep flying in my mind but when I look at my code...lol..not even static cccache is working yet...LOL >>>>> of course, we need to relate swap file >>>>> indexes to real page indexes, and deal with suspend-to-disk et al. >>>> With approach described, we automatically use exisiting mech. of >>>> kernel for our use. This virtual swap is just another swap disk to >>>> other kernel parts. >>> What happens when you suspend to disk? Does the virtual swap get >>> force-flushed to real swap? >> It has to -- after all RAM can retain stuff in suspend mode. >> >>> Can you compress pages in swap and have the >>> kernel not wonder wtf you're smoking when it tries to come back up? >>> >> Yes, it should be easy to swap out pages as compressed form itself. This >> was done in Castro's patch too -- this is not to save swap disk space >> (rest of space was padded) but to reduce unnecessary decompression until >> they are _really_ required. >> > > If you work in multi-page blocks and get better than 1:((N-1)/N) > compression anywhere, you'll squeeze out a page. In groups of 5 pages > it's entirely feasible to get better than 20% compression, squeezing out > a whole page of swap usage. > > Of course, this in no way interferes with the original goal of reducing > unnecessary decompression; it's a happy side-effect. :) > This grouping of pages will also reduce metadata required for ccache. Each group will have a single chunk_head and coressponding chunks i.e. no. of cunk_heads will be reduced. >>> (this would all just require extra infrastructure, mainly APIs to say >>> 'Don't compress these pages' and 'stop storing shit in memory until I >>> say so', nothing majorly intrusive) >>> >>>> ----------------------- >>>> I have a Wiki setup for this project linux-mm.org/CompressedCaching >>>> Here, I have detailed the approach I'm taking (at least for page-cache >>>> pages). It's slightly out-dated now --- It doesn't detail approach for >>>> anon pages in detail.....I will update it soon. >>>> Feel free to edit it. >>>> >>>> Since I'm doing this project as part of Google SoC, I have many >>>> 'deadlines' to consider. So, now bringing major changes to approach >>>> will overwhelm me with work. It will be really nice if you can provide >>>> feedback on my approach :) >>> Your approach sounds at least to be minimally intrusive, which is good >>> because it lets you keep the code separate and do interesting things. >>> You also use the swapping mechanism, so you should be able to take >>> advantage of page cache implicitly (a goal of my own design). >>> >>> You need to consider page grouping; Experiment with different sized >>> groups, especially around 16K, 20K, 24K, 28K, and 32K. Try to make the >>> code flexible enough to handle variable sized page groups too; it would >>> be interesting to have multiple group sizes at run, but possibly more >>> complex than needed. I found page grouping to present gains of 10-15% >>> when moving between 4K-8K pages, and then about 5% between 16K-32K, and >>> then 0.1% between 32K-64K on Rodrigo's patch; this tells me there's >>> nothing interesting about any group bigger than 32K, and maybe you can >>> get away with something smaller and still maximize gains. >>> >> I will definitely lookup into how flexible my current approach is w.r.t >> grouping pages and will try to get some flexibility here if it shows up >> rigid. This grouping of pages seems really important. >> > > Run some simple tests. Feed the modules a group of test data; have them > compress it in chunks of 4K, 8K, 12K, 16K, 20K, 24K, 28K, 32K, 36K; and > then output the average compression ratio for each group size. The > output would look similar to: > > > 4K 16% > 8K 23% > 12K 30% > 16K 35% > 20K 39% > 24K 42% <-- This would be a good place to stop... > 28K 43% <-- Not worth the gains > 32K 43% > 36K 44% > > (Numbers faked) > > And you'd basically be looking for where the numbers stopped going up by > useful degrees (in this example, 42% -> 43% we'll say 1% isn't useful > for the added overhead). If you keep going and going you'll find a > point where no matter how much bigger the block gets you can't gain any > more compression; it's likely good to stop a bit before that, knocking > off 1% or 2%. > Your analysis of grouping pages interests me most :) I will definitely look into having this grouping supported by my design.....again, a bell rings -- nitin! get that static ccache with simplest of things you can do first!! ;) >>> One part of my design was going to be to let the kernel's existing swap >>> code make ALL the LRU decisions and base my decisions entirely on >>> compressibility of the pages. Pages would be regrouped during idle time >>> and groups would be kept in order of compressibility. The idea here was >>> to be able to throw hard-to-compress pages to swap first during memory >>> pressure. This would help maximize the in-memory compression ratio. >>> >>> As for multiple algorithms, I would say to always compress with your >>> lightest, fastest algorithm FIRST (this would be WKdm). During idle >>> time you can go back and decompress/recompress pages with something >>> heavier (WK4x4). There is no shame in abusing the CPU when it's idle ;) >>> >> For desktop systems, this makes sense. But what about OLPC systems? You >> really can't abuse idle CPU, you got to conserve power (I'm not sure of >> its effect on power requirements but it should be like more we run more >> we eat ;) ) >> > > True :) > Cheers, Nitin Gupta |
From: Anderson B. <bri...@gm...> - 2006-06-09 19:17:42
|
On 6/9/06, Nitin Gupta <nit...@gm...> wrote: > On 6/9/06, Anderson Briglia <bri...@gm...> wrote: > > > > I have been trying to compress a page using that WKdm code from > > > > project's CVS. But, something is wrong because the system hangs after > > > > some compression/decompression pages. I'm using this interface: > > > > > > > > to compress "original_page": > > > > > > > > int len; > > > > len = WKdm_compress(page_address(original_page), page_address(comp_page), 1024); > > > > memcpy(page_address(original_page), page_address(comp_page), len); > > > > > > > > to decompress: > > > > > > > > WKdm_decompress(page_address(original_page), page_address(decomp_page), 1024); > > > > memcpy(page_address(original_page), page_address(decomp_page), PAGE_SIZE); > > > > > > > > Is needed to do some padding? I saw at compress_test module that the > > > > comp_data is filled with 0's after the compression. Is it necessary? > > > > > > Yes, WKdm only works when input is exactly 1024 words. > > > > So, I guess I must do a loop if I want to compress one page-full of > > data, right? A loop of four iteractions, of course. > > > > No, I meant 1024 words==PAGE_SIZE (4K bytes). WKdm needs input of 4K > bytes to work so you can compress a page with a single call. Oh, ok! I'm not very familiarized with compression algorithms. :) > > > > Are you compressing pages with less than page-full of data and still without > > > padding? > > > > I'm compressing pages came from the shrink_list( ). I believe those > > pages have PAGE_SIZE of data. > > > > Are you just compressing and discarding pages as they come in > shrink_list() or have you developed any compression structure to store > pages and then later restore them when lookup is performed? No, I don't have a compression structure, yet. I guess that idea of chunks should work, but if I design some other structure, you will know. Actually I'm testing the pages' compression: after the compression, the page is sent to the swap area and when the page is recovered from swap, it's decompressed. Of course we will not have any benefit doing it, but it's a good exercise to face the "scary MM code", :). I'll use your latest compress_test code. Cheers, Anderson Briglia |
From: Nitin G. <nit...@gm...> - 2006-06-09 17:06:26
|
Hi All, Now, all three algos are ported to kernel mode - WKdm, WK4x4, LZO. You can take them from CVS (linux26/lib) or else linux-mm.org/CompressCaching/Code has 'compress_test' (updated) module which allows you to easily test all three of them (all 3 algos are included with this module). Back to that scary MM code... :) Cheers, Nitin |
From: Nitin G. <nit...@gm...> - 2006-06-09 15:23:00
|
On 6/9/06, Anderson Briglia <bri...@gm...> wrote: > > > I have been trying to compress a page using that WKdm code from > > > project's CVS. But, something is wrong because the system hangs after > > > some compression/decompression pages. I'm using this interface: > > > > > > to compress "original_page": > > > > > > int len; > > > len = WKdm_compress(page_address(original_page), page_address(comp_page), 1024); > > > memcpy(page_address(original_page), page_address(comp_page), len); > > > > > > to decompress: > > > > > > WKdm_decompress(page_address(original_page), page_address(decomp_page), 1024); > > > memcpy(page_address(original_page), page_address(decomp_page), PAGE_SIZE); > > > > > > Is needed to do some padding? I saw at compress_test module that the > > > comp_data is filled with 0's after the compression. Is it necessary? > > > > Yes, WKdm only works when input is exactly 1024 words. > > So, I guess I must do a loop if I want to compress one page-full of > data, right? A loop of four iteractions, of course. > No, I meant 1024 words==PAGE_SIZE (4K bytes). WKdm needs input of 4K bytes to work so you can compress a page with a single call. > > Are you compressing pages with less than page-full of data and still without > > padding? > > I'm compressing pages came from the shrink_list( ). I believe those > pages have PAGE_SIZE of data. > Are you just compressing and discarding pages as they come in shrink_list() or have you developed any compression structure to store pages and then later restore them when lookup is performed? Cheers, Nitin Gupta |
From: Anderson B. <bri...@gm...> - 2006-06-09 13:30:37
|
> > I have been trying to compress a page using that WKdm code from > > project's CVS. But, something is wrong because the system hangs after > > some compression/decompression pages. I'm using this interface: > > > > to compress "original_page": > > > > int len; > > len = WKdm_compress(page_address(original_page), page_address(comp_page), 1024); > > memcpy(page_address(original_page), page_address(comp_page), len); > > > > to decompress: > > > > WKdm_decompress(page_address(original_page), page_address(decomp_page), 1024); > > memcpy(page_address(original_page), page_address(decomp_page), PAGE_SIZE); > > > > Is needed to do some padding? I saw at compress_test module that the > > comp_data is filled with 0's after the compression. Is it necessary? > > Yes, WKdm only works when input is exactly 1024 words. So, I guess I must do a loop if I want to compress one page-full of data, right? A loop of four iteractions, of course. > Are you compressing pages with less than page-full of data and still without > padding? I'm compressing pages came from the shrink_list( ). I believe those pages have PAGE_SIZE of data. Cheers, Anderson Briglia |
From: Nitin G. <nit...@gm...> - 2006-06-09 08:35:02
|
Hi Briglia, > I have been trying to compress a page using that WKdm code from > project's CVS. But, something is wrong because the system hangs after > some compression/decompression pages. I'm using this interface: > > to compress "original_page": > > int len; > len = WKdm_compress(page_address(original_page), page_address(comp_page), 1024); > memcpy(page_address(original_page), page_address(comp_page), len); > > to decompress: > > WKdm_decompress(page_address(original_page), page_address(decomp_page), 1024); > memcpy(page_address(original_page), page_address(decomp_page), PAGE_SIZE); > > Is needed to do some padding? I saw at compress_test module that the > comp_data is filled with 0's after the compression. Is it necessary? Yes, WKdm only works when input is exactly 1024 words. Are you compressing pages with less than page-full of data and still without padding? Cheers, Nitin |
From: Nitin G. <nit...@gm...> - 2006-06-09 03:02:07
|
Hi All, Just finished porting LZO (ver 2.02) to kernel mode. The attached file is a kernel module (compress-test) that I created to test these kernel ports of de/compression algorithms. This module has been much refined since it was last posted. Now, two algos are ported - LZO and WKdm (WK4x4 also soon). You can test both of them using this module. It creates 3 /proc entries: /proc/compress-test/{compress, decompress, algo_idx}. --------------------- * In short: Write to /proc/compress-test entries: 1. compress: compress data witten to it and store in internal buffer. 2. algo_idx: write index of algo you want to test (0: WKdm, 1: LZO) Read from /proc/compress-test entries: 1. compress: show original and compressed size (TODO: add other stats like time taken too) 2. decompress: decompress compressed data stored in internal buffer. 3. algo_idx: shows list of algos supported with their index. ---------------------- * commited these two algos to CVS in linux26/lib/{WKdm, LZO} * linux-mm.org/CompressedCaching/Code page has been updated -- this module (with both these algos) is available there. * TODO Next: get virt swap actually working. Cheers, Nitin Gupta |
From: Anderson B. <bri...@gm...> - 2006-06-08 21:10:22
|
Hi all, I have been trying to compress a page using that WKdm code from project's CVS. But, something is wrong because the system hangs after some compression/decompression pages. I'm using this interface: to compress "original_page": int len; len = WKdm_compress(page_address(original_page), page_address(comp_page), 1024); memcpy(page_address(original_page), page_address(comp_page), len); to decompress: WKdm_decompress(page_address(original_page), page_address(decomp_page), 1024); memcpy(page_address(original_page), page_address(decomp_page), PAGE_SIZE); Is needed to do some padding? I saw at compress_test module that the comp_data is filled with 0's after the compression. Is it necessary? Best regards, Anderson Briglia |
From: Nitin G. <nit...@gm...> - 2006-06-08 06:36:41
|
Hi Briglia, > > I'm testing the patch and the module but I got a 'Segmentation Fault' > problem when reading /proc/swaps entry after creating the virtual swap > area. > Is this virtual swap usable? Because when a page will be swapped-out > to the virtual swap area I'm getting a Kernel Null pointer reference. > This virtual swap patch is just the beginning point and these seg faults are 'expected' behavior -- nothing is usable now. To make it work, it is required to integrate the work from previous patches (as on CompressedCaching/Code) and this is what is my next TODO.... Currently, just for a bit change of work, I am working on porting LZO (ver 2.02). It should be completed today. Then I will continue to make this virtual swap actually work. Thanks for looking into this work :) Best Regards, Nitin Gupta |
From: Anderson B. <bri...@gm...> - 2006-06-08 00:45:05
|
Hi Nitin, > The approach I've taken is this: > > * General > It creates a virtual swap device (another swap_info_struct in swap_info[]). Its > size is set though a /proc entry created by 'cc_control' module (attached). This > swap is set to highest priority and is identified in other code paths with > SWP_COMPRESSED flag. So, when adding page to swap cache, it is given 'type' no. > as for this virt swap until it reaches its max capacity (as set through /proc > interface). > > * Storing page to ccache > When swap to disk occurs in pageout(), same things will occur as is done for > page-cache pages -- it doesn't need to know if it's anon or fs-backed page. > These things are - replace corresponding swapper radix tree node to a > 'chunk_head' which then points to the location of 'chunks' in ccache storage > structure. > > * Restoring an anon page from ccache > When page fault occurs, handle_pte_fault() first checks swap cache if it already > has required page which then calls find_get_page() over swapper space radix > tree. Here, a compressed page is handled exactly as for page cache page i.e. > checked if looked-up page has PG_compressed set, decompress it and make radix > node point back to this uncompressed page. > > * Notes > -- This approach deals easily with case where no real swap devices exist. This > virt swap is a separate entity with all separate data structures like swap_map[] > as for a normal swap device. > -- This only requires working with arch independent swp_entry_t instead of arch > dependent representation of it in PTEs. > -- More code sharing for handling page-cache and swap-cache pages - much of code > wouldn't know if its working with anon or fs-backed page like > add_to_swap_cache(), find_get_page() (and friends). > > * About files attached > -- cc_control module: This creates two proc entries: > /proc/cc_control/{max_anon_cc_size, max_fs_backed_cc_size}. Write to > max_anon_cc_size, the size of ccache for anon pages in units of no. of pages. > Writing this value calls set_anon_cc_size() in mm/swapfile.c which creates this > virtual swap device with size as passed and highest priority (hard-coded to > 100). (max_fs_backed_cc_size is dummy for now). > -- patch for 2.6.17-rc5 that has this set_anon_cc_size() with some small misc > handling. I'm testing the patch and the module but I got a 'Segmentation Fault' problem when reading /proc/swaps entry after creating the virtual swap area. Is this virtual swap usable? Because when a page will be swapped-out to the virtual swap area I'm getting a Kernel Null pointer reference. Best regards, Anderson Briglia |
From: Anderson B. <bri...@gm...> - 2006-06-07 20:16:39
|
[RESENDING] Hi Nitin, I'm testing the patch and the module but I got a 'Segmentation Fault' problem when reading /proc/swaps entry after creating the virtual swap area. Is this virtual swap usable? Because when a page will be swapped-out to the virtual swap area I'm getting a Kernel Null pointer reference. Best regards, Anderson Briglia PS: I sent this message yesterday to the linux-compressed mailing list, but it didn't come for some reason. On 6/4/06, Nitin Gupta <nit...@gm...> wrote: > Hi, > > This is kind of weekly work summary for ccache work... :) > > I originally started this project with focus on "general" desktop systems > (OpenOffice + KDE/Gnome on 128MB RAM?). But now I'll instead focus it on much > more exciting case -- The OLPC systems. This should also bring its applicability > to a wide range of Linux based embedded devices. This does not put original goal > out of way but it just results in some change of priority - much of things are > same for both. > > The most important things I learnt for OLPC systems w.r.t ccaching are: > > -- They DO NOT want on-disk compressed swapping (problems: slow speed for > writing to flash space, wear leveling). > -- For in-memory compression, they have much higher priority for compressing > anonymous memory pages instead of filesystem backed (page-cache) pages. > > I originally started out to handle page-cache pages. Since I'm now focusing on > OLPC systems, I'll change gears to anon pages now. This has slowed development > for time being as I still have only half-baked roadmap for anon pages, but now > design work has progressed nicely and have started its implementation work :) > -------------------------------------------------------------------- > > The approach I've taken is this: > > * General > It creates a virtual swap device (another swap_info_struct in swap_info[]). Its > size is set though a /proc entry created by 'cc_control' module (attached). This > swap is set to highest priority and is identified in other code paths with > SWP_COMPRESSED flag. So, when adding page to swap cache, it is given 'type' no. > as for this virt swap until it reaches its max capacity (as set through /proc > interface). > > * Storing page to ccache > When swap to disk occurs in pageout(), same things will occur as is done for > page-cache pages -- it doesn't need to know if it's anon or fs-backed page. > These things are - replace corresponding swapper radix tree node to a > 'chunk_head' which then points to the location of 'chunks' in ccache storage > structure. > > * Restoring an anon page from ccache > When page fault occurs, handle_pte_fault() first checks swap cache if it already > has required page which then calls find_get_page() over swapper space radix > tree. Here, a compressed page is handled exactly as for page cache page i.e. > checked if looked-up page has PG_compressed set, decompress it and make radix > node point back to this uncompressed page. > > * Notes > -- This approach deals easily with case where no real swap devices exist. This > virt swap is a separate entity with all separate data structures like swap_map[] > as for a normal swap device. > -- This only requires working with arch independent swp_entry_t instead of arch > dependent representation of it in PTEs. > -- More code sharing for handling page-cache and swap-cache pages - much of code > wouldn't know if its working with anon or fs-backed page like > add_to_swap_cache(), find_get_page() (and friends). |
From: Nitin G. <nit...@gm...> - 2006-06-06 02:01:24
|
Hi, This is kind of weekly work summary for ccache work... :) I originally started this project with focus on "general" desktop systems (OpenOffice + KDE/Gnome on 128MB RAM?). But now I'll instead focus it on much more exciting case -- The OLPC systems. This should also bring its applicability to a wide range of Linux based embedded devices. This does not put original goal out of way but it just results in some change of priority - much of things are same for both. The most important things I learnt for OLPC systems w.r.t ccaching are: -- They DO NOT want on-disk compressed swapping (problems: slow speed for writing to flash space, wear leveling). -- For in-memory compression, they have much higher priority for compressing anonymous memory pages instead of filesystem backed (page-cache) pages. I originally started out to handle page-cache pages. Since I'm now focusing on OLPC systems, I'll change gears to anon pages now. This has slowed development for time being as I still have only half-baked roadmap for anon pages, but now design work has progressed nicely and have started its implementation work :) -------------------------------------------------------------------- The approach I've taken is this: * General It creates a virtual swap device (another swap_info_struct in swap_info[]). Its size is set though a /proc entry created by 'cc_control' module (attached). This swap is set to highest priority and is identified in other code paths with SWP_COMPRESSED flag. So, when adding page to swap cache, it is given 'type' no. as for this virt swap until it reaches its max capacity (as set through /proc interface). * Storing page to ccache When swap to disk occurs in pageout(), same things will occur as is done for page-cache pages -- it doesn't need to know if it's anon or fs-backed page. These things are - replace corresponding swapper radix tree node to a 'chunk_head' which then points to the location of 'chunks' in ccache storage structure. * Restoring an anon page from ccache When page fault occurs, handle_pte_fault() first checks swap cache if it already has required page which then calls find_get_page() over swapper space radix tree. Here, a compressed page is handled exactly as for page cache page i.e. checked if looked-up page has PG_compressed set, decompress it and make radix node point back to this uncompressed page. * Notes -- This approach deals easily with case where no real swap devices exist. This virt swap is a separate entity with all separate data structures like swap_map[] as for a normal swap device. -- This only requires working with arch independent swp_entry_t instead of arch dependent representation of it in PTEs. -- More code sharing for handling page-cache and swap-cache pages - much of code wouldn't know if its working with anon or fs-backed page like add_to_swap_cache(), find_get_page() (and friends). * About files attached -- cc_control module: This creates two proc entries: /proc/cc_control/{max_anon_cc_size, max_fs_backed_cc_size}. Write to max_anon_cc_size, the size of ccache for anon pages in units of no. of pages. Writing this value calls set_anon_cc_size() in mm/swapfile.c which creates this virtual swap device with size as passed and highest priority (hard-coded to 100). (max_fs_backed_cc_size is dummy for now). -- patch for 2.6.17-rc5 that has this set_anon_cc_size() with some small misc handling. TODO Next: merge work from prev patches (as on CompressedCaching/Code) with this one to get on with anon page ccache. (I'll try to update CompressedCaching page soon for anon page handling) PS: Marcelo, I'm unable to send mail to marcelo AT kvack DOT org (just get mail delivery failure notice), so sending to mtosatti AT redhat DOT com instead. Anyway, added your kvack address too, to see if I get it this time... :) Cheers, Nitin Gupta |
From: Rik v. R. <ri...@re...> - 2006-06-06 01:50:25
|
On Mon, 5 Jun 2006, Nitin Gupta wrote: > This is kind of weekly work summary for ccache work... :) Awesome. Let me know if you need any advice, etc... > It creates a virtual swap device (another swap_info_struct in > swap_info[]). Its size is set though a /proc entry created by > 'cc_control' module (attached). Nice. I wonder if this should/could be handled through flags to swapon so we can have this device in /etc/fstab ? Not an immediate concern though - that would be a cleanup for when things actually work ;) > * About files attached > -- cc_control module: This creates two proc entries: > /proc/cc_control/{max_anon_cc_size, max_fs_backed_cc_size}. This should probably be in /proc/sys/vm, which also means you can use the sysctl code and easily read/write ascii numbers into these files. -- All Rights Reversed |
From: Anderson B. <bri...@gm...> - 2006-05-29 21:24:36
|
On 5/29/06, Nitin Gupta <nit...@gm...> wrote: > Hi, > > > > >> > > >> > About CC, I and Briglia (and...@gm... > >> > <mailto:and...@gm...>) are making some tests with > >> > Rodrigo Castro version (kernel 2.4) for ARM platform. We are studyin= g > >> > a possible port for 2.6 and would like to cooperate with project. > >> > > >> > >> Great! > >> > >> My exams (and other misc things) just ended today...now I can really > >> jump for compressed caching :) > >> Are you doing ccaching for "general" desktop systems or some kind of > >> embedded device? > > > > Some kind of embedded device. :) Actually we tested and did some > > measurements using the Rodrigo's patches for kernel 2.4.18. We tried > > to implement a swap compression on 2.6 kernel, but we stoped when > > recording to disk presented itself as a problem. > > > (I assume you are using flash disks for storage). > So, have you dropped the idea of swap to flash disks? (Compressed) > Swapping to flash disks have lots of issues as far as I have read about i= t. We were using a mmc card as swap device. I believe its behavior is similar from an IDE block device, for instance. The problem was related to the bio structure: how to send more than one page if the bio includes only one page per operation? Best regards, Anderson Briglia > > >> Just in case you didn't notice it earlier -- I have made a page > >> linux-mm.org/CompressedCaching giving all the details collected by now > >> for ccaching implementation on 2.6.x kernels (CompressedCaching/Code h= as > >> the bare bones implementation created by now). Also, this is where I'l= l > >> keep track of the project. > > > > Yes, we have been visiting your project's site. Will you use just the > > site or we should communicate by this list as well? > > > The site is just to keep design details updated. We can always > communicate through this list (and #linux-mm-cc). > >> > >> I intend to now focus on ccaching w.r.t OLPC systems. Thus, I'll now > >> work on handling anon pages first instead of page cache pages. > >> > >> The biggest thing is compression storage structure -- heart of all thi= s > >> work. I've actually received 0 feedback on this by now. Since that > >> involves my semi-random guess work too, it really needs your help :) > >> All other work revolves around this thing. In particular please see > >> 'Problems' listed at end of that page. > > > > Ok. We intend help you and the others to design this crucial part. > Thanks! > > How about that chunks approach? We thought you already decided to use i= t. > Yes, I've decided to use this chunk approach as described on > CompressedCaching page. But, still if you can give your feedback, we can > always improve it. > >> > >> PS: I've 0 idea of ARM systems! > > > > Don't worry about this. The kernel core is similar. Of course we've > > other issues to care about, but the design of ccaching can be used on > > both systems. > > > > Best Regards, > Nitin Gupta > > |