|
From: Julian S. <js...@ac...> - 2002-11-25 00:40:12
|
Spent ages trying to think of good LRU algorithms which don't use reference bits so they will work with translation chaining. It's very difficult -- many constraints. Eventually measured the simple strategy of throw away all translations and try again. With a 16 MB cache (the default size is 32 M), running kate (69.7 M bbs) I get (running memcheck) LRU 124469 total translations (1.98 M -> 25.2 M), 46.91 seconds forget-all 144095 total translations (2.31 M -> 29.4 M), 48.40 seconds so perhaps this scheme is good enough. With the cache restored to 32 MB it will look even better. At 32M the vast majority of programs never manage to fill it up even once, so for those, this scheme is no loss at all. J |
|
From: Jeremy F. <je...@go...> - 2002-11-25 19:15:46
|
On Sun, 2002-11-24 at 16:47, Julian Seward wrote:
> Spent ages trying to think of good LRU algorithms which don't use reference
> bits so they will work with translation chaining. It's very difficult
> -- many constraints.
What are the constraints? The need for reference bits isn't so hard if
you're willing to periodically unchain in order to take a measurement.
> Eventually measured the simple strategy of throw away all translations and
> try again.
>
> With a 16 MB cache (the default size is 32 M), running kate (69.7 M bbs)
> I get (running memcheck)
>
> LRU 124469 total translations (1.98 M -> 25.2 M), 46.91 seconds
> forget-all 144095 total translations (2.31 M -> 29.4 M), 48.40 seconds
>
> so perhaps this scheme is good enough. With the cache restored to 32 MB
> it will look even better. At 32M the vast majority of programs never manage
> to fill it up even once, so for those, this scheme is no loss at all.
This looks quite reasonable.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-25 23:29:54
|
> > Spent ages trying to think of good LRU algorithms which don't use > > reference bits so they will work with translation chaining. It's very > > difficult -- many constraints. > > What are the constraints? The need for reference bits isn't so hard if > you're willing to periodically unchain in order to take a measurement. My understanding of this was enhanced by reading "Page Replacement and Reference Bit Emulation in Mach" (Richard P Draves), at CMU. Operating systems for the VAX are also of particular interest because apparently the VAX didn't support reference bits, so there is no straightforward way to do LRU. Anyway. In one sense our problem is simpler since we don't have to deal with clean vs dirty "pages"; they are always clean. However, our problem is more constrained in two ways. Firstly, we cannot just throw away translations individually; they have to be all unchained to make this safe. We even have to do this if we merely want to move them around a bit. This is a global action which we cannot afford to do too often. We could concievably keep track of which blocks point to which other blocks and so do incremental unchaining, but that sounds complex. Second problem is that our pages are of varying sizes, and are small. This makes dealing with queues of them awkward; and they are so small that keeping them in linked lists is a significant space overhead. ------------- Pretty much all the papers I looked at on Sunday suggest that the brain-dead throw-it-all-away approach is about as good as it gets. I did find a detailed presentation (37 slides) about WABI at Hot Chips (?). WABI got pretty sophisticated in the end, and they suggest an improvement (which is similar to VMSs way of doing paging on the VAX): operate the cache as a FIFO. When it gets full, unchain everything, throw away the oldest N% of them, and keep going (you can see the brain-dead algorithm as a special case where N=100). This is simple to do and according to them somewhat ameliorates the phenomenon of throwing away translations still in active use. This is simple enough to implement that I might try it. VMS does FIFO on its pages, but shifted-out pages are not immediately dumped. Instead they are put in a holding-pen style arrangement from where they can be bought back into active use should a reference arise. This protects VMS from the worst consequences of the FIFO scheme. (so-called Second-Chance FIFO). We could do this too, it is awkward but not impossible with variable- length pages, but I just can't be bothered. J |
|
From: Julian S. <js...@ac...> - 2002-11-27 01:02:29
|
Good news. I've just got a variant of forget-all running -- the variant I mentioned in this thread yesterday, I think. It divides the TC into four chunks, each of size 4M (for a total TC of 16M, so as to be fair to the previous measurements). When TC gets full, we throw away the oldest chunk and start filling that up. This is a crude approximation to FIFO. This means FIFO can happen without actually moving any translations around. It also means we can dynamically allocate the chunks, so that small programs do not suffer the space overhead of allocating the entire TC when they will most likely never use it. This is a Good Thing. Anyway, the good news is > With a 16 MB cache (the default size is 32 M), running kate (69.7 M bbs) > I get (running memcheck) > > LRU 124469 total translations (1.98 M -> 25.2 M), 46.91 seconds > forget-all 144095 total translations (2.31 M -> 29.4 M), 48.40 seconds forget-all 128513 total translations (2.05 M -> 27.3 M), 44.74 seconds in quarters So in terms of translations inadvertantly dumped, this is nearly as good as LRU, and a significant improvement on the utterly braindead algorithm. In speed terms, it's even better. That's probably due to three reasons: (1) we no longer need to mark the current epoch on each TT entry each time it is used, saving 2 insns/bb and some mem traffic in the dispatch loop. [this is of course the whole point of this exercise, so that t-chaining is properly supported] (2) I took the opportunity to word-align the starts of the translations, which they weren't before. (!) (3) the TT entries have shrunk from 16 bytes to 8, reducing Dcache thrashing. And it's all a great deal simpler than it was before. I'm pleased. I'll clean it up and commit it, and then merge in the t-chaining stuff. Then the fast-jcc stuff (another clear win). Then look at SYNCEIP. J |
|
From: Jeremy F. <je...@go...> - 2002-11-27 07:39:48
|
On Tue, 2002-11-26 at 17:09, Julian Seward wrote:
> Good news. I've just got a variant of forget-all running -- the variant
> I mentioned in this thread yesterday, I think. It divides the TC into four
> chunks, each of size 4M (for a total TC of 16M, so as to be fair to the
> previous measurements). When TC gets full, we throw away the oldest
> chunk and start filling that up. This is a crude approximation to FIFO.
Close enough.
> This means FIFO can happen without actually moving any translations
> around. It also means we can dynamically allocate the chunks, so that
> small programs do not suffer the space overhead of allocating the entire
> TC when they will most likely never use it. This is a Good Thing.
Yep (though allocated unused memory just costs page table mappings
rather than real memory).
> (1) we no longer need to mark the current epoch on each TT entry each time
> it is used, saving 2 insns/bb and some mem traffic in the dispatch loop.
> [this is of course the whole point of this exercise, so that t-chaining
> is properly supported]
> (2) I took the opportunity to word-align the starts of the translations,
> which they weren't before. (!)
Yes, that's a good idea. It may be even better to align to a cache line
boundary (worth measuring, anyway).
> (3) the TT entries have shrunk from 16 bytes to 8, reducing Dcache
> thrashing.
Well, the t-chaining patch adds 4 shorts to each TTE, to keep track of
where the jump sites are.
At least as significant as all of these is the horrible cache effects of
moving code around. Reading the instruction stream is moderately bad
(because you end up with aliasing between I & D caches), and writing
into the instruction stream is really nasty.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-27 15:43:19
|
On Wed, 27 Nov 2002, Julian Seward wrote: > Good news. I've just got a variant of forget-all running -- the variant > I mentioned in this thread yesterday, I think. It divides the TC into four > chunks, each of size 4M (for a total TC of 16M, so as to be fair to the > previous measurements). When TC gets full, we throw away the oldest > chunk and start filling that up. This is a crude approximation to FIFO. Nice one. Is the splitting into 4 hard-coded -- have you tried eg. 8? Might make a slight improvement... N |