From: Rob L. <ro...@la...> - 2005-11-03 01:37:16
|
On Wednesday 02 November 2005 02:43, Nick Piggin wrote: > > Hmmm. I don't see at this point. > > Why do you think ZONE_REMOVABLE can satisfy for hugepage. > > At leaset, my ZONE_REMOVABLE patch doesn't any concern about > > fragmentation. > > Well I think it can satisfy hugepage allocations simply because > we can be reasonably sure of being able to free contiguous regions. > Of course it will be memory no longer easily reclaimable, same as > the case for the frag patches. Nor would be name ZONE_REMOVABLE any > longer be the most appropriate! > > But my point is, the basic mechanism is there and is workable. > Hugepages and memory unplug are the two main reasons for IBM to be > pushing this AFAIKS. Who cares what IBM is pushing? I'm interested in fragmentation avoidance for User Mode Linux. I use User Mode Linux to virtualize a system build, and one problem I currently have is that some workloads temporarily use a lot of memory. For example, I can run a complete system build in about 48 megs of ram: except for building GCC. That spikes to a couple hundred megabytes. If I allocate 256 megabytes of memory to UML, that's half the memory on my laptop and UML will just use it for redundant cacheing and such while desktop performance gets a bit unhappy with the build going. UML gets an instance's "physical memory" by allocating a temporary file, mmapping it, and deleting it (which signals to the vfs that flushing this data to backing store should only be done under memory pressure from the rest of the OS, because the file's going away when it's closed so there's no With fragmentation reduction and prezeroing, UML suddenly gains the option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of prezeroing, B) a way of giving memory back to the host OS when it's not in use. This has _nothing_ to do with IBM. Or large systems. This is some random developer trying to run a virtualized system build on his laptop. (The reason I need to use UML is that I build uClibc with the newest 2.6 kernel headers I can, link apps against it, and then running many of those apps during later stages of the build. If the kernel headers used to build libc are sufficiently newer than the kernel the build is running under, I get segfaults because the new libc tries use kernel features that aren't there on the host system, but will be in the final system. I also get the ability to mknod/chown/chroot without needing root access on the host system for free...) Rob |
From: Jeff D. <jd...@ad...> - 2005-11-03 04:34:42
|
On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > With fragmentation reduction and prezeroing, UML suddenly gains the option of > calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of > prezeroing, B) a way of giving memory back to the host OS when it's not in > use. DONT_NEED is insufficient. It doesn't discard the data in dirty file-backed pages. Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) which does do the trick, and I have a UML patch which adds memory hotplug. This combination does free memory back to the host. Jeff |
From: Rob L. <ro...@la...> - 2005-11-03 05:42:29
|
On Wednesday 02 November 2005 23:26, Jeff Dike wrote: > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > > With fragmentation reduction and prezeroing, UML suddenly gains the > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a > > fast way of prezeroing, B) a way of giving memory back to the host OS > > when it's not in use. > > DONT_NEED is insufficient. It doesn't discard the data in dirty > file-backed pages. I thought DONT_NEED would discard the page cache, and punch was only needed to free up the disk space. I was hoping that since the file was deleted from disk and is already getting _some_ special treatment (since it's a longstanding "poor man's shared memory" hack), that madvise wouldn't flush the data to disk, but would just zero it out. A bit optimistic on my part, I know. :) > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) > which does do the trick, and I have a UML patch which adds memory > hotplug. This combination does free memory back to the host. I saw it wander by, and am all for it. If it goes in, it's obviously the right thing to use. You may remember I asked about this two years ago: http://seclists.org/lists/linux-kernel/2003/Dec/0919.html And a reply indicated that SVr4 had it, but we don't. I assume the "naming discussion" mentioned in the recent thread already scrubbed through this old thread to determine that the SVr4 API was icky. http://seclists.org/lists/linux-kernel/2003/Dec/0955.html > Jeff Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-04 03:21:44
|
On Thursday 03 November 2005 06:41, Rob Landley wrote: > On Wednesday 02 November 2005 23:26, Jeff Dike wrote: > > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote: > > > With fragmentation reduction and prezeroing, UML suddenly gains the > > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) > > > a fast way of prezeroing, B) a way of giving memory back to the host OS > > > when it's not in use. > > DONT_NEED is insufficient. It doesn't discard the data in dirty > > file-backed pages. > I thought DONT_NEED would discard the page cache, and punch was only needed > to free up the disk space. This is correct, but... > I was hoping that since the file was deleted from disk and is already > getting _some_ special treatment (since it's a longstanding "poor man's > shared memory" hack), that madvise wouldn't flush the data to disk, but > would just zero it out. A bit optimistic on my part, I know. :) I read at some time that this optimization existed but was deemed obsolete and removed. Why obsolete? Because... we have tmpfs! And that's the point. With DONTNEED, we detach references from page tables, but the content is still pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?) > > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE)) > > which does do the trick, and I have a UML patch which adds memory > > hotplug. This combination does free memory back to the host. > I saw it wander by, and am all for it. If it goes in, it's obviously the > right thing to use. Btw, on this side of the picture, I think fragmentation avoidance is not needed for that. I guess you refer to using frag. avoidance on the guest (if it matters for the host, let me know). When it will be present using it will be nice, but currently we'd do madvise() on a page-per-page basis, and we'd do it on non-consecutive pages (basically, free pages we either find or free or purpose). > You may remember I asked about this two years ago: > http://seclists.org/lists/linux-kernel/2003/Dec/0919.html > And a reply indicated that SVr4 had it, but we don't. I assume the "naming > discussion" mentioned in the recent thread already scrubbed through this > old thread to determine that the SVr4 API was icky. > http://seclists.org/lists/linux-kernel/2003/Dec/0955.html I assume not everybody did (even if somebody pointed out the existance of the SVr4 API), but there was the need, in at least one usage, for a virtual address-based API rather than a file offset based one, like the SVr4 one - that user would need implementing backward mapping in userspace only for this purpose, while we already have it in the kernel. Anyway, the sys_punch() API will follow later - customers need mainly madvise() for now. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Blaisorblade <bla...@ya...> - 2005-11-04 17:19:23
|
(Note - I've removed a few CC's since we're too many ones, sorry for any inconvenience). On Friday 04 November 2005 16:50, Rob Landley wrote: > On Thursday 03 November 2005 21:26, Blaisorblade wrote: > > > I was hoping that since the file was deleted from disk and is already > > > getting _some_ special treatment (since it's a longstanding "poor man's > > > shared memory" hack), that madvise wouldn't flush the data to disk, but > > > would just zero it out. A bit optimistic on my part, I know. :) > > > > I read at some time that this optimization existed but was deemed > > obsolete and removed. > > > > Why obsolete? Because... we have tmpfs! And that's the point. With > > DONTNEED, we detach references from page tables, but the content is still > > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?) > > If I had that kind of control over environment my build would always be > deployed in (including root access), I wouldn't need UML. :) Yep, right for your case... however currently the majority of users use tmpfs (I hope for them)... > > I guess you refer to using frag. avoidance on the guest > > Yes. Moot point since Linus doesn't want it. See lwn.net last issue (when it becomes available) on this issue. In short, however, the real point is that we need this kind of support. > Might be a performance issue if that gets introduced with per-page > granularity, I'm aware of this possibility, and I've said in fact "Frag. avoidance will be nice to use". However I'm not sure that the system call overhead is so big, compared to flushing the TLB entries... But for now we haven't the issue - you don't do hotunplug frequently. When somebody will write the auto-hotunplug management daemon we could have a problem on this... > and how do you avoid giving back pages we're about to re-use? Jeff's trick is call the buddy allocator (__get_free_pages()) to get a full page (and it will do any needed work to free memory), so nobody else will use it, and then madvise() it. If a better API exists, that will be used. > Oh well, bench it when it happens. (And in any case, it needs a tunable to > beat the page cache into submission or there's no free memory to give back. I couldn't parse your sentence. The allocation will free memory like when memory is needed. However look at /proc/sys/vm/swappiness or use Con Kolivas's patches to find new tunable and policies. > If there's already such a tuneable, I haven't found it yet.) -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Rob L. <ro...@la...> - 2005-11-04 17:44:53
|
On Friday 04 November 2005 11:18, Blaisorblade wrote: > > Oh well, bench it when it happens. (And in any case, it needs a tunable > > to beat the page cache into submission or there's no free memory to give > > back. > > I couldn't parse your sentence. The allocation will free memory like when > memory is needed. If you've got a daemon running in the virtual system to hand back memory to the host, then you don't need a tuneable. What I was thinking is that if we get prezeroing infrastructure that can use various prezeroing accelerators (as has been discussed but I don't believe merged), then a logical prezeroing accelerator for UML would be calling madvise on the host system. This has the advantage of automatically giving back to the host system any memory that's not in use, but would require some way to tell kswapd or some such that keeping around lots of prezeroed memory is preferable to keeping around lots of page cache. In my case, I have a workload that can mostly work with 32-48 megs of ram, but it spikes up to 256 at one point. Right now, I'm telling UML mem=64 megs and the feeding it a 256 swap file on ubd, but this is hideously inefficient when it actually tries to use this swap file. (And since the host system is running a 2.6.10 kernel, there's a five minute period during each build where things on my desktop actually freeze for 15-30 seconds at a time. And this is on a laptop with 512 megs of ram. I think it's because the disk is so overwhelmed, and some things (like vim's .swp file, and something similar in kmail's composer) do a gratuitous fsync... > However look at /proc/sys/vm/swappiness Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that completes with swappiness at 60. I mentioned this on the list a little while ago and some people asked for copies of my test script... > or use Con Kolivas's patches to find new tunable and policies. The daemon you mentioned is an alternative, but I'm not quite sure how rapid the daemon's reaction is going to be to potential OOM situations when something suddenly wants an extra 200 megs... Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-04 19:12:02
|
(Ok, now the thing is UML-only). On Friday 04 November 2005 18:44, Rob Landley wrote: > On Friday 04 November 2005 11:18, Blaisorblade wrote: > > > Oh well, bench it when it happens. (And in any case, it needs a > > > tunable to beat the page cache into submission or there's no free > > > memory to give back. > > > > I couldn't parse your sentence. The allocation will free memory like when > > memory is needed. > > If you've got a daemon running in the virtual system to hand back memory to > the host, then you don't need a tuneable. I think Jeff's idea was a daemon running on the host (not as root) to manage splitting of memory between UMLs (and possibly the host). > What I was thinking is that if we get prezeroing infrastructure that can > use various prezeroing accelerators (as has been discussed but I don't > believe merged), then a logical prezeroing accelerator for UML would be > calling madvise on the host system. This has the advantage of > automatically giving back to the host system any memory that's not in use, > but would require some way to tell kswapd or some such that keeping around > lots of prezeroed memory is preferable to keeping around lots of page > cache. Ah, ok, I see, but a tuneable to say this is almost useless for anything else I guess, so it won't even get coded. > In my case, I have a workload that can mostly work with 32-48 megs of ram, > but it spikes up to 256 at one point. Yes, I remember the complete build (with GCC needing lots of memory). > Right now, I'm telling UML mem=64 > megs and the feeding it a 256 swap file on ubd, but this is hideously > inefficient when it actually tries to use this swap file. (And since the > host system is running a 2.6.10 kernel, there's a five minute period during > each build where things on my desktop actually freeze for 15-30 seconds at > a time. And this is on a laptop with 512 megs of ram. > I think it's > because the disk is so overwhelmed, and some things (like vim's .swp file, > and something similar in kmail's composer) do a gratuitous fsync... Yep, that's possible (running Gentoo, I often go to loads like 8-10, including a CPU-hog in the background, and things become a bit slow). However, I feel that really it's the simple "fork" which slows down like a crawl (and given that memory allocation will easily sleep waiting for some memory to be freed - i.e. to be freed or synced to disk, that's reasonable). And, btw, Frag. Avoidance would help for that too... > > However look at /proc/sys/vm/swappiness > Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that > completes with swappiness at 60. Yep, I see - it becomes so reluctant to swapping that it prefers killing. Unintended, but at least a reasonable bug... > I mentioned this on the list a little > while ago and some people asked for copies of my test script... > > or use Con Kolivas's patches to find new tunable and policies. > The daemon you mentioned is an alternative, but I'm not quite sure how > rapid the daemon's reaction is going to be to potential OOM situations when > something suddenly wants an extra 200 megs... The daemon will have to be designed and written, so we'll see... and we _could_ add a pre-OOM hook (it would be meaningful for Xen and any other virtualization tool)... to trigger a mconsole notification on the host and wait for any response from the daemon... At that point I become curious for "how much should the daemon give to the guest", and that would be policy configurable... but the policy file (which I already guess will be more complex than the daemon itself) would like some way to gather "how memory it needs" informations. We already started discussing on IRC with Jeff some ideas for estimating the past usage, but predicting the future one is more difficult. It's still possible to calculate the speed of new allocations, but not to now what's happening inside... the only possibility I see is to allow the notification to include the amount of needed memory (you can already do "echo something nice > /proc/notify", we now only need a client). But this allows DoSing the host with untrusted users. Not fully though, since you can never hotplug memory which wasn't hot-unplugged first - i.e. you would boot your UML with mem=256m and then immediately hot-unplug the most of it. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Rob L. <ro...@la...> - 2005-11-04 20:41:35
|
On Friday 04 November 2005 13:10, Blaisorblade wrote: > > If you've got a daemon running in the virtual system to hand back memory > > to the host, then you don't need a tuneable. > > I think Jeff's idea was a daemon running on the host (not as root) to > manage splitting of memory between UMLs (and possibly the host). That's more configuration on the host that's not really needed. Doesn't do my case any good. > > What I was thinking is that if we get prezeroing infrastructure that can > > use various prezeroing accelerators (as has been discussed but I don't > > believe merged), then a logical prezeroing accelerator for UML would be > > calling madvise on the host system. This has the advantage of > > automatically giving back to the host system any memory that's not in > > use, but would require some way to tell kswapd or some such that keeping > > around lots of prezeroed memory is preferable to keeping around lots of > > page cache. > > Ah, ok, I see, but a tuneable to say this is almost useless for anything > else I guess, so it won't even get coded. If we get prezeroing, the tunable is useful. If we haven't got prezeroing, this infrastructure probably won't get in. > > I think it's > > because the disk is so overwhelmed, and some things (like vim's .swp > > file, and something similar in kmail's composer) do a gratuitous fsync... > > Yep, that's possible (running Gentoo, I often go to loads like 8-10, > including a CPU-hog in the background, and things become a bit slow). It's not load for me, it's disk bandwidth. Every time it writes to the swap UBD, that data is scheduled for write-out. So if it's thrashing the swap file, even though it's reading the data back in fairly quickly the data still gets written out to disk, again and again, each time it's touched. Result: the disk I/O becomes a bottleneck and the disk is _PEGGED_ as long as the swap storm continues. Not try to have anything else on the system that wants to access the disk. Yes, reads are prioritized but anything that does fsync and waits for a write (as vi and kmail's composer do, or anything _else_ that wants to swap out a page to free up some memory) winds up waiting somewhere around 15 seconds. (Yeah, 20 gigs/second on linear reads. Not quite so much on an endless series of small random seeks.) > However, I feel that really it's the simple "fork" which slows down like a > crawl (and given that memory allocation will easily sleep waiting for some > memory to be freed - i.e. to be freed or synced to disk, that's > reasonable). You don't have to fork to block in this swap storm. The fact my old ubuntu's on a 2.6.10 kernel might have something to do with it, though... > And, btw, Frag. Avoidance would help for that too... If it ever goes in... > > > However look at /proc/sys/vm/swappiness > > > > Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that > > completes with swappiness at 60. > > Yep, I see - it becomes so reluctant to swapping that it prefers killing. > Unintended, but at least a reasonable bug... Triggering the OOM killer when you have _any_ writes pending is silly. Wait and memory will free up. And yet we do this all the time. We trigger the OOM killer when there's still swap space, too. I thought the point of a swap block device instead of a swap file was that there were no memory allocations needed to flush out memory. The oom killer is theoretically for when waiting won't help. But the implementation doesn't seem to match that... > > I mentioned this on the list a little > > while ago and some people asked for copies of my test script... > > > > > or use Con Kolivas's patches to find new tunable and policies. > > > > The daemon you mentioned is an alternative, but I'm not quite sure how > > rapid the daemon's reaction is going to be to potential OOM situations > > when something suddenly wants an extra 200 megs... > > The daemon will have to be designed and written, so we'll see... and we > _could_ add a pre-OOM hook (it would be meaningful for Xen and any other > virtualization tool)... to trigger a mconsole notification on the host and > wait for any response from the daemon... > > At that point I become curious for "how much should the daemon give to the > guest", and that would be policy configurable... but the policy file (which > I already guess will be more complex than the daemon itself) would like > some way to gather "how memory it needs" informations. > > We already started discussing on IRC with Jeff some ideas for estimating > the past usage, but predicting the future one is more difficult. > > It's still possible to calculate the speed of new allocations, but not to > now what's happening inside... the only possibility I see is to allow the > notification to include the amount of needed memory (you can already do > "echo something nice > /proc/notify", we now only need a client). > > But this allows DoSing the host with untrusted users. Not fully though, > since you can never hotplug memory which wasn't hot-unplugged first - i.e. > you would boot your UML with mem=256m and then immediately hot-unplug the > most of it. In theory, UML memory isn't all that much different from allocating normal user memory. So a DOS shouldn't enter into it. I like the idea that a UML instance can figure out when it has extra memory it's not using, and hand it back to the host. And for a UML to maintain any significant quantity of page cache that isn't tmpfs or ramfs is probably a bad idea (with the possible exception of NFS mounts). With hostfs or UBD, the host should have it cached to it's just two copies of memory, and fetching it again is easy. (Dentries I can see cacheing lots of.) Hence a UML instance could indeed (in theory) have lots of actually free memory (aggressively reclaim page cache) that it could madvise back to the host. And if it can do that, why does it need the deamon? Rob |
From: Rob L. <ro...@la...> - 2005-11-04 20:57:19
|
On Friday 04 November 2005 14:41, Rob Landley wrote: > seconds. (Yeah, 20 gigs/second on linear reads. Not quite so much on an Megs, obviously. Not _that_ cool of a laptop... Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-04 23:37:49
|
Jeff, some input for you - can you have a look? On Friday 04 November 2005 21:41, Rob Landley wrote: > On Friday 04 November 2005 13:10, Blaisorblade wrote: > > > If you've got a daemon running in the virtual system to hand back > > > memory to the host, then you don't need a tuneable. > > > > I think Jeff's idea was a daemon running on the host (not as root) to > > manage splitting of memory between UMLs (and possibly the host). > > That's more configuration on the host that's not really needed. Doesn't do > my case any good. We'll consider your case too then... for your job the daemon could be done by a thread started by UML on the host. The idea (still very preliminar) was conceived for hosting providers, to my knowlegde. However, see below - we can directly madvise() a page when it's freed. In this case, we'd need the guest to keep some free memory - and this can be done via one of Con Kolivas VM patches on the guest. > > > What I was thinking is that if we get prezeroing infrastructure that > > > can use various prezeroing accelerators (as has been discussed but I > > > don't believe merged), then a logical prezeroing accelerator for UML > > > would be calling madvise on the host system. This has the advantage of > > > automatically giving back to the host system any memory that's not in > > > use, but would require some way to tell kswapd or some such that > > > keeping around lots of prezeroed memory is preferable to keeping around > > > lots of page cache. > > > > Ah, ok, I see, but a tuneable to say this is almost useless for anything > > else I guess, so it won't even get coded. > If we get prezeroing, the tunable is useful. If we haven't got prezeroing, > this infrastructure probably won't get in. Hmm.... yep, prezeroing is useless if you don't keep some prezeroed memory, right.... I had answered too much in a hurry. Btw, indeed I previously planned to use the existing arch_free_pages() hook for page freeing, to call madvise() (conditionally I mean)... actually yes, you can't make sure that page isn't going to be reused, but if the page is _freed_ and you want still the content kept you will _anyway_ loose. The biggest risk is to madvise() a page uselessly, and that disturbs a bit performance, except that in general we should win by letting the host use more memory. > > > I think it's > > > because the disk is so overwhelmed, and some things (like vim's .swp > > > file, and something similar in kmail's composer) do a gratuitous > > > fsync... > > > > Yep, that's possible (running Gentoo, I often go to loads like 8-10, > > including a CPU-hog in the background, and things become a bit slow). > It's not load for me, it's disk bandwidth. Every time it writes to the > swap UBD, that data is scheduled for write-out. So if it's thrashing the > swap file, even though it's reading the data back in fairly quickly the > data still gets written out to disk, again and again, each time it's > touched. Result: the disk I/O becomes a bottleneck and the disk is > _PEGGED_ as long as the swap storm continues. > Not try to have anything else on the system that wants to access the disk. > Yes, reads are prioritized but anything that does fsync and waits for a > write (as vi and kmail's composer do, or anything _else_ that wants to swap > out a page to free up some memory) winds up waiting somewhere around 15 > seconds. > (Yeah, 20 gigs/second on linear reads. Not quite so much on an > endless series of small random seeks.) :: LOL :: If you have such a laptop, then I'm writing this email via Kmail on a Windows client via Putty and Cygwin/X from my laptop. Ah, sorry, I'm indeed doing this... > > However, I feel that really it's the simple "fork" which slows down like > > a crawl (and given that memory allocation will easily sleep waiting for > > some memory to be freed - i.e. to be freed or synced to disk, that's > > reasonable). > You don't have to fork to block in this swap storm. Surely, but at the console you really feel ls or free taking ages ... > The fact my old > ubuntu's on a 2.6.10 kernel might have something to do with it, though... > > Yep, I see - it becomes so reluctant to swapping that it prefers killing. > > Unintended, but at least a reasonable bug... > Triggering the OOM killer when you have _any_ writes pending is silly. Hey, I said "reasonable bug", but don't forget "bug" in the sentence. It's reasonable as opposed to "busybox install doesn't work in UML", which is an unreasonable bug. > Wait and memory will free up. And yet we do this all the time. We trigger > the OOM killer when there's still swap space, too. I thought the point of > a swap block device instead of a swap file was that there were no memory > allocations needed to flush out memory. > > The oom killer is theoretically for when waiting won't help. But the > implementation doesn't seem to match that... > > > The daemon you mentioned is an alternative, but I'm not quite sure how > > > rapid the daemon's reaction is going to be to potential OOM situations > > > when something suddenly wants an extra 200 megs... > > > > The daemon will have to be designed and written, so we'll see... and we > > _could_ add a pre-OOM hook (it would be meaningful for Xen and any other > > virtualization tool)... to trigger a mconsole notification on the host > > and wait for any response from the daemon... > > > > At that point I become curious for "how much should the daemon give to > > the guest", and that would be policy configurable... but the policy file > > (which I already guess will be more complex than the daemon itself) would > > like some way to gather "how memory it needs" informations. > > > > We already started discussing on IRC with Jeff some ideas for estimating > > the past usage, but predicting the future one is more difficult. > > > > It's still possible to calculate the speed of new allocations, but not to > > now what's happening inside... the only possibility I see is to allow the > > notification to include the amount of needed memory (you can already do > > "echo something nice > /proc/notify", we now only need a client). > > > > But this allows DoSing the host with untrusted users. Not fully though, > > since you can never hotplug memory which wasn't hot-unplugged first - > > i.e. you would boot your UML with mem=256m and then immediately > > hot-unplug the most of it. > > In theory, UML memory isn't all that much different from allocating normal > user memory. So a DOS shouldn't enter into it. > > I like the idea that a UML instance can figure out when it has extra memory > it's not using, and hand it back to the host. And for a UML to maintain > any significant quantity of page cache that isn't tmpfs or ramfs is > probably a bad idea (with the possible exception of NFS mounts). With > hostfs or UBD, the host should have it cached to it's just two copies of > memory, and fetching it again is easy. (Dentries I can see cacheing lots > of.) > > Hence a UML instance could indeed (in theory) have lots of actually free > memory (aggressively reclaim page cache) that it could madvise back to the > host. And if it can do that, why does it need the deamon? > > Rob -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Rob L. <ro...@la...> - 2005-11-05 01:46:05
|
On Friday 04 November 2005 17:42, Blaisorblade wrote: > > That's more configuration on the host that's not really needed. Doesn't > > do my case any good. > > We'll consider your case too then... for your job the daemon could be done > by a thread started by UML on the host. The idea (still very preliminar) > was conceived for hosting providers, to my knowlegde. If I don't have to run it, I don't mind. However, keep in mind I get runaway UML threads left after UML theoretically exits on a fairly regular basis. (Haven't banged on 2.6.14 so much, dunno if it still does that.) Due to the fact the process's name has been replaced with random memory, I can't "killall" the bastards, I have to manually go in and track them down by hand, which sucks and can't be scripted. Sometimes they're leftover stopped processes that I have to kill -CONT after I kill them to make them go away. (Personally, I consider that a kernel bug. kill -9 will kill anything, except a stopped process...) > > If we get prezeroing, the tunable is useful. If we haven't got > > prezeroing, this infrastructure probably won't get in. > > Hmm.... yep, prezeroing is useless if you don't keep some prezeroed memory, > right.... I had answered too much in a hurry. > > Btw, indeed I previously planned to use the existing arch_free_pages() hook > for page freeing, to call madvise() (conditionally I mean)... actually yes, > you can't make sure that page isn't going to be reused, but if the page is > _freed_ and you want still the content kept you will _anyway_ loose. > > The biggest risk is to madvise() a page uselessly, and that disturbs a bit > performance, except that in general we should win by letting the host use > more memory. Another advantage of prezeroing: it maps really well to what we're trying to do here. It makes a class of not just free but zeroed memory (and madvise zeroes the memory for us). The decision of how _much_ memory to keep away from the page cache already has to be dealt with (and if it's tuneable we can reap the page cache viciously in UML). In theory we'd want to keep some "free" memory back from being zeroed so we don't thrash with the host OS, but that might not be too hard. Without prezeroing, it's more work. Not quite sure how much more because the vm has changed a lot since I last believed I understood it... > a Windows client via Putty and Cygwin/X from my laptop. Ah, sorry, I'm > indeed doing this... Back under OS/2 I once ran Win/OS2, brought up its dos box, and ran a commodore 64 emulator in it. Because I could. (Don't remember if I managed to shoehorn desqview in there, but if so it wasn't from lack of trying.) > > The fact my old > > ubuntu's on a 2.6.10 kernel might have something to do with it, though... > > > > > Yep, I see - it becomes so reluctant to swapping that it prefers > > > killing. Unintended, but at least a reasonable bug... > > > > Triggering the OOM killer when you have _any_ writes pending is silly. > > Hey, I said "reasonable bug", but don't forget "bug" in the sentence. It's > reasonable as opposed to "busybox install doesn't work in UML", which is an > unreasonable bug. Busybox install does work in UML, though. I've done it. :) Rob |
From: Jeff D. <jd...@ad...> - 2005-11-05 04:35:54
|
On Fri, Nov 04, 2005 at 07:45:53PM -0600, Rob Landley wrote: > Another advantage of prezeroing: it maps really well to what we're trying to > do here. It makes a class of not just free but zeroed memory (and madvise > zeroes the memory for us). Yeah, we would need to keep track of zeroed pages somehow. This isn't currently done, and it would require some generic kernel hooks. And it would require them in some delicate places. If there were some sort of background zeroing and tracking of zeroed pages already there, it would be a lot easier for UML. Without that, I'm not sure how to take advantage of the host's zeroing. Jeff |
From: Jeff D. <jd...@ad...> - 2005-11-05 04:53:13
|
On Fri, Nov 04, 2005 at 02:41:11PM -0600, Rob Landley wrote: > On Friday 04 November 2005 13:10, Blaisorblade wrote: > > > What I was thinking is that if we get prezeroing infrastructure that can > > > use various prezeroing accelerators (as has been discussed but I don't > > > believe merged), then a logical prezeroing accelerator for UML would be > > > calling madvise on the host system. This has the advantage of > > > automatically giving back to the host system any memory that's not in > > > use, but would require some way to tell kswapd or some such that keeping > > > around lots of prezeroed memory is preferable to keeping around lots of > > > page cache. > > > > Ah, ok, I see, but a tuneable to say this is almost useless for anything > > else I guess, so it won't even get coded. > > If we get prezeroing, the tunable is useful. If we haven't got prezeroing, > this infrastructure probably won't get in. I'm not really convinced that prezeroing would be that useful, particularly through madvise. The reason is that the normal case for a system is that it has no free memory because it's caching anything that might be useful. The one case I can think of where you all of a sudden have a lot of free memory that might not be used for a while is when a large process exits, and you get a lot of freed data, page tables, etc. Then, we could possibly madvise that and stick it on a zeroed pages list. Forgetting about the extra infrastructure needed to implement it, even that would be under constant threat. Witness the ocassional proposals to do pre-swapping - swapping in stuff before it's needed when you have some free memory for it. Looking at it another way, what this would basically doing would be moving page zeroing from userspace to kernel space, which is generally counter to the direction that things generally go. > It's not load for me, it's disk bandwidth. Every time it writes to the swap > UBD, that data is scheduled for write-out. So if it's thrashing the swap > file, even though it's reading the data back in fairly quickly the data still > gets written out to disk, again and again, each time it's touched. Result: > the disk I/O becomes a bottleneck and the disk is _PEGGED_ as long as the > swap storm continues. Do you understand exactly what's happening here? Because I don't, and I wish someone could explain it. UML shouldn't be able to bog down the host like that. Its one-request-at-a-time pseudo-AIO shouldn't make that much IO happen that suddenly. There are other things that do IO for a living (kernel builds, updatedb) and they don't seem to bog down the system like this. Jeff |
From: Rob L. <ro...@la...> - 2005-11-05 23:19:42
|
On Friday 04 November 2005 23:45, Jeff Dike wrote: > > If we get prezeroing, the tunable is useful. If we haven't got > > prezeroing, this infrastructure probably won't get in. > > I'm not really convinced that prezeroing would be that useful, particularly > through madvise. The reason is that the normal case for a system is that > it has no free memory because it's caching anything that might be useful. > The one case I can think of where you all of a sudden have a lot of free > memory that might not be used for a while is when a large process exits, > and you get a lot of freed data, page tables, etc. Then, we could possibly > madvise that and stick it on a zeroed pages list. Forgetting about the > extra infrastructure needed to implement it, even that would be under > constant threat. Witness the ocassional proposals to do pre-swapping > - swapping in stuff before it's needed when you have some free memory > for it. For the specific case of User Mode Linux, the extra cacheing is largely a waste of time. For UBD or hostfs, the host OS has the data cached so we're spending twice as much memory as necessary in hopes of avoiding a syscall and copy. A proposal floated by on Linux-Kernel yesterday for a zone for hugepages, something the kernel could only put anonymous and pagecache in. If we had the option to keep page cache out of it as well, then we could specify at boot time, "I'm giving this UML instance mem=256 but I only want the default 32M of that to be used by anything but anonymous pages. When those are free, it's fine for them to be free, I _want_ them to be free. The host can put that to good use in service of Konqueror and Kmail. This would be useful to me. If the sucker reaps all the page cache and dentries and _still_ runs out of memory for a kernel allocation, then yes I've misconfigured it. But it had better reap all the page cache and dentries first... > Looking at it another way, what this would basically doing would be > moving page zeroing from userspace to kernel space, which is generally > counter to the direction that things generally go. Page zeroing is currently done in userspace? Juggling memory is something that userspace has traditionally deeply sucked at. Having to page in a daemon to make decisions in a low memory situation is unlikely to improve matters. > > It's not load for me, it's disk bandwidth. Every time it writes to the > > swap UBD, that data is scheduled for write-out. So if it's thrashing the > > swap file, even though it's reading the data back in fairly quickly the > > data still gets written out to disk, again and again, each time it's > > touched. Result: the disk I/O becomes a bottleneck and the disk is > > _PEGGED_ as long as the swap storm continues. > > Do you understand exactly what's happening here? Because I don't, and > I wish someone could explain it. UML shouldn't be able to bog down > the host like that. Its one-request-at-a-time pseudo-AIO shouldn't > make that much IO happen that suddenly. There are other things that > do IO for a living (kernel builds, updatedb) and they don't seem to > bog down the system like this. Just some educated guesses. 1) Ubuntu is defaulting to the anticipator I/O scheduler, and that melts down under sufficiently heavy loads. I should switch it to CFQ. 2) Some applications (vim and kmail most noticeably) do an fsync(), and in situations with lots of disk activity an fsync can block for 30 seconds. (Why konqueror suffers from this is another question, but konqueror has always been vulnerable to low memory situations...) Either way it's a host kernel problem. UML is a more or less normal userspace app, and it shouldn't be able to bog the system like... P.S. You're mentioning loads the kernel guys have specifically optimized for. Kernel builds are what the kernel has been optimized for over the past 10 years, but it's not a big I/O test. The I/O it does is nicely localized by directory, sucking in entire files each time so readahead is never wasted, and it's generally CPU bound even on a fast machine. As for updatedb, that's also not random seeks taking small chunks out of a file, and it's dominated by reads. The anticipatory scheduler is effectively optimized for updatedb. Of course the first thing I do on any new system is kill cron... swapping has always been tougher to optimize for, and it turns out that even swapping-style access patterns originating in userspace and happening inside a file are still a bit of a pain. I'll see if CFQ improves matters. If nothing else, the ability of "nice 20 mybuild" to actually affect disk I/O would be a _serious_ bonus... > Jeff Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-05 11:25:22
|
On Saturday 05 November 2005 06:45, Jeff Dike wrote: > On Fri, Nov 04, 2005 at 02:41:11PM -0600, Rob Landley wrote: > > On Friday 04 November 2005 13:10, Blaisorblade wrote: > > > > What I was thinking is that if we get prezeroing infrastructure that > > > > can use various prezeroing accelerators (as has been discussed but I > > > > don't believe merged), then a logical prezeroing accelerator for UML > > > > would be calling madvise on the host system. This has the advantage > > > > of automatically giving back to the host system any memory that's not > > > > in use, but would require some way to tell kswapd or some such that > > > > keeping around lots of prezeroed memory is preferable to keeping > > > > around lots of page cache. > > > Ah, ok, I see, but a tuneable to say this is almost useless for > > > anything else I guess, so it won't even get coded. > > If we get prezeroing, the tunable is useful. If we haven't got > > prezeroing, this infrastructure probably won't get in. > I'm not really convinced that prezeroing would be that useful, particularly > through madvise. The reason is that the normal case for a system is that > it has no free memory because it's caching anything that might be useful. > The one case I can think of where you all of a sudden have a lot of free > memory that might not be used for a while is when a large process exits, > and you get a lot of freed data, page tables, etc. Then, we could possibly > madvise that and stick it on a zeroed pages list. Forgetting about the > extra infrastructure needed to implement it, even that would be under > constant threat. Witness the ocassional proposals to do pre-swapping > - swapping in stuff before it's needed when you have some free memory > for it. I've proposed in fact including (for now) another of Con's patch, which gives some preference to free memory over pagecache (to speed up page allocation)... but I don't quite understand why no Con's patches get merged, at least in -mm (not that I follow that a lot)... Also, using pre-zeroing accelerators would mean that we need to keep some zero-ed memory at hand... > Looking at it another way, what this would basically doing would be > moving page zeroing from userspace to kernel space, which is generally > counter to the direction that things generally go. Nope - it's not reimplementing memset() as a syscall - which would move things from userspace to kernelspace. Instead, pre-zeroing is moving the existing kernelspace memset() to hardware. And relying on it doesn't seem bad... > > It's not load for me, it's disk bandwidth. Every time it writes to the > > swap UBD, that data is scheduled for write-out. So if it's thrashing the > > swap file, even though it's reading the data back in fairly quickly the > > data still gets written out to disk, again and again, each time it's > > touched. Result: the disk I/O becomes a bottleneck and the disk is > > _PEGGED_ as long as the swap storm continues. Sorry, you mention UML fsyncing to the swap file... this is misconfiguration! Disable CONFIG_*UBD*_SYNC and enable it per-device with ubd0s= rather than ubd0= (see --help output). The net effect is the same, except you don't get synchronous swapping! Can you try and report _any_ difference? > Do you understand exactly what's happening here? Because I don't, and > I wish someone could explain it. UML shouldn't be able to bog down > the host like that. Its one-request-at-a-time pseudo-AIO shouldn't > make that much IO happen that suddenly. There are other things that > do IO for a living (kernel builds, updatedb) and they don't seem to > bog down the system like this. Some ideas on this: 1) In my experience, even with the CFQ scheduler, updatedb _does_ do many bad to the system... (but I run a Folding@Home CPU hog, so I may not be considered totally trustworthy). Without CFQ it's a total pain (even without CPU hogs)... 2) Also, fsync() is a bad idea here.... the host elevator can either prioritize only UML's writes wrt. all other writes (which could be seen as unfair and so wouldn't be implemented) or prioritize all writes, or do nothing to speed up fsync() - and I guess some elevator prioritizes writes on fsync(). Which bogs down the host 3) Have you looked at C. Aker's ubd token limiter? 4) Also, remember that you mustn't count I/O, but rather seeks (one seek can costs about 10 ms+ on a laptop, i.e. 100 Kb of sequential I/O).... * remember that guest's ext3 (and recently reiserfs too) is proud of avoiding fragmentation by spreading files on the whole disk... (thus making ubd0 _much_ sparse). * and finally, remember that we usually run UML on sparse files, which is an atypical workload and not optimized against fragmentation... we could well have a sequential read in UML become a crazy seek storm in the host. About this, a paper at OLS (Virtualized GNU/Linux testing across distros?) talks about "special I/O elevator setup" for UML, and the authors talked to you for some issues, and IIRC maybe even work at Intel... ***) Jeff, what about talking to them and asking them to submit us their code, or at the very least their recipe? -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Messenger: chiamate gratuite in tutto il mondo http://it.messenger.yahoo.com |
From: Rob L. <ro...@la...> - 2005-11-05 23:44:49
|
On Saturday 05 November 2005 05:30, Blaisorblade wrote: > I've proposed in fact including (for now) another of Con's patch, which > gives some preference to free memory over pagecache (to speed up page > allocation)... but I don't quite understand why no Con's patches get > merged, at least in -mm (not that I follow that a lot)... > > Also, using pre-zeroing accelerators would mean that we need to keep some > zero-ed memory at hand... In theory, the state of truly free memory is irrelevant. The fact madvise zeroes it out is nice, but not actually required. (And I'm not sure madvise would actually zero if /tmp isn't tmpfs, so relying on the zeroing behavior might not be quite advisable just yet anyway.) > > > It's not load for me, it's disk bandwidth. Every time it writes to the > > > swap UBD, that data is scheduled for write-out. So if it's thrashing > > > the swap file, even though it's reading the data back in fairly quickly > > > the data still gets written out to disk, again and again, each time > > > it's touched. Result: the disk I/O becomes a bottleneck and the disk > > > is _PEGGED_ as long as the swap storm continues. > > Sorry, you mention UML fsyncing to the swap file... this is > misconfiguration! Disable CONFIG_*UBD*_SYNC and enable it per-device with > ubd0s= rather than ubd0= (see --help output). The net effect is the same, > except you don't get synchronous swapping! Can you try and report _any_ > difference? I have it disabled in UML. It's vi and kmail that seem to be doing fsync. (In vi's case, it has a .swp file that allows you to recover from crashes. In kmail's case, it has a similar saved state that allows you to resume composing your email after a crash. The problem is, both _block_ waiting for the fsync to finish, which sucks mightily when you're trying to type when it blocks.) All UML is doing is thrashing the heck out of the disk. > Some ideas on this: > > 1) In my experience, even with the CFQ scheduler, updatedb _does_ do many > bad to the system... (but I run a Folding@Home CPU hog, so I may not be > considered totally trustworthy). Without CFQ it's a total pain (even > without CPU hogs)... updatedb is mostly reads, but due to our over-eager page cache, reads can bloat the page cache to push running programs out of memory. There was some work back in the early 2.4 timeframe to add new page cache pages to the 'expired' list or some such, except that under load this fought with readahead... (I lost track of things after Rik's vm got yanked in favor of Arcandrea Angeli's because I never _did_ get a clear explanation of what the heck a classzone was...) > 2) Also, fsync() is a bad idea here.... the host elevator can either > prioritize only UML's writes wrt. all other writes (which could be seen as > unfair and so wouldn't be implemented) or prioritize all writes, or do > nothing to speed up fsync() - and I guess some elevator prioritizes writes > on fsync(). Which bogs down the host The anticipatory scheduler does stupid things sometimes when both writes and reads are under pressure. There have been a dozen different approaches to try to make this all work. Token based swap thrashing control was the most recent, I believe. I've hit every single bad case out there, under 2.4.4 I once got my desktop so badly into swap city that I went to lunch, came back, and it was STILL SWAPPING. Trying to switch to a different konqueror window! (Power cycle time.) I've been able to bring any linux desktop system to its knees (it's easy, open 40 konqueror tabs and a copy of kmail with 60,000 linux-kernel messages in "threaded" mode, while using vi and kmail's composer. Compiling stuff is an optional extra...) I thought upgrading to 512 megs of ram might make it go away, but apparently not... > 3) Have you looked at C. Aker's ubd token limiter? The UML instance is legitimately swapping, it's running something with a ~200 meg working set in 64 megs of ram and 256 megs of swap. (This is why I'm interested in any "give pages back to the host system" approach that would let me just _give_ UML 256 megs of ram without starving my desktop because UML has filled itself up with redundant page cache.) I did try telling UML "echo 0 > /proc/sys/vm/swappiness", but that just triggered UML's OOM killer, as I mentioned. Why I consider to be a separate bug... > 4) Also, remember that you mustn't count I/O, but rather seeks (one seek > can costs about 10 ms+ on a laptop, i.e. 100 Kb of sequential I/O).... > > * remember that guest's ext3 (and recently reiserfs too) is proud of > avoiding fragmentation by spreading files on the whole disk... (thus making > ubd0 _much_ sparse). The guest actually has an ext2 partition loopback mounted out of hostfs, but the situation that's under load isn't stressing that. The stress is entirely on the swap file, and swapping is inherently pretty seeky. > * and finally, remember that we usually run UML on sparse files, I was doing that too. The dd to create the file on the parent filesystem created a sparse file, which UML was happy to loopback mount because hostfs hid the sparseness of it. I stopped doing that because it was yet another way to peg the disk with seek activity. (Tryinging it again has been a todo item for a while...) > which is > an atypical workload and not optimized against fragmentation... we could > well have a sequential read in UML become a crazy seek storm in the host. You could, but I'm not feeding it sparse files, exactly to avoid this possibility. > About this, a paper at OLS (Virtualized GNU/Linux testing across distros?) > talks about "special I/O elevator setup" for UML, and the authors talked to > you for some issues, and IIRC maybe even work at Intel... I've configured UML to use the NOP elevator, because all my I/O goes through the parent system which should have its own elevator. I can try feeding UML an elevator if you think it'll help... > ***) Jeff, what about talking to them and asking them to submit us their > code, or at the very least their recipe? You're welcome to my test case. It's my Firmware Linux build online at http://www.landley.net/code/firmware (I should have an updated version using 2.6.14 out in a few days, working on some unrelated stuff at the moment...) Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-07 19:15:49
|
On Sunday 06 November 2005 00:44, Rob Landley wrote: > On Saturday 05 November 2005 05:30, Blaisorblade wrote: > > I've proposed in fact including (for now) another of Con's patch, which > > gives some preference to free memory over pagecache (to speed up page > > allocation)... but I don't quite understand why no Con's patches get > > merged, at least in -mm (not that I follow that a lot)... > > > > Also, using pre-zeroing accelerators would mean that we need to keep some > > zero-ed memory at hand... > > In theory, the state of truly free memory is irrelevant. The fact madvise > zeroes it out is nice, but not actually required. (And I'm not sure > madvise would actually zero if /tmp isn't tmpfs, so relying on the zeroing > behavior might not be quite advisable just yet anyway.) The API for that is not yet implemented, but the idea is "we're going to punch a hole in the file", so the zero'ing (which is, actually, implicit, especially in this case) is the only correct way. In fact, the zero-ing is done by remapping read-only to empty_zero_page on the next read fault; only when there is a write fault we need to allocate a new zeroed page. > updatedb is mostly reads, but due to our over-eager page cache, reads can > bloat the page cache to push running programs out of memory. There was > some work back in the early 2.4 timeframe to add new page cache pages to > the 'expired' list or some such, except that under load this fought with > readahead... (I lost track of things after Rik's vm got yanked in favor of > Arcandrea Angeli's because I never _did_ get a clear explanation of what > the heck a classzone was...) I've no idea of the "expired" list but I guess current Rik's work on > The anticipatory scheduler does stupid things sometimes when both writes > and reads are under pressure. There have been a dozen different approaches > to try to make this all work. Token based swap thrashing control was the > most recent, I believe. Doesn't work well yet, so it's currently disabled... (see linux-mm wiki and/or comments in the code). > I've hit every single bad case out there, under > 2.4.4 I once got my desktop so badly into swap city that I went to lunch, > came back, and it was STILL SWAPPING. Trying to switch to a different > konqueror window! (Power cycle time.) :: LOL :: To get that I simply needed to run "make -j" on a kernel tree (not "-jN", "-j", i.e. without limit). > I've been able to bring any linux desktop system to its knees (it's easy, > open 40 konqueror tabs You'll likely prefer Opera then - I have your same habit, but I'm actively trying to switch away. > and a copy of kmail with 60,000 linux-kernel > messages in "threaded" mode, while using vi and kmail's composer. > Compiling stuff is an optional extra...) I thought upgrading to 512 megs > of ram might make it go away, but apparently not... > > 3) Have you looked at C. Aker's ubd token limiter? > The UML instance is legitimately swapping, it's running something with a > ~200 meg working set in 64 megs of ram and 256 megs of swap. (This is why > I'm interested in any "give pages back to the host system" approach that > would let me just _give_ UML 256 megs of ram without starving my desktop > because UML has filled itself up with redundant page cache.) ubd token limiter doesn't interact with VM, it's just a "be nice on the host, and auto-limit yourself." It slows down request submitting, and it's used by Christoph for hosting multiple UMLs, but will help you too I guess. > The dd to create the file on the parent filesystem > created a sparse file, which UML was happy to loopback mount because hostfs > hid the sparseness of it. 1) I think that loop-back mount is supposed to work on sparse files too (the only thing which is refused is "swapon <sparse file>"). 2) How does hostfs hides that? That's interesting (as a bug, I mean). > > which is > > an atypical workload and not optimized against fragmentation... we could > > well have a sequential read in UML become a crazy seek storm in the host. > You could, Sorry, "we could" means "it could happen that". > but I'm not feeding it sparse files, exactly to avoid this > possibility. > > About this, a paper at OLS (Virtualized GNU/Linux testing across > > distros?) talks about "special I/O elevator setup" for UML, and the > > authors talked to you for some issues, and IIRC maybe even work at > > Intel... > I've configured UML to use the NOP elevator, because all my I/O goes > through the parent system which should have its own elevator. I can try > feeding UML an elevator if you think it'll help... Dunno... my idea is that nop can make sense, but that paper seemed to hint at something smarter than this. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Messenger: chiamate gratuite in tutto il mondo http://it.messenger.yahoo.com |
From: Rob L. <ro...@la...> - 2005-11-08 00:32:30
|
On Sunday 06 November 2005 11:18, Blaisorblade wrote: > > In theory, the state of truly free memory is irrelevant. The fact > > madvise zeroes it out is nice, but not actually required. (And I'm not > > sure madvise would actually zero if /tmp isn't tmpfs, so relying on the > > zeroing behavior might not be quite advisable just yet anyway.) > > The API for that is not yet implemented, but the idea is "we're going to > punch a hole in the file", so the zero'ing (which is, actually, implicit, > especially in this case) is the only correct way. I saw the punch API patch float by on the kernel list. (I was under the impression that madvise on something ramfs based was a special case, it _would_ zero out the section of data in the file. But the punch API is more generic, and frees up disk space too.) > In fact, the zero-ing is done by remapping read-only to empty_zero_page on > the next read fault; only when there is a write fault we need to allocate a > new zeroed page. That's what I'd expect, and exactly what we want: more memory available for use by the host system until UML needs it back, but when UML _does_ need it back no immediate threat of the (UML-side) OOM killer triggering before we can reactivate it. > > updatedb is mostly reads, but due to our over-eager page cache, reads can > > bloat the page cache to push running programs out of memory. There was > > some work back in the early 2.4 timeframe to add new page cache pages to > > the 'expired' list or some such, except that under load this fought with > > readahead... (I lost track of things after Rik's vm got yanked in favor > > of Arcandrea Angeli's because I never _did_ get a clear explanation of > > what the heck a classzone was...) > > I've no idea of the "expired" list but I guess current Rik's work on This was three or four years ago. It's all changed... > > I've hit every single bad case out there, under > > 2.4.4 I once got my desktop so badly into swap city that I went to lunch, > > came back, and it was STILL SWAPPING. Trying to switch to a different > > konqueror window! (Power cycle time.) > > > :: LOL :: > > To get that I simply needed to run "make -j" on a kernel tree (not "-jN", > "-j", i.e. without limit). Yeah, that'd do it too. > > The UML instance is legitimately swapping, it's running something with a > > ~200 meg working set in 64 megs of ram and 256 megs of swap. (This is > > why I'm interested in any "give pages back to the host system" approach > > that would let me just _give_ UML 256 megs of ram without starving my > > desktop because UML has filled itself up with redundant page cache.) > > ubd token limiter doesn't interact with VM, it's just a "be nice on the > host, and auto-limit yourself." > > It slows down request submitting, and it's used by Christoph for hosting > multiple UMLs, but will help you too I guess. In theory the host should get this right, though. What I really want is ionice, and I'm under the impression that one of the schedulers made this possible a few months back. Dunno if it got merged, or what userspace changes I'd need... > > The dd to create the file on the parent filesystem > > created a sparse file, which UML was happy to loopback mount because > > hostfs hid the sparseness of it. > > 1) I think that loop-back mount is supposed to work on sparse files too > (the only thing which is refused is "swapon <sparse file>"). > 2) How does hostfs hides that? That's interesting (as a bug, I mean). I mean I made a sparse file on ext2 and then ran a UML instance that hostfs mounted that ext2. If hostfs cares that the underlying filesystem is sparse or compressed or whatnot, it's caring way too much about implementation details of the underlying filesystem. :) I suppose if I used the funky block allocation range ioctl thing that LILO uses to figure out where kernel images live on disk, maybe it would pass it through. I haven't tried. (I vaguely recall that at one point loopback mounting sparse files didn't work, but loopback mounting used to be brittle five years or so back. It seems to have been cleaned up a bit since then...) Rob |
From: Blaisorblade <bla...@ya...> - 2005-11-08 16:07:56
|
On Tuesday 08 November 2005 01:32, Rob Landley wrote: > On Sunday 06 November 2005 11:18, Blaisorblade wrote: > In theory the host should get this right, though. What I really want is > ionice, and I'm under the impression that one of the schedulers made this > possible a few months back. Dunno if it got merged, or what userspace > changes I'd need... IIRC that was exactly CFQ, but no details on this. > > > The dd to create the file on the parent filesystem > > > created a sparse file, which UML was happy to loopback mount because > > > hostfs hid the sparseness of it. > > 1) I think that loop-back mount is supposed to work on sparse files too > > (the only thing which is refused is "swapon <sparse file>"). > > 2) How does hostfs hides that? That's interesting (as a bug, I mean). > I mean I made a sparse file on ext2 and then ran a UML instance that hostfs > mounted that ext2. > If hostfs cares that the underlying filesystem is > sparse or compressed or whatnot, it's caring way too much about > implementation details of the underlying filesystem. :) No, the point is that you can loop-back mount even sparse files. > I suppose if I used the funky block allocation range ioctl thing that LILO > uses to figure out where kernel images live on disk, maybe it would pass it > through. I haven't tried. I think it simply isn't implemented - I barely know about that (guess it's ->bmap or ->fibmap in the fs methods). Also, there's no point in that - can the application, after having the block numbers, open the block device in any way? And obviously it's worse for LILO. > (I vaguely recall that at one point loopback > mounting sparse files didn't work, but loopback mounting used to be brittle > five years or so back. It seems to have been cleaned up a bit since > then...) Ah, ok, didn't saw this part before. But hey, it's (now) standard practice to loopback-mount root_fs images to alter them. I've been using linux since less than 3 years (say RedHat 7.3 was my first distro), though, so I can't remember about before. Which means that, when just testing the build process in itself, since you can create a chroot on the loopback-mount, it's perfectly ok to run it in a chroot on the host, for speed. (Yep, I know, but if you need to rebuild the same thing over and over, and the problems are with the compilation rather than with UML, you can do that in the chroot). This can apply for instance when installing Gentoo. Ok, ok, that's maybe too obvious to be laid out - sorry if it's so. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Messenger: chiamate gratuite in tutto il mondo http://it.messenger.yahoo.com |
From: Rob L. <ro...@la...> - 2005-11-09 00:50:36
|
On Tuesday 08 November 2005 09:56, Blaisorblade wrote: > But hey, it's (now) standard practice to loopback-mount root_fs images to > alter them. I've been using linux since less than 3 years (say RedHat 7.3 > was my first distro), though, so I can't remember about before. Red Hat 5.something here. (5.1? It _hated_ my video card. And I was already trying to do strange things like loopback mounting a file that lived in another loopback mount...) > Which means that, when just testing the build process in itself, since you > can create a chroot on the loopback-mount, it's perfectly ok to run it in a > chroot on the host, for speed. I've done that (with my build, run the sources/scripts/1.1-* script in the chroot environment after --bind mounting in /tools and /tools/sources). But at the end of the build the packaging step makes a standalone User Mode Linux (with appended squashfs) to test it out, so I'll need it eventually anyway... Of course the gcc build is still quite a memory hog without UML. But the system is more graceful about dealing with it. (Better at random memory pressure than random IO pressure, apparently. It might just be that the anticipatory scheduler on the host system is particularly _bad_ about dealing with saturation level random-seek equal mixes of reads and writes scattered across a couplke hundred megabytes of disk space.) I should be upgrading the host system soon, not worth pursuing this unless the new kernel has the same problem... Rob |