From: Rob L. <ro...@la...> - 2005-11-24 12:11:19
|
So apparently, one reason for the pathological behavior of UML (pegging the hard drive, which I mentioned earlier) is that by default Ubuntu doesn't mount /tmpfs on /tmp. This means it's part of /root, which is ext3, and every touched page gets scheduled for writeout after a few seconds. (The optimization not to do that for deleted files was apparently taken out of 2.6.) There is a tmpfs mount, it's /dev/shm. And apparently, even if tmpfs isn't exposed as a separate filesystem, system V shared memory will still use it. So my question is, could system v shared memory be used in place of the tmpfs mount? (Can it be mapped in the right location and inherited across fork()?) Or is this just a "systems that don't mount /tmpfs on /tmp are screwed, it's another prerequisite for running UML". Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Blaisorblade <bla...@ya...> - 2005-11-24 20:41:18
|
On Thursday 24 November 2005 13:11, Rob Landley wrote: > So apparently, one reason for the pathological behavior of UML (pegging the > hard drive, which I mentioned earlier) is that by default Ubuntu doesn't > mount /tmpfs on /tmp. This means it's part of /root, which is ext3, and > every touched page gets scheduled for writeout after a few seconds. (The > optimization not to do that for deleted files was apparently taken out of > 2.6.) > There is a tmpfs mount, it's /dev/shm. > And apparently, even if tmpfs isn't > exposed as a separate filesystem, system V shared memory will still use it. Ah,ok... more or less it's true. > So my question is, could system v shared memory be used in place of the > tmpfs mount? (Can it be mapped in the right location and inherited across > fork()?) IIRC you can share a SysV shmem area across arbitrary processes - anybody calls ftok on a file, gets its handle and can open the shmem area. I mostly wonder about automatic cleanup. One (mis) feature of SysV IPC is persistance till reboot (i.e. no auto-cleanup if the process exits). In fact, we make processes sleep on pipes rather than use SysV semaphore exactly for this reason (I wanted to use futexes, but never found the time). However, I just found out, see shmctl(2), that IPC_RMID implements the refcount "garbage collection" algorithm, so apparently it *could* be used. The question is if we want it, and considering the new features being added to shmfs, the answer is probably either "no" or "we accept patches if somebody else is willing to maintain them" (adding yet another code path doesn't make me that happy - see the effort needed to make TT and SKAS3, and now SKAS0 and SKAS3, keep working). > Or is this just a "systems that don't mount /tmpfs on /tmp are > screwed, it's another prerequisite for running UML". First, UML works anyway. Set properly one of TMPDIR / TMP / TEMP (don't remember exact priorities, but IIRC TMPDIR has most priority) to point to /dev/shm. Actually, we could even make it the default (but must cater for older systems). It's used for POSIX shmem, so it's as standard on >=2.4 Linuxes as SysV shmem. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Rob L. <ro...@la...> - 2005-11-25 08:26:19
|
On Thursday 24 November 2005 14:40, Blaisorblade wrote: > However, I just found out, see shmctl(2), that IPC_RMID implements the > refcount "garbage collection" algorithm, so apparently it *could* be used. > > The question is if we want it, and considering the new features being added > to shmfs, the answer is probably either "no" or "we accept patches if > somebody else is willing to maintain them" (adding yet another code path > doesn't make me that happy - see the effort needed to make TT and SKAS3, > and now SKAS0 and SKAS3, keep working). Hmmm... (Eyes to-do list...) > > Or is this just a "systems that don't mount /tmpfs on /tmp are > > screwed, it's another prerequisite for running UML". > > First, UML works anyway. > > Set properly one of TMPDIR / TMP / TEMP (don't remember exact priorities, > but IIRC TMPDIR has most priority) to point to /dev/shm. Actually, we could > even make it the default (but must cater for older systems). > > It's used for POSIX shmem, so it's as standard on >=2.4 Linuxes as SysV > shmem. Expecting /dev/shm to be tmpfs seems more reliable than expecting /tmp to be. (After all, its' original name was shmfs...) I just added [ -d /dev/shm ] && export TMPDIR=/dev/shm to my build script, and it seems to help. Thanks, Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Jeff D. <jd...@ad...> - 2005-11-25 09:02:30
|
On Thu, Nov 24, 2005 at 06:11:01AM -0600, Rob Landley wrote: > So my question is, could system v shared memory be used in place of the tmpfs > mount? (Can it be mapped in the right location and inherited across fork()?) tmpfs and shmfs are two names for the same underlying code. I think the shmfs mount is for the benefit of things that use SysV shared memory. So, no. Just use tmpfs on /tmp. Jeff |
From: Nix <ni...@es...> - 2005-11-25 14:57:02
|
On Thu, 24 Nov 2005, Rob Landley uttered the following: > There is a tmpfs mount, it's /dev/shm. And apparently, even if tmpfs isn't > exposed as a separate filesystem, system V shared memory will still use it. s/System V/POSIX/ It's the shm_open()/shm_close()shm_unlink() functions you're looking for. It's been present in glibc since 2.2, so you should be able to use them without any real difficulty if need be. > So my question is, could system v shared memory be used in place of the tmpfs > mount? (Can it be mapped in the right location and inherited across fork()?) You could certainly do just that with POSIX shm :) -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Chris L. <ch...@ex...> - 2005-11-25 15:03:40
|
On Fri, Nov 25, 2005 at 02:56:49PM +0000, Nix wrote: > You could certainly do just that with POSIX shm :) Another option is to mlock the memory, which should prevent paging, but requires root. I have a patch which does this using a helper binary, if people would like it. -- ``As usual the Liberals offer a mixture of sound and original ideas. Unfortunately none of the sound ideas is original and none of the original ideas is sound.'' (Harold Macmillan) |
From: Nix <ni...@es...> - 2005-11-25 15:36:48
|
On Fri, 25 Nov 2005, Chris Lightfoot murmured woefully: > On Fri, Nov 25, 2005 at 02:56:49PM +0000, Nix wrote: >> You could certainly do just that with POSIX shm :) > > Another option is to mlock the memory, which should > prevent paging, but requires root. I have a patch which > does this using a helper binary, if people would like it. Well, mlocking it is certainly not practical for everyone :) while shm_open() and friends *is* practical as a general solution. e.g., one of my more important UMLs, my firewall: nix@loki 27 /home/nix% ps -o rss,vsz -C uml-esperi RSS VSZ 34296 99788 1468 1624 34296 99788 34296 99788 34296 99788 That's a very large RSS because I'm sshing in through it; normally it's more like 5Mb. The host only has 128Mb RAM and does many other things as well: mlock()ing that 99Mb into RAM would render the host almost useless! -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Rob L. <ro...@la...> - 2005-11-25 16:04:09
|
On Friday 25 November 2005 09:03, Chris Lightfoot wrote: > On Fri, Nov 25, 2005 at 02:56:49PM +0000, Nix wrote: > > You could certainly do just that with POSIX shm :) > > Another option is to mlock the memory, which should > prevent paging, but requires root. I have a patch which > does this using a helper binary, if people would like it. A) mlock would be a bad thing. Not only is it a trivial DOS waiting to happen but I like the UML physmem being swapped out under memory pressure. I just don't want uselessly writing it to disk over and over in the absence of any memory pressure whatosever to consume all I/O bandwidth to no purpose, which is the effect when it's not on tmpfs. B) Still requires root. The suid root helper program is only available in exactly the same circumstances in which you can just ask the admin to mount tmpfs somewhere for you. Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Nix <ni...@es...> - 2005-11-25 19:34:00
|
On Fri, 25 Nov 2005, Rob Landley uttered the following: > A) mlock would be a bad thing. Not only is it a trivial DOS waiting to happen > but I like the UML physmem being swapped out under memory pressure. I just > don't want uselessly writing it to disk over and over in the absence of any > memory pressure whatosever to consume all I/O bandwidth to no purpose, which > is the effect when it's not on tmpfs. Maybe this is a stupid question, but... why do *any* systems other than extremely memory-constrained ones not mount tmpfs on /tmp? It seems to me to have numerous advantages and no disadvantages. In fact, even when you're memory-constrained, if you *have* diskspace that you could spend on /tmp, you can swap to it instead, and spend the space on virtual memory when you're not spending it on /tmp. So, er, why? -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Rob L. <ro...@la...> - 2005-11-25 20:19:10
|
On Friday 25 November 2005 13:33, Nix wrote: > On Fri, 25 Nov 2005, Rob Landley uttered the following: > > A) mlock would be a bad thing. Not only is it a trivial DOS waiting to > > happen but I like the UML physmem being swapped out under memory > > pressure. I just don't want uselessly writing it to disk over and over > > in the absence of any memory pressure whatosever to consume all I/O > > bandwidth to no purpose, which is the effect when it's not on tmpfs. > > Maybe this is a stupid question, but... why do *any* systems other than > extremely memory-constrained ones not mount tmpfs on /tmp? It seems to > me to have numerous advantages and no disadvantages. Actually, I consider the fact the OOM killer doesn't delete files out of tmpfs mounts to be a potential disadvantage in this context. Using /tmp for anything has been kind of discouraged for a while, because throwing any insufficiently randomized filename in there is a security hole waiting to happen. By the time tmpfs was widely available as something you might mount on /tmp, the use of /tmp had been largely replaced with things like the ~/.kde directory or /var/spool/appdir with ownership and permissions enforced. Most of the remaining uses of /tmp are actually for things like named sockets (where tmpfs really doesn't help at all), or for tiny little files (like all the mcop crap) that on a different day would live under /var. It's used for inter-process communications, not for temporary storage space. Long ago things like vi would create temporary files in /tmp, but these days it uses . ${filename}.swp in the same directory as the file being edited. (As a matter of fact, there's even a /var/tmp that konqueror recently started storing its cache in. It used to be in ~/.kde. So there isn't just _one_ tmp directory; if you try to tmpfs mount your /tmp than you need to do more than one.) I suspect that the real reason nobody mounts tmpfs on /tmp is that nobody _bothers_. Nobody in their right mind puts anything big under /tmp, the few remaining uses are largely IPC between different users on the same machine, and even X11 has mostly moved away from that. Things like postfix and cups use subdirectories under /var/spool that aren't world readable. Keep in mind that tmpfs used to be shmfs, and what it's good at is providing shared memory. What UML really _wants_ is shared memory, which has traditionally been available through /dev/shm. Insisting that /tmp behave like /dev/shm because otherwise what you get doesn't behave like shared memory A) doesn't make make a whole lot of sense, B) doesn't match existing practice. > In fact, even when you're memory-constrained, if you *have* diskspace that > you could spend on /tmp, you can swap to it instead, and spend the space > on virtual memory when you're not spending it on /tmp. "can" doesn't mean "should". Yes you can make a 10 gigabyte swap partition, but most people actively don't want one because if your system ever winds up using more than about twice as much swap space as it has physical memory, it's likely that the amount of swap thrashing you're doing is getting pathological. Having a runaway app have to churn through 10 gigabytes of swap space before the OOM killer terminates it can turn 30 seconds of paralysis into 10 minutes. Not an improvement. Also, although it's pretty common to have 10 gigabytes of spare disk space on a modern laptop, it is _not_ common to have 10 gigabytes of spare swap space, and that's for a reason. Extra space in your filesystem can be used for all sorts of things. Extra swap space is normally wasted. So having tmp just be a normal directory isn't really that bad of a choice. It normally manifests no downsides whatsoever. And encouraging people to use /tmp is considered a security hole. > So, er, why? /dev/shm appears to be is the widely available tmpfs mount, because its purpose is to provide shared memory. It is not and never has been the purpose of /tmp to provide shared memory. Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Chris L. <ch...@ex...> - 2005-11-25 23:46:37
|
On Fri, Nov 25, 2005 at 02:18:43PM -0600, Rob Landley wrote: > Using /tmp for anything has been kind of discouraged for a while, because > throwing any insufficiently randomized filename in there is a security hole > waiting to happen. Which case are you worried about here? SFAIK all the filesystems anyone is likely to mount on /tmp implement O_EXCL correctly, and in any case (as was remarked elsewhere) there's always mkdir. -- Sudden death syndrome, eh? Sounds nasty. What are the symptoms? (seen on the internet) |
From: Rob L. <ro...@la...> - 2005-11-26 10:04:23
|
On Friday 25 November 2005 17:46, Chris Lightfoot wrote: > On Fri, Nov 25, 2005 at 02:18:43PM -0600, Rob Landley wrote: > > Using /tmp for anything has been kind of discouraged for a while, because > > throwing any insufficiently randomized filename in there is a security > > hole waiting to happen. > > Which case are you worried about here? SFAIK all the > filesystems anyone is likely to mount on /tmp implement > O_EXCL correctly, and in any case (as was remarked > elsewhere) there's always mkdir. I think programmers got the general impression using /tmp for temporary files was a really stupid idea from the fact that it keeps cropping up on things like LWN's security section. Here's the ones they linked to just last week as still being fixed by various distros: http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-2004-0968 http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-2005-2672 http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2005-2851 http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-2005-2104 http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2005-3124 Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Chris L. <ch...@ex...> - 2005-11-26 10:16:00
|
On Sat, Nov 26, 2005 at 04:03:54AM -0600, Rob Landley wrote: > On Friday 25 November 2005 17:46, Chris Lightfoot wrote: > > On Fri, Nov 25, 2005 at 02:18:43PM -0600, Rob Landley wrote: > > > Using /tmp for anything has been kind of discouraged for a while, because > > > throwing any insufficiently randomized filename in there is a security > > > hole waiting to happen. > > > > Which case are you worried about here? SFAIK all the > > filesystems anyone is likely to mount on /tmp implement > > O_EXCL correctly, and in any case (as was remarked > > elsewhere) there's always mkdir. > > I think programmers got the general impression using /tmp for temporary files > was a really stupid idea from the fact that it keeps cropping up on things > like LWN's security section. Here's the ones they linked to just last week > as still being fixed by various distros: [...] hmm. I'm not sure any of that's an argument for avoiding use of /tmp in new programs. I'm not really sure what the sensible alternative is, either: at least you can sensibly write policy about (e.g.) cleaning old files out of /tmp if you want to, whereas if you have multiple ad-hoc policies for temporary files, you can't. -- language not worship must pink delirious sleep produce (fridge poetry) |
From: Rob L. <ro...@la...> - 2005-11-26 10:04:22
|
On Friday 25 November 2005 15:04, Nix wrote: > The ~/.kde directory doesn't contain temporary files, but persistent state: ~/.kde/share/apps/kmail/lock is persistent state? I do know that half the time the darn battery runs out and kde suddely shuts down my desktop without the courtesy of even _warning_ me first (oh it pops up a window three seconds before doing it), kmail doesn't have a chance to zap this file before being killed and thus I have to drill down and zap the sucker by hand or it'll refuse to run when I boot back up. Circa Red Hat 9, konqueror's cache files were under .kde. I have no idea what the junk in .kde/share/apps/kpdf is for... But I take your point. They've instituted a policy and tried to clean this up. Similarly, .bash_history, .bittorrent, .DCOPserver*, .mcop, and all the other fun stuff written into home must be considered persistent state. > and the same is true of /var/spool, Doesn't /var/spool/cups contains files spooled to the printer? (I dunno, the only printer in the house is hooked up to my fiance's windows machine.) > > things like vi would create temporary files in /tmp, > > /var/tmp, because the entire point of those files was to survive a reboot. Is it? I thought it was to support undo. > > but these days it > > uses . ${filename}.swp in the same directory as the file being edited. > > Yes, and I absolutely despise this behaviour. Is there any way to force vim > to use /var/tmp like everyone else? It's a compile-time option. (I accidentally set it to use /tmp once and had to figure out how to undo it.) > - programs writing to $TMPDIR; config.guess, configure, and GCC are big > users on my systems, but lots of other apps write here for a while. Of > course you could point TMPDIR somewhere else, but does anyone do that? > > There are a quite surprising number of these: generally the files live > for brief instants before being unlinked, if at all. (mkstemp() creates > its files in $TMPDIR, after all, and often for those files minimal overhead > is what counts; and like it or not tmpfs has lower overhead than ext*fs.) Files that live for brief instants never get written out to disk anyway. That's why there's the delay before dirty pages in the page cache are scheduled for writeout. So tmpfs doesn't help there. > - users. A *lot* of my users dump temporary crud in /tmp: Yeah, at Rutgers we used to do that on the Sun machines to get around the disk quota. > the names of these files aren't predictable unless you're telepathic > so we're pretty safe from symlink attacks. (My local users are Nice Guys > anyway, or I shoot them. No shots have so far been necessary.) > > Maybe your users don't dump everything they don't care much about in > /tmp: mine are always sticking all sorts of things in there from > half-chewed LaTeX through to boring logfiles and stuff being looked over > on its way to the printer :) Sounds like your users are old unix hands who cut their teeth on traditional Unix boxes in the days before Linux. > > Keep in mind that tmpfs used to be shmfs, and what it's good at is > > providing shared memory. > > Yep. It just so happens that this gives good properties for transient stuff > that should vanish no later than the next reboot, and generally lives only > for as long as someone has it open. *shrug*. The truly transient stuff never leaves the page cache, no matter what the filesystem. (Especially if you mount with noatime, which is the norm these days.) > > What UML really _wants_ is shared memory, which has > > traditionally been available through /dev/shm. Insisting that /tmp > > behave like /dev/shm because otherwise what you get doesn't behave like > > shared memory A) doesn't make make a whole lot of sense, B) doesn't match > > existing practice. > > `Existing practice' seems to me to have pretty much wanted something, > uh, like tmpfs. But maybe your existing practice of /tmp is very different > from mine. (It certainly sounds like it.) Out there in the field, today, /tmp is not usually tmpfs. And nobody's seen enough benefit in it to bother deploying it on the Fedora, Gentoo, and Ubuntu systems I've tested. I suspect that knoppix uses tmpfs for /tmp, since it has no backing store. (Firing up knoppix 4.0 under qemu...) Heh. I was sort of right. /tmp doesn't have anything explicitly mounted on it, but inherits the unionfs mount on root, which is a combination of the cdrom and a tmpfs mount on /ramdisk. So it is sort of tmpfs, but not explicitly. It seems to line up with Jeff's recommendations entirely by accident. :) > > "can" doesn't mean "should". Yes you can make a 10 gigabyte swap > > partition, but most people actively don't want one because if your system > > ever winds up using more than about twice as much swap space as it has > > physical memory, it's likely that the amount of swap thrashing you're > > doing is getting pathological. > > You've never used dar in infinint mode Never even heard of it. > or watched large matrix maths stuff > churn through to completion :/ Oh I've watched large jobs thrash the heck out of a machine all afternoon. Classic ray tracing, for example... > there really are things with insane memory > requirements and good locality of reference. (I think the most I ever saw > dar eat was 15Gb of swap. *gah*) I'm not saying there aren't uses for it, I'm just saying it's not the norm and hence not a sane default. > > Having a runaway app have to churn through 10 gigabytes of > > swap space before the OOM killer terminates it can turn 30 seconds of > > paralysis into 10 minutes. Not an improvement. > > The problem there is that it's churning, i.e. that its locality of > reference is crap. Such a program should indeed not be allowed to eat that > much swap. Well, there's also the fact a high-end modern laptop probably has about 80 gigs of storage space and the cheaper ones have 40 or even 20. So eating 10 gigs of that just isn't an option. More laptops were sold last year than workstations. > > Also, although it's pretty common to have 10 gigabytes of spare disk > > space on a modern laptop, it is _not_ common to have 10 gigabytes of > > spare swap space, and that's for a reason. Extra space in your > > filesystem can be used for all sorts of things. Extra swap space is > > normally wasted. > > You can zap it if you need it for something else pretty easily. swapfiles > are no slower than swap partitions these days, and swap partitions are easy > to turn into filesystems too. I've done this, but it's not automatic. (Did they ever make swapfiles reliable so they don't lock up under low memory situations?) Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Blaisorblade <bla...@ya...> - 2005-11-27 16:49:36
|
On Friday 25 November 2005 23:31, Rob Landley wrote: > On Friday 25 November 2005 15:04, Nix wrote: > > The ~/.kde directory doesn't contain temporary files, but persistent > > state: > > ~/.kde/share/apps/kmail/lock is persistent state? > > I do know that half the time the darn battery runs out and kde suddely > shuts down my desktop without the courtesy of even _warning_ me first (oh > it pops up a window three seconds before doing it), kmail doesn't have a > chance to zap this file before being killed and thus I have to drill down > and zap the sucker by hand or it'll refuse to run when I boot back up. > > Circa Red Hat 9, konqueror's cache files were under .kde. I have no idea > what the junk in .kde/share/apps/kpdf is for... > > But I take your point. They've instituted a policy and tried to clean this > up. Similarly, .bash_history, .bittorrent, .DCOPserver*, .mcop, and all > the other fun stuff written into home must be considered persistent state. > > > and the same is true of /var/spool, > > Doesn't /var/spool/cups contains files spooled to the printer? (I dunno, > the only printer in the house is hooked up to my fiance's windows machine.) Spool *is* persistent... this is why cups is a daemon. Indeed, when you power on a printer and it starts printing without further action, you (or the average user, or me for a couple of seconds) say "it's a daemon's fault!"... > > > things like vi would create temporary files in /tmp, > > > > /var/tmp, because the entire point of those files was to survive a > > reboot. > > Is it? I thought it was to support undo. > Files that live for brief instants never get written out to disk anyway. > That's why there's the delay before dirty pages in the page cache are > scheduled for writeout. So tmpfs doesn't help there. That's not entirely correct for performances - the file could not get written out, but on most filesystems (excluding XFS, Reiser4 and some experimental ext3 version) a few preparatory steps (block allocation, for instance, and that involves poking through free list bitmaps and is even computationally intensive*) are done. Delayed allocation was invented exactly for that. * on a RAID array with ext3, in a benchmark, it limited writeout speed downto 300 Mb/s instead of 500 Mb/s (See OLS 2005, ext3 paper). > > - users. A *lot* of my users dump temporary crud in /tmp: > *shrug*. The truly transient stuff never leaves the page cache, no matter > what the filesystem. (Especially if you mount with noatime, which is the > norm these days.) I've seen it rarely used... only Gentoo suggests that. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Nix <ni...@es...> - 2005-11-27 18:18:05
|
[Sorry for response delay, steaming cold/flu] On Fri, 25 Nov 2005, Rob Landley worried: > On Friday 25 November 2005 15:04, Nix wrote: >> The ~/.kde directory doesn't contain temporary files, but persistent state: > > ~/.kde/share/apps/kmail/lock is persistent state? No, but KDE is a bit of a mess in some areas, and this is one of htem. > I do know that half the time the darn battery runs out and kde suddely shuts > down my desktop without the courtesy of even _warning_ me first (oh it pops > up a window three seconds before doing it), kmail doesn't have a chance to > zap this file before being killed and thus I have to drill down and zap the > sucker by hand or it'll refuse to run when I boot back up. ... and this is why it should be in /tmp. > Circa Red Hat 9, konqueror's cache files were under .kde. I have no idea what > the junk in .kde/share/apps/kpdf is for... Not true as of reasonably recent Konquerors. > But I take your point. They've instituted a policy and tried to clean this > up. Similarly, .bash_history, .bittorrent, .DCOPserver*, .mcop, and all the > other fun stuff written into home must be considered persistent state. Certainly the bash history and bittorrent stuff is persistent. .mcop is persistent (the trader cache should outlast reboots). >> and the same is true of /var/spool, > > Doesn't /var/spool/cups contains files spooled to the printer? (I dunno, the > only printer in the house is hooked up to my fiance's windows machine.) Yes. Again, if the machine reboots, you don't want to lose stuff you've got waiting to print. >> > things like vi would create temporary files in /tmp, >> >> /var/tmp, because the entire point of those files was to survive a reboot. > > Is it? I thought it was to support undo. Nah, as this XEmacs user understands it, it's for `vi -r'. (XEmacs does that with stuff in the local directory and/or any-directory-of-your-choice, so I picked one under /var/tmp. :) ) >> > but these days it >> > uses . ${filename}.swp in the same directory as the file being edited. >> >> Yes, and I absolutely despise this behaviour. Is there any way to force vim >> to use /var/tmp like everyone else? > > It's a compile-time option. (I accidentally set it to use /tmp once and had > to figure out how to undo it.) Is it? Oh good, I'll flip it next time I upgrade :) >> - programs writing to $TMPDIR; config.guess, configure, and GCC are big >> users on my systems, but lots of other apps write here for a while. Of >> course you could point TMPDIR somewhere else, but does anyone do that? >> >> There are a quite surprising number of these: generally the files live >> for brief instants before being unlinked, if at all. (mkstemp() creates >> its files in $TMPDIR, after all, and often for those files minimal overhead >> is what counts; and like it or not tmpfs has lower overhead than ext*fs.) > > Files that live for brief instants never get written out to disk anyway. Aside: it's easy to test this by writing something that creates and unlinks a file, dumps stuff into it, then deletes it, and loops on that: watch the disk light. I'll write a testcase because I'm so sure I'm right. [five minutes later] ... oops. I just, er, proved I was wrong. Ah well. You live and learn. This was certainly true in 2.4 but in 2.6 it seems to be the case that dirty blocks get magically undirtied if the file in question gets completely unlinked and not kept open by anything before the blocks hit the disk (unless the file is too large to fit in the page cache of course; even then it might fit in tmpfs, as tmpfs is swap-backed but even I'll admit that multi-hundred-megabyte writes to /tmp are rare things for programs to do.) > That's why there's the delay before dirty pages in the page cache are > scheduled for writeout. So tmpfs doesn't help there. Well, it does if the consuming program takes some time to consume the file, or the producing program takes some time to generate it (e.g. GCC; yes, even in -pipe mode, some temporary files in /tmp are used.) >> - users. A *lot* of my users dump temporary crud in /tmp: > > Yeah, at Rutgers we used to do that on the Sun machines to get around the disk > quota. Mine do it to avoid cluttering up their $HOMEs with crap. (Well, all but one whose home directory looks like a sewer. I avoid looking in there unless forced.) >> Maybe your users don't dump everything they don't care much about in >> /tmp: mine are always sticking all sorts of things in there from >> half-chewed LaTeX through to boring logfiles and stuff being looked over >> on its way to the printer :) > > Sounds like your users are old unix hands who cut their teeth on traditional > Unix boxes in the days before Linux. Two of them are for certain: I don't know about the rest. They're not doing it for efficiency reasons, just out of tidiness. >> `Existing practice' seems to me to have pretty much wanted something, >> uh, like tmpfs. But maybe your existing practice of /tmp is very different >> from mine. (It certainly sounds like it.) > > Out there in the field, today, /tmp is not usually tmpfs. Out there in the field, today, the average Linux box is running Oracle and very little else :( > And nobody's seen > enough benefit in it to bother deploying it on the Fedora, Gentoo, and Ubuntu > systems I've tested. I haven't seen a non-tmpfs-for-/tmp Linux box in years. I guess this is another transatlantic divide thing :) >> You've never used dar in infinint mode > > Never even heard of it. A very powerful backup program with, ahem, notable memory consumption problems in its default configuration (unless you *like* 1Gb memory consumption per million files, approx.) >> or watched large matrix maths stuff >> churn through to completion :/ > > Oh I've watched large jobs thrash the heck out of a machine all afternoon. > Classic ray tracing, for example... Ray tracing is a worst case; it has very little locality of reference at all (at least not unless the ray tracer has been optimized for parallism, which `classic' ones generally haven't been). >> > Also, although it's pretty common to have 10 gigabytes of spare disk >> > space on a modern laptop, it is _not_ common to have 10 gigabytes of >> > spare swap space, and that's for a reason. Extra space in your >> > filesystem can be used for all sorts of things. Extra swap space is >> > normally wasted. >> >> You can zap it if you need it for something else pretty easily. swapfiles >> are no slower than swap partitions these days, and swap partitions are easy >> to turn into filesystems too. > > I've done this, but it's not automatic. (Did they ever make swapfiles > reliable so they don't lock up under low memory situations?) As I understand it all the downsides of swapfiles (speed, reliability et al) went away in the 2.5.x timeframe. -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Rob L. <ro...@la...> - 2005-11-27 19:25:06
|
On Sunday 27 November 2005 12:17, Nix wrote: > [Sorry for response delay, steaming cold/flu] > > On Fri, 25 Nov 2005, Rob Landley worried: > > On Friday 25 November 2005 15:04, Nix wrote: > >> The ~/.kde directory doesn't contain temporary files, but persistent > >> state: > > > > ~/.kde/share/apps/kmail/lock is persistent state? > > No, but KDE is a bit of a mess in some areas, and this is one of htem. My fiance's laptop has xfce on it. Her assessment? "The mouse is cute." And she doesn't actively hate it. Pondering switching over to that. It'd mean giving up Konqueror, but that's the Konqueror developers' fault for gluing it to hundreds of megabytes of unnecessary crap... > > I do know that half the time the darn battery runs out and kde suddely > > shuts down my desktop without the courtesy of even _warning_ me first (oh > > it pops up a window three seconds before doing it), kmail doesn't have a > > chance to zap this file before being killed and thus I have to drill down > > and zap the sucker by hand or it'll refuse to run when I boot back up. > > ... and this is why it should be in /tmp. As with all pid files, it should check and see if there is a currently running process with that PID (and that this process is using the same binary as it is, which you can find under /proc) and if not, zap the pid file as stale. There should probably be a library function for this, it's to let you find the currently running instance. You should confirm once you find... > > Circa Red Hat 9, konqueror's cache files were under .kde. I have no idea > > what the junk in .kde/share/apps/kpdf is for... > > Not true as of reasonably recent Konquerors. It's now under /var/tmp, as I mentioned. (Apparently, they want the cache to persist between reboots, despite the fact I told it cookies shouldn't. So when _is_ /var/tmp cleaned, anyway? Randomly?) > > But I take your point. They've instituted a policy and tried to clean > > this up. Similarly, .bash_history, .bittorrent, .DCOPserver*, .mcop, and > > all the other fun stuff written into home must be considered persistent > > state. > > Certainly the bash history and bittorrent stuff is persistent. .mcop is > persistent (the trader cache should outlast reboots). *shrug*. The only one I know what it actually does is the bash history... > >> and the same is true of /var/spool, > > > > Doesn't /var/spool/cups contains files spooled to the printer? (I dunno, > > the only printer in the house is hooked up to my fiance's windows > > machine.) > > Yes. Again, if the machine reboots, you don't want to lose stuff you've got > waiting to print. Actually, I do. Very much so, in some cases. But I can see it being a preference... > >> > but these days it > >> > uses . ${filename}.swp in the same directory as the file being edited. > >> > >> Yes, and I absolutely despise this behaviour. Is there any way to force > >> vim to use /var/tmp like everyone else? > > > > It's a compile-time option. (I accidentally set it to use /tmp once and > > had to figure out how to undo it.) > > Is it? Oh good, I'll flip it next time I upgrade :) Also look for a vimrc file under /etc somewhere. You can override just about anything from there. > > Files that live for brief instants never get written out to disk anyway. > > Aside: it's easy to test this by writing something that creates and > unlinks a file, dumps stuff into it, then deletes it, and loops on that: > watch the disk light. I'll write a testcase because I'm so sure I'm right. > > [five minutes later] > > ... oops. I just, er, proved I was wrong. Ah well. You live and > learn. This was certainly true in 2.4 but in 2.6 it seems to be the case > that dirty blocks get magically undirtied if the file in question gets > completely unlinked and not kept open by anything before the blocks hit > the disk (unless the file is too large to fit in the page cache of > course; even then it might fit in tmpfs, as tmpfs is swap-backed but > even I'll admit that multi-hundred-megabyte writes to /tmp are rare > things for programs to do.) You were right about noatime not being the default, though. Should be, but then "should be" is Jeff's argument for tmpfs on /tmp and I'm the one pushing against that. (Patch forthcoming, I added it to the front my to-do list, might even get to it this evening.) > > That's why there's the delay before dirty pages in the page cache are > > scheduled for writeout. So tmpfs doesn't help there. > > Well, it does if the consuming program takes some time to consume the > file, or the producing program takes some time to generate it (e.g. GCC; > yes, even in -pipe mode, some temporary files in /tmp are used.) In theory that's idle disk time and the sucker is very CPU limited in that case. More or less by definition. (But if the disk is highly bogged by something else. Of course then it's possible you're swapping, so...) > >> - users. A *lot* of my users dump temporary crud in /tmp: > > > > Yeah, at Rutgers we used to do that on the Sun machines to get around the > > disk quota. > > Mine do it to avoid cluttering up their $HOMEs with crap. (Well, all but > one whose home directory looks like a sewer. I avoid looking in there > unless forced.) Doesn't everybody's ~ look like a junk drawer? Every time I reinstall I start with a fresh home directory and the previous stuff in /home/old or some such, and copy stuff over as I need it. > > Sounds like your users are old unix hands who cut their teeth on > > traditional Unix boxes in the days before Linux. > > Two of them are for certain: I don't know about the rest. They're not > doing it for efficiency reasons, just out of tidiness. My programming style tends to have this in common with farming: The end result is as tidy as I can make it, but the workspace is piles of dirt with trenches dug in it. (Laws and sausage are apparently made the same way.) > >> `Existing practice' seems to me to have pretty much wanted something, > >> uh, like tmpfs. But maybe your existing practice of /tmp is very > >> different from mine. (It certainly sounds like it.) > > > > Out there in the field, today, /tmp is not usually tmpfs. > > Out there in the field, today, the average Linux box is running Oracle and > very little else :( Not in my experience. Start by thinking about apache, and from what I've seen mysql installations outnumber Oracle (not in dollar volume but in units and users). I was rooting for postgresql for a while, but apparently there's not as much middle ground as you'd think. These days I'm rooting for the simple entirely in-memory databases... > > And nobody's > > seen enough benefit in it to bother deploying it on the Fedora, Gentoo, > > and Ubuntu systems I've tested. > > I haven't seen a non-tmpfs-for-/tmp Linux box in years. I guess this is > another transatlantic divide thing :) You see a lot of hand-tuned systems. I see a lot of "IT isn't really what we do" systems and a lot of "put it together myself with duct tape", and haven't seen much in between recently. > > Oh I've watched large jobs thrash the heck out of a machine all > > afternoon. Classic ray tracing, for example... > > Ray tracing is a worst case; it has very little locality of reference at > all (at least not unless the ray tracer has been optimized for parallism, > which `classic' ones generally haven't been). Hence the thrashing, yes. :) > >> You can zap it if you need it for something else pretty easily. > >> swapfiles are no slower than swap partitions these days, and swap > >> partitions are easy to turn into filesystems too. > > > > I've done this, but it's not automatic. (Did they ever make swapfiles > > reliable so they don't lock up under low memory situations?) > > As I understand it all the downsides of swapfiles (speed, reliability et > al) went away in the 2.5.x timeframe. I knew they were improving it, but I hadn't been following too closely. *ponders current setup*. I could do tmpfs backed by a swap file living on an ext2 partition that's loopback mounted from a hostfs that's exported from an ext3 partition. I wonder if that has a bat's chance of actually working? Of course at that point, I'd have an almost unbearable urge to stick QEMU in there somewhere. On general principles. Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Rob L. <ro...@la...> - 2005-11-26 11:47:36
|
On Friday 25 November 2005 20:12, Nix wrote: > If it's a problem you have both hostile users and no size limits on /tmp > and you therefore have bigger problems anyway. :) The size limits on /tmp aren't per-user. > >> Yeah, true, if you think the OOM killer is worthwhile (I do: most of the > >> MM hackers don't. I know who knows more about the Linux kernel's MM and > >> it's not me!) > > > > Its euristics are crap (many cases breaking them), and the concept is > > crap: damn hell, a C programmer has been taught to check that malloc() > > can return NULL, not that he should patch a kernel to get a meaningful > > behaviour. > > Yeah, but it does sort of work. Personally I prefer to just never run out > of memory :) My laptop has 512 megs of ram, and 700 megs of swap. I'm running QEMU to boot a knoppix image with 256 megs of ram, running UML to build gcc 4 (which has a high water mark of disk usage somewhere north of 128 megs). I have two konqueror windows open with an average of 30 tabs in each. I have kmail open with a threaded view of linux-kernel with 69,649 messages in that folder. Plus the general overhead for kde, two standalone pdf viewers, several terminal windows, a partridge in a pear tree, and so on. It's been a few weeks since I've triggered the OOM killer, but I've done it. > > However, the idea of an OOM could be made to work, if you can kill an app > > based on the derivative of its memory usage (i.e. how fast usage has > > increased over the last moments). > > ... and the VM appears to be growing things that might help in that area :) We get better as time goes on. My original point was that the semantics of what UML wants is shared memory. It's trusting /tmp to provide different behavior than simply using ~, and this turns out to be a very unreliable assumption. There is a directory (/dev/shm) whose entire definition is to provide those semantics, and shouldn't even _exist_ if it doesn't. I believe that would be a better directory to use. I can submit a patch for this. It's arch/um/os-Linux/mem.c, line 37, in find_tempdir(). And while I'm at it, os-Linux/start_up.c has a check_tmpexec() that has "/tmp" hardwired into its messages, even if that's not what find_tempdir() returned... > >> > Using /tmp for anything has been kind of discouraged for a while, > >> > because throwing any insufficiently randomized filename in there is a > >> > security hole waiting to happen. > >> > >> Um, atomically create a directory, > > > > DoS-able if filenames are predictable... > > ... with a random name, obviously. :) Like "/tmp/uml.ctl" in arch/um/drivers/daemon_kern.c, line 70? (It's not obvious where this file is actually created, it's one of those funky callback things where data in a structure is used somewhere else...) > > Never seen anybody doing it, IIRC. Not even mkstemp() (even if today I > > discover mkdtemp()). > > Oh. I do it all the time. I prefer not to work under the assumption that > I'm more brilliant than thirty years of Unix hackers and spotted > something none of them did, but so be it... 30 years ago the Unix hackers were working on a 16-bit PDP-11 with two RK05 disk packs storing 2.5 megabytes each. And the reason they duplicated /bin and /sbin and /lib under /usr is that they ran out of space on the root disk and had to leak the OS into the second disk pack which had previously held all the user home directories. And people never revisited this decision for the next three decades, despite the fact the "needed for early boot" rationale was entirely a pragmatic thing of the moment, and makes _no_ sense on a modern system ever since the invention of the initial ramdisk, let alone initramfs. I personally symlink /bin, /sbin, and /lib to the corresponding /usr directories and consolidate the whole mess, myself. Yes, you have to patch gcc's paths (in collect2) to not search _both_ /lib and /usr/lib because if gnu's linker finds the same symbols in two different libraries it statically links them in rather than trying to figure out which one is right, resulting in executables as big as if they're statically linked but still refusing to run if they can't find their shared libraries at run time. That's a bug in ld. The point is, it's important to know _what_ conclusions the 30 years of unix hackers came to, but keep in mind that the computing environment of 2005 is in some ways very different from the computing environments of 1976 or 1984. > > - back in 2.4, tmpfs on /tmp broke mkinitrd since it tried to loop-mount > > the new initrd, which was in /tmp. And loop-mount over tmpfs didn't work. > > Ah, well, I never use initrd if I can avoid it, and a bug in one tool is > a reason to *fix that tool*, not rejig teh whole damn system. I agree initrd is kinda pointless, but initramfs isn't. The kernel guys are moving towards initramfs being required someday. These are still nebulous future plans with no actual deadline, but they include moving to dynamically assigned major/minor numbers (so you need something like udev to populate /dev), having userspace find and mount the real root partition (so when you're booting from a USB key but your root paritition lives on an NFS server that in order to access it you have to dhcp yourself an address, nslookup the server name, and then login with a public key from said USB stick...) All the various partitioning schemes could be moved over to device mapper. And so on. They'd proposed a serious kernel crapectomy "for 2.7" back before 2.7 got put on indefinite hold. How they're rolling it out now, we dunno. They seem to be happy chewing their current mouthful, at the moment... > (and `mount', of course, only lists mounts if you trust /proc/mounts to > be accurate. If the kernel doesn't know what's mounted, you have bigger problems. > What does it look like in this brave new world of shared > subtrees? I had this discussion on the kernel list a week or so back: namespaces are reference counted so as soon as the last process that can see a mount goes away, umount happens. This means that umount -a should only zap everything in your current namespace, so that after init kills all sub-processes it can then run umount -a for pid 1, life is good. I had this discussion because I wanted to make sure busybox umount would be doing it right. > Obviously /etc/mtab *must* be a symlink to /proc/mounts, now, > only oops that breaks the quota tools...) I rewrote busybox mount so that things work properly with /proc/mounts. And I vaguely remember coming up with an in-house patch to fix the quota tools (they were upset by rootfs) something like four years ago. > > (Btw, the problem was that he added a new external disk, but labeled it > > /boot, like an existing /boot partition , so mount -a choked with > > "duplicate label '/boot'" and it stopped before mounting /home). > > I think now is an appropriate time to say > > I HATE FSCKING MTAB > > (in three-part harmony, probably) Everybody hates /etc/mtab. It doesn't work if you chroot. It can't handle --bind or --move mounts... Just symlink it to /proc/mounts and recognize that any tool that can't handle that is a buggy tool that needs to be fixed. > >> You've never used dar in infinint mode or watched large matrix maths > >> stuff churn through to completion :/ there really are things with insane > >> memory requirements and good locality of reference. (I think the most I > >> ever saw dar eat was 15Gb of swap. *gah*) > > > > Boy, be serious - we are talking about normal systems, and you know that > > you'd better run dar on properly sized systems... > > I still boggle that infinint mode is the default for that tool. First time I've heard of the tool, but then back under 2.4.7 I remember I had rsync regularly triggering the OOM killer. Not because rsync was leaking, but because the servers backing up only had 128 megs of memory and the balancing was _terrible_ so the dentry cache and page cache would squeeze out anonymous pages to the point where rsync itself got OOM killed... People who want truly insane amounts of memory these days (often for graphics or video editing) tend to mmap their data files directly and work in there. Once again rendering insane amounts of swap less useful... I'm under the vague impression there's some kind of madvise you can do that says "don't flush this before close unless you're responding to memory pressure". Hmmm... Closest I can find is MADV_RANDOM... If we had a "treat this like it's on tmpfs" madvice, that would be ideal... Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Blaisorblade <bla...@ya...> - 2005-11-27 16:39:01
|
On Saturday 26 November 2005 11:44, Rob Landley wrote: > On Friday 25 November 2005 17:33, Blaisorblade wrote: > > - back in 2.4, tmpfs on /tmp broke mkinitrd since it tried to loop-mount > > the new initrd, which was in /tmp. And loop-mount over tmpfs didn't work. > > I vaguely remember this being fixed. :) Yep, it was, but at 2.4.18 (my first kernel) it wasn't yet. > > He had never known that "mount" lists mounts... > > I try not to assume that people know everything there is to know about > Unix. I've done my share of stupid things, but he's _payed_ as a sysadmin (though he's actually a programmer) and I learned this time ago, and since I started using Linux <=3.5 years ago, time ago means at the very beginning ... he _should_ know that. However he had backups - which shows he's well intentioned. (as an aside, I do despise learning *so* fast... I wouldn't be able to use socket() without manuals, and there's a lot of stuff I'd like to learn well. And I'd really like to _start_ and finish even a little project on my own... it's years I don't start coding some fun project up). > And the default seems to be that /tmp ain't tempfs, but /dev/shm is. I argue for that too... > > (Btw, the problem was that he added a new external disk, but labeled it > > /boot, like an existing /boot partition , so mount -a choked with > > "duplicate label '/boot'" and it stopped before mounting /home). > He's using Red Hat, isn't he? :) Yes... > (Been there, done that, moved the darn labels to /dev/hda4 and such. > Wouldn't recommend that with SCSI because the scsi bus detects devices via > chicken entrails and then enumerates them in sequence (with no gaps) on a > first come first served basis. With ATA, /dev/hdd3 means second > controller, slave device, third partition, and that doesn't move unless you > physically unplug it from its connector cable, no matter what else you plug > in. The whole _reason_ Red hat has this boot label stuff is some people > have an unreasoning love of SCSI devices. :) Well, it makes sense anyhow, and though it's unusual and it sucks for the user, it would be much more meaningful to use labels rather than partitions (when repartitioning the same can happen - and I've seen the bloody hell happen with partition tables > google dar linux: > Some french disk archiving tool, apparently. I generally just use tarballs > or rsync. It's clear Nix is using some calculation program (not sure what's it). -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Nix <ni...@es...> - 2005-11-27 18:49:38
|
On Sun, 27 Nov 2005, bla...@ya... whispered secretively: > (as an aside, I do despise learning *so* fast... I wouldn't be able to use > socket() without manuals, and there's a lot of stuff I'd like to learn well. > And I'd really like to _start_ and finish even a little project on my own... > it's years I don't start coding some fun project up). The BSD socket layer is so irregular that I think *everyone* needs to refer to the manuals when using it, unless they write networking apps in C every few days. >> google dar linux: >> Some french disk archiving tool, apparently. I generally just use tarballs >> or rsync. > > It's clear Nix is using some calculation program (not sure what's it). I'm using both matlab/octave *and*, when running backups, said French disk archiver. The source is gradually being Anglicised so that the developer base can rise a bit :) It has numerous advantages over tar and rsync if, like me, you're stuck using a pile of CD-R[W]s as your backup medium. -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Rob L. <ro...@la...> - 2005-11-28 00:37:07
|
On Sunday 27 November 2005 12:49, Nix wrote: > I'm using both matlab/octave *and*, when running backups, said French disk > archiver. The source is gradually being Anglicised so that the developer > base can rise a bit :) > > It has numerous advantages over tar and rsync if, like me, you're stuck > using a pile of CD-R[W]s as your backup medium. I invested in a DVD burner a couple years ago. My laptop's hard drive is bigger than that, but not bigger enough to make the stack unmanageable. Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Blaisorblade <bla...@ya...> - 2005-11-27 17:37:41
|
On Saturday 26 November 2005 12:47, Rob Landley wrote: > On Friday 25 November 2005 20:12, Nix wrote: > > If it's a problem you have both hostile users and no size limits on /tmp > > and you therefore have bigger problems anyway. :) > My original point was that the semantics of what UML wants is shared > memory. It's trusting /tmp to provide different behavior than simply using > ~, and this turns out to be a very unreliable assumption. There is a > directory (/dev/shm) whose entire definition is to provide those semantics, > and shouldn't even _exist_ if it doesn't. I believe that would be a better > directory to use. > I can submit a patch for this. It's arch/um/os-Linux/mem.c, line 37, in > find_tempdir(). > And while I'm at it, os-Linux/start_up.c has a check_tmpexec() that has > "/tmp" hardwired into its messages, even if that's not what find_tempdir() > returned... Good note... I'd gladly accept that. > Like "/tmp/uml.ctl" in arch/um/drivers/daemon_kern.c, line 70? > > (It's not obvious where this file is actually created, it's one of those > funky callback things where data in a structure is used somewhere else...) It's not a file, it's a AF_UNIX socket bound there - and bind() fails if the file exists. So it's a different story (I was puzzled by a missing bind(O_EXCL), but I learned with trial there's no need). It's created at uml_switch (not setuid) startup, which can be done by anybody. Btw, Debian moves that socket to something under /var/run/uml-utilities or something like that. > > Oh. I do it all the time. I prefer not to work under the assumption that > > I'm more brilliant than thirty years of Unix hackers and spotted > > something none of them did, but so be it... I recently realized that even the mktemp(1) utility works - it creates the file and returns the pathname. I kept wondering "but what if an attacker alters the file afterward", but I forgot the sticky bit - nobody else can delete my file. > And the reason they duplicated /bin > and /sbin and /lib under /usr is that they ran out of space on the root > disk and had to leak the OS into the second disk pack which had previously > held all the user home directories. Seen this argumentation for Hurd systems... However until LVM2 (and-all-the-rest)-on-root works out of the box, I'll call anything else crap. > I agree initrd is kinda pointless, but initramfs isn't. The kernel guys > are moving towards initramfs being required someday. These are still > nebulous future plans with no actual deadline, but they include moving to > dynamically assigned major/minor numbers (so you need something like udev > to > populate /dev), Nice move to disable init=/bin/sh. Really. Next one is moving kdelibs into the kernel? > They'd proposed a serious kernel crapectomy Yep, I remember. > "for 2.7" back before 2.7 got > put on indefinite hold. How they're rolling it out now, we dunno. They > seem to be happy chewing their current mouthful, at the moment... > > What does it look like in this brave new world of shared > > subtrees? > > Obviously /etc/mtab *must* be a symlink to /proc/mounts, now, > > only oops that breaks the quota tools...) > I rewrote busybox mount so that things work properly with /proc/mounts. > And I vaguely remember coming up with an in-house patch to fix the quota > tools (they were upset by rootfs) something like four years ago. > > I HATE FSCKING MTAB > > > > (in three-part harmony, probably) > > Everybody hates /etc/mtab. It doesn't work if you chroot. Right. > It can't handle > --bind or --move mounts... In my experience it does (I have 2/3 distros and use chrooting often, so I loop-mount half my disk): $ grep bind /etc/mtab /home /mnt/gen32/home none rw,bind 0 0 /var/spool/wwwoffle /mnt/gen32/var/spool/wwwoffle none rw,bind 0 0 /mnt/gen32/var/tmp/portage /var/tmp/portage none rw,bind 0 0 /home /mnt/mdk/home none rw,bind 0 0 /mnt/win_c /mnt/gen32/mnt/win_c none rw,bind 0 0 Don't know for shared mounts... > Just symlink it to /proc/mounts and recognize > that any tool that can't handle that is a buggy tool that needs to be > fixed. No - the kernel doesn't allow storing the full set of infos which are added by mount there. And frankly I don't want the kernel to do that. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Nix <ni...@es...> - 2005-11-27 18:36:12
|
On Sun, 27 Nov 2005, bla...@ya... whispered secretively: > It's not a file, it's a AF_UNIX socket bound there - and bind() fails if the > file exists. So it's a different story (I was puzzled by a missing > bind(O_EXCL), but I learned with trial there's no need). There's an (optional) abstract namespace for AF_UNIX sockets now. It's Linux-only, but UML isn't going to care about that :) >> > Oh. I do it all the time. I prefer not to work under the assumption that >> > I'm more brilliant than thirty years of Unix hackers and spotted >> > something none of them did, but so be it... > > I recently realized that even the mktemp(1) utility works - it creates the > file and returns the pathname. I kept wondering "but what if an attacker > alters the file afterward", but I forgot the sticky bit - nobody else can > delete my file. If that utility exists :( an *awful* lot of Linux systems don't have it, and of course in the howling wilderness that is proprietary Unix, nobody has it at all. >> And the reason they duplicated /bin >> and /sbin and /lib under /usr is that they ran out of space on the root >> disk and had to leak the OS into the second disk pack which had previously >> held all the user home directories. > > Seen this argumentation for Hurd systems... However until LVM2 > (and-all-the-rest)-on-root works out of the box, I'll call anything else > crap. That's one of the jobs of the initramfs :) and it's even kept up to date for you with new versions of the tools whenever you rebuild the kernel. >> I agree initrd is kinda pointless, but initramfs isn't. The kernel guys >> are moving towards initramfs being required someday. These are still >> nebulous future plans with no actual deadline, but they include moving to >> dynamically assigned major/minor numbers (so you need something like udev >> to >> populate /dev), > > Nice move to disable init=/bin/sh. Really. Next one is moving kdelibs into the > kernel? Nah, AIUI the initramfs runs *first*; it's its job to parse those parts of the kernel parameters. (I just hope it gets it right. A lot of initrd scripts I've seen just ignore init=, leading to much pain later on.) > Don't know for shared mounts... /etc/mtab assumes *one single* canonical filesystem view, so shared or private mounts or anything smacking of them will break it completely. (Indeed in my experience breathing heavily near it will break it completely...) >> Just symlink it to /proc/mounts and recognize >> that any tool that can't handle that is a buggy tool that needs to be >> fixed. > > No - the kernel doesn't allow storing the full set of infos which are added by > mount there. And frankly I don't want the kernel to do that. Why not? It should. Only root can call mount(), so there's no real danger that some attacker will stick megabytes of stuff in there. -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |
From: Rob L. <ro...@la...> - 2005-11-28 00:37:07
|
On Sunday 27 November 2005 12:35, Nix wrote: > On Sun, 27 Nov 2005, bla...@ya... whispered secretively: > > It's not a file, it's a AF_UNIX socket bound there - and bind() fails if > > the file exists. So it's a different story (I was puzzled by a missing > > bind(O_EXCL), but I learned with trial there's no need). > > There's an (optional) abstract namespace for AF_UNIX sockets now. It's > Linux-only, but UML isn't going to care about that :) I dunno if that's any less susceptible to attack, but it's probably "the right thing" anyway... > > Don't know for shared mounts... > > /etc/mtab assumes *one single* canonical filesystem view, so shared or > private mounts or anything smacking of them will break it completely. > > (Indeed in my experience breathing heavily near it will break it > completely...) I once asked a couple of lute players what would put a lute out of tune. "Large flowers" and "brightly colored wallpaper" were the immediate answers. (Very delicate instrument your average lute; thin wood, lightly strung, so it's basically out of tune by the end of any given song...) It came to mind thinking about the reliability of mtab, for some reason... > >> Just symlink it to /proc/mounts and recognize > >> that any tool that can't handle that is a buggy tool that needs to be > >> fixed. > > > > No - the kernel doesn't allow storing the full set of infos which are > > added by mount there. And frankly I don't want the kernel to do that. > > Why not? It should. Only root can call mount(), so there's no real > danger that some attacker will stick megabytes of stuff in there. And only the kernel really knows what's mounted. Userspace can try to keep track, assuming every program that calls the mount or umount syscall (mount, autofs, nfsmount, smbmount) remembers to update /etc/mtab, agrees on how, never has a race condition with any other instance. But kernel is still the ultimate authority here. If the _kernel_ doesn't know something's mounted, then it's not mounted. Period. Rob -- Steve Ballmer: Innovation! Inigo Montoya: You keep using that word. I do not think it means what you think it means. |
From: Nix <ni...@es...> - 2005-11-27 18:31:37
|
On Sat, 26 Nov 2005, Rob Landley murmured woefully: > On Friday 25 November 2005 20:12, Nix wrote: >> If it's a problem you have both hostile users and no size limits on /tmp >> and you therefore have bigger problems anyway. :) > > The size limits on /tmp aren't per-user. True. TODO: add tmpfs quota support. :) >> > However, the idea of an OOM could be made to work, if you can kill an app >> > based on the derivative of its memory usage (i.e. how fast usage has >> > increased over the last moments). >> >> ... and the VM appears to be growing things that might help in that area :) > > We get better as time goes on. > > My original point was that the semantics of what UML wants is shared memory. > It's trusting /tmp to provide different behavior than simply using ~, and > this turns out to be a very unreliable assumption. There is a directory > (/dev/shm) whose entire definition is to provide those semantics, and > shouldn't even _exist_ if it doesn't. I believe that would be a better > directory to use. I have to agree, not least because it is counterintuitive to set a strict size limit on /tmp on the host and then find you can't start big UMLs. >> ... with a random name, obviously. :) > > Like "/tmp/uml.ctl" in arch/um/drivers/daemon_kern.c, line 70? Um. :/ > I personally symlink /bin, /sbin, and /lib to the > corresponding /usr directories and consolidate the whole mess, myself. Yes, > you have to patch gcc's paths (in collect2) to not search _both_ /lib > and /usr/lib because if gnu's linker finds the same symbols in two different > libraries it statically links them in rather than trying to figure out which > one is right, resulting in executables as big as if they're statically linked > but still refusing to run if they can't find their shared libraries at run > time. That's a bug in ld. I'll say! I'll see if I can fix that (if it isn't already fixed: I'm having trouble reproducing it here, with binutils 2.16.91.0.2...) >> > - back in 2.4, tmpfs on /tmp broke mkinitrd since it tried to loop-mount >> > the new initrd, which was in /tmp. And loop-mount over tmpfs didn't work. >> >> Ah, well, I never use initrd if I can avoid it, and a bug in one tool is >> a reason to *fix that tool*, not rejig the whole damn system. > > I agree initrd is kinda pointless, but initramfs isn't. The kernel guys are > moving towards initramfs being required someday. I didn't properly understand the difference (using rootfs versus not) when I wrote that email. I do now thanks to your nice little document recently mentioned on l-k, and I have to agree that initramfs seems a whole lot nicer. > These are still nebulous > future plans with no actual deadline, but they include moving to dynamically > assigned major/minor numbers (so you need something like udev to > populate /dev), How terrible. :) > having userspace find and mount the real root partition (so > when you're booting from a USB key but your root paritition lives on an NFS > server that in order to access it you have to dhcp yourself an address, > nslookup the server name, and then login with a public key from said USB > stick...) All the various partitioning schemes could be moved over to device > mapper. And so on. It's a little annoying for those of us *without* horribly complex boot schemes; I guess there'll be a `default initramfs' which replicates the current behaviour. > They'd proposed a serious kernel crapectomy "for 2.7" back before 2.7 got put > on indefinite hold. How they're rolling it out now, we dunno. They seem to > be happy chewing their current mouthful, at the moment... Yeah, the change rate of the kernel doesn't exactly seem to be at an all-time low :) >> What does it look like in this brave new world of shared >> subtrees? > > I had this discussion on the kernel list a week or so back: namespaces are > reference counted so as soon as the last process that can see a mount goes > away, umount happens. This means that umount -a should only zap everything > in your current namespace, so that after init kills all sub-processes it can > then run umount -a for pid 1, life is good. Yeah, but what does /proc/mounts say? Does it show only references that the querying process can see? ... actually, hey, yes, it's a symlink to /proc/self/mounts, so it does the right thing already. Nifty. >> Obviously /etc/mtab *must* be a symlink to /proc/mounts, now, >> only oops that breaks the quota tools...) > > I rewrote busybox mount so that things work properly with /proc/mounts. And I > vaguely remember coming up with an in-house patch to fix the quota tools > (they were upset by rootfs) something like four years ago. Please feed it upstream to the quota tools people before I have to write the same damn patch ;))) >> I HATE FSCKING MTAB >> >> (in three-part harmony, probably) > > Everybody hates /etc/mtab. It doesn't work if you chroot. It can't handle > --bind or --move mounts... Just symlink it to /proc/mounts and recognize > that any tool that can't handle that is a buggy tool that needs to be fixed. Well, ideally the kernel should allow mount(2) to feed it *arbitrary* options in the `data' argument, reflecting those it doesn't understand back into /proc/mounts. That would avoid breaking the quota tools and, um, whatever else depends on this (I've seen distributed administration tools that mark up filesystems with custom options in the expectation that they'll land in mtab, too: I think there's some automated fstab editor in HAL that does the same thing). [...] > First time I've heard of the tool, but then back under 2.4.7 I remember I had > rsync regularly triggering the OOM killer. Not because rsync was leaking, > but because the servers backing up only had 128 megs of memory and the > balancing was _terrible_ so the dentry cache and page cache would squeeze out > anonymous pages to the point where rsync itself got OOM killed... Ick, yes. I switched to 2.4 around that time and switched right back to 2.2 again because the MM had so many problems... > People who want truly insane amounts of memory these days (often for graphics > or video editing) tend to mmap their data files directly and work in there. > Once again rendering insane amounts of swap less useful... Not necessarily, given the existence of MAP_PRIVATE. (The problem with working directly in data files without MAP_PRIVATE is that if you lose power at *any* time, your data file is toast.) > If we had a "treat this like it's on tmpfs" madvice, that would be ideal... Agreed. Combine that with per-user filesytems and, well, give every user a small tmpfs mount of their own on /tmp and let apps use suitably advised mmaps for everything else :) (security holes? but other users can't *see* that /tmp, which is why it's mode 640, just like their $HOME...) -- `Y'know, London's nice at this time of year. If you like your cities freezing cold and full of surly gits.' --- David Damerell |