> As far as I can tell, there are three primary things that make SF slow:
> 1. It's implemented in Python. If it were rewritten mostly or entirely in C,
> it would about as fast as strace (a bit faster to somewhat slower,
> depending on what tricks were in use).
> I'm surprised that the problem is so bad, and I'm reluctant to go to C,
> but there doesn't seem to be any other way to really fix the performance
> problem. I would like to keep the trick interface accessible to Python
> (probably in addition to C), but I believe this should be possible without
> a reduction in performance.
How much would you have to rewrite?
There's something amusing and twisted about the notion of a Python program
(very high level) using all sorts of weird ptrace techniques to swizzle
system calls around (very low level).
> 2. The ptrace interface requires four process context-switches per syscall.
> I haven't tried to measure this, but this probably accounts for a fair
> amount of the feeling of sluggishness of strace, and SF suffers from it
> too. This could be improved by, say, having a mask in the kernel of the
> syscalls being traced, so that the context switches would not be required
> for untraced calls. (I wouldn't look forward to trying to get this into
> the kernel, though.)
"Medusa DS9" (ick) seems to be heading in this direction.
> Okay, so suppose SF becomes as fast as strace. I'm interested to hear your
> ideas about killer apps. What would you do with SF then?
Well, um, uh. Start here:
> > (Location transparency! There is one computer, with lots of devices,
> > which we all happen to have accounts on... software should never be
> > installed, users shouldn't have to manage the fact that their hard drive
> > is mostly a cache of network data... oops, I'm foaming at the mouth again.)
So. Excuse me while I rant about far-off impossibilities.
Many of these scenarios may actually require speed significantly in
excess of what even 'strace' provides.
I wish there were a distributed filesystem that worked, so that rather than
using FTP or SCP to send files and HTTP to retrieve them, and rather than
putting files into little "bundles" (tarballs, binary packages, ...) and
forcing users to move those bundles around, unpack them when they need them,
delete them when they're done, and rather than building umpteen little
badly-designed frameworks for notification and update and caching and
mirroring, we could just have one big seamless global filesystem.
Then, if I discover a program called 'fubtersugue' that looks really cool,
and I want to try it, I don't have to point a Web browser at
http://fubtersugue.sourceforge.net/ and download a binary package and
become root and (hold my breath and) install that package and then see
what binaries that package installed and ...
I would just say
% /sourceforge.net/fubtersugue/$ARCH/bin/fs --help
It would take a little while the first time (loading over the Internet), but
then it would be fast, at least until my system decided my disk space was
better spent on something else. (My system does a fine job of managing
the first five or six levels of caching, from register allocation in gcc
down to VM and I/O page buffering in the kernel; why is the last level of
caching -- my hard disk vs. network resources -- suddenly up to me?)
This is not easy, and lots of people have tried (and almost all failed);
each system learns from the last, but it's slow going. (Amoeba, Plan 9,
NFS, AFS, DFS, CODA, InterMezzo, SFS, ...) I think one big reason
development in this area is so slow is that creating a filesystem that
works transparently with any application currently requires Deep Kernel
Magic to implement, and when the implementation is flaky, you're completely
hosed. (Anyone who has struggled with NFS knows this; when things start
to go wrong, it's not long before your system is close to useless as more
and more programs get stuck in the "tar pit" of a nonresponding mount.
And NFS is by far the most mature and stable such filesystem...)
Furthermore, Deep Kernel Magic is by its nature a system-wide, root-user,
administrative-privilege sort of thing. You can't just try running it as
a user; you have to commit your system to it in some way.
SF offers an escape, by letting the filesystem be written with ordinary
user-mode tools, be run by ordinary user-mode users, and support "soft
failure". Neither I nor SF have magic solutions to all the hard problems
that have plagued universal distributed filesystems in the past... but
the hope is that breaking this domain out of kernel-land (where it's
really really hard to make any progress), things would go better.
Systems should have global undo/redo. Ideally, this would totally subsume
all application undo/redo (though that's terribly ambitious!). Right now,
using an operating system is much like using an application before they
invented undo. Some operations are "dangerous" and the system confirms that
you Really Want to do them, but of course you get so used to the confirmation
step that it becomes useless, and it's not always possible to predict what a
"dangerous" operation is. So, you have to tread warily.
With a modern application, there's no confirmation step to get in your way
and no attempt to segregate operations into "dangerous" and "safe". Instead,
you can do anything you want, and if you don't like it, you just undo it.
In fact, this gives you a lot more freedom to explore; in fact, many users
end up using undo/redo as navigation of a sort (this brings up more UI issues
that are out of scope).
The OS should be the same way. No more "rm -i", no more "noclobber", just
a big "oops" button somewhere. (There are lots and lots of finicky details
here. What if you want to undo something that touched the network? What
if there are multiple users on the system?)
Matt Kimball's "libundo" (seen it?) is tangentially related here.
You've mentioned this one yourself, in the past.
> Those are probably the best ones I know about. If you haven't used Python
> before, well, it's a gem hiding in plain sight. 'lsof' is kind of
> interesting. 'vnc' is cool, you seem to have found that.
Sure. Everyone knows about those. :)
A friend of mine once started writing a Python script, similar in spirit
to 'lsof', that walked Linux' /proc tables to find open handles. It
differed in that, rather than simply dumping open files, it would figure
out what all the connections *between* programs were (i.e. it could tell
that the two endpoints of a pipe or socket were connected to each other),
and it would display a diagram showing the relationships between all the
programs running on your system (or across multiple systems!).
It was pretty spiffy, but he never really finished it.
> It would be kind of nice if there were some sort of organized catalog of the
> wheat and the chaff, but I don't know how it could be maintained. Presence in
> the debian distribution seems to correlate reasonably well with being
A futures market, perhaps?
> > OK, here's a *really* random question...
> > Could you (or some sufficiently wizardly hacker) build "SF" *into* an
> > application process? "Of course not," you splutter, but wait.
> Dan, this is a very interesting idea and very original. After considering it
> a while, though, I see some pretty major downsides:
> - That "trap" instruction (INT 0x80, I think) is one (32-bit) word, which can
> appear elsewhere in the program text (e.g., as an immediate operand).
> Sifting "real" INTs from other occurrences looks very hard, maybe
> practically undoable. You'd have to be able to check text, including text
> introduced later by dynamic loading, and you'd have to either change all INT
> words (real and otherwise), in which case you have a problem when non-real
> occurences get used (because they're an operand to another instruction), or
> you have to be able to guarantee that the program can never jump to a
> non-real occurrence. I think this is a show-stopper.
Bummer that it's only a single word. What is the opcode representation?
Is it likely to occur in an immediate operand more often than the 1 in 2^32
you'd expect? At worst case you have to do basic block tracing (see below).
> - It seems like you have to draw a lot more support stuff into the traced
> process if you want to do this, increasing the chances that you'll trip the
> Heisenberg effect (program behavior will differ). Probably this is the
> reason no one's ever seriously worked on an in-process debugger for Unix,
It is certainly true that the in-process trusted code is living in a
pretty hostile environment. It can't exactly just call printf(). I
imagine that it would mostly exist to decide when something needed to be
"handled" by a separate process, and would use some IPC if so.
> > In the limit, we have VMware, and we know that works, so it must be
> > possible. The question is just how painful it would be. :)
> Hmm. I'm not sure exactly how VMware does it's thing, though I believe it
> requires a special kernel module. There might be something interesting here.
I think the kernel module is mostly to make networking go, though I'm not
sure what exactly 'vmmon' does.
VMware's exact techniques are proprietary, of course, but the basic
techniques can be gleaned by reading the papers published by the companies'
founders when they were grad students. (I bet they also have patents on
file, though I haven't looked.)
AIUI, the basic idea in VMware is to find all instructions that could
possibly give away the fact that the software is running in a virtual box
and not on "real metal". (Some processors have no such instructions, and
are "fully virtualizable", but we're not so lucky with ia32.) There are
quite a few such instructions, and the problems you point out above come
So, suppose you have a bunch of code in memory (e.g. the Phoenix BIOS which
is present when VMware boots), but you don't know anything about any of it.
You do know the entry point. So you start there and scan instructions
looking for a JMP (conditional or otherwise) that could transfer control
somewhere else. (You also look for non-virtualizable instructions and
replace them with breakpoint traps of some kind.)
When you get to a JMP, you examine the target. If it's a fixed address
(this is by far the common case), you start examining instructions at the
target. If the JMP is conditional, you now have to follow two possible
execution paths. You end up tracing out the graph of possible control flow
through the program. If the JMP target isn't a fixed address but instead a
computed value (this is relatively rare; switch/case JMP tables are the most
common case), you replace it with a trap.
Eventually, you've characterized every possible place the instruction pointer
could possibly reach, examined every instruction that could possibly be
executed, and placed breakpoint traps everywhere the instruction pointer
could possibly "jump the tracks" and go somewhere wild. You mark all those
pages executable but read-only.
Then, you let the program run.
When it ever runs into a trap, you emulate the instruction that the trap
replaced. If the instruction causes execution to transfer someplace you
haven't already checked, then you get to start the checking process all over
again. If the process ever writes to a page you've scanned, it faults
(because the page is read-only); you mark the page non-executable, and
re-scan it the next time they jump there.
In practice, I'm glossing over all sorts of details. First of all, you do
this lazily, on an incremental basis, marking pages executable as you go
(and catching the faults which tell you when you have to do more processing).
There's also the issue that you're modifying code, but the app may try to
read its code and be confused that it's changed (so you actually modify a
copy of the code, which you also need to relocate, but luckily you're
already tracing all the JMPs, so relocation is easy enough). You really
want to do this *fast*, so there's lots of hand-tuned assembly code, lookup
tables and packed data structures.
But the end result really is very, very fast. VMware's slowness is almost
entirely due to hardware emulation; the actual JMP tracing adds trivial
overhead to most programs. (If you run a CPU-bound program under VMware,
it runs at nearly full speed.)
So... you could imagine doing this all within a process. Rather than
emulating physical machine hardware, you're just emulating the kernel
trap interface. Simpler than VMware, but still more complex than anything
I'm going to write any time soon :).