[Syllable-kernel] more on EROS and KeyKOS..

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Since I have read a lot of the EROS documentation and found it pretty 
cool, I thought I'd write a quick summary.  As a working open-source OS, 
it is still basically limited to "Hello, world", but the concepts behind 
EROS are very cool and were in fact proven by KeyKOS which was a real 
commercial OS for S/370 mainframes sold by a company called Key Logic 
during the 1980s.  You can read more on EROS concepts here:

http://www.eros-os.org/devel/00Devel.html

The two best things about EROS/KeyKOS:

  * The default access to any resource is no access (EPERM) unless the 
process has explicit permission.  That permission comes in the form of a 
key held in OS-controlled memory which can be manipulated or passed 
around to other processes via a secure IPC mechanism.  Keys are used to 
allocate all sorts of resources, including memory pages (up to the 
maximum VM size allowed to the process), network sockets, read-only or 
r/w access to a "file" (really, an anonymous shared region of VM), or 
even classes of CPU priority.  Each end of a pipe acts like a 
"capability" in that any process can pass its end of the pipe to any 
other process and, in so doing, transfer the "privilege" of that stream. 
  This allows for example a debugger or tracing utility to function by 
inserting itself in between two processes having a "conversation" and it 
would be impossible for either side to detect that there was an 
eavesdropper.  The middle process could pass data (messages and keys) 
across the pipe unchanged, or it could alter them (it couldn't send any 
keys it didn't already possess or receive, of course) or save the data 
and keys for later.  That sounds like a huge security flaw but there is 
a reason behind it which I don't recall at the moment.  :-) Anyway, I 
believe it is different from the Linux/BSD passing of unspoofable 
uid/gid/pid identities which is after all a very primitive form of 
access control tying a process to a single user and fixed number of 
groups, as opposed to having a bunch of "tokens" that are tied to 
particular resources and of which a single process can collect a large 
(but not too large, for performance reasons) number.  In KeyKOS/EROS, 
you can store a small number of keys within the process's key 
"registers", or swap keys out in between their registers and special key 
pages in virtual memory held by the kernel and only accessible through, 
yet another capability key.

  * There is no filesystem!  None at all.  Better yet, the system is 
completely checkpointed every 30 seconds so you can literally pull the 
plug on the PC and when you reboot you have lost at most 30 seconds 
worth of work.  Sound too good to be true?  KeyKOS made the bold 
decision to use almost the entire disk partition for a paging file. 
Every 30 seconds the system is paused, all of the memory pages are 
marked read-only, and the processes are restarted, while simultaneously 
the checkpoint thread copies the contents of RAM to the checkpoint 
journal.  The running processes slow down a bit during the checkpoint as 
every memory write operation triggers a page fault and the kernel has to 
block the app until the page becomes ready for r/w, but after each page 
is written to the journal, it can be returned to its previous read/write 
privilege and the system returns to full speed.  Presumably 
higher-priority regions of memory, e.g. a video streaming buffer, could 
be handled, instead of blocking the app, by swapping out some other page 
(that is already checkpointed), marking the original page r/w, and 
scheduling the checkpoint of that page from the r/o copy.  Meanwhile, 
any other page-outs would go to the checkpoint instead of to the normal 
destination on the disk so there is always at least one clean checkpoint 
that can be booted into.

Instead of storing user data in a traditional filesystem, you have one 
or more database server processes that hold the ENTIRE filesystem in 
virtual memory.  Since Syllable currently only runs on 32-bit 
architectures and we already have hard drives (and individual files!) 
much bigger than 4GB, this is not exactly practical for today's large 
media files and big hard drives.  OTOH, the Athlon 64 and PowerMac G5 
and upcoming Intel x86-64 chip are all 64-bit so perhaps this 
architecture would begin to make more sense in another five years.

Anyway, if the "big VM filestore" is going to be no different in 
practice from what we have now with AFS, and considering that AFS would 
likely have much better performace and reliability due to being 
optimized for on-disk layout, I would of course prefer to stick with AFS 
for our primary file storage, especially since it's already been written 
and works pretty well, and the POSIX semantics are well understood. 
OTOH, the KeyKOS concept of checkpointing the entire system state is 
still cool even in conjunction with a traditional FS, when you consider 
that you would NEVER have to worry about losing your e-mail composition 
or spreadsheet or source code file because of a power outage or system 
crash when you hadn't saved. And no need even to have a separate 
"Hibernate" option, just pull the plug!  The alternative is of course to 
have an "auto-save" feature in every application on the desktop 
explicitly, and that is not going to happen nor is it nice to require 
that all developers explicitly handle this when the entire system could 
be checkpointed atomically, once, in the kernel.

So the idea of checkpointing a very large virtual memory would still be 
a very cool thing for Syllable to be able to do at some unknown time in 
the future, even in addition to the traditional Linux/BeOS paradigm of 
having a separate filesystem accessible through read/write system calls 
(or mmap) with files and disks that can be larger in size than the 
address space.  But for implementing something like the registrar or 
d-bus, you don't have to figure out how to serialize your in-memory data 
structures to persistant storage:  you simply do nothing (or hint to the 
kernel what you are doing with certain pages or when you're doing an 
operation that must be atomic and can't be interrupted half-way by a 
checkpoint) but instead rely on the system to keep your in-memory trees 
and hashtables checkpointed.  You can afford to give all running 
processes the chance to give their approval before atomically 
checkpointing the RAM since it only has to run approximately every 30 
seconds.

Another follow-on consequence of this design is that you NEVER reboot 
the OS.  You might have to reboot the _computer_ after a power crash or 
to install a new kernel or new drivers but the cool thing is that after 
rebooting, the new kernel loads in the EXACT same big VM image from the 
previous checkpoint, updating any in-memory structures as necessary if 
the kernel was upgraded.  The flip side of this is that if something 
gets really screwed up, it is a pain to do a "hard reset" and the 
installation process is also a lot different.  Again, we avoid all of 
that nonsense by sticking with the conventional wisdom and retaining the 
ability to do a conventional boot from a conventional filesystem.. even 
though we will only be suspending or hibernating, never shutting down or 
cold-booting under normal circumstances!

-Jake