From: Jake H. <jh...@po...> - 2005-01-28 16:32:06
|
Since I have read a lot of the EROS documentation and found it pretty cool, I thought I'd write a quick summary. As a working open-source OS, it is still basically limited to "Hello, world", but the concepts behind EROS are very cool and were in fact proven by KeyKOS which was a real commercial OS for S/370 mainframes sold by a company called Key Logic during the 1980s. You can read more on EROS concepts here: http://www.eros-os.org/devel/00Devel.html The two best things about EROS/KeyKOS: * The default access to any resource is no access (EPERM) unless the process has explicit permission. That permission comes in the form of a key held in OS-controlled memory which can be manipulated or passed around to other processes via a secure IPC mechanism. Keys are used to allocate all sorts of resources, including memory pages (up to the maximum VM size allowed to the process), network sockets, read-only or r/w access to a "file" (really, an anonymous shared region of VM), or even classes of CPU priority. Each end of a pipe acts like a "capability" in that any process can pass its end of the pipe to any other process and, in so doing, transfer the "privilege" of that stream. This allows for example a debugger or tracing utility to function by inserting itself in between two processes having a "conversation" and it would be impossible for either side to detect that there was an eavesdropper. The middle process could pass data (messages and keys) across the pipe unchanged, or it could alter them (it couldn't send any keys it didn't already possess or receive, of course) or save the data and keys for later. That sounds like a huge security flaw but there is a reason behind it which I don't recall at the moment. :-) Anyway, I believe it is different from the Linux/BSD passing of unspoofable uid/gid/pid identities which is after all a very primitive form of access control tying a process to a single user and fixed number of groups, as opposed to having a bunch of "tokens" that are tied to particular resources and of which a single process can collect a large (but not too large, for performance reasons) number. In KeyKOS/EROS, you can store a small number of keys within the process's key "registers", or swap keys out in between their registers and special key pages in virtual memory held by the kernel and only accessible through, yet another capability key. * There is no filesystem! None at all. Better yet, the system is completely checkpointed every 30 seconds so you can literally pull the plug on the PC and when you reboot you have lost at most 30 seconds worth of work. Sound too good to be true? KeyKOS made the bold decision to use almost the entire disk partition for a paging file. Every 30 seconds the system is paused, all of the memory pages are marked read-only, and the processes are restarted, while simultaneously the checkpoint thread copies the contents of RAM to the checkpoint journal. The running processes slow down a bit during the checkpoint as every memory write operation triggers a page fault and the kernel has to block the app until the page becomes ready for r/w, but after each page is written to the journal, it can be returned to its previous read/write privilege and the system returns to full speed. Presumably higher-priority regions of memory, e.g. a video streaming buffer, could be handled, instead of blocking the app, by swapping out some other page (that is already checkpointed), marking the original page r/w, and scheduling the checkpoint of that page from the r/o copy. Meanwhile, any other page-outs would go to the checkpoint instead of to the normal destination on the disk so there is always at least one clean checkpoint that can be booted into. Instead of storing user data in a traditional filesystem, you have one or more database server processes that hold the ENTIRE filesystem in virtual memory. Since Syllable currently only runs on 32-bit architectures and we already have hard drives (and individual files!) much bigger than 4GB, this is not exactly practical for today's large media files and big hard drives. OTOH, the Athlon 64 and PowerMac G5 and upcoming Intel x86-64 chip are all 64-bit so perhaps this architecture would begin to make more sense in another five years. Anyway, if the "big VM filestore" is going to be no different in practice from what we have now with AFS, and considering that AFS would likely have much better performace and reliability due to being optimized for on-disk layout, I would of course prefer to stick with AFS for our primary file storage, especially since it's already been written and works pretty well, and the POSIX semantics are well understood. OTOH, the KeyKOS concept of checkpointing the entire system state is still cool even in conjunction with a traditional FS, when you consider that you would NEVER have to worry about losing your e-mail composition or spreadsheet or source code file because of a power outage or system crash when you hadn't saved. And no need even to have a separate "Hibernate" option, just pull the plug! The alternative is of course to have an "auto-save" feature in every application on the desktop explicitly, and that is not going to happen nor is it nice to require that all developers explicitly handle this when the entire system could be checkpointed atomically, once, in the kernel. So the idea of checkpointing a very large virtual memory would still be a very cool thing for Syllable to be able to do at some unknown time in the future, even in addition to the traditional Linux/BeOS paradigm of having a separate filesystem accessible through read/write system calls (or mmap) with files and disks that can be larger in size than the address space. But for implementing something like the registrar or d-bus, you don't have to figure out how to serialize your in-memory data structures to persistant storage: you simply do nothing (or hint to the kernel what you are doing with certain pages or when you're doing an operation that must be atomic and can't be interrupted half-way by a checkpoint) but instead rely on the system to keep your in-memory trees and hashtables checkpointed. You can afford to give all running processes the chance to give their approval before atomically checkpointing the RAM since it only has to run approximately every 30 seconds. Another follow-on consequence of this design is that you NEVER reboot the OS. You might have to reboot the _computer_ after a power crash or to install a new kernel or new drivers but the cool thing is that after rebooting, the new kernel loads in the EXACT same big VM image from the previous checkpoint, updating any in-memory structures as necessary if the kernel was upgraded. The flip side of this is that if something gets really screwed up, it is a pain to do a "hard reset" and the installation process is also a lot different. Again, we avoid all of that nonsense by sticking with the conventional wisdom and retaining the ability to do a conventional boot from a conventional filesystem.. even though we will only be suspending or hibernating, never shutting down or cold-booting under normal circumstances! -Jake |