Re: [Lse-tech] topology design notes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mon, 3 Dec 2001, Patrick Mochel wrote:

> ... This particular message seems to cover most of
> what I want to discuss.

I am honored, Pat, to have been part of the catalyst
for you fine post.  I think you have just brought us
a bunch closer to some good answers.

> It is in poor taste, so please forgive me.

Good food for thought is seldom in poor taste.

> http://kernel.org/pub/linux/kernel/people/mochel/device/

Good stuff - thanks.  Last week, I pointed out this same
web directory to the gentle readers of this unduly long
email thread.

> (1) and (3) are the same thing with a procfs-like interface. The
> topology information is implicit with the device tree. And, when it is
> mounted, seeing the topology is as easy as doing something equivalent to
> 
> $ find -type d
> 
> There are very few instances when the kernel would need to
> know the topology of the system without any interaction with
> userspace. The one that I am familiar with is system-wide
> power management, and the need to put all the devices to
> sleep before you transition to a low power state.
> 
> Even in that case, I don't think that the kernel should concern
> itself with either (a) recursive walks of the device tree, or
> (b) ugly inline code that simulates recursion. But, that's
> a radical POV that I've assumed lately.

Yes - exactly.  The kernel should provide a minimal, orthogonal
set of primitive, flexible mechanisms.  In this case, it should
expose the devices according to a single model, via a single means.

It's up to userland code to construct purposeful, holistic views
of that data.

Just as you see no need for the kernel to have to traverse the
topology, I have yet to see any reason why the kernel scheduler
and page allocator need know the topology and metrics of the
systems cpus and memory.  The kernel should just be told to
schedule a certain task on these cpus, and to satisfy a page
fault from these memory blocks.  The motivation for these
choices, that say these cpus and memory blocks are close to
each other, is not something the kernel scheduler or allocator
need know.  Which is not to deny that some of the more dynamic
feedback mechanisms used to fine tune the scheduler and allocator
won't require kernel variables, but is to say that simple,
relatively static decisions can be dictated from user land.

And since they can be, they probably should be.

> As far as having a syscall interface - I think all you need
> are read(2) and write(2). If a device driver exports a file
> to get/set information, you can use read() and write() to
> deal with the data from userspace, and the kernel can use
> copy_{to,from}_user() to deal with it.
> 
> There is no need to add a global ioctl()-like syscall to deal
> with devices and their associated data.

Ok - I had gone along with Paul Dorwin's proposal for using
primarily system calls (prctl or ioctl or some such) to
walk the tree, node-by-node.

However if you are prepared to deliver a driverfs file system
that delivers the required data, and if that is blessed by
Linus, so much the better.

In a separate message, Neils asked Pat:
    Just for the heck of it, please tell me why procfs is an
    abomination (which I tend to agree with) but driverfs is not?
    Why this fixation on making everything a file system?

The advantages of a file system, especially for non-trivial
amounts of relatively static data that can sensibly be organized
in a tree, are that dang near every possible userland programming
environment knows how to access it, and that the natural naming
and permission structure of a file system view are often a
good fit to what's needed.  For example, a special file system
encourages publically visible short string names, instead of
obscure small-nonnegative-integers (recall <major, minor> ?),
to name devices and attributes, which is the right direction.
Publically visible as in "cd /driverfs; ls", which is one of
the important apis to support.

Perhaps access to a file system from C runtime is the most
cumbersome of any popular environment, but if someone gets
off their duff and provides an open library that converts the
file system api into a C-friendly api, then that issue can be
resolved once and for all.

hmmm ... looks below like Pat may just be doing this ...

> > Perhaps my ideal solution would be:
> >
> >   1) A classic system call API to walk the system hardware ...
> >
> >   2) A user daemon that inputs (1), and outputs a special file
> >      system view of the topology ...
> >
> >   3) A kernel facility to notify any requesting user process
> >      of any changes in (1). ...
> >
> > But this is a rather radical proposal, I'll grant.
> 
> It's not radical, and has been somewhat implemented, even before my
> changes. However, I think things can be simplified quite a bit.
> 
> The system call in (1) is not needed. Let userspace walk the tree and
> figure everything out, since:
> 
> The topology in (2) is gained by simply mounting the driverfs 'partition'
> (it's an in-memory fs, so 'partition' doesn't sound quite right..)

Both the C-friendly API and the file system view are desirable,
for different users.  The kernel should export one, and the
other should be fabricated in user land.

That much we clearly agree on.

And since you are providing what looks to be a promising driverfs
file system mechanism for this, we agree on it all - just like
you said.

good.

> For hotplug events, there exists on most distributions '/sbin/hotplug'
> for exactly what you describe in (3).

Excellent - thanks for the pointer.

> Every system will have the driver model, and a driverfs hierarchy.
> Linus said he had no problem making it unconditional.

good.

[ ... how devices are identified with structs and ascii strings,
      not integers ... ]

I wish I'd said that ;).  I think you're right.

[ ... discussion of device name stability across hot swap and reboot ... ]

> However, across reboots, things are much more likely to show up in the
> same place. If a user space program saved the device tree before going
> down, it could feed that back into the kernel on system start up and
> recreate the device tree. Of course, if there was a hardware reconfig,
> you would have to trigger a re-probe, but that shouldn't be that hard.

If the systems device naming mechanism is going to deal with
this, I suspect that this means a user level program, in early
boot, after /(root) is mounted, that can perform primitive
pattern matching on various device attributes, in order to
identify which is which, and then assign them their persistent
names.

Better to punt, I suspect.

Devices (or parts thereof) that really care about this, such
as file system partitions, can develop their own mechanisms,
such as file system labels, in order to cope with unstable
device names.

> Most languages have the ability to access shared libraries. There should
> be one created for accessing the device tree. It should be as simple and
> extensible as possible. And, there should be a wrapper application for it
> that allows languages that cannot easily access shared library routines
> (shell scripts) to use its functionality.

Bingo.  And a Perl module.  And a Python module.

> http://kernel.org/pub/linux/kernel/people/mochel/powerctl/

cool.

> Except for the things I mentioned above, the common hierarchial
> view is a good and needed thing.

Hmmm ... "hierarchical" ...  How do you handle systems that are
ring shaped or hyper cubes or some other shape not isomorphic
to a rooted tree?

> For those reasons, I would really like to see support for CPUs and memory
> banks be integrated into the driver model and filesystem.

Please do - I vote yes.

> > Would a more generic named attribute mechanism be appropriate?
> > Perhaps a linked list of structs hanging off each object, where
> > each struct has three members - a label (nul-term string), a
> > type and a value of that type, for type one of string or int.
> > Or would that be overdesign?
> 
> Sort of like the Windows Registry?

ouch - you know how to sting <grin>.

> With a directory per device, any ancestor of the device may
> add files (controls) to the directory and implement handlers
> in kernel space, and deal with them appropriately.
> 
> Make everything ascii, so anything can get/set the
> attributes. Type conversion may seem like a lot of overhead,
> but it's really not that bad.  Esp. on large, fast systems.

Ah - so you weren't agreeing with the less generic specific
named attributes with a pair of specifically named C routines
to set/get each attribute.

Rather you were presenting an improved way to support flexible
attributes - expressed in terms of the driverfs file system
view, and eschewing integers for ascii everything.

I can buy that.

> > In other words, I am imagining a system as a collection of
> > partially connected parts, such as CPUs, caches, buses, routers,
> > memory, i/o boards, graphics boards.  Here, "connected" means
> > directly attached to or plugged into -- electrically adjacent.
> 
> Yes. Exactly.

Ah - finally - a direct hit ;).

> I am very interested in integrating features and changes that will benefit
> the lse effort. In fact, it is the large-system integration that I will be
> dealing with in the near future, so please help me get it right..

Will you be maintaining patches against 2.4, or only against 2.5?

Is Gooch's devfs essentially a competitor to driverfs?  If so,
how is devfs better, and how driverfs better/

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373