Paul Jackson wrote:
> On Mon, 14 Jan 2002, Paul Dorwin wrote:
> > I now have a new document, which can be found at
> > http://lse.sourceforge.net/numa/Topology/Topology.html#REVIEW
> Ah - you've been busy. Good to see you back on this list.
I've always been here - I'm just a little quiet :)
> I will confess to having a little trouble with this design.
> Since it may well have been in part some of my suggestions from
> last Nov/Dec that led to this current design, I'm conflicted.
I will confess that this design is based mainly on the general theme
from feedback that hardcoding the attributes as in the first design
was less than desireable.
> I share your concern, expressed in Section 1.2, that the new
> design is "much larger".
> Anyhow ... to specifics:
> 1) As with the first design, I have trouble with trying to
> present the same API to both the user level and the kernel.
> This API in the second design strikes me as likely suitable
> for use by C code in user land, possibly suitable for passing
> across the user-kernel boundary, and rather unsuitable for
> an internal kernel API. Even you note that this API may
> "cause unacceptable overhead for the kernel".
> I am open to a relatively large API for userland (as can be
> seen from my CpuMemSet work [;)] . But internal kernel API's
> should be, in my view, minimal and more demand driven --
> little more than what is needed by the code that depends
> on it.
Hmm, sounds kinda like where my first design was headed. The API was
minimal, but this required knowledge of the data structures by users.
In this case the user was the architecture specific boot code. The
API to access the data would then explode into a zillion functions
which hide the dependancy on the data structures. I thought that this
was what you described as a brittle interface.
Do you have any suggestions on how to merge the two designs into
> If you decided that this latest API was just for use by
> userland code, not also exposed within the kernel, then
> I would be more comfortable.
> 2) Do you intend to add a notion of "distance" to your topology?
Distances are there. They are attributes of the cpu type.
See section 3.2.3: Cpu Attributes. I probably need to start
a separate thread on the notion of distance. Some think
hops would work. Others think that it is a product
of latency and bandwidth. What type of values make sense for distance?
How are they generated?
> For simpler systems of perhaps 16 to 64 cpus or less, depending
> on topology, all distances are trivially obvious from the
> abstract topology. But for more complex geometries seen in
> larger systems, distances (cpu to memory and cpu to cpu)
> can become non-obvious.
> I appreciate however that you might not think that distance
> fit well within your structure. We'll need it somehow.
> Any suggestions how?
> 3) As was commented on by a couple of us in the lengthy email
> thread that followed up your first design, perhaps the best
> way to pass this topology across the kernel-user boundary
> would be by a /proc-like (or driverfs or whatever like) display
> of files containing ascii text lines, in a directory tree.
My prototype does just that. In fact, the second design makes
it much simpler to iterate first through the object types
creating /proc directories. As the directories are created,
each attribute of the object is presented as a file.
And it follows the /proc mantra: 1 file 1 value. That value
is a long, a character string, or an array of bytes.
Actually, I am looking forward to seeing your python code
to see just how you iterate through your graph. Then, I would
like to try to adapt it to the /proc/topology tree.
> Then the C-friendly API in your second design would be
> implemented purely in user land, depending on reading the
> file-system display and converting it to linked C structures.
> Once gain, I recommend to your consideration SGI's hwgraph
> interface, as described on the man page:
> 4) You state that there are no plans at this time to map hardware
> beyond the I/O controller. This seems to mean we are headed
> toward the following collection of display mechanisms in Linux:
Note that the key words there are 'at this time'. In the future it may
make sense to map devices beyond controller, I hope to make this
design flexible enough to accomadate this possibility.
> a] Your topology subsystem displays cpu/mem/node/cache, with
> future extensions to the I/O bus and controller level.
> b] Gooch's much "loved" devfs displays block and char devices
> in a namespace that is a modern replacement to /dev.
> c] Mochel's recent driverfs exposure of the device driver
I don't particularly want to get into a devfs/driverfs dispute. Nor am
I particularly interested in trying to replace either. As I see it,
here is the problem I am trying to solve: Every architecture relies on
the BIOS to present machine topology to the kernel in order to
boot. For the most part, each architecture/BIOS does it differently. And
for some architechtures there may be multiple ways to do it. However,
much of the infomration is either tossed into the bit bucket or not
available to applications which want this information.
I am simply trying to build a subsystem which captures and stores this
data in a non-architecture specific way. The subystem should also
present an API which allows other kernel subsystems as well as
applications to get at this data.
> Gooch was motivated by the need to manage a larger /dev directory,
> where the number of special device files was exceeding both the
> simple limits of <major, minor> bit fields, and the human limits
> of managing a large list of changing device entries with static
> Mochel was motivated by the need to support Power Management,
> Plug and Play and hot plug, which required a better structure
> for connecting the device drivers in the kernel.
> This split is ok by me, more or less. Other no doubt are
> concerned that there is apparent duplication of similar
> effort here. But perhaps the focus is sufficiently different
> in each case, and the likelihood of genuine merger and
> sharing of effort sufficiently low, that this is how it
> is, and that's ok.
> Still, I don't understand why you would not anticipate adding
> I/O devices to your topology? They would seem like a natural
> addition to me.
> 5) How would you model a system where, say, each node had two
> dual-processor chips (4 processors per node), with cache
> at each level: per processor, per chip and per node?
I will add this diagram to the next revision of the document.
> Thanks for the good work -- hopefully I will get time soon
> to download your patch and have a look at the real code.
> -- I won't rest till it's the best ... Programmer, Linux Scalability
Paul Jackson <pj@...> 1.650.933.1373
IBM Linux Technology Center
ph: (503) 578-7786, tie: 775-7786