Hello, Paul!
The example -really- helped! I was missing the point of the
cms_memory_list
data structure. This clears up most of my confusion, I believe, anyway...
I believe that we will really want to have a default value (e.g., NULL)
for the pointer to the cms_memory_list, since for many applications
a default of either "search my node first, then search the rest however
you wish" or "search my node only" will suffice. I believe that it is
a bit cruel and unusual to force someone to specify either of these in
great detail. ;-)
Each "mirror" in my "hall" is in a separate process that used cpumemmap
to restrict things. So a given process normally cares only about its own
view and that of the underlying system. One common exception might be
debuggers, but, on the other hand, the multithreaded debuggers I have
used don't know a CPU from a hole in the ground. So, the two views
seemed likely to be sufficient.
The reason I keep wanting non-root processes to be able to restrict their
cpumemmaps is that the cpumemmap is the only thing in your proposal that
is capable of virtualizing CPUs and memory. So, an application that wishes
to partition itself needs either to track the CPU numbers and offsets
itself,
or it needs to provide a different cpumemmap to each of its children.
Having
done quite a bit of the former, the latter seems quite attractive. ;-)
Also, I need to manipulate cpumemmaps to implement pieces of the
simple-binding API.
Manual manipulation of the cpumemset structures seems quite painful. Am
I right in assuming that one cannot invoke cmsFreeCMS on hand-built
cpumemset structures? If I use my crude trick of decrementing the
nr_cpus field, am I then prohibited from calling cmsFreeCMS(), or does
cmsFreeCMS() hide some additionaw information away somewhere?
So here is what I believe I need to do to write the simple-binding API in
terms of cpumemset and cpumemmap. I assume that cmsFreeCMS() is
telepathic,
since I cannot think of a way that it could do what I want it to do...
I also take the liberty of leaking memory right and left. The purpose of
this exercise is to see if I understand the cpumemsets. Once I find that
I do, I will try one of the existing proprietary-Unix APIs. And only
-then- will I worry about optimal, nice-looking code!!!
This exercise raised the following questions:
1. It seems to be possible to have a different cmsmemmap for different
objects in a given virtual address space. The CPU portions of these
cmsmemmaps are ignored, right?
2. The memory portion of the cmsmemmap associated with a process
is used as a default for things like data, bss, and heap space?
Is it used as a default for objects created subsequently?
Or is it ignored?
3. The memory portion of the cmsmemmap associated with a thread is
used as a default for that thread's stack? Is it used as a
default for objects created subsequently by this thread?
Or is it ignored?
If the answers to both #2 and #3 are "it is ignored", then I am not sure
how the various node-restriction APIs would be implemented.
4. I made up the makecmsmemset() API, since I did not want to
second-guess the layout. I know that this is not really what
you had in mind, so am looking for guidance here.
5. I used NULL for cpumemset mems, since the simple-binding API
does not care about the order of search.
6. We need to align the memory-binding behavior... One could
implement bindtonode() stepping through each page in the
process's virtual=address space, but this might be a bit
inefficient on 64-bit architectures. ;-)
typedef struct {
cpumemset *set;
cpumemmap *map;
} numasubset_t;
numasubset_t *restrictcpus(unsigned long cpus, numasubset_t *res)
{
unsigned i;
unsigned j;
unsigned nbits;
cms_pcpu_t *setcpus;
cms_pcpu_t *mapcpus;
cpumemmap *r = &(res->map);
numasubset_t *newres;
nbits = 0;
for (i = 0; i < sizeof(cpus) * 8; i++) {
if (i >= r->nr_cpus) {
break;
}
if (cpus & (1 << i)) {
nbits++;
}
}
if (r == NULL) {
r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0);
/* Do I need getpid()? */
/* Do I care about vaddr??? */
/* should I substitute */
/* &restrictcpus? */
if (r == NULL) {
goto die0;
}
}
mapcpus = malloc(nbits * sizeof(*mapcpus));
if (mapcpus == NULL) {
errno = ENOMEM;
goto die1;
setcpus = malloc(nbits * sizeof(*setcpus));
if (setcpus == NULL) {
errno = ENOMEM;
goto die1;
}
j = 0;
for (i = 0; i < sizeof(cpus) * 8; i++) {
if (i >= r->nr_cpus) {
break;
}
if (cpus & (1 << i)) {
mapcpus[j] = r->cpus[i];
setcpus[j] = j;
}
}
newres = malloc(sizeof(*newres));
if (newres == NULL) {
errno = ENOMEM;
goto die3;
}
newres->map = r;
newres->map = makecmsmemmap(nbits,
setcpus,
res->map->nr_mems,
res->map->mems);
if (newres->map == NULL) {
goto die4;
}
newres->set = makecmsmemset(nbits,
setcpus,
res->set->nr_mems,
res->set->mems);
if (newres->map == NULL) {
goto die5;
}
return (r);
die5:
cmsFreeCMM(newres->map);
die4:
free(newres);
die3:
free(mapcpus);
die2:
free(setcpus);
die1:
if (r != &(res->map)) {
cmsFreeCMM(r);
}
die0:
return (NULL);
}
numasubset_t *restrictnodes(unsigned long nodes, numasubset_t *res)
{
unsigned i;
unsigned j;
unsigned nbits;
cms_pmem_t *setmem;
cms_pmem_t *mapmem;
cpumemmap *r = &(res->map);
numasubset_t *newres;
nbits = 0;
for (i = 0; i < sizeof(nodes) * 8; i++) {
if (i >= r->nr_nodes) {
break;
}
if (nodes & (1 << i)) {
nbits++;
}
}
if (r == NULL) {
r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0);
/* Do I need getpid()? */
/* Do I care about vaddr??? */
/* should I substitute */
/* &restrictcpus? */
if (r == NULL) {
goto die0;
}
}
mapnodes = malloc(nbits * sizeof(*mapnodes));
if (mapcpus == NULL) {
errno = ENOMEM;
goto die1;
setnodes = malloc(nbits * sizeof(*setnodes));
if (setnodes == NULL) {
errno = ENOMEM;
goto die1;
}
j = 0;
for (i = 0; i < sizeof(nodes) * 8; i++) {
if (i >= r->nr_nodes) {
break;
}
if (nodes & (1 << i)) {
mapnodes[j] = r->nodes[i];
setnodes[j] = j;
}
}
newres = malloc(sizeof(*newres));
if (newres == NULL) {
errno = ENOMEM;
goto die3;
}
newres->map = r;
newres->map = makecmsmemmap(res->map->nr_cpus,
res->map->cpus,
nbits,
setnodes);
if (newres->map == NULL) {
goto die4;
}
newres->set = makecmsmemset(res->set->nr_cpus,
res->set->cpus,
0,
NULL);
if (newres->map == NULL) {
goto die5;
}
return (r);
die5:
cmsFreeCMM(newres->map);
die4:
free(newres);
die3:
free(mapnodes);
die2:
free(setnodes);
die1:
if (r != &(res->map)) {
cmsFreeCMM(r);
}
die0:
return (NULL);
}
void freerestrict(numasubset_t *restrict)
{
cmsFreeCMM(restrict->map);
cmsFreeCMS(restrict->set);
free(restrict);
}
I don't know how to implement getcpu(), getnode(), cputonode(), or
nodetocpu() based on this API.
int bindtocpu(unsigned long cpus, numasubset_t *restrict)
{
numasubset_t *r1;
*r1 = restrictcpu(cpus, restrict);
if (r1 == NULL) {
return (-1); /* adds ENOMEM */
}
cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
freerestrict(r1);
}
int bindtonode(unsigned long nodes,
int behavior,
numasubset_t *restrict)
{
numasubset_t *r1;
*r1 = restrictnode(nodes, restrict);
if (r1 == NULL) {
return (-1); /* adds ENOMEM */
}
r1->set->cms_policy = behavior; /* assume we rationalize */
cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
/* Need to step through all VM objects... How? */
freerestrict(r1);
}
int setlaunch(unsigned long cpus,
unsigned long nodes,
int behavior,
numasubset_t *restrict)
{
numasubset_t *r1;
numasubset_t *r2;
r1 = restrictnode(nodes, restrict);
if (r1 == NULL) {
return (-1); /* adds ENOMEM */
}
r2 = restrictcpu(cpus, r1);
freerestrict(r1);
if (r2 == NULL) {
return (-1); /* adds ENOMEM */
}
r2->set->cms_policy = behavior; /* assume we rationalize */
cmsSetCMS(CMS_CHILD, (void *)0, r2->set);
cmsSetCMM(CMS_CHILD, 0, (void *)0, r2->map);
freerestrict(r2);
}
int bindmemory(unsigned long start,
size_t len,
unsigned long nodes,
int behavior,
numasubset_t *restrict)
{
numasubset_t *r1;
unsigned long cur;
*r1 = restrictnode(nodes, restrict);
if (r1 == NULL) {
return (-1); /* adds ENOMEM */
}
r1->set->cms_policy = behavior; /* assume we rationalize */
cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
cur = start + len - PAGESIZE;
for (; cur >= start; cur -= PAGESIZE) {
cmsSetCMM(CMS_VMAREA, 0, (void *)cur, r1->map);
}
freerestrict(r1);
}
> On Mon, 15 Oct 2001, Paul McKenney wrote:
> > > I expect that when someone needs a kernel for > 64 cpus, they
> > > will have to find an alternative to, or extension of, Ingo's
> > > cpus_allowed bit vector.
> >
> > I agree with this. My hope is that there will be a way to bury such
> > differences between 64-CPU and >64-CPU data structure manipulation
> > in a macro, an inlined function, or some such.
>
> Something like that, yes. This is just a concern to the schedule()
> code - nicely isolated.
>
>
> > > The kernel code that implements the system calls to set
> > > CpuMemMaps and Sets will always have the responsibility for
> > > translating the stable, but cumbersome, API of these Maps and
> > > Sets into the efficient representation of the day required by
> > > the scheduling and allocation code.
> >
> > Yep! There may come a time when people want a short-form interface,
> > but I believe that what is good enough for select() vectors and for
> > signal masks is good enough for NUMA. ;-)
>
> My intention is that short-form interfaces are provided on top
> of CpuMemSets -- we (sgi) expect to do a few such ourselves,
> to emulate existing interfaces.
>
>
> > > Let me try again ...
> > >
> > > A CpuMemSet specifies two things. It specifies on which cpus
> > > (in the corresponding CpuMemMap) a task may be scheduled,
> > > and it specifies in what order to search for memory, per
> > > virtual memory area, depending on which cpu the request
> > > for memory was executed.
> >
> > OK... How is the CPU taken into account? Do you traverse the
> > list of memories until you find one that is closest to the current
> > CPU? If there is no memory to be had there, do you then go through
> > the list of memories in order starting at that point, and wrapping
> > around if necessary? Or are you using something similar to
> > classzones, which would be represented separately?
>
>
> Hmmm ... I have not yet adequately presented some aspects of
> this data structure. I must add an example ... that might
> connect with additional readers.
>
> Lets try this one:
>
> Example:
> ========
> One way to understand these data structures is to look at
> an example.
>
> Given the following hardware configuration:
>
> Let's say we have a four node system, with four CPUs
> per node, and one memory per node, named as follows:
>
> Name the 16 CPUs: c0, c1, ..., c15 # 'c' for CPU
> and number them: 0, 1, 2, ..., 15 # cms_pcpu_t
>
> Name the 4 memories: mn0, mn1, mn2, mn3 # 'mn' for
memory node
> and number them: 0, 1, 2, 3 #
cms_pmem_t
>
> CpuMemMap:
>
> Now lets say the administrator (root) chooses to setup a
> Map containing just the 2nd and 3rd node (CPUs and memory
> thereon). The cpumemmap for this would contain:
>
> {
> 8, # nr_cpus (length of
cpus array)
> p1, # cpus (ptr to array of
cms_pcpu_t)
> 2, # nr_mems (length of
mems array)
> p2 # mems (ptr to array of
cms_pmem_t)
> }
>
> where p1, p2 point to arrays of physical cpu + mem numbers:
>
> p1 = [ 4,5,6,7,8,9,10,11 ] # cpus (array of
cms_pcpu_t)
> p2 = [ 1,2 ] # mems (array of cms_pmem_t)
>
> This map shows, for example, that for this Map, logical cpu 0
> corresponds to physical cpu 4 (c4).
>
> CpuMemSet:
>
> Further lets say that an application running within this map
> chooses to restrict itself to just the odd-numbered CPUs, and
> to search memory in the common "first-touch" manner (local
> node first). It would establish a CpuMemSet containing:
>
> {
> CMS_DEFAULT, # cms_policy
> 4, # nr_cpus (length of
cpus array)
> q1, # cpus (ptr to array of
cms_lcpu_t)
> 2, # nr_mems (length of
mems array)
> q2, # mems (ptr to array of
cms_memory_list)
> }
>
> where q1 points to an array of 4 logical cpu numbers and q2 to
an
> array of 2 memory lists:
>
>
> q1 = [ 1,3,5,7 ], # cpus (array of
cms_lcpu_t)
> q2 = [ # See "Verbalization
example" below
> { 3, r1, 2, s1 }
> { 2, r2, 2, s2 }
> ]
> where r1, r2 are arrays of logical cpus:
> r1 = [1, 3, CMS_DEFAULT_CPU]
> r2 = [5, 7]
> and s1, s2 are arrays of memory nodes:
> s1 = [0, 1]
> s2 = [1, 0]
>
> Verbalization examples:
>
> To read item q1 out loud:
>
> Tasks in this CpuMemSet may be scheduled on any of
> the logical CPUs [ 1, 3, 5, 7 ], which correspond
> in the associated Map with physical CPUs c5, c7, c9
> and c11.
>
> To read item q2 out loud:
>
> If a fault occurs on any of the 2 explicit CPUs in
> r1, then search the 2 memory nodes in s1 in order,
> looking for available memory (mn1, then mn2).
>
> If a fault occurs on any of the 2 CPUs in r2, search
> the 2 memory nodes in s2 in order (mn2, then mn1).
>
> If a fault occurs on any other CPU, then since the
> CMS_DEFAULT_CPU value is listed in r1, search the
> 2 memory nodes in s1 in order (mn1, then mn2).
>
> Interpretations of the above:
>
> The meaning of "s1 = [0, 1]" is that if a page fault occurs on
> the logical CPUs in "r1 = [1, 3, CMS_DEFAULT_CPU]", then the
> allocator should search logical memory node 0 first (that's
> the memory on physical node 1 - mn1), then search logical
> memory node 1 second (the memory on physical node 2 - mn2).
>
> The meaning of "s2 = [1, 0]" is that if a page fault occurs
> on the logical CPUs listed in "r2 = [5, 7]", then the same
> memory nodes are searched, but in the other order, mn2 then mn1.
>
> In particular, if a vm area using the above CpuMemSet was
> also shared with an application running on some other Map,
> and that application faulted while running on some CPU not
> explicitly listed in the above CpuMemSet (item r1 or r2),
> then the allocator would search mn1 first, then mn2, for
> available memory. This is because CMS_DEFAULT_CPU is listed
> amongst the CPUs in r1, and the corresponding s1 is equivalent
> to the ordered array of physical memory nodes [mn1, mn2].
>
> Observation:
>
> The allocator need have _no_ notion of distance. It just
> searches, in order specified, the memory list prescribed
> for that vm area, for a fault on the specified CPU (or the
> CMS_DEFAULT_CPU). To provide the usual first-touch, distance
> ordered memory search, some system service or utility must
> sort the memory lists in distance order.
>
> ========
>
> I should add the above example to my Design Notes.
>
> My apologies for the tediousness of this example. I realize
> that the above data structure is a layer or two deeper than
> intuitions expect. However when I methodically strip all
> (most?) higher level policy from the various CPU and memory
> API's we need to support, the above is what I am left with, as
> the necessary generic multiplexor between a variety of API's
> and the specific needs of the static placement logic in the
> kernel allocation and scheduliing code.
>
> For example, observe that no notion of locality domain or node
> exists here - it has been disassembled into simple lists of
> CPUs and chunks of memory, called here 'memory nodes' only
> because there tends to be one chunk per node, and I couldn't
> find a better noun to name a maximal contiguous (?) chunk of
> memory that is equidistant from all processing elements.
>
>
> > > Seems that I can either (1) require that a memory list _always_
> > > be specified for the CMS_DEFAULT_CPU, or (2) mandate that
> > > attempts to allocate memory when the CpuMemSet does not specify
> > > (even by CMS_DEFAULT_CPU) any memory list for the currently
> > > executing cpu must fail. Choice (2) would cause nasty, obscure
> > > and intermittent errors. So it must be choice (1).
> >
> > Agreed that deterministic errors are much better than non-deterministic
> > errors! But how do you indicate which memory is associated with
> > CMS_DEFAULT_CPU?
>
> The value CMS_DEFAULT_CPU is included on a the cpu list
> (r1 or r2, in the above example) in one of the memory lists.
>
> ==> Memory lists have a list of memory nodes, sorted in search
> order, _and_ a list of CPUs to which that memory list applies.
>
> > Must the CPU and the memory lists be the same size?
> > My guess was "no", since there are separate fields for the length of
> > each.
>
> Correct - they need not be, and in the case of architectures
> with multiple CPUs per memory node, typically are not the same size.
>
> > Or does CMS_DEFAULT_CPU just start the search of memory from the
> > first element of the array?
>
> er eh no. This question confuses me.
>
> CMS_DEFAULT_CPU chooses which memory list to search if the
> faulting cpu is not explicitly listed.
>
> The search order is by default (CMS_DEFAULT) always from the
> first element of the memory node array (s1 and s2, above), unless
> the CMS_ROUND_ROBIN policy is specified for that CpuMemSet.
>
>
> > > What is a "least-memory utilization policy"? I am not familiar
> > > with that term.
> >
> > It means that, at page-fault time, you allocate memory from the
> > node/memory with the least utilization.
>
> This is clearly then a dynamic placement policy, not a static
> one. I would expect the implementation of a "least-memory
> utilization policy" to add code to the allocator and elsewhere
> to track memory utilization. And I would expect the CpuMemSet
> (or at least CpuMemMap) to control which memory nodes could
> be searched. But the order of search would depend on more
> dynamic code outside the domain of CpuMemSets.
>
> >
> > > ====
> > > > A cpumemset has no meaning except in the context of a cpumemmap,
> > right?
> > > > A cpumemmap maps the CPU numbers, while a cpumemset simply
restricts
> > > > them, right? (Could make it work either way, but things like
getcpu
> > ()
> > > > and getnode() need to be in sync with the choice.) For example,
> > suppose
> > > > that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose
> > that
> > > > the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.
> > What
> > > > is the physical CPU corresponding to logical CPUs 0, 1, and 2?
> > >
> > > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}.
> > >
> > > My presumption, from reading this, is that getcpu(), executed in this
> > > map and set, would return either 100 or 103, rather oblivious to the
> > > CpuMemMap. I have no clue what getnode() would or should do that
> > > relates to CpuMemSets. Perhaps we need a "getcmscpu()"?
> >
> > I believe that we need a way to get the physical CPU ID. So, at this
> > point, do we have two levels of ID, or three? In case of diagnostics,
> > you want to identify the physical CPU. So we quickly get into the
> > issue that Martin Bligh raised earlier. ;-)
>
> When you want the physical CPU ID, as with diagnostics, then getcpu()
> provides the physical CPU ID, which should be no more ambiguous than
> it was before CpuMemSets. I only see one level of ID here. I see
> nothing logical about this <grin>.
>
>
> > For related sets of processes running as root, you can end up with many
> > more levels of ID. Process 100 maps to exclude the first CPU, then
forks
> > process 101, which maps to exclude its idea of the first CPU, and so
on.
> > This problem is inherent in the notion of virtualizing the CPUs. We
could
> > try to eliminate the middle level (cms_pcpu_t), but that could require
> > using whatever ugly IDs the hardware wanted to provide. Maybe the name
> > of the cms_pcpu_t level should be "complete" instead of "physical"? Or
> > some other naming?
>
> No - cms_pcpu_t is the ugly hardware ID (well one of them -
> seems that these too come in a couple forms, such as compact
> or not). This is not a hall of infinitely regressing mirrors.
> There are only two levels, the ugly physical ID level, and the
> logical mapping to the integers 0..N-1 (logical ID) for any
> given subset of size N CPUs or memory nodes.
>
> I suspect that CpuMemMaps are a degree less fancy than you are
> presuming, while CpuMemSets are a couple degrees more fancy,
> as in the above example ;).
>
>
> > So the relevant numbering schemes are the following: (1) the numbering
> > that the current process sees, as mapped by the CpuMemMap that controls
> > it, and (2) the underlying physical identifiers that would make sense
> > to someone servicing the hardware.
>
> Yes - exactly. How we ended up at this same point, after the
> above confusions, baffles me a bit. oh well.
>
>
>
> > > > HP's launch policies include policies for CPU allocation (round
> > > > robin and the like). My guess is that this requires either
> > > > additional bits in cms_policy or another field for CPU (as
> > > > opposed to memory) policies.
> > >
> > > Aha - might be so. I should investigate this further.
> >
> > http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for
> > "Launch".
>
> Here I see several dynamic scheduling policies:
> ROUND_ROBIN
> PACKED
> FILL
> LEASTLOAD
>
> As with the dynamic "least-memory utilization" policy above,
> I would expect CpuMemSets to control which CPUs were eligible
> for scheduling, but that the control of details of scheduling
> would be left to other mechanisms.
>
>
> > > Non-root processes can restrict where they execute tasks or allocate
> > > memory by altering their CpuMemSet, within the confines of the
CpuMemMap
> > > established for them. Why is that not enough?
> >
> > My concern is that non-root processes cannot virtualize their children.
> > If you want to divide the CPUs and memory available to you into two
> > pieces, and run a child in each piece, you can do so, but you cannot
> > have both children thinking that they have their own CPU 0.
>
> Yes, this limitation exists. Below you suggest letting non-root
> processes further restrict their map. That would be doable -
> a modest wart, in that it complicates what was a simple story:
> root manipulates maps, anyone manipulates sets. How serious
> is this limitation?
>
>
> > > > (At this point, I am favoring letting non-root processes
> > > > manipulate cpumemmaps, with the restriction that any that
> > > > arevinstalled must be pure restrictions of the current
> > > > cpumemmap.)
> > Please let me know whether my non-root virtualization example
> > above seems reasonable to you.
>
> I am still lacking an appreciation of the benefit of this
> sufficient to justify the extra half-twist of logic. I'm
> open to hearing more.
>
>
> > > > Is the user supposed to directly manipulate the fields of the
> > > > cpumemmap and cpumemset?
> > >
> > > Yes, via the cmsSet*() system calls. Well, as direct as
> > > anything in a protected operating system -- the user politely
> > > asks the kernel to set a map or set, and doesn't directly write
> > > kernel memory.
> > >
> > > My hunch here is that I missed the real point of this question ...
> >
> > Should the interfaces be used as follows?
> >
> > p = cmsQueryCMS(CMS_CURRENT, (void *)0);
> > if (p->nr_cpus > 1) {
> > p->nr_cpus--;
> > }
> > cmsSetCMS(CMS_CURRENT, (void *)0, p);
>
> Ah ... that gets dicey. What I hear you asking is whether it
> is appropriate for an application (more likely, some library
> supporting a friendlier interface on top of CpuMemSets) to (1)
> get a response from a cmsQuery*(), (2) munge it in place, and
> (3) push it back down a cmsSet*() call.
>
> As we have already seen in the Example above, the key data
> structure is a tad nasty to deal with in C, with its nested
> variable length arrays. It is especially nasty across the system
> call boundary -- the kernel can't exactly malloc and assemble
> multiple small chunks of user memory during the response to a
> single system call.
>
> This is in good part the motivation for the cmsFree*() calls,
> to isolate the caller of these routines from knowing just how
> the memory for them is allocated.
>
> For a change such as you give in your example above, changing
> a value in place, that's no problem .. because it makes no
> assumptions as to how the memory was allocated. Changing a
> Policy flag would be easy, for the same reason. Shortening
> one of the arrays, and even overwriting the values in them,
> is fine.
>
> Any lenghthening of an array should be done with a deep copy
> into memory that is managed in ways known to the caller, so
> that the caller knows how to free that memory, when finished.
>
>
> > One thing we need to hash out is what the unit of memory control is.
> > In the simple-binding proposal, it is an arbitrary range of virtual
> > memory (similar to what madvise() might do), while in the Process
> > Scheduling and Memory Placement proposal, it appears to be an object.
> > Conflicts are handled in a last-change-wins manner?
>
> Yes - I see that section in the "Proposed NUMA API" now.
>
> There are a couple things here we need to hash out. My reading
> of this section is that bindmemory() has a couple of capabilities
> that are not a natural fit with CpuMemSets:
>
> 1) Does bindmemory (vaddr, len, ...) apply to:
> a) already mapped and allocated memory, or
> b) already mapped, but not allocated, memory, or
> c) also extended mappings in already known vm areas
> d) any future mapping or remapping in the vaddr range?
>
> Since these bindings are inherited across fork/exec, I guess
> it must be (d). This means that the binding request has
> to be kept as an inherited property of the process, and
> recalculated for applicability any time that any mapping
> is changed, whether to grow or shrink an existing vm area
> or to add or remove a vm area. CpuMemSets are, as you note,
> per object (per vm area), not for an arbitrary address range
> and whatever portions of whatever future vm areas might
> happen to overlap that range.
>
> 2) bindmemory() requires per-process specifications as to
> where to find memory. CpuMemSets supports per-cpu
> specifications (based on which cpu is executing the
> page_alloc request).
>
> My first inclination is to recommend against supporting
> capability (1) above, on the grounds that its semantics don't
> respect very well the existing objects that the kernel supports.
> That, or I failed to understand it, quite possible.
>
> The second capability, per-process rather than (or in addition
> to) per-cpu memory search specifications is at least more
> doable, but still seems not quite right to me.
>
> I am of course open to hearing the motivation for these
> capabilities. I may be a purist at heart, but I am a pragmatist
> in my wallet.
>
> Also perhaps you could comment on the status of the Simple
> Binding proposal. Whether it is in use, or is required to
> support API's in use. What level of flexibility exists on such
> significant semantics as the two capabilities just above.
> I won't rest till it's the best ...
> Manager, Linux Scalability
> Paul Jackson <pj@...> 1.650.933.1373
|