I agree with most of this, but I still think it would be a valuable (and
easy to implement)
first step... I like to work with the design philosophy: make it work,
make it work right,
make it work fast. The make it work bit would be to get a running system
as fast as possible,
that way other people can be encouraged to patch/fix/improve it as soon
as possible. With this
approach the easiest thing to get working would be to pretend its an SMP
machine, with complete
shared memory, and enumerated CPUs ... moving on from this we can look
at modeling a 32bit shared
address space with caching on each node - etc.
The simplest implementation I can tink of to start with is to run the
kernel and apps in a shared 32 bit
address space - all pages paged in on one CPU to start with. When
another CPU needs a page it faults, and the page will
be stolen from whichever node currently has it... will involve an
interrupt, and then calling the scheduler.
Initially I might just let the page bounce between nodes, very
inefficient but it should work - actually network latency
may help here as it might let a thread have a decent run-time, before
the page is stolen.
This would get the cheapo-NUMA to the point of looking like an SMP
machine, at which point the real NUMA work would
start, and hopefully borrow a lot from the NUMA support code - infact as
the system should be a superset of NUMA architectures, as
there will be a range of access times, from local page through to a
machine several switches away. Also as the difference in
access times is exaggerated compared to a true NUMA, it may prove a good
testbed for shedualing/allocation algorithms.
Martin J. Bligh wrote:
>I think unconquerable dragons lie that way (well, it'll work, but the
>performance will suck). For one, if you're going to cope with 32 bit
>architectures, you'll run against the fundamental limitation that you only
>get about 1GB of kernel virtual address space for the kernel, and pretty
>much everything in Linux is global. Believe me, I have a very sore head
>from beating it continuously against this problem for the last couple of
>Of course, you could limit yourself to 64 bit architectures, which might be
>a good plan, long term. But it severly limits the developer base.
>However, the fundamental thing you don't have that NUMA boxes do is the
>cache coherency hardware ... bouncing a page would mean doing a
>cluster-wide (to every cpu) global cache invalidate for every cacheline
>within that page. Which basically means a huge TLB flushing storm going on
>continously (the cheap ia32 boxes don't have a tlb_flush_range even). We
>need an architecture that encourages locality wherever possible.