I'm interested in using this library for my C++ simulation codes. I think the library has many potentials to improve performance for OpenMP-parallel C++ codes on ccNuma machines (SGI altix is the main target).
We find that the data placement is very critical for the performance and scaling for our simulations. To do so, we have developed a simple allocator that manages a chunk of memory on each physical processor. The main objects are shared but they reside on the physical processor by the allocator. In our simulations, we can structure the code so that the objects are read-only by remote processors, while the processor who owns the object can read and write. However, the classes are not generic so that I can use the allocator for any object nor compatible with STL.
My question is if it is possible to use (with some modifications if necessary) your library to control data placement to each physical processor to implement the ideas I described above?
Thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Perhaps I do not understand your question.
Effectively, this allocator works by calling
shm_open(), a POSIX system call, to open a UNIX
shared memory segment. Then, mmap is used to map
that shared memory segment to an address specified
by the allocator parameter keys. The rest of the
allocator deterministically divides up the memory
and provides element sized portions to the STL
container. Shared access is provided by allowing
secondary processes to use an identical key to
open existing shared memory segments and attach to
existing containers placed in those shared
segments.
What you could perhaps do is, assuming you have
enough control, allow the different processor
processes to have separate keys. So assuming you
have 4 processors, 1, 2, 3, and 4. Processor 1
could create allocators using keys 1, 5, 9, 13,
... Processor 2 could create allocators using
keys 2, 6, 10, 14, ..., and so forth. Processor 2
could access the elements created by Processor 1
and stored using key 1. And Processor 1 could
access the elements stored in the allocator using
key 6 created by Processor 2. The issue would
just be having the underlying system calls create
the shared memory segments in memory controlled by
specific processors. This assumes you are
describing a single multiprocessor machine. Also,
it assumes that the processes are not migrated by
the operating system among the processors, in
which case the situation is obviously more
dynamic. If system memory is allocated to
processors based on physical address, then perhaps
your goal is reasonable.
Also, if you are thinking about memory control in
terms of say how elements in a matrix are aligned
in memory, you do not have that much control with
C++ STL allocators. The allocators simply request
blocks of adjacent memory cells for their
containers. So the container elements end up in
adjacent cells, but its not quite the same as row
major versus column major oriented array layouts,
as in Fortran. There is not that level of
control.
Please accept my apologies if I have misunderstood
your question. Thank you for your interest.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm interested in using this library for my C++ simulation codes. I think the library has many potentials to improve performance for OpenMP-parallel C++ codes on ccNuma machines (SGI altix is the main target).
We find that the data placement is very critical for the performance and scaling for our simulations. To do so, we have developed a simple allocator that manages a chunk of memory on each physical processor. The main objects are shared but they reside on the physical processor by the allocator. In our simulations, we can structure the code so that the objects are read-only by remote processors, while the processor who owns the object can read and write. However, the classes are not generic so that I can use the allocator for any object nor compatible with STL.
My question is if it is possible to use (with some modifications if necessary) your library to control data placement to each physical processor to implement the ideas I described above?
Thanks in advance.
Perhaps I do not understand your question.
Effectively, this allocator works by calling
shm_open(), a POSIX system call, to open a UNIX
shared memory segment. Then, mmap is used to map
that shared memory segment to an address specified
by the allocator parameter keys. The rest of the
allocator deterministically divides up the memory
and provides element sized portions to the STL
container. Shared access is provided by allowing
secondary processes to use an identical key to
open existing shared memory segments and attach to
existing containers placed in those shared
segments.
What you could perhaps do is, assuming you have
enough control, allow the different processor
processes to have separate keys. So assuming you
have 4 processors, 1, 2, 3, and 4. Processor 1
could create allocators using keys 1, 5, 9, 13,
... Processor 2 could create allocators using
keys 2, 6, 10, 14, ..., and so forth. Processor 2
could access the elements created by Processor 1
and stored using key 1. And Processor 1 could
access the elements stored in the allocator using
key 6 created by Processor 2. The issue would
just be having the underlying system calls create
the shared memory segments in memory controlled by
specific processors. This assumes you are
describing a single multiprocessor machine. Also,
it assumes that the processes are not migrated by
the operating system among the processors, in
which case the situation is obviously more
dynamic. If system memory is allocated to
processors based on physical address, then perhaps
your goal is reasonable.
Also, if you are thinking about memory control in
terms of say how elements in a matrix are aligned
in memory, you do not have that much control with
C++ STL allocators. The allocators simply request
blocks of adjacent memory cells for their
containers. So the container elements end up in
adjacent cells, but its not quite the same as row
major versus column major oriented array layouts,
as in Fortran. There is not that level of
control.
Please accept my apologies if I have misunderstood
your question. Thank you for your interest.