|
From: Riccardo M. <ric...@gm...> - 2011-08-04 15:42:25
|
Hello, I see that each UML instance starts a variable number of threads/processes. I'm using UML in a batch system (Sun Grid Engine 6.2); SGE kills my jobs because they exceed the allowed memory reservation. My guess is that SGE miscomputes the memory usage by computing the total over all threads/processes without accounting for shared pages. Is there a way to limit the number of threads/processes in a UML instance? (This would allow me to request an amount of memory equal to N*M, where N is the max number of UML threads and M is the memory allocated to UML.) Best regards, Riccardo |
|
From: richard -r. w. <ric...@gm...> - 2011-08-04 17:10:00
|
On Thu, Aug 4, 2011 at 5:42 PM, Riccardo Murri <ric...@gm...> wrote: > Hello, > > I see that each UML instance starts a variable number of threads/processes. > > I'm using UML in a batch system (Sun Grid Engine 6.2); SGE kills my > jobs because they exceed the allowed memory reservation. My guess is > that SGE miscomputes the memory usage by computing the total over all > threads/processes without accounting for shared pages. > > Is there a way to limit the number of threads/processes in a UML instance? > (This would allow me to request an amount of memory equal to N*M, > where N is the max number of UML threads and M is the memory allocated > to UML.) UML starts on the host side per process one helper thread. (In SKAS0 mode, which is the default.) So, you can limit the number of host threads by starting less processes within UML. ;) Most likely SGE does not detect them as threads because UML uses clone() to create them... -- Thanks, //richard |
|
From: Riccardo M. <ric...@gm...> - 2011-10-12 23:37:34
|
Hello,
sorry for resurrecting this old thread, but I need to test my
understanding of the problem and I'd like to ask for a clarification.
On Thu, Aug 4, 2011 at 7:09 PM, richard -rw- weinberger
<ric...@gm...> wrote:
> On Thu, Aug 4, 2011 at 5:42 PM, Riccardo Murri
<ric...@gm...> wrote:
>>
>> I see that each UML instance starts a variable number of threads/processes.
>>
>> I'm using UML in a batch system (Sun Grid Engine 6.2); SGE kills my
>> jobs because they exceed the allowed memory reservation. My guess is
>> that SGE miscomputes the memory usage by computing the total over all
>> threads/processes without accounting for shared pages.
>> [...]
>
> UML starts on the host side per process one helper thread.
> (In SKAS0 mode, which is the default.)
> So, you can limit the number of host threads by starting less
> processes within UML. ;)
>
> Most likely SGE does not detect them as threads because UML uses
> clone() to create them...
Actually we've seen the same behavior also in TORQUE, so this is
becoming a major issue for us.
The question is this: I see in the libc sources that clone() is used
to create threads as well. So I guess the difference is in the flags
that are passed to clone() in the two cases?
Now, libc create_thread() uses (lines 182--188 of file file
"nptl/sysdeps/pthread/createthread.c"):
int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
| CLONE_SETTLS | CLONE_PARENT_SETTID
| CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
#if __ASSUME_NO_CLONE_DETACHED == 0
| CLONE_DETACHED
#endif
| 0);
whereas, if I'm not mistaken, UML uses (file "kernel/skas/clone.c"):
err = stub_syscall2(__NR_clone, CLONE_PARENT | CLONE_FILES | SIGCHLD,
STUB_DATA + UM_KERN_PAGE_SIZE / 2 - sizeof(void *));
But then this means that the additional processes created by UML do
not share the memory space (no CLONE_VM), correct?
Thus:
- batch system schedulers do righteously consider each UML "thread" as
a separate process;
- however, UML "threads" do share a large portion of the memory, as
can be seen from this "ps" output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6467 admin 15 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
6466 admin 16 0 32.0g 13g 13g S 0.0 27.7 0:00.15 kernel64-3.0.4
6465 admin 22 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
6458 admin 15 0 32.0g 13g 13g S 39.2 27.7 37:00.04 kernel64-3.0.4
7437 admin 15 0 12.0g 12g 12g T 52.9 25.6 70:54.39 kernel64-3.0.4
- so the problem lies in the algorithm that SGE and TORQUE apply for
computing the amount of memory used, which apparently just sums up
the total VSZ for each process (fast), instead of counting the
number of pages while ensuring that each shared page is counted only
once (slow)?
Thanks for any clarification!
Riccardo
|
|
From: Jeff D. <jd...@ad...> - 2011-10-13 01:48:19
|
On Thu, Oct 13, 2011 at 01:37:24AM +0200, Riccardo Murri wrote: > Thus: > > - batch system schedulers do righteously consider each UML "thread" as > a separate process; > > - however, UML "threads" do share a large portion of the memory, as > can be seen from this "ps" output: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 6467 admin 15 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4 > 6466 admin 16 0 32.0g 13g 13g S 0.0 27.7 0:00.15 kernel64-3.0.4 > 6465 admin 22 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4 > 6458 admin 15 0 32.0g 13g 13g S 39.2 27.7 37:00.04 kernel64-3.0.4 > 7437 admin 15 0 12.0g 12g 12g T 52.9 25.6 70:54.39 kernel64-3.0.4 > > - so the problem lies in the algorithm that SGE and TORQUE apply for > computing the amount of memory used, which apparently just sums up > the total VSZ for each process (fast), instead of counting the > number of pages while ensuring that each shared page is counted only > once (slow)? > > Thanks for any clarification! Correct on all counts (the first two anyway, and I bet you're right on the third). UML uses separate address spaces for its processes, thus they don't look like threads to anything else, but the bulk of the memory (the UML kernel) in those address spaces is shared. If you look at /proc/<pid>/smaps for a couple of UML processes, you should see the sharing. Jeff |
|
From: Riccardo M. <ric...@gm...> - 2011-10-13 07:34:12
|
Hi Jeff, all, On Thu, Oct 13, 2011 at 3:35 AM, Jeff Dike <jd...@ad...> wrote: > UML uses separate address spaces for its processes, thus > they don't look like threads to anything else, but the bulk of the > memory (the UML kernel) in those address spaces is shared. Is it technically feasible to modify UML so that it uses "real" threads instead? (Perhaps at the cost of giving up the real process separation in the UML and assuming processes are "good citizens".) If yes, do you think it would be at the reach of someone who has no kernel and UML hacking experience? I presume the bulk of that would be re-implementing the memory management part into UML, which is currently offloaded to the host kernel? Thanks for your help! Riccardo |
|
From: Riccardo M. <ric...@gm...> - 2011-12-06 18:48:48
|
Hello,
Sorry again for resurrecting an old thread, but I each time I look
into this issue I realize that I haven't quite understood the details...
On Thu, Oct 13, 2011 at 03:35, Jeff Dike <jd...@ad...> wrote:
> On Thu, Oct 13, 2011 at 01:37:24AM +0200, Riccardo Murri wrote:
>> - however, UML "threads" do share a large portion of the memory, as
>> can be seen from this "ps" output:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 6467 admin 15 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
>> 6466 admin 16 0 32.0g 13g 13g S 0.0 27.7 0:00.15 kernel64-3.0.4
>> 6465 admin 22 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
>> 6458 admin 15 0 32.0g 13g 13g S 39.2 27.7 37:00.04 kernel64-3.0.4
>> 7437 admin 15 0 12.0g 12g 12g T 52.9 25.6 70:54.39 kernel64-3.0.4
>
> UML uses separate address spaces for its processes, thus
> they don't look like threads to anything else, but the bulk of the
> memory (the UML kernel) in those address spaces is shared.
>
I couldn't find many explanations of the SKAS0 mode, and the source
code needs to much kernel-fu for me, so I'm trying to understand it
from Jeff Dike's original announcement:
If I got it right:
- The UML kernel runs in its own process (hence kernel space
separation, enforced by the host kernel), which is the parent of
all the UML processes (one per guest process).
- The UML kernel process ptrace()'s its child processes. (just like
in TT mode, right?)
- Two extra memory pages are allocated per child process, which
are to communicate with the kernel process during syscalls.
Since these pages need to be shared among two host processes (the
UML kernel and its child), they are allocated through mmap()
backed by a temporary file.
Actually, I guess that the whole UML memory is allocated as mmap()'ed
pages from a temporary file: the UML kernel creates a file the size of the
requested memory, and when it has to satisfy a memory allocation it
just mmap()'s a page from that file. Correct?
In addition, *every* syscall generates a SIGTRAP to the UML kernel
process, which handles it. The advantage of SKAS0 over TT is that
memory management syscalls allow the separation of kernel and process
address space, but every other syscall needs to be handled exactly as
in TT: e.g., open() needs to map paths using the UML filesystem, etc.
Right?
Now a final question: according to the above `ps` output, the shared
memory among UML processes is ~13GB each. If the above is correct,
only the UML kernel process should have large shared memory. Is this
due to `top` misreporting shared memory occupation? (CentOS 5.x w/
stock kernel) Or could it be rather a feature of the program that was
running in the UML? (a data-intensive scientific application)
Thank you very much for any explanation.
Best regards,
Riccardo
|
|
From: richard -r. w. <ric...@gm...> - 2011-12-06 19:33:39
|
Hi, On Tue, Dec 6, 2011 at 7:48 PM, Riccardo Murri <ric...@gm...> wrote: > > If I got it right: > > - The UML kernel runs in its own process (hence kernel space > separation, enforced by the host kernel), which is the parent of > all the UML processes (one per guest process). The separation is enforced by memory mappings and mprotect(). A strict separation via processes would make UML very slow. (Yes, even more slower :P) > - The UML kernel process ptrace()'s its child processes. (just like > in TT mode, right?) Yep. UML is a system call emulator. Thus, to the guest processes UML looks like a real kernel... > - Two extra memory pages are allocated per child process, which > are to communicate with the kernel process during syscalls. Yes. This is where the black magic happens. UML installs hooks into the guest processes such that they cannot remove or modify memory mappings. > Since these pages need to be shared among two host processes (the > UML kernel and its child), they are allocated through mmap() > backed by a temporary file. > > Actually, I guess that the whole UML memory is allocated as mmap()'ed > pages from a temporary file: the UML kernel creates a file the size of the > requested memory, and when it has to satisfy a memory allocation it > just mmap()'s a page from that file. Correct? Correct. Using this technique the kernel is able to share only some parts with other guest processes. > In addition, *every* syscall generates a SIGTRAP to the UML kernel > process, which handles it. The advantage of SKAS0 over TT is that > memory management syscalls allow the separation of kernel and process > address space, but every other syscall needs to be handled exactly as > in TT: e.g., open() needs to map paths using the UML filesystem, etc. > Right? Correct. As I said, UML is a system call emulator. It uses ptrace() to get notified of every executed system call and emulates it. IOW UML is a ptrace() based Linux sandbox... > Now a final question: according to the above `ps` output, the shared > memory among UML processes is ~13GB each. If the above is correct, > only the UML kernel process should have large shared memory. Is this > due to `top` misreporting shared memory occupation? (CentOS 5.x w/ > stock kernel) Or could it be rather a feature of the program that was > running in the UML? (a data-intensive scientific application) If your UML instance has 512MiB all UML processes (kernel and guest) use together 512MiB. For tools like "top" it looks like as each process would use 512MiB. "top" cannot know that these processes are threads (constructed using clone()) and share all memory. This would only work if UML would use pthreads. Using clone() you can create nearly any kind of (unportable) threads. I don't know whether it's possible to implement SKAS0 using pthreads. -- Thanks, //richard |
|
From: Jeff D. <jd...@ad...> - 2011-12-06 20:50:04
|
On Tue, Dec 06, 2011 at 07:48:40PM +0100, Riccardo Murri wrote: > Sorry again for resurrecting an old thread, but I each time I look > into this issue I realize that I haven't quite understood the details... You basically have it all right. > In addition, *every* syscall generates a SIGTRAP to the UML kernel > process, which handles it. The advantage of SKAS0 over TT is that > memory management syscalls allow the separation of kernel and process > address space, but every other syscall needs to be handled exactly as > in TT: e.g., open() needs to map paths using the UML filesystem, etc. > Right? A little off the rails here - in TT mode, there is one address space in which userspace runs, on every context switch, that address space needs to be completely remapped in order to become the memory of the switched-in process. In SKAS, every UML process has a host address space, and UML process context switching is done by the host, at hardware speed. > Now a final question: according to the above `ps` output, the shared > memory among UML processes is ~13GB each. If the above is correct, > only the UML kernel process should have large shared memory. Is this > due to `top` misreporting shared memory occupation? (CentOS 5.x w/ > stock kernel) Or could it be rather a feature of the program that was > running in the UML? (a data-intensive scientific application) I can't find any documentation of the exact meaning of SHR, but I'd guess that it's looking at MAP_SHARED pages, which for a UML process, is everything. No utilities are good at accounting for shared memory. If you just add up the numbers, you end up far away from reality. Jeff |
|
From: Riccardo M. <ric...@gm...> - 2011-12-06 22:05:59
|
Hi Jeff, Richard, many thanks for your explanations! I think I got it now... One more question: On Tue, Dec 6, 2011 at 21:49, Jeff Dike <jd...@ad...> wrote: > On Tue, Dec 06, 2011 at 07:48:40PM +0100, Riccardo Murri wrote: >> In addition, *every* syscall generates a SIGTRAP to the UML kernel >> process, which handles it. The advantage of SKAS0 over TT is that >> memory management syscalls allow the separation of kernel and process >> address space, but every other syscall needs to be handled exactly as >> in TT: e.g., open() needs to map paths using the UML filesystem, etc. >> Right? > > A little off the rails here - in TT mode, there is one address space > in which userspace runs, on every context switch, that address space > needs to be completely remapped in order to become the memory of the > switched-in process. > Does this mean that in TT mode all UML "guest processes" are really threads of a single host process? i.e., they are created with clone(CLONE_VM|...) so they literally share any single page of memory? (So the it's the job of the UML kernel to mprotect() all the pages upon every in-UML context switch?) Best regards, Riccardo |
|
From: <cl...@cl...> - 2011-12-06 21:51:32
|
Is there still a chance that the skas0 patch will end up in the mainline? > On Tue, Dec 06, 2011 at 07:48:40PM +0100, Riccardo Murri wrote: >> Sorry again for resurrecting an old thread, but I each time I look >> into this issue I realize that I haven't quite understood the details... > > You basically have it all right. > >> In addition, *every* syscall generates a SIGTRAP to the UML kernel >> process, which handles it. The advantage of SKAS0 over TT is that >> memory management syscalls allow the separation of kernel and process >> address space, but every other syscall needs to be handled exactly as >> in TT: e.g., open() needs to map paths using the UML filesystem, etc. >> Right? > > A little off the rails here - in TT mode, there is one address space > in which userspace runs, on every context switch, that address space > needs to be completely remapped in order to become the memory of the > switched-in process. > > In SKAS, every UML process has a host address space, and UML process > context switching is done by the host, at hardware speed. > >> Now a final question: according to the above `ps` output, the shared >> memory among UML processes is ~13GB each. If the above is correct, >> only the UML kernel process should have large shared memory. Is this >> due to `top` misreporting shared memory occupation? (CentOS 5.x w/ >> stock kernel) Or could it be rather a feature of the program that was >> running in the UML? (a data-intensive scientific application) > > I can't find any documentation of the exact meaning of SHR, but I'd > guess that it's looking at MAP_SHARED pages, which for a UML process, > is everything. No utilities are good at accounting for shared > memory. If you just add up the numbers, you end up far away from > reality. > > Jeff > > ------------------------------------------------------------------------------ > Cloud Services Checklist: Pricing and Packaging Optimization > This white paper is intended to serve as a reference, checklist and point > of > discussion for anyone considering optimizing the pricing and packaging > model > of a cloud services business. Read Now! > http://www.accelacomm.com/jaw/sfnl/114/51491232/ > _______________________________________________ > User-mode-linux-user mailing list > Use...@li... > https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user > |
|
From: richard -r. w. <ric...@gm...> - 2011-12-06 21:56:44
|
On Tue, Dec 6, 2011 at 10:31 PM, <cl...@cl...> wrote: > > Is there still a chance that the skas0 patch will end up in the mainline? > It is already in mainline. -- Thanks, //richard |
|
From: <cl...@cl...> - 2011-12-06 23:35:40
|
I did not read everything carefully but I thought that with the skas0, I would see only one uml linux process in the host for the whole uml machine, and also that it would not take the /dev/shm memory anymore. Sometimes, with some weak PC or a bad /dev/shm config, if you put too many machines, the /dev/shm memory reaches its limit and new machines crash. Do I have the skas0 with mainline kernel 3.1.1 and without any more options at uml launch than before, I mean is it the default value? > On Tue, Dec 6, 2011 at 10:31 PM, <cl...@cl...> wrote: >> >> Is there still a chance that the skas0 patch will end up in the >> mainline? >> > > It is already in mainline. > > -- > Thanks, > //richard > |
|
From: richard -r. w. <ric...@gm...> - 2011-12-06 22:44:34
|
On Tue, Dec 6, 2011 at 11:40 PM, <cl...@cl...> wrote: > > I did not read everything carefully but I thought that with the skas0, > I would see only one uml linux process in the host for the whole uml > machine, and also that it would not take the /dev/shm memory anymore. No. Maybe your are referring to SKAS3/4? > Sometimes, with some weak PC or a bad /dev/shm config, if you put too many > machines, the /dev/shm memory reaches its limit and new machines crash. > > Do I have the skas0 with mainline kernel 3.1.1 and without any more > options at uml launch than before, I mean is it the default value? > SKAS0 is default. -- Thanks, //richard |
|
From: <cl...@cl...> - 2011-12-07 01:35:26
|
Sorry! I was referring to the SKAS3/4, I forgot the number. So are the SKAS3/4 still alive ? I have never tried them, I was waiting for them to get into the mainline. The normal uml machine is already very good, but if it can be even better, it would be good news. > On Tue, Dec 6, 2011 at 11:40 PM, <cl...@cl...> wrote: >> >> I did not read everything carefully but I thought that with the skas0, >> I would see only one uml linux process in the host for the whole uml >> machine, and also that it would not take the /dev/shm memory anymore. > > No. > Maybe your are referring to SKAS3/4? > >> Sometimes, with some weak PC or a bad /dev/shm config, if you put too >> many >> machines, the /dev/shm memory reaches its limit and new machines crash. >> >> Do I have the skas0 with mainline kernel 3.1.1 and without any more >> options at uml launch than before, I mean is it the default value? >> > > SKAS0 is default. > > -- > Thanks, > //richard > |