|
From: Riccardo M. <ric...@gm...> - 2011-10-12 23:37:34
|
Hello,
sorry for resurrecting this old thread, but I need to test my
understanding of the problem and I'd like to ask for a clarification.
On Thu, Aug 4, 2011 at 7:09 PM, richard -rw- weinberger
<ric...@gm...> wrote:
> On Thu, Aug 4, 2011 at 5:42 PM, Riccardo Murri
<ric...@gm...> wrote:
>>
>> I see that each UML instance starts a variable number of threads/processes.
>>
>> I'm using UML in a batch system (Sun Grid Engine 6.2); SGE kills my
>> jobs because they exceed the allowed memory reservation. My guess is
>> that SGE miscomputes the memory usage by computing the total over all
>> threads/processes without accounting for shared pages.
>> [...]
>
> UML starts on the host side per process one helper thread.
> (In SKAS0 mode, which is the default.)
> So, you can limit the number of host threads by starting less
> processes within UML. ;)
>
> Most likely SGE does not detect them as threads because UML uses
> clone() to create them...
Actually we've seen the same behavior also in TORQUE, so this is
becoming a major issue for us.
The question is this: I see in the libc sources that clone() is used
to create threads as well. So I guess the difference is in the flags
that are passed to clone() in the two cases?
Now, libc create_thread() uses (lines 182--188 of file file
"nptl/sysdeps/pthread/createthread.c"):
int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
| CLONE_SETTLS | CLONE_PARENT_SETTID
| CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
#if __ASSUME_NO_CLONE_DETACHED == 0
| CLONE_DETACHED
#endif
| 0);
whereas, if I'm not mistaken, UML uses (file "kernel/skas/clone.c"):
err = stub_syscall2(__NR_clone, CLONE_PARENT | CLONE_FILES | SIGCHLD,
STUB_DATA + UM_KERN_PAGE_SIZE / 2 - sizeof(void *));
But then this means that the additional processes created by UML do
not share the memory space (no CLONE_VM), correct?
Thus:
- batch system schedulers do righteously consider each UML "thread" as
a separate process;
- however, UML "threads" do share a large portion of the memory, as
can be seen from this "ps" output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6467 admin 15 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
6466 admin 16 0 32.0g 13g 13g S 0.0 27.7 0:00.15 kernel64-3.0.4
6465 admin 22 0 32.0g 13g 13g S 0.0 27.7 0:00.00 kernel64-3.0.4
6458 admin 15 0 32.0g 13g 13g S 39.2 27.7 37:00.04 kernel64-3.0.4
7437 admin 15 0 12.0g 12g 12g T 52.9 25.6 70:54.39 kernel64-3.0.4
- so the problem lies in the algorithm that SGE and TORQUE apply for
computing the amount of memory used, which apparently just sums up
the total VSZ for each process (fast), instead of counting the
number of pages while ensuring that each shared page is counted only
once (slow)?
Thanks for any clarification!
Riccardo
|