From: Terry H. <ter...@gm...> - 2013-04-06 19:23:57
|
Hello guys, Is there any available resource that explains how user-mode-linux maps the pages of a task in UML to the host kernel? In my UML, I modified a task's page table when forking it. Then I ran into a situation where the page fault happens over and over again for the same address in the forked task. I use gdb debugger and find out that when the page fault happens for the first time, the kernel calls do_wp_page() to fault in the page and marks the page present. This should prevent the next page fault for the same address from happening again. I checked the PTE in UML, they are marked as present so is it possible that the page is not being allocated properly on the host kernel so that the page fault keeps happening for the same address even though UML thinks the page is present. Any suggestions? Thank! |
From: richard -r. w. <ric...@gm...> - 2013-04-07 16:52:29
|
On Sat, Apr 6, 2013 at 9:23 PM, Terry Hsu <ter...@gm...> wrote: > Is there any available resource that explains how user-mode-linux maps the > pages of a task in UML to the host kernel? The code...? ;) UML receives a SIGEGV on the host side if a page is not mapped. The SIGEGV handler then installs the mapping using mmap(). > In my UML, I modified a task's page table when forking it. Then I ran into a > situation where the page fault happens over and over again for the same > address in the forked task. I use gdb debugger and find out that when the > page fault happens for the first time, the kernel calls do_wp_page() to > fault in the page and marks the page present. This should prevent the next > page fault for the same address from happening again. I checked the PTE in > UML, they are marked as present so is it possible that the page is not being > allocated properly on the host kernel so that the page fault keeps happening > for the same address even though UML thinks the page is present. > > Any suggestions? If the same fault happens over and over UML (on the host side) seems unable to fix the fault. Check the return values of mmap().... Thanks, //richard |
From: Peter B. <pb...@pt...> - 2013-04-07 18:30:54
|
Here's one more example, still the same setup, but this time crashing at the same place as the original bug report. (BUG: failure at block/blk-core.c:2978/blk_flush_plug_list()!) See below for output. BTW my host setup is Linux Mint 14: Linux ufo 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:31:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux bash $ ./linux ubd0=Fedora18-AMD64-root_fs rw mem=4096M con0=fd:0,fd:1 Core dump limits : soft - 0 hard - NONE Checking that ptrace can change system call numbers...OK Checking syscall emulation patch for ptrace...OK Checking advanced syscall emulation patch for ptrace...OK Checking for tmpfs mount on /dev/shm...nothing mounted on /dev/shm Checking PROT_EXEC mmap in /tmp/...OK Checking for the skas3 patch in the host: - /proc/mm...not found: No such file or directory - PTRACE_FAULTINFO...not found - PTRACE_LDT...not found UML running in SKAS0 mode Adding 26181632 bytes to physical memory to account for exec-shield gap Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 3.9.0-rc5 (pbutler@ufo) (gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1) ) #1 Sat Apr 6 13:15:06 EDT 2013 Built 1 zonelists in Zone order, mobility grouping on. Total pages: 1040544 Kernel command line: ubd0=Fedora18-AMD64-root_fs rw mem=4096M con0=fd:0,fd:1 root=98:0 PID hash table entries: 4096 (order: 3, 32768 bytes) Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Memory: 4068740k available NR_IRQS:15 Calibrating delay loop... 3800.26 BogoMIPS (lpj=19001344) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 256 Initializing cgroup subsys cpuacct Initializing cgroup subsys devices Initializing cgroup subsys freezer Initializing cgroup subsys blkio Checking that host ptys support output SIGIO...Yes Checking that host ptys support SIGIO on close...No, enabling workaround devtmpfs: initialized Using 2.6 host AIO NET: Registered protocol family 16 bio: create slab <bio-0> at 0 Switching to clocksource itimer NET: Registered protocol family 2 TCP established hash table entries: 32768 (order: 7, 524288 bytes) TCP bind hash table entries: 32768 (order: 6, 262144 bytes) TCP: Hash tables configured (established 32768 bind 32768) TCP: reno registered UDP hash table entries: 2048 (order: 4, 65536 bytes) UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes) NET: Registered protocol family 1 mconsole (version 2) initialized on /home/pbutler/.uml/ovuM3w/mconsole Checking host MADV_REMOVE support...OK VFS: Disk quotas dquot_6.5.2 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) msgmni has been set to 7946 io scheduler noop registered io scheduler deadline registered (default) TCP: cubic registered NET: Registered protocol family 17 Initialized stdio console driver Console initialized on /dev/tty0 console [tty0] enabled Initializing software serial port version 1 console [mc-1] enabled ubda: unknown partition table EXT4-fs (ubda): couldn't mount as ext3 due to feature incompatibilities EXT4-fs (ubda): couldn't mount as ext2 due to feature incompatibilities EXT4-fs (ubda): warning: maximal mount count reached, running e2fsck is recommended EXT4-fs (ubda): mounted filesystem with ordered data mode. Opts: (null) VFS: Mounted root (ext4 filesystem) on device 98:0. devtmpfs: mounted systemd[1]: systemd 197 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Welcome to Fedora 18 (Spherical Cow)! systemd[1]: Failed to insert module 'autofs4' systemd[1]: No hostname configured. systemd[1]: Set hostname to <localhost>. systemd[1]: Failed to enable kbrequest handling: Inappropriate ioctl for device systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory. See system logs and 'systemctl status display-manager.service' for details. systemd[1]: Started Replay Read-Ahead Data. systemd[1]: Starting Collect Read-Ahead Data... Starting Collect Read-Ahead Data... systemd[1]: Starting Forward Password Requests to Wall Directory Watch. systemd[1]: Started Forward Password Requests to Wall Directory Watch. systemd[1]: Starting Remote File Systems. [ OK ] Reached target Remote File Systems. systemd-readahead[228]: Failed to create fanotify object: Function not implemented systemd[1]: Reached target Remote File Systems. systemd[1]: Starting Syslog Socket. [ OK ] Listening on Syslog Socket. systemd[1]: Listening on Syslog Socket. systemd[1]: Starting /dev/initctl Compatibility Named Pipe. [ OK ] Listening on /dev/initctl Compatibility Named Pipe. systemd[1]: Listening on /dev/initctl Compatibility Named Pipe. systemd[1]: Starting Delayed Shutdown Socket. [ OK ] Listening on Delayed Shutdown Socket. systemd[1]: Listening on Delayed Shutdown Socket. systemd[1]: Starting Encrypted Volumes. [ OK ] Reached target Encrypted Volumes. systemd[1]: Reached target Encrypted Volumes. systemd[1]: Starting Arbitrary Executable File Formats File System Automount Point. systemd[1]: Failed to open /dev/autofs: No such file or directory systemd[1]: Failed to initialize automounter: No such file or directory [FAILED] Failed to set up automount Arbitrary Executable File...utomount Point. See 'systemctl status proc-sys-fs-binfmt_misc.automount' for details. systemd[1]: Failed to set up automount Arbitrary Executable File Formats File System Automount Point. systemd[1]: Unit proc-sys-fs-binfmt_misc.automount entered failed state systemd[1]: Starting LVM2 metadata daemon socket. [ OK ] Listening on LVM2 metadata daemon socket. systemd[1]: Listening on LVM2 metadata daemon socket. systemd[1]: Starting Device-mapper event daemon FIFOs. [ OK ] Listening on Device-mapper event daemon FIFOs. systemd[1]: Listening on Device-mapper event daemon FIFOs. systemd[1]: Starting Swap. [ OK ] Reached target Swap. systemd[1]: Reached target Swap. systemd[1]: Starting udev Kernel Socket. [ OK ] Listening on udev Kernel Socket. systemd[1]: Listening on udev Kernel Socket. systemd[1]: Starting udev Control Socket. [ OK ] Listening on udev Control Socket. systemd[1]: Listening on udev Control Socket. systemd[1]: Starting Journal Socket. [ OK ] Listening on Journal Socket. systemd[1]: Listening on Journal Socket. systemd[1]: Starting Syslog. [ OK ] Reached target Syslog. systemd[1]: Reached target Syslog. systemd[1]: Mounting Temporary Directory... Mounting Temporary Directory... systemd[1]: tmp.mount: Directory /tmp to mount over is not empty, mounting anyway. systemd[1]: Started Import network configuration from initramfs. systemd[1]: Starting Configure read-only root support... Starting Configure read-only root support... systemd[1]: Mounted Huge Pages File System. systemd[1]: Starting Journal Service... Starting Journal Service... [ OK ] Started Journal Service. systemd[1]: Started Journal Service. systemd[1]: Mounted Debug File System. systemd[1]: Mounting POSIX Message Queue File System... Mounting POSIX Message Queue File System... systemd[1]: Starting udev Kernel Device Manager... Starting udev Kernel Device Manager... systemd[1]: Starting udev Coldplug all Devices... Starting udev Coldplug all Devices... systemd[1]: systemd-readahead-collect.service: main process exited, code=exited, status=1/FAILURE [ OK ] Started Collect Read-Ahead Data. systemd[1]: Started Collect Read-Ahead Data. [ OK ] Mounted Temporary Directory. systemd[1]: Mounted Temporary Directory. systemd[1]: Started Load legacy module configuration. systemd[1]: Started Load Kernel Modules. systemd[1]: Mounted Configuration File System. systemd[1]: Mounted FUSE Control File System. systemd[1]: Started File System Check on Root Device. systemd[1]: Starting Remount Root and Kernel File Systems... Starting Remount Root and Kernel File Systems... systemd[1]: Started Set Up Additional Binary Formats. systemd[1]: Starting Apply Kernel Variables... Starting Apply Kernel Variables... systemd[1]: Starting Setup Virtual Console... Starting Setup Virtual Console... [ OK ] Mounted POSIX Message Queue File System. [ OK ] Started Remount Root and Kernel File Systems. [ OK ] Reached target Local File Systems (Pre). Starting Load Random Seed... [ OK ] Started Apply Kernel Variables. [FAILED] Failed to start Setup Virtual Console. See 'systemctl status systemd-vconsole-setup.service' for details. [ OK ] Started Load Random Seed. [ OK ] Started Configure read-only root support. [ OK ] Started udev Kernel Device Manager. systemd-udevd[238]: starting version 197 [ OK ] Started udev Coldplug all Devices. Starting udev Wait for Complete Device Initialization... Starting Show Plymouth Boot Screen... BUG: failure at block/blk-core.c:2978/blk_flush_plug_list()! Kernel panic - not syncing: BUG! Call Trace: 160477d70: [<6024be78>] panic+0x145/0x2a7 160477da8: [<6024bd33>] panic+0x0/0x2a7 160477de8: [<6024bfda>] printk+0x0/0xa0 160477e60: [<600182c0>] _init+0x7e0/0x8b0 160477e80: [<6018c15d>] blk_flush_plug_list+0x191/0x252 160477ec0: [<60046970>] sigsuspend+0x0/0x9e 160477ed0: [<600182c0>] _init+0x7e0/0x8b0 160477ef0: [<602503c0>] schedule+0x6a/0x78 160477f00: [<6004579c>] set_current_blocked+0x17/0x19 160477f10: [<600469cc>] sigsuspend+0x5c/0x9e 160477f30: [<6001e6da>] winch_thread+0x204/0x242 160477fd0: [<6001e4d6>] winch_thread+0x0/0x242 Modules linked in: Pid: 1615311232, comm: Not tainted 3.9.0-rc5 RIP: 12f0:[<0000000160476e50>] RSP: 0000000000000000 EFLAGS: 00000000 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000016047b3a8 RDX: 000000016047b3a8 RSI: 000000016047b3b8 RDI: 000000016047b3b8 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 000000016047b380 R11: 000000016047b380 R12: 0000000000000000 R13: 0000000c00000000 R14: 000000000c731a70 R15: 0000000051618cac Call Trace: 160477cc8: [<6006b2d6>] __module_text_address+0x14/0x5a 160477ce0: [<6001c48f>] panic_exit+0x3a/0x58 160477cf0: [<6004eba2>] __kernel_text_address+0x30/0x5c 160477d10: [<60055b34>] notifier_call_chain+0x32/0x5c 160477d38: [<600182c0>] _init+0x7e0/0x8b0 160477d50: [<60055b6e>] __atomic_notifier_call_chain+0x10/0x12 160477d60: [<60055b86>] atomic_notifier_call_chain+0x16/0x18 160477d70: [<6024beab>] panic+0x178/0x2a7 160477da8: [<6024bd33>] panic+0x0/0x2a7 160477de8: [<6024bfda>] printk+0x0/0xa0 160477e60: [<600182c0>] _init+0x7e0/0x8b0 160477e80: [<6018c15d>] blk_flush_plug_list+0x191/0x252 160477ec0: [<60046970>] sigsuspend+0x0/0x9e 160477ed0: [<600182c0>] _init+0x7e0/0x8b0 160477ef0: [<602503c0>] schedule+0x6a/0x78 160477f00: [<6004579c>] set_current_blocked+0x17/0x19 160477f10: [<600469cc>] sigsuspend+0x5c/0x9e 160477f30: [<6001e6da>] winch_thread+0x204/0x242 160477fd0: [<6001e4d6>] winch_thread+0x0/0x242 systemd-journald[233]: Received SIGUSR1 |
From: richard -r. w. <ric...@gm...> - 2013-04-07 21:55:30
Attachments:
no_winch.diff
|
On Sun, Apr 7, 2013 at 8:30 PM, Peter Butler <pb...@pt...> wrote: > Here's one more example, still the same setup, but this time crashing at > the same place as the original bug report. (BUG: failure at > block/blk-core.c:2978/blk_flush_plug_list()!) See below for output. > > BTW my host setup is Linux Mint 14: > > Linux ufo 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:31:23 UTC 2012 > x86_64 x86_64 x86_64 GNU/Linux > > > > bash $ ./linux ubd0=Fedora18-AMD64-root_fs rw mem=4096M con0=fd:0,fd:1 Please don't post into unrelated threads. Anyway, all your crashes share one thing, before the crash UML did a sigsuspend(). It does only so in the SIGWINCH irq path. To verify ma theory please apply the attached patch. It disables the feature that you can resize an UML window on the host side and UML will change the terminal size in the guest. Thanks, //richard |
From: Terry H. <ter...@gm...> - 2013-04-11 04:16:35
|
Hi Richard, thanks for replying. I did go back to see the code and try to understand what exactly is going on in UML, but still no luck. The faulted address is covered by one of the vm areas of the task, so it passed the vma sanity check at the beginning of handle_page_fault(). I print out the PTEs of the task and I notice one strange thing: when the fault happens for the first time, the PTE does not exist; the PTE is present when the second fault happens for the same address (but still a page fault); in the third page fault (same address), the PTE does not exist anymore. So in my case, the faulted address does not require a new vma to be installed. Also I've looked into copy_mm() to see how pages are copied from parent task to its child. I do not understand the purpose of the the special mapping installed by UML. It seems that every new task with a new mm_struct will have one special mapping at the head of its vma list. Thanks. On Sun, Apr 7, 2013 at 12:52 PM, richard -rw- weinberger < ric...@gm...> wrote: > On Sat, Apr 6, 2013 at 9:23 PM, Terry Hsu <ter...@gm...> wrote: > > Is there any available resource that explains how user-mode-linux maps > the > > pages of a task in UML to the host kernel? > > The code...? ;) > UML receives a SIGEGV on the host side if a page is not mapped. > The SIGEGV handler then installs the mapping using mmap(). > > > In my UML, I modified a task's page table when forking it. Then I ran > into a > > situation where the page fault happens over and over again for the same > > address in the forked task. I use gdb debugger and find out that when the > > page fault happens for the first time, the kernel calls do_wp_page() to > > fault in the page and marks the page present. This should prevent the > next > > page fault for the same address from happening again. I checked the PTE > in > > UML, they are marked as present so is it possible that the page is not > being > > allocated properly on the host kernel so that the page fault keeps > happening > > for the same address even though UML thinks the page is present. > > > > Any suggestions? > > If the same fault happens over and over UML (on the host side) seems > unable to fix the fault. > Check the return values of mmap().... > > Thanks, > //richard > |
From: richard -r. w. <ric...@gm...> - 2013-04-11 13:05:06
|
On Thu, Apr 11, 2013 at 6:15 AM, Terry Hsu <ter...@gm...> wrote: > Hi Richard, thanks for replying. I did go back to see the code and try to > understand what exactly is going on in UML, but still no luck. > > The faulted address is covered by one of the vm areas of the task, so it > passed the vma sanity check at the beginning of handle_page_fault(). I print > out the PTEs of the task and I notice one strange thing: when the fault > happens for the first time, the PTE does not exist; the PTE is present when > the second fault happens for the same address (but still a page fault); in > the third page fault (same address), the PTE does not exist anymore. > > So in my case, the faulted address does not require a new vma to be > installed. But this is a feature added by you? We are not talking about a mainline kernel, right? > Also I've looked into copy_mm() to see how pages are copied from parent task > to its child. I do not understand the purpose of the the special mapping > installed by UML. It seems that every new task with a new mm_struct will > have one special mapping at the head of its vma list. The special mapping (the SKAS stub) is needed to install new mapping from the host side of UML. Currently the stub pages have a vma, this will go away such that they have only a PTE. -- Thanks, //richard |
From: Terry H. <ter...@gm...> - 2013-04-11 20:15:03
|
The page fault loop for the same address happens in my UML. But for both my UML and the mainline (I am using 3.7.1) kernel, the addresses that trigger the page fault (in the child thread) are covered by certain vm areas. I use gdb to trace the function call and notice that mmap_region() is never called during the execution of the child task. I am guessing it's because the child task does not use large enough memory space to have the UML installed mapping for it. The major change I did to my kernel is to modify the vm areas pointers of certain child tasks to share the vm area structure of its parent task. So the parent task's vm areas are shared (as long as VM_DONTCOPY is not set) among some of its child tasks. On Thu, Apr 11, 2013 at 9:04 AM, richard -rw- weinberger < ric...@gm...> wrote: > On Thu, Apr 11, 2013 at 6:15 AM, Terry Hsu <ter...@gm...> wrote: > > Hi Richard, thanks for replying. I did go back to see the code and try to > > understand what exactly is going on in UML, but still no luck. > > > > The faulted address is covered by one of the vm areas of the task, so it > > passed the vma sanity check at the beginning of handle_page_fault(). I > print > > out the PTEs of the task and I notice one strange thing: when the fault > > happens for the first time, the PTE does not exist; the PTE is present > when > > the second fault happens for the same address (but still a page fault); > in > > the third page fault (same address), the PTE does not exist anymore. > > > > So in my case, the faulted address does not require a new vma to be > > installed. > > But this is a feature added by you? > We are not talking about a mainline kernel, right? > > Also I've looked into copy_mm() to see how pages are copied from parent > task > > to its child. I do not understand the purpose of the the special mapping > > installed by UML. It seems that every new task with a new mm_struct will > > have one special mapping at the head of its vma list. > > The special mapping (the SKAS stub) is needed to install new mapping > from the host > side of UML. > Currently the stub pages have a vma, this will go away such that they > have only a PTE. > > -- > Thanks, > //richard > |
From: richard -r. w. <ric...@gm...> - 2013-04-11 21:19:08
|
On Thu, Apr 11, 2013 at 10:14 PM, Terry Hsu <ter...@gm...> wrote: > The page fault loop for the same address happens in my UML. But for both my > UML and the mainline (I am using 3.7.1) kernel, the addresses that trigger > the page fault (in the child thread) are covered by certain vm areas. I use > gdb to trace the function call and notice that mmap_region() is never called > during the execution of the child task. I am guessing it's because the child > task does not use large enough memory space to have the UML installed > mapping for it. Okay, let's try to figure out what happens here. The UML _guest_ process has some vmas installed, upon access the host kernel finds out that there is no memory mapping installed in the _host_ side of UML and sends SIGSEGV to the process. UML's host part catches the SIGSEGV and tries to fix it. Usually it does so by mmap()'ing the faulting page into the UML guest process. This is where the SKAS stub magic happens. It write the to be fixed address into STUB_DATA and sets EIP/RIP to STUB_CODE such that the process itself calls mmap(). After the stub has finished it traps itself and the UML emulation continues. Now we need to figure out a) What address is faulting and why? b) What does the UML _host_ side code to fix it? i.e. What are the mmap() parameters? c) Does this mmap() fail? To me it looks like UML is unable to fix the fault and therefore it faults over and over again. -- Thanks, //richard |
From: Terry H. <ter...@gm...> - 2013-04-11 23:00:58
|
In the unmodified kernel, I did not see the kernel call mmap (which in turn calls mmap_region) to install the mapping for the faulting page in child task. The child task does not have the UML invoked mmap to install mapping. So I could not examine the parameters passed to mmap neither the return value of it. Thanks for the explanation of the special mapping. After reading your comment I went to Jeff Dike's website to find out more about skas: http://user-mode-linux.sourceforge.net/old/skas.html The handle_pte_fault() calls __do_fault(), which in turn invokes filemap_fault() through vma->vm_ops->fault(vma, &vmf). How do I find out exactly what the miss address is for? I am posting the log I print out here. This is the unmodified kernel version. So the page is faulted in correctly without calling mmap for the forked child task. *Note: this is the correct version of page fault in the unmodified kernel.* [segv_handler] Caller is userspace+0x25d/0x44c, pid 598 a.out [segv] Caller is segv_handler+0xb1/0xbb, pid 598 a.out [handle_page_fault] Caller is segv+0xfa/0x324, pid 598 a.out [handle_page_fault] fault address: 0x400e9cc8 [handle_page_fault] page walk for 0x400e9cc8 [handle_page_fault] pte does not exist! [handle_page_fault] before handle_page_fault [print_mm_rss_stat] mm->rss_stat for mm id: 673 [print_mm_rss_stat] mm->rss_stat.count[0] = 0 [print_mm_rss_stat] mm->rss_stat.count[1] = 27 [print_mm_rss_stat] mm->rss_stat.count[2] = 0 [find_vma] Caller is handle_page_fault+0x1ca/0x957, pid 598 a.out [handle_mm_fault] Caller is handle_page_fault+0x50d/0x957, pid 598 a.out [handle_mm_fault] pgd: 295944192 [handle_mm_fault] pud: 295944192 [handle_mm_fault] pmd: 294746112 [*handle_mm_fault*] pte: 295581512 [*handle_pte_fault*] calling do_linear_fault [*__do_fault*] __do_fault for 0x400e9cc8 [__do_fault] line 3292 of file mm/memory.c, pid 598 [*filemap_fault*] line 1604 of file mm/filemap.c, pid 598 [filemap_fault] line 1622 of file mm/filemap.c, pid 598 [filemap_fault] line 1654 of file mm/filemap.c, pid 598 [filemap_fault] line 1680 of file mm/filemap.c, pid 598 [__do_fault] line 3312 of file mm/memory.c, pid 598 [__do_fault] line 3367 of file mm/memory.c, pid 598 [__do_fault] line 3395 of file mm/memory.c, pid 598 [__do_fault] line 3408 of file mm/memory.c, pid 598 [__do_fault] line 3425 of file mm/memory.c, pid 598 [__do_fault] line 3458 of file mm/memory.c, pid 598 [__do_fault] __do_fault for 0x400e9cc8 returning 512 [handle_page_fault] line 205 of file arch/um/kernel/trap.c, pid 598 [handle_page_fault] mm->mm_id: 673 [flush_tlb_page] Caller is handle_page_fault+0x7f5/0x957, pid 598 a.out [flush_tlb_page] mm->mm_id: 673 [handle_page_fault] page walk for 0x400e9cc8 [handle_page_fault] pte for 0x400e9cc8: 0x119e3748 [handle_page_fault] after handle_page_fault [print_mm_rss_stat] mm->rss_stat for mm id: 673 [print_mm_rss_stat] mm->rss_stat.count[0] = 1 [print_mm_rss_stat] mm->rss_stat.count[1] = 27 [print_mm_rss_stat] mm->rss_stat.count[2] = 0 On Thu, Apr 11, 2013 at 5:19 PM, richard -rw- weinberger < ric...@gm...> wrote: > On Thu, Apr 11, 2013 at 10:14 PM, Terry Hsu <ter...@gm...> wrote: > > The page fault loop for the same address happens in my UML. But for both > my > > UML and the mainline (I am using 3.7.1) kernel, the addresses that > trigger > > the page fault (in the child thread) are covered by certain vm areas. I > use > > gdb to trace the function call and notice that mmap_region() is never > called > > during the execution of the child task. I am guessing it's because the > child > > task does not use large enough memory space to have the UML installed > > mapping for it. > > Okay, let's try to figure out what happens here. > The UML _guest_ process has some vmas installed, upon access the host > kernel finds > out that there is no memory mapping installed in the _host_ side of > UML and sends SIGSEGV > to the process. UML's host part catches the SIGSEGV and tries to fix it. > Usually it does so by mmap()'ing the faulting page into the UML guest > process. > This is where the SKAS stub magic happens. It write the to be fixed > address into STUB_DATA > and sets EIP/RIP to STUB_CODE such that the process itself calls mmap(). > After the stub has finished it traps itself and the UML emulation > continues. > > Now we need to figure out a) What address is faulting and why? b) What > does the UML _host_ side > code to fix it? i.e. What are the mmap() parameters? c) Does this mmap() > fail? > > To me it looks like UML is unable to fix the fault and therefore it > faults over and over again. > > -- > Thanks, > //richard > |
From: Terry H. <ter...@gm...> - 2013-04-12 05:15:41
|
okay so I looked into the faultinfo structure and was able to obtain the faulting address, error code, and trap number(?). From my understanding the error code is the bottom 3 bits of the exception code. But I see error code "20" sometimes and do not what it means. I am now looking at how the special mapping works with the host kernel. I think this might lead me to the solution of my problem. It sounds like the special mapping is not installed correctly so that the UML was not able to fix the fault. On Thu, Apr 11, 2013 at 7:00 PM, Terry Hsu <ter...@gm...> wrote: > In the unmodified kernel, I did not see the kernel call mmap (which in > turn calls mmap_region) to install the mapping for the faulting page in > child task. The child task does not have the UML invoked mmap to install > mapping. So I could not examine the parameters passed to mmap neither the > return value of it. > > Thanks for the explanation of the special mapping. After reading your > comment I went to Jeff Dike's website to find out more about skas: > http://user-mode-linux.sourceforge.net/old/skas.html > > The handle_pte_fault() calls __do_fault(), which in turn invokes > filemap_fault() through > vma->vm_ops->fault(vma, &vmf). How do I find out exactly what the miss > address is for? I am posting the log I print out here. This is the > unmodified kernel version. So the page is faulted in correctly without > calling mmap for the forked child task. > > *Note: this is the correct version of page fault in the unmodified kernel. > * > [segv_handler] Caller is userspace+0x25d/0x44c, pid 598 a.out > [segv] Caller is segv_handler+0xb1/0xbb, pid 598 a.out > [handle_page_fault] Caller is segv+0xfa/0x324, pid 598 a.out > [handle_page_fault] fault address: 0x400e9cc8 > [handle_page_fault] page walk for 0x400e9cc8 > [handle_page_fault] pte does not exist! > [handle_page_fault] before handle_page_fault > [print_mm_rss_stat] mm->rss_stat for mm id: 673 > [print_mm_rss_stat] mm->rss_stat.count[0] = 0 > [print_mm_rss_stat] mm->rss_stat.count[1] = 27 > [print_mm_rss_stat] mm->rss_stat.count[2] = 0 > [find_vma] Caller is handle_page_fault+0x1ca/0x957, pid 598 a.out > [handle_mm_fault] Caller is handle_page_fault+0x50d/0x957, pid 598 a.out > [handle_mm_fault] pgd: 295944192 > [handle_mm_fault] pud: 295944192 > [handle_mm_fault] pmd: 294746112 > [*handle_mm_fault*] pte: 295581512 > [*handle_pte_fault*] calling do_linear_fault > [*__do_fault*] __do_fault for 0x400e9cc8 > [__do_fault] line 3292 of file mm/memory.c, pid 598 > [*filemap_fault*] line 1604 of file mm/filemap.c, pid 598 > [filemap_fault] line 1622 of file mm/filemap.c, pid 598 > [filemap_fault] line 1654 of file mm/filemap.c, pid 598 > [filemap_fault] line 1680 of file mm/filemap.c, pid 598 > [__do_fault] line 3312 of file mm/memory.c, pid 598 > [__do_fault] line 3367 of file mm/memory.c, pid 598 > [__do_fault] line 3395 of file mm/memory.c, pid 598 > [__do_fault] line 3408 of file mm/memory.c, pid 598 > [__do_fault] line 3425 of file mm/memory.c, pid 598 > [__do_fault] line 3458 of file mm/memory.c, pid 598 > [__do_fault] __do_fault for 0x400e9cc8 returning 512 > [handle_page_fault] line 205 of file arch/um/kernel/trap.c, pid 598 > [handle_page_fault] mm->mm_id: 673 > [flush_tlb_page] Caller is handle_page_fault+0x7f5/0x957, pid 598 a.out > [flush_tlb_page] mm->mm_id: 673 > [handle_page_fault] page walk for 0x400e9cc8 > [handle_page_fault] pte for 0x400e9cc8: 0x119e3748 > [handle_page_fault] after handle_page_fault > [print_mm_rss_stat] mm->rss_stat for mm id: 673 > [print_mm_rss_stat] mm->rss_stat.count[0] = 1 > [print_mm_rss_stat] mm->rss_stat.count[1] = 27 > [print_mm_rss_stat] mm->rss_stat.count[2] = 0 > > > > > > On Thu, Apr 11, 2013 at 5:19 PM, richard -rw- weinberger < > ric...@gm...> wrote: > >> On Thu, Apr 11, 2013 at 10:14 PM, Terry Hsu <ter...@gm...> >> wrote: >> > The page fault loop for the same address happens in my UML. But for >> both my >> > UML and the mainline (I am using 3.7.1) kernel, the addresses that >> trigger >> > the page fault (in the child thread) are covered by certain vm areas. I >> use >> > gdb to trace the function call and notice that mmap_region() is never >> called >> > during the execution of the child task. I am guessing it's because the >> child >> > task does not use large enough memory space to have the UML installed >> > mapping for it. >> >> Okay, let's try to figure out what happens here. >> The UML _guest_ process has some vmas installed, upon access the host >> kernel finds >> out that there is no memory mapping installed in the _host_ side of >> UML and sends SIGSEGV >> to the process. UML's host part catches the SIGSEGV and tries to fix it. >> Usually it does so by mmap()'ing the faulting page into the UML guest >> process. >> This is where the SKAS stub magic happens. It write the to be fixed >> address into STUB_DATA >> and sets EIP/RIP to STUB_CODE such that the process itself calls mmap(). >> After the stub has finished it traps itself and the UML emulation >> continues. >> >> Now we need to figure out a) What address is faulting and why? b) What >> does the UML _host_ side >> code to fix it? i.e. What are the mmap() parameters? c) Does this mmap() >> fail? >> >> To me it looks like UML is unable to fix the fault and therefore it >> faults over and over again. >> >> -- >> Thanks, >> //richard >> > > |
From: Terry H. <ter...@gm...> - 2013-04-12 19:59:43
|
Do you know which functions are used by UML to write the to be fixed address into SKAS stub? In the handle_page_fault(), the stub mapping is never referenced. I print out vm area (in find_vma()) if the address cover by the stub mapping is referenced, and it prints nothing there. I want to know when/where the UML writes the to be fixed address into SKAS stub so I can fix the problem accordingly. I think my UML is using the wrong SKAS stub to fixed the fault... Thanks! On Fri, Apr 12, 2013 at 1:14 AM, Terry Hsu <ter...@gm...> wrote: > okay so I looked into the faultinfo structure and was able to obtain the > faulting address, error code, and trap number(?). From my understanding the > error code is the bottom 3 bits of the exception code. But I see error code > "20" sometimes and do not what it means. > I am now looking at how the special mapping works with the host kernel. I > think this might lead me to the solution of my problem. It sounds like the > special mapping is not installed correctly so that the UML was not able to > fix the fault. > > > > > On Thu, Apr 11, 2013 at 7:00 PM, Terry Hsu <ter...@gm...> wrote: > >> In the unmodified kernel, I did not see the kernel call mmap (which in >> turn calls mmap_region) to install the mapping for the faulting page in >> child task. The child task does not have the UML invoked mmap to install >> mapping. So I could not examine the parameters passed to mmap neither the >> return value of it. >> >> Thanks for the explanation of the special mapping. After reading your >> comment I went to Jeff Dike's website to find out more about skas: >> http://user-mode-linux.sourceforge.net/old/skas.html >> >> The handle_pte_fault() calls __do_fault(), which in turn invokes >> filemap_fault() through >> vma->vm_ops->fault(vma, &vmf). How do I find out exactly what the miss >> address is for? I am posting the log I print out here. This is the >> unmodified kernel version. So the page is faulted in correctly without >> calling mmap for the forked child task. >> >> *Note: this is the correct version of page fault in the unmodified >> kernel.* >> [segv_handler] Caller is userspace+0x25d/0x44c, pid 598 a.out >> [segv] Caller is segv_handler+0xb1/0xbb, pid 598 a.out >> [handle_page_fault] Caller is segv+0xfa/0x324, pid 598 a.out >> [handle_page_fault] fault address: 0x400e9cc8 >> [handle_page_fault] page walk for 0x400e9cc8 >> [handle_page_fault] pte does not exist! >> [handle_page_fault] before handle_page_fault >> [print_mm_rss_stat] mm->rss_stat for mm id: 673 >> [print_mm_rss_stat] mm->rss_stat.count[0] = 0 >> [print_mm_rss_stat] mm->rss_stat.count[1] = 27 >> [print_mm_rss_stat] mm->rss_stat.count[2] = 0 >> [find_vma] Caller is handle_page_fault+0x1ca/0x957, pid 598 a.out >> [handle_mm_fault] Caller is handle_page_fault+0x50d/0x957, pid 598 a.out >> [handle_mm_fault] pgd: 295944192 >> [handle_mm_fault] pud: 295944192 >> [handle_mm_fault] pmd: 294746112 >> [*handle_mm_fault*] pte: 295581512 >> [*handle_pte_fault*] calling do_linear_fault >> [*__do_fault*] __do_fault for 0x400e9cc8 >> [__do_fault] line 3292 of file mm/memory.c, pid 598 >> [*filemap_fault*] line 1604 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1622 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1654 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1680 of file mm/filemap.c, pid 598 >> [__do_fault] line 3312 of file mm/memory.c, pid 598 >> [__do_fault] line 3367 of file mm/memory.c, pid 598 >> [__do_fault] line 3395 of file mm/memory.c, pid 598 >> [__do_fault] line 3408 of file mm/memory.c, pid 598 >> [__do_fault] line 3425 of file mm/memory.c, pid 598 >> [__do_fault] line 3458 of file mm/memory.c, pid 598 >> [__do_fault] __do_fault for 0x400e9cc8 returning 512 >> [handle_page_fault] line 205 of file arch/um/kernel/trap.c, pid 598 >> [handle_page_fault] mm->mm_id: 673 >> [flush_tlb_page] Caller is handle_page_fault+0x7f5/0x957, pid 598 a.out >> [flush_tlb_page] mm->mm_id: 673 >> [handle_page_fault] page walk for 0x400e9cc8 >> [handle_page_fault] pte for 0x400e9cc8: 0x119e3748 >> [handle_page_fault] after handle_page_fault >> [print_mm_rss_stat] mm->rss_stat for mm id: 673 >> [print_mm_rss_stat] mm->rss_stat.count[0] = 1 >> [print_mm_rss_stat] mm->rss_stat.count[1] = 27 >> [print_mm_rss_stat] mm->rss_stat.count[2] = 0 >> >> >> >> >> >> On Thu, Apr 11, 2013 at 5:19 PM, richard -rw- weinberger < >> ric...@gm...> wrote: >> >>> On Thu, Apr 11, 2013 at 10:14 PM, Terry Hsu <ter...@gm...> >>> wrote: >>> > The page fault loop for the same address happens in my UML. But for >>> both my >>> > UML and the mainline (I am using 3.7.1) kernel, the addresses that >>> trigger >>> > the page fault (in the child thread) are covered by certain vm areas. >>> I use >>> > gdb to trace the function call and notice that mmap_region() is never >>> called >>> > during the execution of the child task. I am guessing it's because the >>> child >>> > task does not use large enough memory space to have the UML installed >>> > mapping for it. >>> >>> Okay, let's try to figure out what happens here. >>> The UML _guest_ process has some vmas installed, upon access the host >>> kernel finds >>> out that there is no memory mapping installed in the _host_ side of >>> UML and sends SIGSEGV >>> to the process. UML's host part catches the SIGSEGV and tries to fix it. >>> Usually it does so by mmap()'ing the faulting page into the UML guest >>> process. >>> This is where the SKAS stub magic happens. It write the to be fixed >>> address into STUB_DATA >>> and sets EIP/RIP to STUB_CODE such that the process itself calls mmap(). >>> After the stub has finished it traps itself and the UML emulation >>> continues. >>> >>> Now we need to figure out a) What address is faulting and why? b) What >>> does the UML _host_ side >>> code to fix it? i.e. What are the mmap() parameters? c) Does this mmap() >>> fail? >>> >>> To me it looks like UML is unable to fix the fault and therefore it >>> faults over and over again. >>> >>> -- >>> Thanks, >>> //richard >>> >> >> > |
From: Terry H. <ter...@gm...> - 2013-04-13 03:00:19
|
On Fri, Apr 12, 2013 at 1:14 AM, Terry Hsu <ter...@gm...> wrote: > okay so I looked into the faultinfo structure and was able to obtain the > faulting address, error code, and trap number(?). From my understanding the > error code is the bottom 3 bits of the exception code. But I see error code > "20" sometimes and do not what it means. > According to p.6-55 in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3: System Programming Guide<http://download.intel.com/design/processor/manuals/253668.pdf>, the lower 5 bits are Present, Read/Write, User/supervisor, RSVD, and Instruction/Data bit respectively. So error code 20 means the fault is caused by an instruction read to a non-present page in user mode. I found the the reason why the fault cannot be fixed by UML. It is probably because UML puts the faultinfo in the wrong stub, since I changed the vm area pointers of the child process, when the fault happens, UML incorrectly finds its parent process's stub pages and puts the faultinfo in it. Therefore when the child process tries to access its own skas stub and fix the fault, it still cannot find the correct instruction pointers hence the fault happens endlessly. Why does every process that runs in UML need its own stub for page fault handling? It seems to me they could've shared the SIGSEGV signal handler and the function that invokes mmap, munmap, mprotect. In this way only two pages are needed for all the processes. I am not sure if I understand the whole thing correctly. Please correct me if it's not right. Thanks! I am now looking at how the special mapping works with the host kernel. I > think this might lead me to the solution of my problem. It sounds like the > special mapping is not installed correctly so that the UML was not able to > fix the fault. > > > > > On Thu, Apr 11, 2013 at 7:00 PM, Terry Hsu <ter...@gm...> wrote: > >> In the unmodified kernel, I did not see the kernel call mmap (which in >> turn calls mmap_region) to install the mapping for the faulting page in >> child task. The child task does not have the UML invoked mmap to install >> mapping. So I could not examine the parameters passed to mmap neither the >> return value of it. >> >> Thanks for the explanation of the special mapping. After reading your >> comment I went to Jeff Dike's website to find out more about skas: >> http://user-mode-linux.sourceforge.net/old/skas.html >> >> The handle_pte_fault() calls __do_fault(), which in turn invokes >> filemap_fault() through >> vma->vm_ops->fault(vma, &vmf). How do I find out exactly what the miss >> address is for? I am posting the log I print out here. This is the >> unmodified kernel version. So the page is faulted in correctly without >> calling mmap for the forked child task. >> >> *Note: this is the correct version of page fault in the unmodified >> kernel.* >> [segv_handler] Caller is userspace+0x25d/0x44c, pid 598 a.out >> [segv] Caller is segv_handler+0xb1/0xbb, pid 598 a.out >> [handle_page_fault] Caller is segv+0xfa/0x324, pid 598 a.out >> [handle_page_fault] fault address: 0x400e9cc8 >> [handle_page_fault] page walk for 0x400e9cc8 >> [handle_page_fault] pte does not exist! >> [handle_page_fault] before handle_page_fault >> [print_mm_rss_stat] mm->rss_stat for mm id: 673 >> [print_mm_rss_stat] mm->rss_stat.count[0] = 0 >> [print_mm_rss_stat] mm->rss_stat.count[1] = 27 >> [print_mm_rss_stat] mm->rss_stat.count[2] = 0 >> [find_vma] Caller is handle_page_fault+0x1ca/0x957, pid 598 a.out >> [handle_mm_fault] Caller is handle_page_fault+0x50d/0x957, pid 598 a.out >> [handle_mm_fault] pgd: 295944192 >> [handle_mm_fault] pud: 295944192 >> [handle_mm_fault] pmd: 294746112 >> [*handle_mm_fault*] pte: 295581512 >> [*handle_pte_fault*] calling do_linear_fault >> [*__do_fault*] __do_fault for 0x400e9cc8 >> [__do_fault] line 3292 of file mm/memory.c, pid 598 >> [*filemap_fault*] line 1604 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1622 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1654 of file mm/filemap.c, pid 598 >> [filemap_fault] line 1680 of file mm/filemap.c, pid 598 >> [__do_fault] line 3312 of file mm/memory.c, pid 598 >> [__do_fault] line 3367 of file mm/memory.c, pid 598 >> [__do_fault] line 3395 of file mm/memory.c, pid 598 >> [__do_fault] line 3408 of file mm/memory.c, pid 598 >> [__do_fault] line 3425 of file mm/memory.c, pid 598 >> [__do_fault] line 3458 of file mm/memory.c, pid 598 >> [__do_fault] __do_fault for 0x400e9cc8 returning 512 >> [handle_page_fault] line 205 of file arch/um/kernel/trap.c, pid 598 >> [handle_page_fault] mm->mm_id: 673 >> [flush_tlb_page] Caller is handle_page_fault+0x7f5/0x957, pid 598 a.out >> [flush_tlb_page] mm->mm_id: 673 >> [handle_page_fault] page walk for 0x400e9cc8 >> [handle_page_fault] pte for 0x400e9cc8: 0x119e3748 >> [handle_page_fault] after handle_page_fault >> [print_mm_rss_stat] mm->rss_stat for mm id: 673 >> [print_mm_rss_stat] mm->rss_stat.count[0] = 1 >> [print_mm_rss_stat] mm->rss_stat.count[1] = 27 >> [print_mm_rss_stat] mm->rss_stat.count[2] = 0 >> >> >> >> >> >> On Thu, Apr 11, 2013 at 5:19 PM, richard -rw- weinberger < >> ric...@gm...> wrote: >> >>> On Thu, Apr 11, 2013 at 10:14 PM, Terry Hsu <ter...@gm...> >>> wrote: >>> > The page fault loop for the same address happens in my UML. But for >>> both my >>> > UML and the mainline (I am using 3.7.1) kernel, the addresses that >>> trigger >>> > the page fault (in the child thread) are covered by certain vm areas. >>> I use >>> > gdb to trace the function call and notice that mmap_region() is never >>> called >>> > during the execution of the child task. I am guessing it's because the >>> child >>> > task does not use large enough memory space to have the UML installed >>> > mapping for it. >>> >>> Okay, let's try to figure out what happens here. >>> The UML _guest_ process has some vmas installed, upon access the host >>> kernel finds >>> out that there is no memory mapping installed in the _host_ side of >>> UML and sends SIGSEGV >>> to the process. UML's host part catches the SIGSEGV and tries to fix it. >>> Usually it does so by mmap()'ing the faulting page into the UML guest >>> process. >>> This is where the SKAS stub magic happens. It write the to be fixed >>> address into STUB_DATA >>> and sets EIP/RIP to STUB_CODE such that the process itself calls mmap(). >>> After the stub has finished it traps itself and the UML emulation >>> continues. >>> >>> Now we need to figure out a) What address is faulting and why? b) What >>> does the UML _host_ side >>> code to fix it? i.e. What are the mmap() parameters? c) Does this mmap() >>> fail? >>> >>> To me it looks like UML is unable to fix the fault and therefore it >>> faults over and over again. >>> >>> -- >>> Thanks, >>> //richard >>> >> >> > |
From: richard -r. w. <ric...@gm...> - 2013-04-13 09:22:59
|
On Sat, Apr 13, 2013 at 4:59 AM, Terry Hsu <ter...@gm...> wrote: > > On Fri, Apr 12, 2013 at 1:14 AM, Terry Hsu <ter...@gm...> wrote: >> >> okay so I looked into the faultinfo structure and was able to obtain the >> faulting address, error code, and trap number(?). From my understanding the >> error code is the bottom 3 bits of the exception code. But I see error code >> "20" sometimes and do not what it means. > > > According to p.6-55 in Intel® 64 and IA-32 Architectures Software > Developer’s Manual, Volume 3: System Programming Guide, the lower 5 bits are > Present, Read/Write, User/supervisor, RSVD, and Instruction/Data bit > respectively. So error code 20 means the fault is caused by an instruction > read to a non-present page in user mode. > > I found the the reason why the fault cannot be fixed by UML. It is probably > because UML puts the faultinfo in the wrong stub, since I changed the vm > area pointers of the child process, when the fault happens, UML incorrectly > finds its parent process's stub pages and puts the faultinfo in it. > Therefore when the child process tries to access its own skas stub and fix > the fault, it still cannot find the correct instruction pointers hence the > fault happens endlessly. Can you share an example with us which triggers the issue? > Why does every process that runs in UML need its own stub for page fault > handling? It seems to me they could've shared the SIGSEGV signal handler and > the function that invokes mmap, munmap, mprotect. In this way only two pages > are needed for all the processes. > > I am not sure if I understand the whole thing correctly. Please correct me > if it's not right. We need a stub per process because the installs a mapping into the prozess (on the host size). As mmap() always operates on current, we need a way to make the prozess call mmap() itself. The stub does this. -- Thanks, //richard |