From: Henrik N. <hn...@ma...> - 2003-11-22 13:10:03
|
We have found what looks like a UML-specific problem when running the iptables-restore command. Environment: Linux-2.4.22 + current UML from CVS. iptables compiled as modules. iptables (tested with both 1.2.8 and current CVS) Host: RedHat 7.1, 7.3 and 9 with current updates. boot without any iptables loaded load the iptables modules iptables -t mangle -L iptables -t nat -L iptables -t filter -L Try running iptables-restore iptables-save | iptables-restore If no problem seen, try adding a few rules or chains to the table and repeat the command or try loading additional iptables such as filter and/or nat. In our tests this very reliably makes the iptables-restore fail after at most a few attemtps. In many cases a cycle of 4 is seen where the command works three times and then fails once, or fails with one error three times and another error once. This problem is only seen when using UML, not when running on "real" hardware. This makes me suspect the problem is to be found somewhere within UML. We have also seen occations when the kernel copy of the iptables table (usually the nat table) gets corrupted in the few cases where iptables-restore runs without error, later causing a kernel panic inside iptables. The problems does only seem to occur when using iptables-restore to load a new iptables definition, not when using iptables to modify individual rules. But I can not understand why as the kernel operations are almost identical in both cases.. iptables reads the current table from the kernel, modifies it and writes the new table, iptables-restore compiles a new table from the source rules and writes it into the kernel. Any hints on how to pinpoint if this is a UML or a iptables problem is preciated. Regards Henrik |
From: Henrik N. <hn...@ma...> - 2003-11-23 23:37:24
|
On Sat, 22 Nov 2003, Henrik Nordstrom wrote: > We have found what looks like a UML-specific problem when running the > iptables-restore command. Have been looking closely at this one, and from what I can tell everything indicates there is something within UML which corrupts the kernel memory allocated by iptables. The culpit code from net/ipv4/netfilter/ip_tables.c:do_replace() with comments: newinfo = vmalloc(sizeof(struct ipt_table_info) + SMP_ALIGN(tmp.size) * smp_num_cpus); if (!newinfo) return -ENOMEM; [The allocated size is verified to be correct] if (copy_from_user(newinfo->entries, user + sizeof(tmp), tmp.size) != 0) { ret = -EFAULT; goto free_newinfo; } [So far everything looks fine and what gdb finds in newinfo->entries matches what looks like was sent from userspace. Sizes of the copied data has been verified that there is no buffer overflow. newinfo->entries is correct even if it looks odd at first sight (array, not a pointer)] counters = vmalloc(tmp.num_counters * sizeof(struct ipt_counters)); if (!counters) { ret = -ENOMEM; goto free_newinfo; } [still everything looks fine, and contents identical to after the copy_from_user call above] memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters)); [here randomly (not always) the data copied has been currupted in semi-random ways. Sometimes this corruption can be seen before the call to translate_table(), sometimes early in the start of translate_table() before the memory in question is referenced. There seems to be only two or three different patterns of corruption all spread out little here little there in the memory aream, but that pattern may also be an artefact of the repetive nature of the test used to trigger the error] ret = translate_table(tmp.name, tmp.valid_hooks, newinfo, tmp.size, tmp.num_entries, tmp.hook_entry, tmp.underflow); I have tried using a watch expression on the data which gets currupted but GDB does not seem to detect any writes to this area. The memory just "magically" changes content. Using watch expressions on data changes in the same are done within the relevant code flow works fine however. This is seen with Linux-2.4.22 + current UML CVS in TT mode, without any other kernel patches. The only good thing is that I can very easily reproduce the problem within a few seconds, and also reliably detect when it happens by using a few breakpoints spread out along the code path checksumming the relevant memory area for changes. Bad thing is that I am really out of options what to look for next. To me it looks like my UML is simulating badly broken memory, but I am pretty sure there is some other magics messing around somewhere.. The exact same error is seen on multiple different hosts so I am fully confident it is not a cause of broken host memory. Hmm.. I just found one odd thing which might hint to where the problem may be.. if I break just after vmalloc() but before copy_from_user() then GDB can't read the memory area: (gdb) x/1024b newinfo 0xa2819000: Cannot access memory at address 0xa2819000 the same addresses just after copy_from_user() works just fine however.. Regards Henrik |
From: Adam H. <ad...@do...> - 2003-11-26 00:47:52
|
On Mon, 24 Nov 2003, Henrik Nordstrom wrote: > The culpit code from net/ipv4/netfilter/ip_tables.c:do_replace() with > comments: > > newinfo = vmalloc(sizeof(struct ipt_table_info) > + SMP_ALIGN(tmp.size) * smp_num_cpus); > if (!newinfo) > return -ENOMEM; > > [The allocated size is verified to be correct] > > if (copy_from_user(newinfo->entries, user + sizeof(tmp), > tmp.size) != 0) { > ret = -EFAULT; > goto free_newinfo; > } Er, I don't see newinfo being initialized with anything. Did you exclude this in your email? If not, then newinfo->entries will randomly point to some other block of memory. > Hmm.. I just found one odd thing which might hint to where the problem may > be.. if I break just after vmalloc() but before copy_from_user() then GDB > can't read the memory area: > > (gdb) x/1024b newinfo > 0xa2819000: Cannot access memory at address 0xa2819000 > > the same addresses just after copy_from_user() works just fine however.. This kind of error goes hand-in-hand with my observation above. |
From: Henrik N. <hn...@ma...> - 2003-11-26 01:42:27
|
On Tue, 25 Nov 2003, Adam Heath wrote: > Er, I don't see newinfo being initialized with anything. Did you exclude this > in your email? > > If not, then newinfo->entries will randomly point to some other block of > memory. As commented in my message newinfo->entries is not a pointer, it is an array within newinfo and this reference is correct even if it looks odd at a first sight. the name of an array evaluates to the address of the array. Due to this oddity of the C language newinfo->entries == &newinfo->entries == &newinfo->entries[0] and in this case is the address 0x40 bytes into newinfo (0xa2819040 if newinfo is 0xa2819000) > > (gdb) x/1024b newinfo > > 0xa2819000: Cannot access memory at address 0xa2819000 > > > > the same addresses just after copy_from_user() works just fine however.. > > This kind of error goes hand-in-hand with my observation above. I don't see the relation. Please elaborate. Please note that the command above is reading the memory area as returned by vmalloc(), not directly the newinfo->entries member. How I read the above is that the UML kernel has not yet set up the VM mapping of this newly vmalloc():ed area. The real problem as I see it is that the content of the memory area changes suddenly without this kernel touching it, at least not in this code thread or in any other code thread visible to gdb. The exact place when this happens seems to be timing related. What I suspect is that the kernel virtual memory area gets remapped somehow causing it to suddenly point to another "physical" memory area, and due to the repetive nature of my tests the contents of this other memory usually looks similar but not identical to the correct data. It would be great if someone (Jeff?) could explain how the kernel virtual->physical memory mappings are managed in TT mode, and if there is any such operations which is not visible to a GDB session attached via the ptproxy interface of tt mode UML kernels. This is still mostly magics to me.. Before looking into this I did not even know the kernel had a virtual memory and thought it was always using "physical" memory statically mapped except when referring to user memory.. Regards Henrik |
From: Henrik N. <hn...@ma...> - 2003-11-26 04:11:40
|
On Wed, 26 Nov 2003, Henrik Nordstrom wrote: > What I suspect is that the kernel virtual memory area gets remapped > somehow causing it to suddenly point to another "physical" memory area, > and due to the repetive nature of my tests the contents of this other > memory usually looks similar but not identical to the correct data. Have been digging a little and indeed the problem is in this area. The corruption always occurs when the memset of the next vmalloc() area executes, triggering a segv()->...->flush_tlb_kernel_vm_tt()->flush_kernel_vm_range() call sequence to set up the VM map of that other area.. Below can be seen a trace of what happens in flush_tlb_kernel_vm_range() when the relevant address is reached and gets corrupted. $addr is a pointer to the memory area that gets corrupted: 97 for(addr = start; addr < end;){ 1: $addr = 0xa2811040 "" (gdb) p/x addr $7 = 0xa2811000 (gdb) n 98 pgd = pgd_offset(mm, addr); 1: $addr = 0xa2811040 "" (gdb) 100 if(pmd_present(*pmd)){ 1: $addr = 0xa2811040 "" (gdb) 101 pte = pte_offset(pmd, addr); 1: $addr = 0xa2811040 "" (gdb) 102 if(!pte_present(*pte) || pte_newpage(*pte)){ 1: $addr = 0xa2811040 "" (gdb) 104 err = os_unmap_memory((void *) addr, 1: $addr = 0xa2811040 "" (gdb) 106 if(err < 0) 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> (gdb) 103 updated = 1; 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> (gdb) 106 if(err < 0) 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> (gdb) 109 if(pte_present(*pte)) 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> (gdb) 110 map_memory(addr, 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> (gdb) 113 } 1: $addr = 0xa2811040 "p" (gdb) 117 } 1: $addr = 0xa2811040 "p" (gdb) 118 addr += PAGE_SIZE; Here it can be clearly seen that the page for some reason gets remapped. Note that the corrupted area was allocated earlier using vmalloc() and has already been mapped and filled with content by copy_from_user(). The call to flush_tlb_kernel_vm_tt() is due to a write to another vmalloc() allocated area allocated just after the corrupted area. So a number of questions arises here a) Why the heck does it unmap the already mapped and used pages? This page was mapped earlier on write fault by copy_from_user() and has not been freed yet. b) What is the content it gets mapped to? c) Why does it only fail some times and not always.. I guess it often just happens to remap the page to the same backing location. Full code listing of the failing section of flush_tlb_kernel_vm_range(): 98 pgd = pgd_offset(mm, addr); 99 pmd = pmd_offset(pgd, addr); 100 if(pmd_present(*pmd)){ 101 pte = pte_offset(pmd, addr); 102 if(!pte_present(*pte) || pte_newpage(*pte)){ 103 updated = 1; 104 err = os_unmap_memory((void *) addr, 105 PAGE_SIZE); 106 if(err < 0) 107 panic("munmap failed, errno = %d\n", 108 -err); 109 if(pte_present(*pte)) 110 map_memory(addr, 111 pte_val(*pte) & PAGE_MASK, 112 PAGE_SIZE, 1, 1, 1); 113 } 114 else if(pte_newprot(*pte)){ 115 updated = 1; 116 protect_memory(addr, PAGE_SIZE, 1, 1, 1, 1); 117 } 118 addr += PAGE_SIZE; 119 } 120 else { 121 if(pmd_newpage(*pmd)){ 122 updated = 1; 123 err = os_unmap_memory((void *) addr, PMD_SIZE); 124 if(err < 0) 125 panic("munmap failed, errno = %d\n", 126 -err); 127 } 128 addr += PMD_SIZE; 129 } I can only assume the problem is that pte_newpage() still is true for the page in question. The skas code looks identical so probably the same problem is found there. What is responsible for marking the kernel vm pages as up to date? Shouldn't there be pte_mkuptodate/pmd_mkuptodate() calls just like there is in the same loop of fix_range()? Regards Henrik |
From: Henrik N. <hn...@ma...> - 2003-12-01 10:44:34
|
Can I please get a little help with debugging this one? I tend to get a little lost in all the page map translations.. I have got so far that it looks like the pte is not modified making me suspect that there is a old page mapping around initially maybe not maching the pte after the call to vmalloc(). Another question: Why is the kernel vm always remapped in its entire? Would it not be possible to remember the current kernel vm mapping state in the process page table somehow, avoiding having to continously munmap()/mmap() the same data.. Regards Henrik On Wed, 26 Nov 2003, Henrik Nordstrom wrote: > Have been digging a little and indeed the problem is in this area. > > The corruption always occurs when the memset of the next vmalloc() area > executes, triggering a > segv()->...->flush_tlb_kernel_vm_tt()->flush_kernel_vm_range() call > sequence to set up the VM map of that other area.. > > Below can be seen a trace of what happens in flush_tlb_kernel_vm_range() > when the relevant address is reached and gets corrupted. $addr is a > pointer to the memory area that gets corrupted: > > 97 for(addr = start; addr < end;){ > 1: $addr = 0xa2811040 "" > (gdb) p/x addr > $7 = 0xa2811000 > (gdb) n > 98 pgd = pgd_offset(mm, addr); > 1: $addr = 0xa2811040 "" > (gdb) > 100 if(pmd_present(*pmd)){ > 1: $addr = 0xa2811040 "" > (gdb) > 101 pte = pte_offset(pmd, addr); > 1: $addr = 0xa2811040 "" > (gdb) > 102 if(!pte_present(*pte) || > pte_newpage(*pte)){ > 1: $addr = 0xa2811040 "" > (gdb) > 104 err = os_unmap_memory((void *) > addr, > 1: $addr = 0xa2811040 "" > (gdb) > 106 if(err < 0) > 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> > (gdb) > 103 updated = 1; > 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> > (gdb) > 106 if(err < 0) > 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> > (gdb) > 109 if(pte_present(*pte)) > 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> > (gdb) > 110 map_memory(addr, > 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds> > (gdb) > 113 } > 1: $addr = 0xa2811040 "p" > (gdb) > 117 } > 1: $addr = 0xa2811040 "p" > (gdb) > 118 addr += PAGE_SIZE; > > > Here it can be clearly seen that the page for some reason gets remapped. > > Note that the corrupted area was allocated earlier using vmalloc() and has > already been mapped and filled with content by copy_from_user(). The call > to flush_tlb_kernel_vm_tt() is due to a write to another vmalloc() > allocated area allocated just after the corrupted area. > > So a number of questions arises here > > a) Why the heck does it unmap the already mapped and used pages? This page > was mapped earlier on write fault by copy_from_user() and has not been > freed yet. > > b) What is the content it gets mapped to? > > c) Why does it only fail some times and not always.. I guess it often > just happens to remap the page to the same backing location. > > > Full code listing of the failing section of flush_tlb_kernel_vm_range(): > > 98 pgd = pgd_offset(mm, addr); > 99 pmd = pmd_offset(pgd, addr); > 100 if(pmd_present(*pmd)){ > 101 pte = pte_offset(pmd, addr); > 102 if(!pte_present(*pte) || pte_newpage(*pte)){ > 103 updated = 1; > 104 err = os_unmap_memory((void *) addr, > 105 PAGE_SIZE); > 106 if(err < 0) > 107 panic("munmap failed, errno = %d\n", > 108 -err); > 109 if(pte_present(*pte)) > 110 map_memory(addr, > 111 pte_val(*pte) & PAGE_MASK, > 112 PAGE_SIZE, 1, 1, 1); > 113 } > 114 else if(pte_newprot(*pte)){ > 115 updated = 1; > 116 protect_memory(addr, PAGE_SIZE, 1, 1, 1, 1); > 117 } > 118 addr += PAGE_SIZE; > 119 } > 120 else { > 121 if(pmd_newpage(*pmd)){ > 122 updated = 1; > 123 err = os_unmap_memory((void *) addr, PMD_SIZE); > 124 if(err < 0) > 125 panic("munmap failed, errno = %d\n", > 126 -err); > 127 } > 128 addr += PMD_SIZE; > 129 } > > > > I can only assume the problem is that pte_newpage() still is true for the > page in question. > > > The skas code looks identical so probably the same problem is found there. > > > What is responsible for marking the kernel vm pages as up to date? > Shouldn't there be pte_mkuptodate/pmd_mkuptodate() calls just like there > is in the same loop of fix_range()? > > > Regards > Henrik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > User-mode-linux-devel mailing list > Use...@li... > https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel > |
From: Henrik N. <hn...@ma...> - 2003-12-02 04:14:32
|
On Mon, 1 Dec 2003, Henrik Nordstrom wrote: > I have got so far that it looks like the pte is not modified making me > suspect that there is a old page mapping around initially maybe not > maching the pte after the call to vmalloc(). This has now been confirmed. I instrumented the kernel with a few printk both in the iptables code telling when the new table is vmalloc():ed, copied into etc and the mmap/munmap calls to the host and when the iptables command fails there is no mmap of the newly vmalloc:ed area when copying the table from userspace, and the area only gets correcly mapped to the current pte value on the next tlb flush. Now remains to figure out why the mapping of the old page is still around and why the tlb mappings has not been synched. Regards Henrik |
From: Henrik N. <hn...@ma...> - 2003-12-02 03:57:20
Attachments:
tlb_flush.patch
|
On Tue, 2 Dec 2003, Henrik Nordstrom wrote: > On Mon, 1 Dec 2003, Henrik Nordstrom wrote: > > > I have got so far that it looks like the pte is not modified making me > > suspect that there is a old page mapping around initially maybe not > > maching the pte after the call to vmalloc(). > > This has now been confirmed. > > Now remains to figure out why the mapping of the old page is still > around and why the tlb mappings has not been synched. This too has now been identified, but I am not sure what is the best way to fix this. vfree() calls flush_tlb_all(), but as this does not update the vm_seq number old the mapping is still there unless there is another page fault before the page is referenced again. Because of this there is a race if vmalloc() returns the same area that was last vmfree():d causing that area to temporarily refer to the old physical location until the next kernel page fault and quickly resulting in very odd results.. What I wonder is if this can be fixed without dropping the vm_seq optimization of tt kernel virtual memory updates. But from looking at the skas implementation I suppose dropping the vm_seq optimization is the correct way.. (skas does not have this optimization and should thus be safe from the issue) I have now removed this vm_seq optimization from my kernel sources and have verified that the iptables-restore problem is no where to be seen with this vm_seq optimization disabled. [silly patch attached just to illustrate the problem area] What was the design thought behind the vm_seq optimization? I understand the principle, but not the conditions when it can be safely deduced that the init_mm has not been updated since the last flush_kernel_vm_range(). I still think the kernel vm pte mappings should be mirrored into the current process and flush_kernel_vm_range() changed to do incremental remaps where the kernel vm pte mappings of init_mm differs from the current process. This applies to both tt and skas mode. With the kernel vm area being very limited in size such optimization should not be very hard or costly to accomplish and will by far outperform the mm_seq optimization of tt.. but my understanding of this is still a little limited so it might well be the case that this is not worth bothering with. Regards Henrik |
From: Jeff D. <jd...@ad...> - 2003-12-02 16:30:45
|
On Tue, Dec 02, 2003 at 01:55:25AM +0100, Henrik Nordstrom wrote: > vfree() calls flush_tlb_all(), but as this does not update the vm_seq > number old the mapping is still there unless there is another page fault > before the page is referenced again. Because of this there is a race if > vmalloc() returns the same area that was last vmfree():d causing that area > to temporarily refer to the old physical location until the next kernel > page fault and quickly resulting in very odd results.. Sorry about the delay. I've been in Tokyo for the last week. I haven't stared at the code yet, but your analysis looks reasonable. > What I wonder is if this can be fixed without dropping the vm_seq > optimization of tt kernel virtual memory updates. But from looking at the > skas implementation I suppose dropping the vm_seq optimization is the > correct way.. (skas does not have this optimization and should thus be > safe from the issue) Right, this is because of the kernel existing in multiple host address spaces on the host in tt mode, and the kernel VM area needing to be kept in sync between them. > What was the design thought behind the vm_seq optimization? I understand > the principle, but not the conditions when it can be safely deduced that > the init_mm has not been updated since the last flush_kernel_vm_range(). The idea is to avoid unecessary mmapping and munmapping on context switches if the kernel VM area hasn't changed since the process last ran. > I still think the kernel vm pte mappings should be mirrored into the > current process and flush_kernel_vm_range() changed to do incremental > remaps where the kernel vm pte mappings of init_mm differs from the > current process. How? > This applies to both tt and skas mode. No, it doesn't. Those two are completely different in this case. Jeff |
From: Henrik N. <hn...@ma...> - 2003-12-02 16:50:04
|
On Tue, 2 Dec 2003, Jeff Dike wrote: > > This applies to both tt and skas mode. > > No, it doesn't. Those two are completely different in this case. Not from what I can tell.. from what I can see when staring at the code the code path for flushing the tlb is the same, and the frequency this will happen is also the same.. in both it completely remaps the kernel vm area on a tlb flush. The only difference I see in this respect is that tt tries (but fails) to guess when the vm has not been changed and then does nothing.. How I propose this would be optimized differs slightly in tt or scas mode. What I propose is a page table mirror of the kernel vm area related to each host vm used by the UML. This page table mirror is simply an array of pte_t to the size of the kernel vm area. And only if there is any bitwise difference between the pte in init_mm and the page table mirror is the kernel vm page remapped and the page table mirror updated with the current pte for that page. Doing a lot of munmap/mmap calls is quite expensive for the host. But as this field of the kernel is relatively new to me I suspect this kind of optimization is actually overkill in skas mode. I suppose the kernel vm area is meant to be unmapped when returning to user context and in such case the number of tlb flushes compared to the total amount of remaps is very low.. Regards Henrik |
From: Jeff D. <jd...@ad...> - 2003-12-02 18:46:19
|
On Tue, Dec 02, 2003 at 05:49:54PM +0100, Henrik Nordstrom wrote: > Not from what I can tell.. from what I can see when staring at the code > the code path for flushing the tlb is the same, and the frequency this > will happen is also the same.. in both it completely remaps the kernel vm > area on a tlb flush. Once you're in the tlb flush code, you're right. I was talking about the problem you're seeing, which is very tt mode-specific. > How I propose this would be optimized differs slightly in tt or scas mode. > What I propose is a page table mirror of the kernel vm area related to > each host vm used by the UML. This page table mirror is simply an array of > pte_t to the size of the kernel vm area. And only if there is any bitwise > difference between the pte in init_mm and the page table mirror is the > kernel vm page remapped and the page table mirror updated with the current > pte for that page. I used to do that. The problem with it is that doing lookups and updates of the shadow page tables involves allocating memory, which can sleep. So, when you sleep inside schedule, you schedule again, and sleep in get_free_page again, and that gets ugly rather quickly. > Doing a lot of munmap/mmap calls is quite expensive for the host. Yup. This can be optimized somewhat by eliminating some of the munmap parts of the munmap/mmap pairs. Jeff |
From: Henrik N. <hn...@ma...> - 2003-12-02 21:37:55
|
On Tue, 2 Dec 2003, Jeff Dike wrote: > I used to do that. The problem with it is that doing lookups and updates > of the shadow page tables involves allocating memory, which can sleep. So, > when you sleep inside schedule, you schedule again, and sleep in get_free_page > again, and that gets ugly rather quickly. Right. > Yup. This can be optimized somewhat by eliminating some of the munmap parts > of the munmap/mmap pairs. Accorting to my stats about 99% of the kernel vm mmap/munmap is in vain, at least in TT mode.. and this was observed with the mm_seq optimization enabled. But to optimize these away one needs to know what have changed.. What I also noticed was that there seems to be a lot of vm_seq updates even without any kernel vm area modifications (at leas not any I could notice). So not only did this miss updates on vfree() it also triggered a flush even if there was no updates. Regards Henrik |
From: Henrik N. <hn...@ma...> - 2003-12-02 18:39:09
|
On Tue, 2 Dec 2003, Jeff Dike wrote: > Right, this is because of the kernel existing in multiple host address > spaces on the host in tt mode, and the kernel VM area needing to be kept > in sync between them. I suppose this is done on every context switch? How is SMP handled? > The idea is to avoid unecessary mmapping and munmapping on context switches > if the kernel VM area hasn't changed since the process last ran. This I understood, but it seems the assumptions on how to detect if the VM area has changed or not when vmalloc/vfree is used is false, at least in case of vfree(). Regards Henrik |