Thread: [uml-devel] iptables-restore randomly crashes under UML

Brought to you by: blaisorblade, derrichard, jdike, rusty

user-mode-linux-devel

[uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-11-22 13:10:03

We have found what looks like a UML-specific problem when running the 
iptables-restore command.

Environment:

Linux-2.4.22 + current UML from CVS. iptables compiled as modules.

iptables (tested with both 1.2.8 and current CVS)

Host: RedHat 7.1, 7.3 and 9 with current updates.


boot without any iptables loaded

load the iptables modules

iptables -t mangle -L
iptables -t nat -L
iptables -t filter -L

Try running iptables-restore

iptables-save | iptables-restore

If no problem seen, try adding a few rules or chains to the table and 
repeat the command or try loading additional iptables such as filter 
and/or nat. In our tests this very reliably makes the iptables-restore 
fail after at most a few attemtps. In many cases a cycle of 4 is seen 
where the command works three times and then fails once, or fails with one 
error three times and another error once.

This problem is only seen when using UML, not when running on "real" 
hardware. This makes me suspect the problem is to be found somewhere 
within UML.

We have also seen occations when the kernel copy of the iptables table 
(usually the nat table) gets corrupted in the few cases where 
iptables-restore runs without error, later causing a kernel panic inside 
iptables.

The problems does only seem to occur when using iptables-restore to load a
new iptables definition, not when using iptables to modify individual
rules. But I can not understand why as the kernel operations are almost
identical in both cases.. iptables reads the current table from the
kernel, modifies it and writes the new table, iptables-restore compiles a
new table from the source rules and writes it into the kernel.


Any hints on how to pinpoint if this is a UML or a iptables problem is 
preciated.

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-11-23 23:37:24

On Sat, 22 Nov 2003, Henrik Nordstrom wrote:

> We have found what looks like a UML-specific problem when running the 
> iptables-restore command.

Have been looking closely at this one, and from what I can tell everything
indicates there is something within UML which corrupts the kernel memory
allocated by iptables.

The culpit code from net/ipv4/netfilter/ip_tables.c:do_replace() with 
comments:

        newinfo = vmalloc(sizeof(struct ipt_table_info)
                          + SMP_ALIGN(tmp.size) * smp_num_cpus);
        if (!newinfo)
                return -ENOMEM;

[The allocated size is verified to be correct]

        if (copy_from_user(newinfo->entries, user + sizeof(tmp),
                           tmp.size) != 0) {
                ret = -EFAULT;
                goto free_newinfo;
        }

[So far everything looks fine and what gdb finds in newinfo->entries
matches what looks like was sent from userspace. Sizes of the copied data
has been verified that there is no buffer overflow. newinfo->entries is
correct even if it looks odd at first sight (array, not a pointer)]

        counters = vmalloc(tmp.num_counters * sizeof(struct ipt_counters));
        if (!counters) {
                ret = -ENOMEM;
                goto free_newinfo;
        }

[still everything looks fine, and contents identical to after the 
copy_from_user call above]

        memset(counters, 0, tmp.num_counters * sizeof(struct ipt_counters));

[here randomly (not always) the data copied has been currupted in
semi-random ways.  Sometimes this corruption can be seen before the call
to translate_table(), sometimes early in the start of translate_table()  
before the memory in question is referenced. There seems to be only two or
three different patterns of corruption all spread out little here little
there in the memory aream, but that pattern may also be an artefact of 
the repetive nature of the test used to trigger the error]

        ret = translate_table(tmp.name, tmp.valid_hooks,
                              newinfo, tmp.size, tmp.num_entries,
                              tmp.hook_entry, tmp.underflow);

I have tried using a watch expression on the data which gets currupted but
GDB does not seem to detect any writes to this area. The memory just
"magically" changes content. Using watch expressions on data changes in
the same are done within the relevant code flow works fine however.

This is seen with Linux-2.4.22 + current UML CVS in TT mode, without any
other kernel patches.

The only good thing is that I can very easily reproduce the problem within
a few seconds, and also reliably detect when it happens by using a few
breakpoints spread out along the code path checksumming the relevant
memory area for changes.

Bad thing is that I am really out of options what to look for next. To me 
it looks like my UML is simulating badly broken memory, but I am pretty 
sure there is some other magics messing around somewhere..

The exact same error is seen on multiple different hosts so I am fully
confident it is not a cause of broken host memory.

Hmm.. I just found one odd thing which might hint to where the problem may
be.. if I break just after vmalloc() but before copy_from_user() then GDB
can't read the memory area:

(gdb) x/1024b newinfo
0xa2819000:     Cannot access memory at address 0xa2819000

the same addresses just after copy_from_user() works just fine however..

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Adam H. <ad...@do...> - 2003-11-26 00:47:52

On Mon, 24 Nov 2003, Henrik Nordstrom wrote:

> The culpit code from net/ipv4/netfilter/ip_tables.c:do_replace() with
> comments:
>
>         newinfo = vmalloc(sizeof(struct ipt_table_info)
>                           + SMP_ALIGN(tmp.size) * smp_num_cpus);
>         if (!newinfo)
>                 return -ENOMEM;
>
> [The allocated size is verified to be correct]
>
>         if (copy_from_user(newinfo->entries, user + sizeof(tmp),
>                            tmp.size) != 0) {
>                 ret = -EFAULT;
>                 goto free_newinfo;
>         }

Er, I don't see newinfo being initialized with anything.  Did you exclude this
in your email?

If not, then newinfo->entries will randomly point to some other block of
memory.

> Hmm.. I just found one odd thing which might hint to where the problem may
> be.. if I break just after vmalloc() but before copy_from_user() then GDB
> can't read the memory area:
>
> (gdb) x/1024b newinfo
> 0xa2819000:     Cannot access memory at address 0xa2819000
>
> the same addresses just after copy_from_user() works just fine however..

This kind of error goes hand-in-hand with my observation above.

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-11-26 01:42:27

On Tue, 25 Nov 2003, Adam Heath wrote:

> Er, I don't see newinfo being initialized with anything.  Did you exclude this
> in your email?
> 
> If not, then newinfo->entries will randomly point to some other block of
> memory.

As commented in my message newinfo->entries is not a pointer, it is an 
array within newinfo and this reference is correct even if it looks odd at 
a first sight.

the name of an array evaluates to the address of the array.  Due to this
oddity of the C language newinfo->entries == &newinfo->entries ==
&newinfo->entries[0] and in this case is the address 0x40 bytes into 
newinfo (0xa2819040 if newinfo is 0xa2819000)

> > (gdb) x/1024b newinfo
> > 0xa2819000:     Cannot access memory at address 0xa2819000
> >
> > the same addresses just after copy_from_user() works just fine however..
> 
> This kind of error goes hand-in-hand with my observation above.

I don't see the relation. Please elaborate. Please note that the command 
above is reading the memory area as returned by vmalloc(), not directly 
the newinfo->entries member.

How I read the above is that the UML kernel has not yet set up the VM 
mapping of this newly vmalloc():ed area.

The real problem as I see it is that the content of the memory area
changes suddenly without this kernel touching it, at least not in this
code thread or in any other code thread visible to gdb. The exact place
when this happens seems to be timing related.

What I suspect is that the kernel virtual memory area gets remapped
somehow causing it to suddenly point to another "physical" memory area,
and due to the repetive nature of my tests the contents of this other
memory usually looks similar but not identical to the correct data.

It would be great if someone (Jeff?) could explain how the kernel
virtual->physical memory mappings are managed in TT mode, and if there is
any such operations which is not visible to a GDB session attached via the
ptproxy interface of tt mode UML kernels. This is still mostly magics to
me.. Before looking into this I did not even know the kernel had a virtual
memory and thought it was always using "physical" memory statically mapped
except when referring to user memory..

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-11-26 04:11:40

On Wed, 26 Nov 2003, Henrik Nordstrom wrote:

> What I suspect is that the kernel virtual memory area gets remapped
> somehow causing it to suddenly point to another "physical" memory area,
> and due to the repetive nature of my tests the contents of this other
> memory usually looks similar but not identical to the correct data.

Have been digging a little and indeed the problem is in this area.

The corruption always occurs when the memset of the next vmalloc() area
executes, triggering a 
segv()->...->flush_tlb_kernel_vm_tt()->flush_kernel_vm_range() call
sequence to set up the VM map of that other area..

Below can be seen a trace of what happens in flush_tlb_kernel_vm_range()  
when the relevant address is reached and gets corrupted. $addr is a
pointer to the memory area that gets corrupted:

97              for(addr = start; addr < end;){
1: $addr = 0xa2811040 ""
(gdb) p/x addr
$7 = 0xa2811000
(gdb) n
98                      pgd = pgd_offset(mm, addr);
1: $addr = 0xa2811040 ""
(gdb) 
100                     if(pmd_present(*pmd)){
1: $addr = 0xa2811040 ""
(gdb) 
101                             pte = pte_offset(pmd, addr);
1: $addr = 0xa2811040 ""
(gdb) 
102                             if(!pte_present(*pte) || 
pte_newpage(*pte)){
1: $addr = 0xa2811040 ""
(gdb) 
104                                     err = os_unmap_memory((void *) 
addr, 
1: $addr = 0xa2811040 ""
(gdb) 
106                                     if(err < 0)
1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
(gdb) 
103                                     updated = 1;
1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
(gdb) 
106                                     if(err < 0)
1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
(gdb) 
109                                     if(pte_present(*pte))
1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
(gdb) 
110                                             map_memory(addr, 
1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
(gdb) 
113                             }
1: $addr = 0xa2811040 "p"
(gdb) 
117                             }
1: $addr = 0xa2811040 "p"
(gdb) 
118                             addr += PAGE_SIZE;

Here it can be clearly seen that the page for some reason gets remapped.

Note that the corrupted area was allocated earlier using vmalloc() and has 
already been mapped and filled with content by copy_from_user(). The call 
to flush_tlb_kernel_vm_tt() is due to a write to another vmalloc() 
allocated area allocated just after the corrupted area.

So a number of questions arises here

a) Why the heck does it unmap the already mapped and used pages? This page
was mapped earlier on write fault by copy_from_user() and has not been
freed yet.

b) What is the content it gets mapped to?

c) Why does it only fail some times and not always.. I guess it often 
just happens to remap the page to the same backing location.

Full code listing of the failing section of flush_tlb_kernel_vm_range():

98                      pgd = pgd_offset(mm, addr);
99                      pmd = pmd_offset(pgd, addr);
100                     if(pmd_present(*pmd)){
101                             pte = pte_offset(pmd, addr);
102                             if(!pte_present(*pte) || pte_newpage(*pte)){
103                                     updated = 1;
104                                     err = os_unmap_memory((void *) addr, 
105                                                           PAGE_SIZE);
106                                     if(err < 0)
107                                             panic("munmap failed, errno = %d\n",
108                                                   -err);
109                                     if(pte_present(*pte))
110                                             map_memory(addr, 
111                                                        pte_val(*pte) & PAGE_MASK,
112                                                        PAGE_SIZE, 1, 1, 1);
113                             }
114                             else if(pte_newprot(*pte)){
115                                     updated = 1;
116                                     protect_memory(addr, PAGE_SIZE, 1, 1, 1, 1);
117                             }
118                             addr += PAGE_SIZE;
119                     }
120                     else {
121                             if(pmd_newpage(*pmd)){
122                                     updated = 1;
123                                     err = os_unmap_memory((void *) addr, PMD_SIZE);
124                                     if(err < 0)
125                                             panic("munmap failed, errno = %d\n",
126                                                   -err);
127                             }
128                             addr += PMD_SIZE;
129                     }

I can only assume the problem is that pte_newpage() still is true for the 
page in question.

The skas code looks identical so probably the same problem is found there.

What is responsible for marking the kernel vm pages as up to date? 
Shouldn't there be pte_mkuptodate/pmd_mkuptodate() calls just like there 
is in the same loop of fix_range()?

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-01 10:44:34

Can I please get a little help with debugging this one? I tend to get a
little lost in all the page map translations..

I have got so far that it looks like the pte is not modified making me 
suspect that there is a old page mapping around initially maybe not 
maching the pte after the call to vmalloc().

Another question: Why is the kernel vm always remapped in its entire?  
Would it not be possible to remember the current kernel vm mapping state
in the process page table somehow, avoiding having to continously
munmap()/mmap() the same data..

Regards
Henrik



On Wed, 26 Nov 2003, Henrik Nordstrom wrote:

> Have been digging a little and indeed the problem is in this area.
> 
> The corruption always occurs when the memset of the next vmalloc() area
> executes, triggering a 
> segv()->...->flush_tlb_kernel_vm_tt()->flush_kernel_vm_range() call
> sequence to set up the VM map of that other area..
> 
> Below can be seen a trace of what happens in flush_tlb_kernel_vm_range()  
> when the relevant address is reached and gets corrupted. $addr is a
> pointer to the memory area that gets corrupted:
> 
> 97              for(addr = start; addr < end;){
> 1: $addr = 0xa2811040 ""
> (gdb) p/x addr
> $7 = 0xa2811000
> (gdb) n
> 98                      pgd = pgd_offset(mm, addr);
> 1: $addr = 0xa2811040 ""
> (gdb) 
> 100                     if(pmd_present(*pmd)){
> 1: $addr = 0xa2811040 ""
> (gdb) 
> 101                             pte = pte_offset(pmd, addr);
> 1: $addr = 0xa2811040 ""
> (gdb) 
> 102                             if(!pte_present(*pte) || 
> pte_newpage(*pte)){
> 1: $addr = 0xa2811040 ""
> (gdb) 
> 104                                     err = os_unmap_memory((void *) 
> addr, 
> 1: $addr = 0xa2811040 ""
> (gdb) 
> 106                                     if(err < 0)
> 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
> (gdb) 
> 103                                     updated = 1;
> 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
> (gdb) 
> 106                                     if(err < 0)
> 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
> (gdb) 
> 109                                     if(pte_present(*pte))
> 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
> (gdb) 
> 110                                             map_memory(addr, 
> 1: $addr = 0xa2811040 <Address 0xa2811040 out of bounds>
> (gdb) 
> 113                             }
> 1: $addr = 0xa2811040 "p"
> (gdb) 
> 117                             }
> 1: $addr = 0xa2811040 "p"
> (gdb) 
> 118                             addr += PAGE_SIZE;
> 
> 
> Here it can be clearly seen that the page for some reason gets remapped.
> 
> Note that the corrupted area was allocated earlier using vmalloc() and has 
> already been mapped and filled with content by copy_from_user(). The call 
> to flush_tlb_kernel_vm_tt() is due to a write to another vmalloc() 
> allocated area allocated just after the corrupted area.
> 
> So a number of questions arises here
> 
> a) Why the heck does it unmap the already mapped and used pages? This page
> was mapped earlier on write fault by copy_from_user() and has not been
> freed yet.
> 
> b) What is the content it gets mapped to?
> 
> c) Why does it only fail some times and not always.. I guess it often 
> just happens to remap the page to the same backing location.
> 
> 
> Full code listing of the failing section of flush_tlb_kernel_vm_range():
> 
> 98                      pgd = pgd_offset(mm, addr);
> 99                      pmd = pmd_offset(pgd, addr);
> 100                     if(pmd_present(*pmd)){
> 101                             pte = pte_offset(pmd, addr);
> 102                             if(!pte_present(*pte) || pte_newpage(*pte)){
> 103                                     updated = 1;
> 104                                     err = os_unmap_memory((void *) addr, 
> 105                                                           PAGE_SIZE);
> 106                                     if(err < 0)
> 107                                             panic("munmap failed, errno = %d\n",
> 108                                                   -err);
> 109                                     if(pte_present(*pte))
> 110                                             map_memory(addr, 
> 111                                                        pte_val(*pte) & PAGE_MASK,
> 112                                                        PAGE_SIZE, 1, 1, 1);
> 113                             }
> 114                             else if(pte_newprot(*pte)){
> 115                                     updated = 1;
> 116                                     protect_memory(addr, PAGE_SIZE, 1, 1, 1, 1);
> 117                             }
> 118                             addr += PAGE_SIZE;
> 119                     }
> 120                     else {
> 121                             if(pmd_newpage(*pmd)){
> 122                                     updated = 1;
> 123                                     err = os_unmap_memory((void *) addr, PMD_SIZE);
> 124                                     if(err < 0)
> 125                                             panic("munmap failed, errno = %d\n",
> 126                                                   -err);
> 127                             }
> 128                             addr += PMD_SIZE;
> 129                     }
> 
> 
> 
> I can only assume the problem is that pte_newpage() still is true for the 
> page in question.
> 
> 
> The skas code looks identical so probably the same problem is found there.
> 
> 
> What is responsible for marking the kernel vm pages as up to date? 
> Shouldn't there be pte_mkuptodate/pmd_mkuptodate() calls just like there 
> is in the same loop of fix_range()?
> 
> 
> Regards
> Henrik
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> Does SourceForge.net help you be more productive?  Does it
> help you create better code?  SHARE THE LOVE, and help us help
> YOU!  Click Here: http://sourceforge.net/donate/
> _______________________________________________
> User-mode-linux-devel mailing list
> Use...@li...
> https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel
>

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-02 04:14:32

On Mon, 1 Dec 2003, Henrik Nordstrom wrote:

> I have got so far that it looks like the pte is not modified making me 
> suspect that there is a old page mapping around initially maybe not 
> maching the pte after the call to vmalloc().

This has now been confirmed.

I instrumented the kernel with a few printk both in the iptables code 
telling when the new table is vmalloc():ed, copied into etc and the 
mmap/munmap calls to the host and when the iptables command fails there is 
no mmap of the newly vmalloc:ed area when copying the table from 
userspace, and the area only gets correcly mapped to the current pte value 
on the next tlb flush.

Now remains to figure out why the mapping of the old page is still 
around and why the tlb mappings has not been synched.

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-02 03:57:20

Attachments: tlb_flush.patch

On Tue, 2 Dec 2003, Henrik Nordstrom wrote:

> On Mon, 1 Dec 2003, Henrik Nordstrom wrote:
> 
> > I have got so far that it looks like the pte is not modified making me 
> > suspect that there is a old page mapping around initially maybe not 
> > maching the pte after the call to vmalloc().
> 
> This has now been confirmed.
> 
> Now remains to figure out why the mapping of the old page is still 
> around and why the tlb mappings has not been synched.

This too has now been identified, but I am not sure what is the best way 
to fix this.

vfree() calls flush_tlb_all(), but as this does not update the vm_seq
number old the mapping is still there unless there is another page fault
before the page is referenced again. Because of this there is a race if
vmalloc() returns the same area that was last vmfree():d causing that area
to temporarily refer to the old physical location until the next kernel
page fault and quickly resulting in very odd results..

What I wonder is if this can be fixed without dropping the vm_seq
optimization of tt kernel virtual memory updates. But from looking at the
skas implementation I suppose dropping the vm_seq optimization is the
correct way.. (skas does not have this optimization and should thus be
safe from the issue)

I have now removed this vm_seq optimization from my kernel sources and
have verified that the iptables-restore problem is no where to be seen 
with this vm_seq optimization disabled. [silly patch attached just to 
illustrate the problem area]

What was the design thought behind the vm_seq optimization? I understand
the principle, but not the conditions when it can be safely deduced that
the init_mm has not been updated since the last flush_kernel_vm_range().

I still think the kernel vm pte mappings should be mirrored into the
current process and flush_kernel_vm_range() changed to do incremental
remaps where the kernel vm pte mappings of init_mm differs from the
current process. This applies to both tt and skas mode. With the kernel vm
area being very limited in size such optimization should not be very hard
or costly to accomplish and will by far outperform the mm_seq optimization
of tt.. but my understanding of this is still a little limited so it might
well be the case that this is not worth bothering with.

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Jeff D. <jd...@ad...> - 2003-12-02 16:30:45

On Tue, Dec 02, 2003 at 01:55:25AM +0100, Henrik Nordstrom wrote:
> vfree() calls flush_tlb_all(), but as this does not update the vm_seq
> number old the mapping is still there unless there is another page fault
> before the page is referenced again. Because of this there is a race if
> vmalloc() returns the same area that was last vmfree():d causing that area
> to temporarily refer to the old physical location until the next kernel
> page fault and quickly resulting in very odd results..

Sorry about the delay.  I've been in Tokyo for the last week.  I haven't
stared at the code yet, but your analysis looks reasonable.

> What I wonder is if this can be fixed without dropping the vm_seq
> optimization of tt kernel virtual memory updates. But from looking at the
> skas implementation I suppose dropping the vm_seq optimization is the
> correct way.. (skas does not have this optimization and should thus be
> safe from the issue)

Right, this is because of the kernel existing in multiple host address 
spaces on the host in tt mode, and the kernel VM area needing to be kept
in sync between them.

> What was the design thought behind the vm_seq optimization? I understand
> the principle, but not the conditions when it can be safely deduced that
> the init_mm has not been updated since the last flush_kernel_vm_range().

The idea is to avoid unecessary mmapping and munmapping on context switches
if the kernel VM area hasn't changed since the process last ran.

> I still think the kernel vm pte mappings should be mirrored into the
> current process and flush_kernel_vm_range() changed to do incremental
> remaps where the kernel vm pte mappings of init_mm differs from the
> current process. 

How?

> This applies to both tt and skas mode. 

No, it doesn't.  Those two are completely different in this case.

				Jeff

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-02 16:50:04

On Tue, 2 Dec 2003, Jeff Dike wrote:

> > This applies to both tt and skas mode. 
> 
> No, it doesn't.  Those two are completely different in this case.

Not from what I can tell.. from what I can see when staring at the code
the code path for flushing the tlb is the same, and the frequency this
will happen is also the same.. in both it completely remaps the kernel vm
area on a tlb flush.

The only difference I see in this respect is that tt tries (but fails) to 
guess when the vm has not been changed and then does nothing..

How I propose this would be optimized differs slightly in tt or scas mode.  
What I propose is a page table mirror of the kernel vm area related to
each host vm used by the UML. This page table mirror is simply an array of
pte_t to the size of the kernel vm area. And only if there is any bitwise
difference between the pte in init_mm and the page table mirror is the
kernel vm page remapped and the page table mirror updated with the current
pte for that page.

Doing a lot of munmap/mmap calls is quite expensive for the host.

But as this field of the kernel is relatively new to me I suspect this
kind of optimization is actually overkill in skas mode. I suppose the
kernel vm area is meant to be unmapped when returning to user context and
in such case the number of tlb flushes compared to the total amount of 
remaps is very low..

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Jeff D. <jd...@ad...> - 2003-12-02 18:46:19

On Tue, Dec 02, 2003 at 05:49:54PM +0100, Henrik Nordstrom wrote:
> Not from what I can tell.. from what I can see when staring at the code
> the code path for flushing the tlb is the same, and the frequency this
> will happen is also the same.. in both it completely remaps the kernel vm
> area on a tlb flush.

Once you're in the tlb flush code, you're right.  I was talking about the
problem you're seeing, which is very tt mode-specific.

> How I propose this would be optimized differs slightly in tt or scas mode.  
> What I propose is a page table mirror of the kernel vm area related to
> each host vm used by the UML. This page table mirror is simply an array of
> pte_t to the size of the kernel vm area. And only if there is any bitwise
> difference between the pte in init_mm and the page table mirror is the
> kernel vm page remapped and the page table mirror updated with the current
> pte for that page.

I used to do that.  The problem with it is that doing lookups and updates
of the shadow page tables involves allocating memory, which can sleep.  So,
when you sleep inside schedule, you schedule again, and sleep in get_free_page
again, and that gets ugly rather quickly.

> Doing a lot of munmap/mmap calls is quite expensive for the host.

Yup.  This can be optimized somewhat by eliminating some of the munmap parts
of the munmap/mmap pairs.

				Jeff

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-02 21:37:55

On Tue, 2 Dec 2003, Jeff Dike wrote:

> I used to do that.  The problem with it is that doing lookups and updates
> of the shadow page tables involves allocating memory, which can sleep.  So,
> when you sleep inside schedule, you schedule again, and sleep in get_free_page
> again, and that gets ugly rather quickly.

Right.

> Yup.  This can be optimized somewhat by eliminating some of the munmap parts
> of the munmap/mmap pairs.

Accorting to my stats about 99% of the kernel vm mmap/munmap is in vain,
at least in TT mode.. and this was observed with the mm_seq optimization
enabled. But to optimize these away one needs to know what have changed..

What I also noticed was that there seems to be a lot of vm_seq updates
even without any kernel vm area modifications (at leas not any I could
notice). So not only did this miss updates on vfree() it also triggered 
a flush even if there was no updates.

Regards
Henrik

Re: [uml-devel] iptables-restore randomly crashes under UML

From: Henrik N. <hn...@ma...> - 2003-12-02 18:39:09

On Tue, 2 Dec 2003, Jeff Dike wrote:

> Right, this is because of the kernel existing in multiple host address 
> spaces on the host in tt mode, and the kernel VM area needing to be kept
> in sync between them.

I suppose this is done on every context switch?

How is SMP handled?

> The idea is to avoid unecessary mmapping and munmapping on context switches
> if the kernel VM area hasn't changed since the process last ran.

This I understood, but it seems the assumptions on how to detect if the VM
area has changed or not when vmalloc/vfree is used is false, at least in 
case of vfree().

Regards
Henrik