From: Blaisorblade <bla...@ya...> - 2005-08-26 18:30:34
Attachments:
fremap-test-complete.c.bz2
|
This is a followup to my post of last week (Aug 12) about remap_file_pages protection support. I've improved and consolidated the patches and updated them against 2.6.13-rc6/rc7 (the same patches apply against both versions). I'm sending the full patch series only to akpm, mingo and LKML. I've also reduced them to only 18, and made the splitting more significant. I'm not resending all the patches for foreign architectures, because they're almost unchanged since last time (there's just a trivial reject from ppc32, because one change has already been done after -rc4). I'm working on this to provide support for UML, which currently easily creates more than 64K (the default limit) vma's for a single process. Actually, it needs one VMA per each page. So, with this patch and specific UML support, which Ingo wrote and which I'm porting to recent UMLs. Some highlights: * The first 2 patches modify the PTE encoding macros and start preparing the VM for the new situation (i.e. VMA which have variable protections, which are called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name). Patch number 2 will require fixing up all arches like in 2.6.4-rc2-mm1, to provide the new PTE encoding macros. * Patch 5 allows the syscall to actually create such VMAs. Before that, there's no difference in behaviour with the current kernel (except that there's less space for file offset encoding in PTEs). And even here, the new operations are only enabled for arch explicitly supporting it (see patch #7). * Patch 8 and 9 change the path for handling page faults, since the permission checking on nonuniform vmas cannot be done until the PTE entry has been read. This is the most intrusive part, but a) archs are not required to adequate to this immediately b) it isn't so difficult in practice. * Patch 11 is a big simplification. Since we must encode the PTE's on swapout like in VM_NONLINEAR vmas, the simplest way to reuse the existing code is to make sure that VM_NONUNIFORM vmas are also marked as VM_NONLINEAR. It is possible to avoid this, as in patch #18, but it's just a bit scary, and Then there are 4 optimization patches and 3 fixups for some odd cases that we maybe won't support. They are namely: *) vmas with default PROT_NONE protection (I actually feel we're going to support this, the only patch which has problems is an optimization) *) MAP_POPULATE on private VMA (no problem on this) and consequently remap_file_pages on private VMA to install linear uniform mappings (since MAP_POPULATE is implemented in terms of remap_file_pages): there's a patch to stop this from truncating COW pages away, but I don't think it's worth it. *) linear nonuniform vmas. I initially created them because there's no relation between being nonlinear and nonuniform, but it later turned out supporting them is intrusive. I have improved even more the patches, and understood better some changes from Ingo which I didn't last time, and fixed their bugs. I hope these changes can be reviewed, and included inside -mm, even if they'll conflict with pagefault scalability patches (even if I think the conflicts are not difficult to solve). Still, the patch is IMHO in better shape, in many ways, than when it was in -mm last time. To handle properly all possibilities it has become a bit more intrusive. The original one was designed to handle only the simpler needs of UML (an mmap'ing with PROT_NONE followed by nonlinear and nonuniform remappings), but it still failed in some cases. I've taken original Ingo's test-program and significantly extended it, it's attached to this patch. I'll appreciate any comments. ============== Changes from 2.6.5-mm1/dropped version of the patches: ============== *) Actually implemented _real_ and _anal_ protection support, safe against swapout; programs get SIGSEGV *always* when they should. I've used the attached test program (an improved version of Ingo's one) to check that. I tested just until patch 25, onto UML. The subsequent ones are either patches for foreign archs or proposed *) Fixed many changes present in the patches. *) Fixed UML bits *) Added some headaches for arches ports. I've also included some patches which reduce this. *) No more usage of a new syscall slot: to use the new interface, application will use the new MAP_NOINHERIT flag I've added. I've still the patches to use the old -mm ABI, if there's any reason they're needed. *) Fixed a regression wrt using mprotect() against remapped area (see patch 15) ====== Changes from my last patch-bomb of the patches: ====== *) fixed mprotect VS remap_file_pages(MAP_NOINHERIT) interaction *) fixed truncation (with madvise_dontneed or truncate()) of nonuniform but linear vmas. Either with patch 11, by removing "nonuniform but linear VMAs", or with patch 18. ====== Still todo ====== *) ->populate flushes each TLB individually, instead of using mmu_gathers as it should; this was suggested even by Ingo when sending the patch, but it seems he didn't get the time to finish this. And I'm now wondering how would that relate with I/O... at each I/O point we should finish and regather the mmu_gather, as in zap_page_range. But here we are reading pages, not the reverse! Seems rewriting the kernel locking is a quite time-consuming task! -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade |
From: Hugh D. <hu...@ve...> - 2005-08-26 19:09:34
|
On Fri, 26 Aug 2005, Blaisorblade wrote: > This is a followup to my post of last week (Aug 12) about remap_file_pages > protection support. I've improved and consolidated the patches and updated > them against 2.6.13-rc6/rc7 (the same patches apply against both versions). > I'm sending the full patch series only to akpm, mingo and LKML. > > I've also reduced them to only 18, and made the splitting more significant. > I'm not resending all the patches for foreign architectures, because they're > almost unchanged since last time (there's just a trivial reject from ppc32, > because one change has already been done after -rc4). > > I'm working on this to provide support for UML, which currently easily creates > more than 64K (the default limit) vma's for a single process. Actually, it > needs one VMA per each page. So, with this patch and specific UML support, > which Ingo wrote and which I'm porting to recent UMLs. I'll try to take a look sometime next week - or, if I wait until next Friday, can we expect it to have come down to 9 patches ;-? I should say, my initial reaction is very much like Andi's last week. sys_remap_file_pages solves a real problem, but it does so by breaking lots of rules. For more than a year after it came in, almost every development we tried in mm would come up against "but then what do we do about the nonlinear mappings?". That has settled down now, but I don't look forward to extending it. On the other hand, UML does deserve better support. Hugh |
From: Blaisorblade <bla...@ya...> - 2005-08-26 20:02:21
|
On Friday 26 August 2005 21:11, Hugh Dickins wrote: > On Fri, 26 Aug 2005, Blaisorblade wrote: > > This is a followup to my post of last week (Aug 12) about > > remap_file_pages protection support. I've improved and consolidated the > > patches and updated them against 2.6.13-rc6/rc7 (the same patches apply > > against both versions). I'm sending the full patch series only to akpm, > > mingo and LKML. > > > > I've also reduced them to only 18, and made the splitting more > > significant. I'm not resending all the patches for foreign architectures, > > because they're almost unchanged since last time (there's just a trivial > > reject from ppc32, because one change has already been done after -rc4). > > > > I'm working on this to provide support for UML, which currently easily > > creates more than 64K (the default limit) vma's for a single process. > > Actually, it needs one VMA per each page. So, with this patch and > > specific UML support, which Ingo wrote and which I'm porting to recent > > UMLs. > > I'll try to take a look sometime next week - or, if I wait until > next Friday, can we expect it to have come down to 9 patches ;-? :-) Don't think so, unless you want just me to join patches together. However, there are still some oneliners, so the patchset is not so huge. Well, diffstat seems to contradict me (on the joined-up patch): Documentation/feature-removal-schedule.txt | 12 + arch/i386/mm/fault.c | 19 ++ arch/um/kernel/trap_kern.c | 52 ++++++- arch/x86_64/mm/fault.c | 6 include/asm-i386/mman.h | 1 include/asm-i386/pgtable-2level.h | 15 +- include/asm-i386/pgtable-3level.h | 11 + include/asm-i386/pgtable.h | 3 include/asm-ia64/mman.h | 1 include/asm-ppc/mman.h | 1 include/asm-ppc64/mman.h | 1 include/asm-s390/mman.h | 1 include/asm-um/pgtable-2level.h | 15 +- include/asm-um/pgtable-3level.h | 21 ++- include/asm-um/pgtable.h | 3 include/asm-x86_64/mman.h | 1 include/asm-x86_64/pgtable.h | 12 + include/linux/mm.h | 40 ++++-- include/linux/pagemap.h | 32 ++++ mm/filemap.c | 18 ++ mm/fremap.c | 135 +++++++++++++++----- mm/madvise.c | 2 mm/memory.c | 193 ++++++++++++++++++++++------- mm/mmap.c | 14 +- mm/mprotect.c | 3 mm/rmap.c | 6 mm/shmem.c | 13 + > I should say, my initial reaction is very much like Andi's last week. > sys_remap_file_pages solves a real problem, but it does so by breaking > lots of rules. For more than a year after it came in, almost every > development we tried in mm would come up against "but then what do we > do about the nonlinear mappings?". Nonuniform mappings are much less of a problem. Really. The reason nonlinear mappings reached mainline before nonuniform ones (and I don't know if they willl ever) is not simplicity, but the miss of uses until now. And also the fact that Ingo hadn't the time to finish it. In fact, I've been playing a lot with the GIT history during this month of development, particularly with objrmap, so I've come across those problems quite a lot, but what I noticed is that you mostly don't care about the VMA to be uniform, just it to be linear or not (because nonlinear VMAs break objrmap). This patch, in comparison, is just: *) check permissions in the generic VM when faulting in pages, if and only if the vma is nonuniform (yes, nontrivial at all). *) be anally picky to save the PTE protections together with the rest, and do it *everywhere*; but if you say "nonuniform implies linear", you lose this problem almost completely. The only exception is that when you remap an address range with PROT_NONE (thus effectively unmapping those addressed), you can't clear the PTEs but you must use pfn_pte(0, __S000) to fill them in (this is done in the optimization-fixup patch #15). > That has settled down now, but I don't look forward to extending it. > On the other hand, UML does deserve better support. > Hugh -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |
From: Hugh D. <hu...@ve...> - 2005-09-02 21:00:28
|
On Fri, 26 Aug 2005, Blaisorblade wrote: > > * The first 2 patches modify the PTE encoding macros and start preparing the > VM for the new situation (i.e. VMA which have variable protections, which are > called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name). What a pity: please revert. VM_NONUNIFORM sounds impressive, but might mean all kinds of things, maybe to do with NUMA. VM_MANYPROTS is good, it says what it means. > * Patch 11 is a big simplification. Since we must encode the PTE's on swapout > like in VM_NONLINEAR vmas, the simplest way to reuse the existing code is to > make sure that VM_NONUNIFORM vmas are also marked as VM_NONLINEAR. In some places you seem to say that you (UML) only need VM_MANYPROTS vmas linear, in other places you seem to say that your VM_MANYPROTS vmas will be nonlinear. I've no idea which way round it is. Perhaps the "non" sometimes goes missing (another reason to avoid NONUNIFORM). I wrote that yesterday. Thanks, you've cleared it up in private mail: the VM_MANYPROTS vmas that UML wants are VM_NONLINEAR anyway. Yes, I see your dilemma there: you rightly want to avoid adding bloat by distinguishing cases that you don't need distinguished; but equally rightly fear that someone somewhere will start using the VM_MANYPROTS for other reasons, and hit the inefficiencies of VM_NONLINEAR unnecessarily. I share your uncertainty, I don't have an immediate feel for the right direction on that. > *) No more usage of a new syscall slot: to use the new interface, application > will use the new MAP_NOINHERIT flag I've added. I've still the patches to use > the old -mm ABI, if there's any reason they're needed. I'm glad you've scrapped the new syscall slot, that really put me off the old patch (though I was probably being silly about it). This way is much better, but again I quarrel with your naming. "Inherit" is about parents and children, this is not; and furthermore, some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about passing an mmap across exec. Your MAP_NOINHERIT has nothing to do with that. MAP_MANYPROTS would help us to follow the trail more easily (though it's true that you can't actually pass many prots in to a single remap_file_pages call). > Subject: [patch 01/18] remap_file_pages protection support: uml, i386, x64 bits > > Update pte encoding macros for UML, i386 and x86-64. > Also, add the MAP_NOINHERIT flag to arch headers. Well, I don't find your patch division very helpful, since you introduce these without us seeing what use is made of them. And the MAP_NOINHERIT additions cover a different subset of arches (ppc, ppc64, s390 in there): those should be in some other patch. Usually we just do the i386 arch first, and supply some other patch(es) for all the others. But you've good reason to start with UML too, and it makes sense to include x86_64 along too if you're happy to do so. But it'll probably waste your time and mine to go on discussng patch division, let's leave it at that. > *** remap_file_pages protection support: improvement for UML bits > > Recover one bit by additionally using _PAGE_NEWPROT. Since I wasn't sure this > would work, I've split this out, but it has worked well. We rely on the fact > that pte_newprot always checks first if the PTE is marked present. And we never hear of _PAGE_NEWPROT or pte_newprot again. Ah, they're already defined in and peculiar to UML, I see. Well, if this some UML improvement change, please put that in a separate UML patch. > -#define pte_to_pgoff(pte) (pte_val(pte) >> 4) > +#define pte_to_pgoff(pte) (((pte_val(pte) >> 6) << 1) | ((pte_val(pte) >> 2) & 0x1)) > +#define pte_to_pgprot(pte) \ > + __pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \ > + | ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \ > + (_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED) It took me a long time to get past this! But it's not your fault at all, you are just tweaking what's there. In the end I decided I'd do better to take it on trust and move along. (I realize that PROTNONE is a specially awkward case, but I keep feeling that we make it even more awkward than it need be. But again, if so, that's not your fault. And I've never been forced to think it through here, as you have.) > -#define PTE_FILE_MAX_BITS 29 > +#define PTE_FILE_MAX_BITS 28 That being the i386 2-level case. I think that one less bit there is probably okay, since wli changed the 3-level case to use all 32 bits of the high word. The people needing 29 file offset bits are probably already using PAE, and can be directed to it if not. But there may be other 32-bit arches, without a 3-level alternative, which are impacted. (I don't think I'm telling you anything you don't already know, just highlighting a point for others to comment.) > +#define MAP_NOINHERIT 0x20000 /* don't inherit the protection bits of the underlying vma*/ As above, MAP_MANYPROTS or something better please. And please make it clear in the comment that it's a flag to remap_file_pages and nothing else. > Subject: [patch 02/18] remap_file_pages protection support: handle nonuniform VMAs > > * (TODO?) avoid regression in max file offset with r_f_p() for older mappings; > we could use either the old offset encoding or the new offset-prot encoding > depending on this flag. > It's trivial to do, just I don't know whether existing apps will overflow > the new limits. They go down from 2Tb to 1Tb on i386 and 512G on PPC, and > from 256G to 128G on S390/31 bits. Give me a call in case. Spare me! That would make those macros even worse. I hope noone will demand it. > mprotect alters the VMA prots and walks each present PTE, ignoring installed > ones; their saved prots will be restored on faults, ignoring VMA ones and > losing the mprotect() on them. So, we must restore VMA prots when the VMA is > uniform, as we used to do. If I understand it, you're here commenting on the history of these remap_file_pages patches. That's okay in the 0/18 introducing a new version, but please, in the comment on the patch itself, explain what it's doing, not how you've been changing it around. Oh, mm/mprotect.c doesn't even come in this patch, you're writing about the next one. > + pgprot = vma->vm_flags & VM_NONUNIFORM ? pte_to_pgprot(*pte): vma->vm_page_prot; Personally I find "(vma->vm_flags & VM_NONUNIFORM)" more reassuring there. > Subject: [patch 03/18] remap_file_pages protection support: make mprotect skip pagetables on nonuniform > > There is IMHO no reason to support using mprotect on non-uniform VMAs. The > only exception is to change the VMA's default protection (which is used for > non-individually remapped pages), but it must still ignore the page tables, as > done in this patch. > > The only unsatisfied need is if I want to change protections without changing > the indexes, which with remap_file_pages you must do one page at a time and > re-specifying the indexes. > > It is more reasonable to allow remap_file_pages to change protections on a PTE > range without changing the offsets. I've not implemented this, but if wanted I > can. For sure, UML doesn't currently need this interface. > > However, for now I've implemented only this change to mprotect(), I'd like to > get some feedback about this choice. No, I think I disagree with your choice here. A reasonable person, doing a successful, prohibitive (e.g. remove write access) mprotect on a range, would expect those prohibitions then to be enforced across the range. Whereas you're letting existing entries (whether present or paged out) retain the permissions they were given earlier with remap_file_pages. I think mprotect should remove the VM_MANYPROTS attribute of each vma in the range, and give all the present entries the new pgprot (in the same way that it normally does). Or, if that would really bloat the implementation (I don't see why, but perhaps the encoding of absent entries, maybe manyprots, maybe nonlinear, might do so), just fail with -EACCES when sys_mprotect meets a manyprots vma, until someone asks for better. As to letting remap_file_pages change permissions without changing file offsets somehow (another MAP_ flag I suppose), yes, it could be done but don't bother until/unless there is such need. > [PATCH] remove implied vm_ops check This one was missing from the sequence, and you supplied privately. Remove implied vm_ops check?? No, that belongs to something else. > Even now, we'll sometimes take the write lock, and maybe go to sleep with it > for I/O. You're writing about sys_remap_file_pages. > So, in that case, we could downgrade it; after a tiny bit of thought, I've > choosen doing that when we'll either do any I/O or we'll alter a lot of PTEs. > About how much "a lot" is, I've copied the values from this code in > mm/memory.c, but note they're about mm->page_table_lock, a spinlock, not a > semaphore: > > #ifdef CONFIG_PREEMPT > # define ZAP_BLOCK_SIZE (8 * PAGE_SIZE) > #else > /* No preempt: go for improved straight-line efficiency */ > # define ZAP_BLOCK_SIZE (1024 * PAGE_SIZE) > #endif > > I'm not sure about the trade-offs - we used to have a down_write, now we have > a down_read() and a possible up_read()-down_write(), and with this patch, the > fast-path still takes only down_read, but the slow path will do down_read(), > up_read(), down_write(), downgrade_write(). This will increase by one the > number of atomic operation but increase concurrency wrt mmap and similar > operations - I don't know how much contention there is on that lock. Please drop all this, it's overengineered. A semaphore is not a spinlock, it's not adding to the latency of the system as a whole, just preventing concurrent page faults on that mm (if you downgrade_write, you're still excluding concurrent mmaps, mprotects etc., and necessarily so). Originally every sys_remap_file_pages did down_write, and there was a downgrade_write before the populate. Doing down_write every time did hurt, so it was changed only to do that when first going nonlinear (to respect the locking on vm_flags), and we didn't bother to downgrade in that one instance. I don't mind you just adding a simple if (has_write_lock) downgrade_write(&mm->mmap_sem); before the populate (and getting rid of the has_write_lock condition further down), but anything more seems overkill to me. > Also, drop a bust comment: we cannot clear VM_NONLINEAR simply because code > elsewhere is going to use it. At the very least, madvise_dontneed() relies on > that flag being set (remaining non-linear truncation read the mapping > list), but the list is probably longer, and going to increase. You misunderstand the comment (it could indeed be clearer), but you're right that the bit about downgrading the lock is out-of-date. I suggest you insert this comment instead. /* * We would like to clear VM_NONLINEAR, in the case when * sys_remap_file_pages covers the whole vma, so making * it linear again. But cannot do so until after a * successful populate, and have no way to upgrade sem. */ > Subject: [patch 04/18] remap_file_pages protection support: cleanup syscall checks > > This patch reorganizes the code only, without differences in behaviour. It > makes the code more readable on its own, and is needed for next patches. I've > split this out to avoid cluttering real patches. Okay, you're breaking up conditions nicely, but please break this one up too > > + if (!vma->vm_ops || !vma->vm_ops->populate || end <= start || start < > + vma->vm_start || end > vma->vm_end) > + goto out_unlock; into a vm_ops part and a start/end part. > Subject: [patch 05/18] remap_file_pages protection support: enhance syscall interface > > @@ -203,7 +203,7 @@ asmlinkage long sys_remap_file_pages(uns > /* Can we represent this offset inside this architecture's pte's? */ > #if PTE_FILE_MAX_BITS < BITS_PER_LONG > if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS)) > - return err; > + return -EOVERFLOW; > #endif In the last patch you replaced all(?) the early returns by gotos; now in this one you reintroduce an early return. Better be consistent and make this one a goto too. > + if (flags & MAP_NOINHERIT) { > + err = -EPERM; > + if (((prot & PROT_READ) && !(vma->vm_flags & VM_MAYREAD))) > + goto out_unlock; > + if (((prot & PROT_WRITE) && !(vma->vm_flags & VM_MAYWRITE))) > + goto out_unlock; > + if (((prot & PROT_EXEC) && !(vma->vm_flags & VM_MAYEXEC))) > + goto out_unlock; > + err = -EINVAL; > + pgprot = protection_map[calc_vm_prot_bits(prot) | VM_SHARED]; Please follow mprotect's way of doing this, calc_vm_prot_bits first then if ((newflags & ~(newflags >> 4)) & 0xf) { Except please do NOT say 0xf! Use VM_READ|VM_WRITE|VM_EXEC instead (yeah, yeah, I know that only comes to 7). > Subject: [patch 06/18] remap_file_pages protection support: support private vma for MAP_POPULATE > > Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we don't > rearrange pages around. And it's trivial to do. This seems reasonable to me. I was afraid you were going to extend VM_NONLINEAR to private maps, but no, you're just letting private maps be populated linearly, that's fine. (Really we ought then also to allow anonymous maps be pre-populated also, but that's a different patch, which wouldn't touch sys_remap_file_pages at all: forget it.) > --- linux-2.6.git/mm/mmap.c~rfp-private-vma-2 2005-08-24 20:57:13.000000000 +0200 > +++ linux-2.6.git-paolo/mm/mmap.c 2005-08-24 20:57:13.000000000 +0200 > @@ -1124,6 +1124,10 @@ out: > } > if (flags & MAP_POPULATE) { > up_write(&mm->mmap_sem); > + /* > + * remap_file_pages() works even if the mapping is private, > + * in the linearly-mapped case: > + */ But that's one of your historical comments. The point of the patch is that it's peculiar not to populate the private mappings, so why comment when we're no longer doing something peculiar? Please leave it out. > sys_remap_file_pages(addr, len, 0, > pgoff, flags & MAP_NONBLOCK); > down_write(&mm->mmap_sem); > Subject: [patch 07/18] remap_file_pages protection support: safety net for lazy arches > > Since proper support requires that the arch at the very least handles > VM_FAULT_SIGSEGV, as in next patch (otherwise the arch may BUG), and things > are even more complex (see next patches), and it's triggerable only with > VM_NONUNIFORM vma's, simply refuse creating them if the arch doesn't declare > itself ready. > > This is a very temporary hack, so I've clearly marked it as such. And, with > current rythms, I've given about 6 months for arches to get ready. Reducing > this time is perfectly ok for me. > > +#ifndef __ARCH_SUPPORTS_VM_NONUNIFORM > + if (flags & MAP_NOINHERIT) > + goto out; > +#endif Any arch that doesn't have a definition of MAP_MANYPROTS in its asm/mman.h has its build broken anyways. While you're giving them all MAP_MANYPROTS definitions, please just modify their fault handlers, as simply as possible. So skip the __ARCH_SUPPORTS_VM_NONUNIFORM business. > > +++ linux-2.6.git-paolo/Documentation/feature-removal-schedule.txt 2005-08-24 20:57:18.000000000 +0200 > + > +--------------------------- > + > +What: __ARCH_SUPPORTS_VM_NONUNIFORM > +When: December 2005 > +Files: mm/fremap.c, include/asm-*/pgtable.h > +Why: It's just there to allow arches to update their page fault handlers to > + support VM_FAULT_SIGSEGV, for remap_file_pages protection support. > + Since they may BUG if this support is not added, the syscall code > + refuses this new operation mode unless the arch declares itself as > + "VM_FAULT_SIGSEGV-aware" with this macro. > +Who: Paolo 'Blaisorblade' Giarrusso <bla...@ya...> This is very thorough and well-intentioned, but skip it - we do that for old features from which many drivers need converting, but it's not like that with the architectures. Just supply the simple patches. > Subject: [patch 08/18] remap file pages protection support: use FAULT_SIGSEGV for protection checking Ah ha, now we get to the heart of it. (And nothing wrong with such a lead up, I too like a number of patches to set the stage before the big one.) Well, this is two patches. First there's handling VM_FAULT_SIGSEGV in various arches. Please minimize those patches for now: VM_FAULT_SIGSEGV is very sensible in itself, and with its help we might be able to move more code from arches to common in future; but for now, try to keep it to adding the case VM_FAULT_SIGSEGV: goto bad_area; in each arch fault handler. Resist changing "survive" to "handle_fault". Resist adding "access_mask" and passing it in place of "write": that may be necessary later to handle VM_EXEC properly on a few architectures, but I suggest (particularly given the subset of arches of interest) that you start with just the read/write distinction, and if exec needs more add it later (even if it's just a later patch of the same series). And resist adding a preliminary VM_MANYPROTS check to each fault handler. Let's say instead that the mmap/mprotect permissions are the maximum that the area can take (and therefore modify the check done on prots in sys_remap_file_pages, to make sure they do not exceed "default"). Would that work out? And then there's the core patch, to mm/memory.c. But I'm afraid at this point, just when it gets interesting, my time and my patience run out. I still haven't come to a conclusion on the most interesting lines of all, do_no_page's if (write_access || unlikely(vma->vm_flags & VM_NONUNIFORM)) entry = maybe_mkwrite(pte_mkdirty(entry), vma); but I suspect you're wrong to be avoiding do_wp_page at all costs, should instead be adapting do_wp_page to do the right thing with the faults which arrive there. And I think there's a serious flaw in handle_pte_fault, where it says > + /* VM_NONUNIFORM vma's have PTE's always installed with the correct > + * protection. So, generate a SIGSEGV if a fault is caught there. */ > + if (unlikely(vma->vm_flags & VM_NONUNIFORM)) > + goto out_segv; Consider two threads faulting on the same pte on different cpus. One of them fixes it up, you're giving the other SIGSEGV? I think this runs quite deep and will need a rework. Sorry, I do not know when I can next face going over this, it's a hard task for me: perhaps someone else can take over. Hugh |
From: Blaisorblade <bla...@ya...> - 2005-09-04 19:13:15
|
On Friday 02 September 2005 23:02, Hugh Dickins wrote: > On Fri, 26 Aug 2005, Blaisorblade wrote: > > * The first 2 patches modify the PTE encoding macros and start preparing > > the VM for the new situation (i.e. VMA which have variable protections, > > which are called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name). > What a pity: please revert. VM_NONUNIFORM sounds impressive, but might > mean all kinds of things, maybe to do with NUMA. VM_MANYPROTS is good, > it says what it means. Ok. Btw, before I forget: I assume I should redo the patches rather than fix what you say on top of mine, (at least when not changing behaviour), right? > > * Patch 11 is a big simplification. Since we must encode the PTE's on > > swapout like in VM_NONLINEAR vmas, the simplest way to reuse the existing > > code is to make sure that VM_NONUNIFORM vmas are also marked as > > VM_NONLINEAR. > In some places you seem to say that you (UML) only need VM_MANYPROTS vmas > linear, in other places you seem to say that your VM_MANYPROTS vmas will > be nonlinear. I've no idea which way round it is. Perhaps the "non" > sometimes goes missing (another reason to avoid NONUNIFORM). > I wrote that yesterday. Thanks, you've cleared it up in private mail: > the VM_MANYPROTS vmas that UML wants are VM_NONLINEAR anyway. > Yes, I see your dilemma there: you rightly want to avoid adding bloat > by distinguishing cases that you don't need distinguished; but equally > rightly fear that someone somewhere will start using the VM_MANYPROTS > for other reasons, and hit the inefficiencies of VM_NONLINEAR > unnecessarily. I share your uncertainty, I don't have an immediate > feel for the right direction on that. We'll see later if we can cater to this case without messing up zap_pte_range as I did in last patch (that is the only one with which I was able to break something - not in the version I sent, however). > > *) No more usage of a new syscall slot: to use the new interface, > > application will use the new MAP_NOINHERIT flag I've added. I've still > > the patches to use the old -mm ABI, if there's any reason they're needed. > I'm glad you've scrapped the new syscall slot, that really put me off > the old patch (though I was probably being silly about it). This way > is much better, but again I quarrel with your naming. > "Inherit" is about parents and children, this is not; and furthermore, > some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about > passing an mmap across exec. Your MAP_NOINHERIT has nothing to do > with that. MAP_MANYPROTS would help us to follow the trail more > easily (though it's true that you can't actually pass many prots > in to a single remap_file_pages call). MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT? VM_MANYPROTS is internal name, so there's no reason to have the same name either. > > Subject: [patch 01/18] remap_file_pages protection support: uml, i386, > > x64 bits > > > > Update pte encoding macros for UML, i386 and x86-64. > > Also, add the MAP_NOINHERIT flag to arch headers. > > Well, I don't find your patch division very helpful, since you introduce > these without us seeing what use is made of them. And the MAP_NOINHERIT > additions cover a different subset of arches (ppc, ppc64, s390 in there): > those should be in some other patch. For this patch, I joined up everything because people get scared when they see 39 patches (and I've not really removed code, apart for things which were introduced and later rewritten, just changed the presentation between the two sends). > Usually we just do the i386 arch first, and supply some other patch(es) > for all the others. But you've good reason to start with UML too, and > it makes sense to include x86_64 along too if you're happy to do so. > But it'll probably waste your time and mine to go on discussng patch > division, let's leave it at that. > > *** remap_file_pages protection support: improvement for UML bits > > Recover one bit by additionally using _PAGE_NEWPROT. Since I wasn't sure > > this would work, I've split this out, but it has worked well. We rely on > > the fact that pte_newprot always checks first if the PTE is marked > > present. > And we never hear of _PAGE_NEWPROT or pte_newprot again. Ah, they're > already defined in and peculiar to UML, I see. Well, if this some > UML improvement change, please put that in a separate UML patch. As above, I joined altogether more patches to reduce noise. And after proper unit testing and checking, it was safe anyway to join it. > > -#define pte_to_pgoff(pte) (pte_val(pte) >> 4) > > +#define pte_to_pgoff(pte) (((pte_val(pte) >> 6) << 1) | ((pte_val(pte) > > >> 2) & 0x1)) +#define pte_to_pgprot(pte) \ > > + __pgprot((pte_val(pte) & (_PAGE_RW | _PAGE_PROTNONE)) \ > > + | ((pte_val(pte) & _PAGE_PROTNONE) ? 0 : \ > > + (_PAGE_USER | _PAGE_PRESENT)) | _PAGE_ACCESSED) > It took me a long time to get past this! But it's not your fault > at all, you are just tweaking what's there. In the end I decided > I'd do better to take it on trust and move along. > (I realize that PROTNONE is a specially awkward case, but I keep > feeling that we make it even more awkward than it need be. But > again, if so, that's not your fault. And I've never been forced > to think it through here, as you have.) Actually I first just included this piece from Ingo, while changing shifts - one week later I was astonished about where we put _PAGE_USER, until realizing it's not read from the PTE but synthetized above. I should put a comment there, anyway. > > -#define PTE_FILE_MAX_BITS 29 > > +#define PTE_FILE_MAX_BITS 28 > > That being the i386 2-level case. I think that one less bit there is > probably okay, since wli changed the 3-level case to use all 32 bits > of the high word. The people needing 29 file offset bits are probably > already using PAE, and can be directed to it if not. > But there may be other 32-bit arches, without a 3-level alternative, > which are impacted. (I don't think I'm telling you anything you > don't already know, just highlighting a point for others to comment.) s390/31 bit would suffer even more (they go 26->25). But on this, it can be avoided (for who doesn't use the new API) as explained below. > > +#define MAP_NOINHERIT 0x20000 /* don't inherit the protection bits of > > the underlying vma*/ > As above, MAP_MANYPROTS or something better please. And please make it > clear in the comment that it's a flag to remap_file_pages and nothing else. Ok. > > Subject: [patch 02/18] remap_file_pages protection support: handle > > nonuniform VMAs > > * (TODO?) avoid regression in max file offset with r_f_p() for older > > mappings; we could use either the old offset encoding or the new > > offset-prot encoding depending on this flag. > > It's trivial to do, just I don't know whether existing apps will > > overflow the new limits. They go down from 2Tb to 1Tb on i386 and 512G on > > PPC, and from 256G to 128G on S390/31 bits. Give me a call in case. > Spare me! That would make those macros even worse. > I hope noone will demand it. It's just what you remarked above. But we'd have separate macros and code paths (not hugely separate): use the old macro version if VM_MANYPROTS clear, use the new one if VM_MANYPROTS set. > > mprotect alters the VMA prots and walks each present PTE, ignoring > > installed ones; their saved prots will be restored on faults, ignoring > > VMA ones and losing the mprotect() on them. So, we must restore VMA prots > > when the VMA is uniform, as we used to do. > If I understand it, you're here commenting on the history of these > remap_file_pages patches. That's okay in the 0/18 introducing > a new version, but please, in the comment on the patch itself, > explain what it's doing, not how you've been changing it around. > Oh, mm/mprotect.c doesn't even come in this patch, you're writing > about the next one. "We used to do" refers to before the patches themselves, i.e. with current r_f_p() code. > > + pgprot = vma->vm_flags & VM_NONUNIFORM ? pte_to_pgprot(*pte): > > vma->vm_page_prot; > > Personally I find "(vma->vm_flags & VM_NONUNIFORM)" more reassuring there. Ok. > > Subject: [patch 03/18] remap_file_pages protection support: make mprotect > > skip pagetables on nonuniform > > There is IMHO no reason to support using mprotect on non-uniform VMAs. > > The only exception is to change the VMA's default protection (which is > > used for non-individually remapped pages), but it must still ignore the > > page tables, as done in this patch. > > The only unsatisfied need is if I want to change protections without > > changing the indexes, which with remap_file_pages you must do one page at > > a time and re-specifying the indexes. > > It is more reasonable to allow remap_file_pages to change protections on > > a PTE range without changing the offsets. I've not implemented this, but > > if wanted I can. For sure, UML doesn't currently need this interface. > > However, for now I've implemented only this change to mprotect(), I'd > > like to get some feedback about this choice. > No, I think I disagree with your choice here. A reasonable person, > doing a successful, prohibitive (e.g. remove write access) mprotect > on a range, would expect those prohibitions then to be enforced across > the range. Whereas you're letting existing entries (whether present > or paged out) retain the permissions they were given earlier with > remap_file_pages. A reasonable person using VM_MANYPROTS must study the new API anyway - and if he went even to the trouble of writing twice the code to support even older kernels (UML does), and he wants to prohibit access to a range, he'd rather go with remap_file_pages (PROT_NONE), which does exactly what you expect, or it would waste the advantage of avoiding VMA splitting. My point is that after I set some PTEs individually with RFP, while some others are installed by do_no_page() with the default VMA prot, how can the kernel distinguish between them? Because (say) I don't want to change the ones I installed separately. Should the kernel cater to this need? However, changing the default VMA prot is very reasonable. And with the way I chose the user is given more flexibility. Random example: a profiler/debugger lets the app run for a while and fault in some pages, on a nonuniform vma, and/or maps some ones by hand, and then wants to protect the rest without touching present ones - it'd use mprotect(), but if that's done your way it becomes impossible. Don't know if this will ever be useful, but that's why the API is more powerful. However without that new "remap range changing permissions without changing offset" we have a problem on the other side. So, since we have no use for it currently, I'll choose -EINVAL. > I think mprotect should remove the VM_MANYPROTS attribute of each > vma in the range, and give all the present entries the new pgprot > (in the same way that it normally does). > Or, if that would really bloat the implementation (I don't see why, > but perhaps the encoding of absent entries, maybe manyprots, maybe > nonlinear, might do so), just fail with -EACCES when sys_mprotect > meets a manyprots vma, until someone asks for better. That is reasonable; -EINVAL makes more sense for me in this case (why permission denied?) > As to letting remap_file_pages change permissions without changing > file offsets somehow (another MAP_ flag I suppose), yes, it could > be done but don't bother until/unless there is such need. Agreed. > > [PATCH] remove implied vm_ops check > This one was missing from the sequence, and you supplied privately. > Remove implied vm_ops check?? No, that belongs to something else. Ahr - sorry, that's an old bug of my SCM scripts. > > Even now, we'll sometimes take the write lock, and maybe go to sleep with > > it for I/O. > You're writing about sys_remap_file_pages. Yes, the real title is more meaningful. > > So, in that case, we could downgrade it; after a tiny bit of thought, > > I've choosen doing that when we'll either do any I/O or we'll alter a lot > > of PTEs. About how much "a lot" is, I've copied the values from this code > > in mm/memory.c, but note they're about mm->page_table_lock, a spinlock, > > not a semaphore: > > > > #ifdef CONFIG_PREEMPT > > # define ZAP_BLOCK_SIZE (8 * PAGE_SIZE) > > #else > > /* No preempt: go for improved straight-line efficiency */ > > # define ZAP_BLOCK_SIZE (1024 * PAGE_SIZE) > > #endif > > I'm not sure about the trade-offs - we used to have a down_write, now we > > have a down_read() and a possible up_read()-down_write(), and with this > > patch, the fast-path still takes only down_read, but the slow path will > > do down_read(), up_read(), down_write(), downgrade_write(). This will > > increase by one the number of atomic operation but increase concurrency > > wrt mmap and similar operations - I don't know how much contention there > > is on that lock. > Please drop all this, it's overengineered. A semaphore is not a spinlock, > it's not adding to the latency of the system as a whole, just preventing > concurrent page faults on that mm Don't know if this is casual - but have you anything personal against Christoph Lameter ;-)? > (if you downgrade_write, you're still > excluding concurrent mmaps, mprotects etc., and necessarily so). Yes, correct, I missed that. > Originally every sys_remap_file_pages did down_write, and there was a > downgrade_write before the populate. Doing down_write every time did > hurt, so it was changed only to do that when first going nonlinear > (to respect the locking on vm_flags), and we didn't bother to downgrade > in that one instance. > I don't mind you just adding a simple > if (has_write_lock) > downgrade_write(&mm->mmap_sem); > before the populate (and getting rid of the has_write_lock condition > further down), but anything more seems overkill to me. Ok, will see with Ingo. The fact that it's done only on first time and it's not a spinlock agrees with you, anyway. > > Also, drop a bust comment: we cannot clear VM_NONLINEAR simply because > > code elsewhere is going to use it. At the very least, madvise_dontneed() > > relies on that flag being set (remaining non-linear truncation read the > > mapping list), but the list is probably longer, and going to increase. > You misunderstand the comment (it could indeed be clearer), but you're > right that the bit about downgrading the lock is out-of-date. > I suggest you insert this comment instead. > /* > * We would like to clear VM_NONLINEAR, in the case when > * sys_remap_file_pages covers the whole vma, so making > * it linear again. But cannot do so until after a > * successful populate, and have no way to upgrade sem. > */ Aaaaaaaaah, ok. > > Subject: [patch 04/18] remap_file_pages protection support: cleanup > > syscall checks > > This patch reorganizes the code only, without differences in behaviour. > > It makes the code more readable on its own, and is needed for next > > patches. I've split this out to avoid cluttering real patches. > Okay, you're breaking up conditions nicely, but please break this one up > too > > + if (!vma->vm_ops || !vma->vm_ops->populate || end <= start || start < > > + vma->vm_start || end > vma->vm_end) > > + goto out_unlock; > > into a vm_ops part and a start/end part. Ok. > > Subject: [patch 05/18] remap_file_pages protection support: enhance > > syscall interface > > @@ -203,7 +203,7 @@ asmlinkage long sys_remap_file_pages(uns > > /* Can we represent this offset inside this architecture's pte's? */ > > #if PTE_FILE_MAX_BITS < BITS_PER_LONG > > if (pgoff + (size >> PAGE_SHIFT) >= (1UL << PTE_FILE_MAX_BITS)) > > - return err; > > + return -EOVERFLOW; > > #endif > In the last patch you replaced all(?) the early returns by gotos; > now in this one you reintroduce an early return. Better be consistent > and make this one a goto too. Ok. > > + if (flags & MAP_NOINHERIT) { > > + err = -EPERM; > > + if (((prot & PROT_READ) && !(vma->vm_flags & VM_MAYREAD))) > > + goto out_unlock; > > + if (((prot & PROT_WRITE) && !(vma->vm_flags & VM_MAYWRITE))) > > + goto out_unlock; > > + if (((prot & PROT_EXEC) && !(vma->vm_flags & VM_MAYEXEC))) > > + goto out_unlock; > > + err = -EINVAL; > > + pgprot = protection_map[calc_vm_prot_bits(prot) | VM_SHARED]; > Please follow mprotect's way of doing this, calc_vm_prot_bits first then > if ((newflags & ~(newflags >> 4)) & 0xf) { > Except please do NOT say 0xf! Use VM_READ|VM_WRITE|VM_EXEC instead > (yeah, yeah, I know that only comes to 7). Ok - nice trick hardcoding the shift between MAY and base flags.... when constants are hardcoded in assembler, nobody forgets comments. > > Subject: [patch 06/18] remap_file_pages protection support: support > > private vma for MAP_POPULATE > > Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we > > don't rearrange pages around. And it's trivial to do. > This seems reasonable to me. I was afraid you were going to extend > VM_NONLINEAR to private maps, but no, you're just letting private maps > be populated linearly, that's fine. Would that be a real problem, when limited to readonly mappings? There's some interest in that, for library loading - a binary with 100 dso's has 300 vmas... I see the problem with anonymous memory is bigger (not entirely different from the current situation). > (Really we ought then also to > allow anonymous maps be pre-populated also, but that's a different > patch, which wouldn't touch sys_remap_file_pages at all: forget it.) > > --- linux-2.6.git/mm/mmap.c~rfp-private-vma-2 2005-08-24 > > 20:57:13.000000000 +0200 +++ linux-2.6.git-paolo/mm/mmap.c 2005-08-24 > > 20:57:13.000000000 +0200 @@ -1124,6 +1124,10 @@ out: > > } > > if (flags & MAP_POPULATE) { > > up_write(&mm->mmap_sem); > > + /* > > + * remap_file_pages() works even if the mapping is private, > > + * in the linearly-mapped case: > > + */ > But that's one of your historical comments. The point of the patch is > that it's peculiar not to populate the private mappings, so why comment > when we're no longer doing something peculiar? Please leave it out. Ok. > > sys_remap_file_pages(addr, len, 0, > > pgoff, flags & MAP_NONBLOCK); > > down_write(&mm->mmap_sem); > > Subject: [patch 07/18] remap_file_pages protection support: safety net > > for lazy arches > > Since proper support requires that the arch at the very least handles > > VM_FAULT_SIGSEGV, as in next patch (otherwise the arch may BUG), and > > things are even more complex (see next patches), and it's triggerable > > only with VM_NONUNIFORM vma's, simply refuse creating them if the arch > > doesn't declare itself ready. > > This is a very temporary hack, so I've clearly marked it as such. And, > > with current rythms, I've given about 6 months for arches to get ready. > > Reducing this time is perfectly ok for me. > > +#ifndef __ARCH_SUPPORTS_VM_NONUNIFORM > > + if (flags & MAP_NOINHERIT) > > + goto out; > > +#endif > Any arch that doesn't have a definition of MAP_MANYPROTS in its asm/mman.h > has its build broken anyways. While you're giving them all MAP_MANYPROTS > definitions, please just modify their fault handlers, as simply as > possible. So skip the __ARCH_SUPPORTS_VM_NONUNIFORM business. Argh, ok. > > +++ > > linux-2.6.git-paolo/Documentation/feature-removal-schedule.txt 2005-08-24 > > 20:57:18.000000000 +0200 + [...] > This is very thorough and well-intentioned, but skip it - we do that > for old features from which many drivers need converting, but it's > not like that with the architectures. Just supply the simple patches. > > Subject: [patch 08/18] remap file pages protection support: use > > FAULT_SIGSEGV for protection checking > Ah ha, now we get to the heart of it. (And nothing wrong with such a lead > up, I too like a number of patches to set the stage before the big one.) > Well, this is two patches. First there's handling VM_FAULT_SIGSEGV in > various arches. Please minimize those patches for now: VM_FAULT_SIGSEGV > is very sensible in itself, and with its help we might be able to move > more code from arches to common in future; but for now, try to keep it > to adding the > case VM_FAULT_SIGSEGV: > goto bad_area; > in each arch fault handler. Resist changing "survive" to "handle_fault". > Resist adding "access_mask" and passing it in place of "write": that may > be necessary later to handle VM_EXEC properly on a few architectures, > but I suggest (particularly given the subset of arches of interest) > that you start with just the read/write distinction, and if exec needs > more add it later (even if it's just a later patch of the same series). > And resist adding a preliminary VM_MANYPROTS check to each fault > handler. Let's say instead that the mmap/mprotect permissions are the > maximum that the area can take (and therefore modify the check done on > prots in sys_remap_file_pages, to make sure they do not exceed "default"). > Would that work out? No, absolutely not, I'm sorry. Since I may have sparse mappings (I'll use this to emulate TLBs), we have a huge PROT_NONE mapping and then remap individual pages, after their allocation, at fault time, with more permissive settings. That check may be moved later, to beginning of bad_area, if you say "default prots are the minimum one, I can only remap with more permissive settings". That would avoid affecting the fast path - even if the branch is clearly an "unlikely" one, so shouldn't give mispredictions at least. If your problem is just avoiding changing all arch handlers, that amounts to letting the new API misbehave on them (a bit of an hack, but with case VM_FAULT_SIGSEGV: goto bad_area; everywhere it's still safe) or using my arch_supports trick. > And then there's the core patch, to mm/memory.c. But I'm afraid at this > point, just when it gets interesting, my time and my patience run out. > I still haven't come to a conclusion on the most interesting lines of > all, do_no_page's > if (write_access || unlikely(vma->vm_flags & VM_NONUNIFORM)) > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > but I suspect you're wrong to be avoiding do_wp_page at all costs, > should instead be adapting do_wp_page to do the right thing with the > faults which arrive there. Actually, today (in your discussion with "Heff" Dike) I realized the above change is redundant. pte_mkdirty is only needed if VM_WRITE (and not even there), and in that case, since we are doing shared mappings, the write access is already set in vma->vm_page_prot. > And I think there's a serious flaw in handle_pte_fault, where it says > > + /* VM_NONUNIFORM vma's have PTE's always installed with the correct > > + * protection. So, generate a SIGSEGV if a fault is caught there. */ > > + if (unlikely(vma->vm_flags & VM_NONUNIFORM)) > > + goto out_segv; > Consider two threads faulting on the same pte on different cpus. > One of them fixes it up, you're giving the other SIGSEGV? > I think this runs quite deep and will need a rework. Not so deep. Weeeell, I was ready to ripping this out (even if for other reasons), I assume that comparing write_access/access_mask and the protections in the PTE is perfectly ok and fixes this. Luckily, even here we have no regression wrt other apps. > Sorry, I do not know when I can next face going over this, > it's a hard task for me: perhaps someone else can take over. For me is ok - just tell Andrew who should be appointed at this. There's not an "official" list of VM maintainers anywhere, even if I'm under the impression you're currently the top one. > Hugh -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Messenger: chiamate gratuite in tutto il mondo http://it.beta.messenger.yahoo.com |
From: Hugh D. <hu...@ve...> - 2005-09-07 12:01:00
|
On Sun, 4 Sep 2005, Blaisorblade wrote: > On Friday 02 September 2005 23:02, Hugh Dickins wrote: > > On Fri, 26 Aug 2005, Blaisorblade wrote: > > > * The first 2 patches modify the PTE encoding macros and start preparing > > > the VM for the new situation (i.e. VMA which have variable protections, > > > which are called VM_NONUNIFORM. I dropped the early VM_MANYPROTS name). > > > What a pity: please revert. VM_NONUNIFORM sounds impressive, but might > > mean all kinds of things, maybe to do with NUMA. VM_MANYPROTS is good, > > it says what it means. > Ok. Btw, before I forget: I assume I should redo the patches rather than fix > what you say on top of mine, (at least when not changing behaviour), right? If it's something pervasive, like such naming, then yes you should redo the patches. Minor local corrections can be appended as an additional patch, so long as they don't make the original patch ridiculous. I don't understand your "(at least when not changing behaviour)": if you mean that significant changes of behaviour should be done as extra patches at the end, no: surely they should be patches of the main sequence. All of this supposes that my opinion is to be followed. I've given my opinions, Andi gave his, I don't remember others. It doesn't mean any one of us is right. Did you agree with mine? > > "Inherit" is about parents and children, this is not; and furthermore, > > some UNIXes had a MAP_INHERIT (see asm-alpha/mman.h) which was about > > passing an mmap across exec. Your MAP_NOINHERIT has nothing to do > > with that. MAP_MANYPROTS would help us to follow the trail more > > easily (though it's true that you can't actually pass many prots > > in to a single remap_file_pages call). > MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT? > VM_MANYPROTS is internal name, so there's no reason to have the same name > either. It doesn't have to be the same name, true, but I find it a lot easier to follow the trail of functionality when similar naming is used. Perhaps the "PROT" part is enough: MAP_SETPROT or MAP_CHGPROT or MAP_CHANGEPROT if you don't like MAP_MANYPROTS. > It's just what you remarked above. But we'd have separate macros and code > paths (not hugely separate): use the old macro version if VM_MANYPROTS clear, > use the new one if VM_MANYPROTS set. I think those macros are grotesque enough already. But I don't have a constructive suggestion. > > No, I think I disagree with your choice here. A reasonable person, > > doing a successful, prohibitive (e.g. remove write access) mprotect > > on a range, would expect those prohibitions then to be enforced across > > the range. Whereas you're letting existing entries (whether present > > or paged out) retain the permissions they were given earlier with > > remap_file_pages. > A reasonable person using VM_MANYPROTS must study the new API anyway - and if > he went even to the trouble of writing twice the code to support even older > kernels (UML does), and he wants to prohibit access to a range, he'd rather > go with remap_file_pages (PROT_NONE), which does exactly what you expect, or > it would waste the advantage of avoiding VMA splitting. We disagree quite strongly on what's reasonable for mprotect to do, if it's to do anything interesting at all: we could indeed define it to do special new things for this case (sorry, I've cut your suggestions), but I'd much prefer not. However, it looks like we can agree that for now it needn't do anything on these areas but give an error. > > Or, if that would really bloat the implementation (I don't see why, > > but perhaps the encoding of absent entries, maybe manyprots, maybe > > nonlinear, might do so), just fail with -EACCES when sys_mprotect > > meets a manyprots vma, until someone asks for better. > That is reasonable; -EINVAL makes more sense for me in this case (why > permission denied?) But can we agree on which error? I avoided -EINVAL, partly because it's a very overloaded error code (perhaps the most overloaded of all?), so often unhelpful to the user; and partly because for mprotect it appears to relate to invalid PROT flags rather than anything else. Whereas if mprotect is applied to a HUGETLB area which it can't do anything with, -EACCES is currently returned. So I followed that precedent. > > > Subject: [patch 06/18] remap_file_pages protection support: support > > > private vma for MAP_POPULATE > > > > Fix MAP_POPULATE | MAP_PRIVATE. We don't need the VMA to be shared if we > > > don't rearrange pages around. And it's trivial to do. > > > This seems reasonable to me. I was afraid you were going to extend > > VM_NONLINEAR to private maps, but no, you're just letting private maps > > be populated linearly, that's fine. > Would that be a real problem, when limited to readonly mappings? > There's some interest in that, for library loading - a binary with 100 dso's > has 300 vmas... You seem to see vmas as nothing but obstacles. Perhaps there is a reason we divide the mm into vmas? But perhaps there is a better way of doing it. Regarding nonlinear readonly. I never asked Ingo why he excluded it - suspect he didn't intend to, but missed the peculiar treatment of VM_SHARED versus VM_MAYSHARE - my apologies, Ingo, if I'm underestimating you! But I was glad he had because it demands write access to the file being mapped nonlinear. Therefore the ordinary user cannot map libc.so nonlinear, and condemn all users to the sledgehammer fashion of try_to_unmap_cluster. Though thinking through that again now, the user of the nonlinear vma is penalized, and the whole system is penalized by the difficulty in reclaiming efficiently, but I don't see the other users of the library particularly penalized (they might be unfairly advantaged by having its pages stay unnaturally long in memory). Either I was wrong before, or I'm missing another aspect of it now: I don't know which. > > And resist adding a preliminary VM_MANYPROTS check to each fault > > handler. Let's say instead that the mmap/mprotect permissions are the > > maximum that the area can take (and therefore modify the check done on > > prots in sys_remap_file_pages, to make sure they do not exceed "default"). > > > Would that work out? > No, absolutely not, I'm sorry. Since I may have sparse mappings (I'll use this > to emulate TLBs), we have a huge PROT_NONE mapping and then remap individual > pages, after their allocation, at fault time, with more permissive settings. I think I get the picture now: you start with a dark sky, then insert the stars one by one. Yes, I accept what I proposed would be of no use to you. Okay, so as well as the VM_FAULT_SIGSEGV case, you do need to add that preliminary check to each arch. So far as I can see (I may have missed it), you really don't need to change from the write boolean (perhaps -1 for exec in one arch??) to the write_access mask: it's not a bad change, but the less you have to modify each of the architectures, the better for now. > That check may be moved later, to beginning of bad_area, if you say "default > prots are the minimum one, I can only remap with more permissive settings". > That would avoid affecting the fast path - even if the branch is clearly an > "unlikely" one, so shouldn't give mispredictions at least. Would it? Anyway, although it fits with UML's particular usage (starting from PROT_NONE), your minimum permissions idea seems neither natural nor useful to me (and I don't think you're convinced by it either, just trying to offer a constructive alternative). No, just stick with the "unlikely" test in the obvious place where you already put it, and leave it for later optimization to rearrange that if necessary: perhaps Christoph L will want modify the ia64 path, to move that test down into the bad_area handling, to jump back if manyprots; but no need for spaghetti in the initial implementation. > > And I think there's a serious flaw in handle_pte_fault, where it says > > > > + /* VM_NONUNIFORM vma's have PTE's always installed with the correct > > > + * protection. So, generate a SIGSEGV if a fault is caught there. */ > > > + if (unlikely(vma->vm_flags & VM_NONUNIFORM)) > > > + goto out_segv; > > > Consider two threads faulting on the same pte on different cpus. > > One of them fixes it up, you're giving the other SIGSEGV? > > I think this runs quite deep and will need a rework. > Not so deep. > Weeeell, I was ready to ripping this out (even if for other reasons), I assume > that comparing write_access/access_mask and the protections in the PTE is > perfectly ok and fixes this. Probably, yes. Let's see what you come up with next and think it over. > > Sorry, I do not know when I can next face going over this, > > it's a hard task for me: perhaps someone else can take over. > For me is ok - just tell Andrew who should be appointed at this. There's not > an "official" list of VM maintainers anywhere, even if I'm under the > impression you're currently the top one. Mercifully for you, me and Linux, no. Andrew was mm maintainer when he started the -mm tree, and remains so. But obviously he has a very heavy load, and may look to others for opinions. He didn't ask me on this, but remap_file_pages does fall into one of my areas of concern, so I felt obliged to give my opinions. I guess that I shall have to look at a later version, but I badly need to focus on the page_table_lock again before spending time elsewhere: I get very sick of holding up other people's requirements, but yours is not the first in line. Hugh |
From: Blaisorblade <bla...@ya...> - 2005-09-13 18:28:34
|
On Wednesday 07 September 2005 14:00, Hugh Dickins wrote: > On Sun, 4 Sep 2005, Blaisorblade wrote: > > On Friday 02 September 2005 23:02, Hugh Dickins wrote: > > > On Fri, 26 Aug 2005, Blaisorblade wrote: > > > > * The first 2 patches modify the PTE encoding macros and start > > > > preparing the VM for the new situation (i.e. VMA which have variable > > > > protections, which are called VM_NONUNIFORM. I dropped the early > > > > VM_MANYPROTS name). > If it's something pervasive, like such naming, then yes you should redo > the patches. Minor local corrections can be appended as an additional > patch, so long as they don't make the original patch ridiculous. > > I don't understand your "(at least when not changing behaviour)": > if you mean that significant changes of behaviour should be done as extra > patches at the end, no: surely they should be patches of the main sequence. Perfectly fine. > All of this supposes that my opinion is to be followed. > I've given my opinions, Andi gave his, I don't remember others. > It doesn't mean any one of us is right. Did you agree with mine? For me it's no problem renaming, I see your point, have no particular reason to insist with mine. > > > "Inherit" is about parents and children, this is not; [...] > > MAP_CHGPROT? MAP_CHANGEPROT? MAP_REPROT? > > VM_MANYPROTS is internal name, so there's no reason to have the same name > > either. > It doesn't have to be the same name, true, but I find it a lot easier > to follow the trail of functionality when similar naming is used. > Perhaps the "PROT" part is enough: MAP_SETPROT or MAP_CHGPROT or > MAP_CHANGEPROT if you don't like MAP_MANYPROTS. MAP_CHGPROT is perfectly ok... > > It's just what you remarked above. But we'd have separate macros and code > > paths (not hugely separate): use the old macro version if VM_MANYPROTS > > clear, use the new one if VM_MANYPROTS set. > I think those macros are grotesque enough already. That change wouldn't be in the macro themselves. We'd have a slightly different path in the caller to handle that. > But I don't have a constructive suggestion. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade ___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it |