From: Konstantin K. <khl...@op...> - 2012-03-31 09:29:40
|
Currently the kernel sets mm->exe_file during sys_execve() and then tracks number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it resets mm->exe_file at last mmput() when mm->mm_users drops to zero. Vma with VM_EXECUTABLE flag appears after mapping file with flag MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma splitting, because sys_mmap ignores this flag. Usually binfmt module sets mm->exe_file and mmaps some executable vmas with this file, they hold mm->exe_file while task is running. comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), where all this stuff was introduced: > The kernel implements readlink of /proc/pid/exe by getting the file from > the first executable VMA. Then the path to the file is reconstructed and > reported as the result. > > Because of the VMA walk the code is slightly different on nommu systems. > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of > walking the VMAs to find the first executable file-backed VMA we store a > reference to the exec'd file in the mm_struct. > > That reference would prevent the filesystem holding the executable file > from being unmounted even after unmapping the VMAs. So we track the number > of VM_EXECUTABLE VMAs and drop the new reference when the last one is > unmapped. This avoids pinning the mounted filesystem. So, this logic is hooked into every file mmap/unmmap and vma split/merge just to fix some hypothetical pinning fs from umounting by mm which already unmapped all its executable files, but still alive. Does anyone know any real world example? mm can be borrowed by swapoff or some get_task_mm() user, but it's not a big problem. Thus, we can remove all this stuff together with VM_EXECUTABLE flag and keep mm->exe_file alive till final mmput(). After that we can access current->mm->exe_file without any locks (after checking current->mm and mm->exe_file for NULL) Some code around security and oprofile still uses VM_EXECUTABLE for retrieving task's executable file, after this patch they will use mm->exe_file directly. In tomoyo and audit mm is always current->mm, oprofile uses get_task_mm(). Signed-off-by: Konstantin Khlebnikov <khl...@op...> Cc: Matt Helsley <mat...@us...> Cc: Al Viro <vi...@ze...> Cc: Eric Paris <ep...@re...> Cc: Oleg Nesterov <ol...@re...> Cc: lin...@vg... Cc: opr...@li... --- arch/powerpc/oprofile/cell/spu_task_sync.c | 15 ++++---------- arch/tile/mm/elf.c | 12 ++++-------- drivers/oprofile/buffer_sync.c | 17 +++------------- include/linux/mm.h | 4 ---- include/linux/mm_types.h | 1 - include/linux/mman.h | 1 - kernel/auditsc.c | 17 ++-------------- kernel/fork.c | 29 ++++------------------------ mm/mmap.c | 27 +++++--------------------- mm/nommu.c | 11 +---------- security/tomoyo/util.c | 14 +++----------- 11 files changed, 26 insertions(+), 122 deletions(-) diff --git a/arch/powerpc/oprofile/cell/spu_task_sync.c b/arch/powerpc/oprofile/cell/spu_task_sync.c index 642fca1..28f1af2 100644 --- a/arch/powerpc/oprofile/cell/spu_task_sync.c +++ b/arch/powerpc/oprofile/cell/spu_task_sync.c @@ -304,7 +304,7 @@ static inline unsigned long fast_get_dcookie(struct path *path) return cookie; } -/* Look up the dcookie for the task's first VM_EXECUTABLE mapping, +/* Look up the dcookie for the task's mm->exe_file, * which corresponds loosely to "application name". Also, determine * the offset for the SPU ELF object. If computed offset is * non-zero, it implies an embedded SPU object; otherwise, it's a @@ -321,7 +321,6 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int *offsetp, { unsigned long app_cookie = 0; unsigned int my_offset = 0; - struct file *app = NULL; struct vm_area_struct *vma; struct mm_struct *mm = spu->mm; @@ -330,16 +329,10 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int *offsetp, down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!vma->vm_file) - continue; - if (!(vma->vm_flags & VM_EXECUTABLE)) - continue; - app_cookie = fast_get_dcookie(&vma->vm_file->f_path); + if (mm->exe_file) { + app_cookie = fast_get_dcookie(&mm->exe_file->f_path); pr_debug("got dcookie for %s\n", - vma->vm_file->f_dentry->d_name.name); - app = vma->vm_file; - break; + mm->exe_file->f_dentry->d_name.name); } for (vma = mm->mmap; vma; vma = vma->vm_next) { diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c index 758b603..43e5279 100644 --- a/arch/tile/mm/elf.c +++ b/arch/tile/mm/elf.c @@ -39,16 +39,12 @@ static void sim_notify_exec(const char *binary_name) static int notify_exec(void) { int retval = 0; /* failure */ - struct vm_area_struct *vma = current->mm->mmap; - while (vma) { - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) - break; - vma = vma->vm_next; - } - if (vma) { + struct mm_struct *mm = current->mm; + + if (mm->exe_file) { char *buf = (char *) __get_free_page(GFP_KERNEL); if (buf) { - char *path = d_path(&vma->vm_file->f_path, + char *path = d_path(&mm->exe_file->f_path, buf, PAGE_SIZE); if (!IS_ERR(path)) { sim_notify_exec(path); diff --git a/drivers/oprofile/buffer_sync.c b/drivers/oprofile/buffer_sync.c index f34b5b2..d93b2b6 100644 --- a/drivers/oprofile/buffer_sync.c +++ b/drivers/oprofile/buffer_sync.c @@ -216,7 +216,7 @@ static inline unsigned long fast_get_dcookie(struct path *path) } -/* Look up the dcookie for the task's first VM_EXECUTABLE mapping, +/* Look up the dcookie for the task's mm->exe_file, * which corresponds loosely to "application name". This is * not strictly necessary but allows oprofile to associate * shared-library samples with particular applications @@ -224,21 +224,10 @@ static inline unsigned long fast_get_dcookie(struct path *path) static unsigned long get_exec_dcookie(struct mm_struct *mm) { unsigned long cookie = NO_COOKIE; - struct vm_area_struct *vma; - - if (!mm) - goto out; - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!vma->vm_file) - continue; - if (!(vma->vm_flags & VM_EXECUTABLE)) - continue; - cookie = fast_get_dcookie(&vma->vm_file->f_path); - break; - } + if (mm && mm->exe_file) + cookie = fast_get_dcookie(&mm->exe_file->f_path); -out: return cookie; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 553d134..3a4d721 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -88,7 +88,6 @@ extern unsigned int kobjsize(const void *objp); #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ -#define VM_EXECUTABLE 0x00001000 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ @@ -1374,9 +1373,6 @@ extern void exit_mmap(struct mm_struct *); extern int mm_take_all_locks(struct mm_struct *mm); extern void mm_drop_all_locks(struct mm_struct *mm); -/* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */ -extern void added_exe_file_vma(struct mm_struct *mm); -extern void removed_exe_file_vma(struct mm_struct *mm); extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); extern struct file *get_mm_exe_file(struct mm_struct *mm); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3cc3062..b480c06 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -378,7 +378,6 @@ struct mm_struct { /* store ref to file /proc/<pid>/exe symlink points to */ struct file *exe_file; - unsigned long num_exe_file_vmas; #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif diff --git a/include/linux/mman.h b/include/linux/mman.h index 8b74e9b..77cec2f 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -86,7 +86,6 @@ calc_vm_flag_bits(unsigned long flags) { return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | - _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); } #endif /* __KERNEL__ */ diff --git a/kernel/auditsc.c b/kernel/auditsc.c index af1de0f..aa27a00 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1164,21 +1164,8 @@ static void audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk get_task_comm(name, tsk); audit_log_format(ab, " comm="); audit_log_untrustedstring(ab, name); - - if (mm) { - down_read(&mm->mmap_sem); - vma = mm->mmap; - while (vma) { - if ((vma->vm_flags & VM_EXECUTABLE) && - vma->vm_file) { - audit_log_d_path(ab, " exe=", - &vma->vm_file->f_path); - break; - } - vma = vma->vm_next; - } - up_read(&mm->mmap_sem); - } + if (mm && mm->exe_file) + audit_log_d_path(ab, " exe=", &mm->exe_file->f_path); audit_log_task_context(ab); } diff --git a/kernel/fork.c b/kernel/fork.c index b9372a0..40e4b49 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -587,26 +587,6 @@ void mmput(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(mmput); -/* - * We added or removed a vma mapping the executable. The vmas are only mapped - * during exec and are not mapped with the mmap system call. - * Callers must hold down_write() on the mm's mmap_sem for these - */ -void added_exe_file_vma(struct mm_struct *mm) -{ - mm->num_exe_file_vmas++; -} - -void removed_exe_file_vma(struct mm_struct *mm) -{ - mm->num_exe_file_vmas--; - if ((mm->num_exe_file_vmas == 0) && mm->exe_file) { - fput(mm->exe_file); - mm->exe_file = NULL; - } - -} - void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) { if (new_exe_file) @@ -614,20 +594,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) if (mm->exe_file) fput(mm->exe_file); mm->exe_file = new_exe_file; - mm->num_exe_file_vmas = 0; } +/* + * Caller must have mm->mm_users reference, + * for example current->mm or acquired by get_task_mm(). + */ struct file *get_mm_exe_file(struct mm_struct *mm) { struct file *exe_file; - /* We need mmap_sem to protect against races with removal of - * VM_EXECUTABLE vmas */ - down_read(&mm->mmap_sem); exe_file = mm->exe_file; if (exe_file) get_file(exe_file); - up_read(&mm->mmap_sem); return exe_file; } diff --git a/mm/mmap.c b/mm/mmap.c index 3d254ca..2647bb7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -230,11 +230,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) might_sleep(); if (vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); - if (vma->vm_file) { + if (vma->vm_file) fput(vma->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(vma->vm_mm); - } mpol_put(vma_policy(vma)); kmem_cache_free(vm_area_cachep, vma); return next; @@ -616,11 +613,8 @@ again: remove_next = 1 + (end > next->vm_end); mutex_unlock(&mapping->i_mmap_mutex); if (remove_next) { - if (file) { + if (file) fput(file); - if (next->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } if (next->anon_vma) anon_vma_merge(vma, next); mm->map_count--; @@ -1293,8 +1287,6 @@ munmap_back: error = file->f_op->mmap(file, vma); if (error) goto unmap_and_free_vma; - if (vm_flags & VM_EXECUTABLE) - added_exe_file_vma(mm); /* Can addr have changed?? * @@ -1969,11 +1961,8 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, if (anon_vma_clone(new, vma)) goto out_free_mpol; - if (new->vm_file) { + if (new->vm_file) get_file(new->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - added_exe_file_vma(mm); - } if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new); @@ -1991,11 +1980,8 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, /* Clean everything up if vma_adjust failed. */ if (new->vm_ops && new->vm_ops->close) new->vm_ops->close(new); - if (new->vm_file) { - if (vma->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); + if (new->vm_file) fput(new->vm_file); - } unlink_anon_vmas(new); out_free_mpol: mpol_put(pol); @@ -2377,11 +2363,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, new_vma->vm_start = addr; new_vma->vm_end = addr + len; new_vma->vm_pgoff = pgoff; - if (new_vma->vm_file) { + if (new_vma->vm_file) get_file(new_vma->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - added_exe_file_vma(mm); - } if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); vma_link(mm, new_vma, prev, rb_link, rb_parent); diff --git a/mm/nommu.c b/mm/nommu.c index afa0a15..d617d5c 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -789,11 +789,8 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) kenter("%p", vma); if (vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); - if (vma->vm_file) { + if (vma->vm_file) fput(vma->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } put_nommu_region(vma->vm_region); kmem_cache_free(vm_area_cachep, vma); } @@ -1287,10 +1284,6 @@ unsigned long do_mmap_pgoff(struct file *file, get_file(file); vma->vm_file = file; get_file(file); - if (vm_flags & VM_EXECUTABLE) { - added_exe_file_vma(current->mm); - vma->vm_mm = current->mm; - } } down_write(&nommu_region_sem); @@ -1443,8 +1436,6 @@ error: kmem_cache_free(vm_region_jar, region); if (vma->vm_file) fput(vma->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(vma->vm_mm); kmem_cache_free(vm_area_cachep, vma); kleave(" = %d", ret); return ret; diff --git a/security/tomoyo/util.c b/security/tomoyo/util.c index 867558c..b929dd3 100644 --- a/security/tomoyo/util.c +++ b/security/tomoyo/util.c @@ -949,19 +949,11 @@ bool tomoyo_path_matches_pattern(const struct tomoyo_path_info *filename, const char *tomoyo_get_exe(void) { struct mm_struct *mm = current->mm; - struct vm_area_struct *vma; const char *cp = NULL; - if (!mm) - return NULL; - down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) { - cp = tomoyo_realpath_from_path(&vma->vm_file->f_path); - break; - } - } - up_read(&mm->mmap_sem); + if (mm && mm->exe_file) + cp = tomoyo_realpath_from_path(&mm->exe_file->f_path); + return cp; } |
From: Oleg N. <ol...@re...> - 2012-03-31 20:13:42
|
On 03/31, Konstantin Khlebnikov wrote: > > comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > where all this stuff was introduced: > > > ... > > This avoids pinning the mounted filesystem. > > So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > fix some hypothetical pinning fs from umounting by mm which already unmapped all > its executable files, but still alive. Does anyone know any real world example? This is the question to Matt. > keep mm->exe_file alive till final mmput(). Please see the recent discussion, http://marc.info/?t=133096188900012 (just in case, the patch itself was deadly wrong, don't look at it ;) > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -378,7 +378,6 @@ struct mm_struct { > > /* store ref to file /proc/<pid>/exe symlink points to */ > struct file *exe_file; > - unsigned long num_exe_file_vmas; Add Cyrill. This conflicts with c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch in -mm. Oleg. |
From: Cyrill G. <gor...@op...> - 2012-03-31 20:39:24
|
On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > > Add Cyrill. This conflicts with > c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch in -mm. Thanks for CC'ing, Oleg. I think if thise series go in it won't be a problem to update my patch accordingly. Cyrill |
From: Konstantin K. <khl...@op...> - 2012-04-02 09:46:14
|
Cyrill Gorcunov wrote: > On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: >> >> Add Cyrill. This conflicts with >> c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch in -mm. > > Thanks for CC'ing, Oleg. I think if thise series go in it won't > be a problem to update my patch accordingly. In this patch I leave mm->exe_file lockless. After exec/fork we can change it only for current task and only if mm->mm_users == 1. something like this: task_lock(current); if (atomic_read(¤t->mm->mm_users) == 1) set_mm_exe_file(current->mm, new_file); else ret = -EBUSY; task_unlock(current); task_lock() protect this code against get_task_mm() |
From: Cyrill G. <gor...@op...> - 2012-04-02 09:54:54
|
On Mon, Apr 02, 2012 at 01:46:03PM +0400, Konstantin Khlebnikov wrote: > Cyrill Gorcunov wrote: > >On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > >> > >>Add Cyrill. This conflicts with > >>c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch in -mm. > > > >Thanks for CC'ing, Oleg. I think if thise series go in it won't > >be a problem to update my patch accordingly. > > In this patch I leave mm->exe_file lockless. > After exec/fork we can change it only for current task and only if mm->mm_users == 1. > > something like this: > > task_lock(current); > if (atomic_read(¤t->mm->mm_users) == 1) > set_mm_exe_file(current->mm, new_file); > else > ret = -EBUSY; > task_unlock(current); > > task_lock() protect this code against get_task_mm() I see. Konstantin, the question is what is more convenient way to update the patch in linux-next. The c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch is in -mm already, so I either should wait until Andrew pick your series up and send updating patch on top, or I could fetch your series, update my patch and send it here as reply. Hmm? Cyrill |
From: Konstantin K. <khl...@op...> - 2012-04-02 10:13:58
|
Cyrill Gorcunov wrote: > On Mon, Apr 02, 2012 at 01:46:03PM +0400, Konstantin Khlebnikov wrote: >> Cyrill Gorcunov wrote: >>> On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: >>>> >>>> Add Cyrill. This conflicts with >>>> c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch in -mm. >>> >>> Thanks for CC'ing, Oleg. I think if thise series go in it won't >>> be a problem to update my patch accordingly. >> >> In this patch I leave mm->exe_file lockless. >> After exec/fork we can change it only for current task and only if mm->mm_users == 1. >> >> something like this: >> >> task_lock(current); >> if (atomic_read(¤t->mm->mm_users) == 1) >> set_mm_exe_file(current->mm, new_file); >> else >> ret = -EBUSY; >> task_unlock(current); >> >> task_lock() protect this code against get_task_mm() > > I see. Konstantin, the question is what is more convenient way to update the > patch in linux-next. The c-r-prctl-add-ability-to-set-new-mm_struct-exe_file.patch > is in -mm already, so I either should wait until Andrew pick your series up and > send updating patch on top, or I could fetch your series, update my patch and > send it here as reply. Hmm? Let's wait for Andrew's response. And maybe somebody disagree with my changes. |
From: Matt H. <mat...@us...> - 2012-04-02 23:06:16
|
On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > On 03/31, Konstantin Khlebnikov wrote: > > > > comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > > where all this stuff was introduced: > > > > > ... > > > This avoids pinning the mounted filesystem. > > > > So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > > fix some hypothetical pinning fs from umounting by mm which already unmapped all > > its executable files, but still alive. Does anyone know any real world example? > > This is the question to Matt. This is where I got the scenario: https://lkml.org/lkml/2007/7/12/398 Cheers, -Matt Helsley PS: I seem to keep coming back to this so I hope folks don't mind if I leave some more references to make (re)searching this topic easier: Thread with Cyrill Gorcunov discussing c/r of symlink: https://lkml.org/lkml/2012/3/16/448 Thread with Oleg Nesterov re: cleanups: https://lkml.org/lkml/2012/3/5/240 Thread with Alexey Dobriyan re: cleanups: https://lkml.org/lkml/2009/6/4/625 mainline commit 925d1c401fa6cfd0df5d2e37da8981494ccdec07 Date: Tue Apr 29 01:01:36 2008 -0700 procfs task exe symlink |
From: Matt H. <mat...@us...> - 2012-04-02 23:18:53
|
On Sat, Mar 31, 2012 at 01:29:29PM +0400, Konstantin Khlebnikov wrote: > Currently the kernel sets mm->exe_file during sys_execve() and then tracks > number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon as > this counter drops to zero kernel resets mm->exe_file to NULL. Plus it resets > mm->exe_file at last mmput() when mm->mm_users drops to zero. > > Vma with VM_EXECUTABLE flag appears after mapping file with flag MAP_EXECUTABLE, > such vmas can appears only at sys_execve() or after vma splitting, because > sys_mmap ignores this flag. Usually binfmt module sets mm->exe_file and mmaps > some executable vmas with this file, they hold mm->exe_file while task is running. > > comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > where all this stuff was introduced: > > > The kernel implements readlink of /proc/pid/exe by getting the file from > > the first executable VMA. Then the path to the file is reconstructed and > > reported as the result. > > > > Because of the VMA walk the code is slightly different on nommu systems. > > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of > > walking the VMAs to find the first executable file-backed VMA we store a > > reference to the exec'd file in the mm_struct. > > > > That reference would prevent the filesystem holding the executable file > > from being unmounted even after unmapping the VMAs. So we track the number > > of VM_EXECUTABLE VMAs and drop the new reference when the last one is > > unmapped. This avoids pinning the mounted filesystem. > > So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > fix some hypothetical pinning fs from umounting by mm which already unmapped all > its executable files, but still alive. Does anyone know any real world example? > mm can be borrowed by swapoff or some get_task_mm() user, but it's not a big problem. > > Thus, we can remove all this stuff together with VM_EXECUTABLE flag and > keep mm->exe_file alive till final mmput(). > > After that we can access current->mm->exe_file without any locks > (after checking current->mm and mm->exe_file for NULL) > > Some code around security and oprofile still uses VM_EXECUTABLE for retrieving > task's executable file, after this patch they will use mm->exe_file directly. > In tomoyo and audit mm is always current->mm, oprofile uses get_task_mm(). Perhaps I'm missing something but it seems like you ought to split this into two patches. The first could fix up the cell, tile, etc. arch code to use the exe_file reference rather than walk the VMAs. Then the second patch could remove the unusual logic used to allow userspace to unpin the mount and we could continue to discuss that separately. It would also make the git log somewhat cleaner I think... Cheers, -Matt Helsley <mat...@us...> > > Signed-off-by: Konstantin Khlebnikov <khl...@op...> > Cc: Matt Helsley <mat...@us...> > Cc: Al Viro <vi...@ze...> > Cc: Eric Paris <ep...@re...> > Cc: Oleg Nesterov <ol...@re...> > Cc: lin...@vg... > Cc: opr...@li... > --- > arch/powerpc/oprofile/cell/spu_task_sync.c | 15 ++++---------- > arch/tile/mm/elf.c | 12 ++++-------- > drivers/oprofile/buffer_sync.c | 17 +++------------- > include/linux/mm.h | 4 ---- > include/linux/mm_types.h | 1 - > include/linux/mman.h | 1 - > kernel/auditsc.c | 17 ++-------------- > kernel/fork.c | 29 ++++------------------------ > mm/mmap.c | 27 +++++--------------------- > mm/nommu.c | 11 +---------- > security/tomoyo/util.c | 14 +++----------- > 11 files changed, 26 insertions(+), 122 deletions(-) > > diff --git a/arch/powerpc/oprofile/cell/spu_task_sync.c b/arch/powerpc/oprofile/cell/spu_task_sync.c > index 642fca1..28f1af2 100644 > --- a/arch/powerpc/oprofile/cell/spu_task_sync.c > +++ b/arch/powerpc/oprofile/cell/spu_task_sync.c > @@ -304,7 +304,7 @@ static inline unsigned long fast_get_dcookie(struct path *path) > return cookie; > } > > -/* Look up the dcookie for the task's first VM_EXECUTABLE mapping, > +/* Look up the dcookie for the task's mm->exe_file, > * which corresponds loosely to "application name". Also, determine > * the offset for the SPU ELF object. If computed offset is > * non-zero, it implies an embedded SPU object; otherwise, it's a > @@ -321,7 +321,6 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int *offsetp, > { > unsigned long app_cookie = 0; > unsigned int my_offset = 0; > - struct file *app = NULL; > struct vm_area_struct *vma; > struct mm_struct *mm = spu->mm; > > @@ -330,16 +329,10 @@ get_exec_dcookie_and_offset(struct spu *spu, unsigned int *offsetp, > > down_read(&mm->mmap_sem); > > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - if (!vma->vm_file) > - continue; > - if (!(vma->vm_flags & VM_EXECUTABLE)) > - continue; > - app_cookie = fast_get_dcookie(&vma->vm_file->f_path); > + if (mm->exe_file) { > + app_cookie = fast_get_dcookie(&mm->exe_file->f_path); > pr_debug("got dcookie for %s\n", > - vma->vm_file->f_dentry->d_name.name); > - app = vma->vm_file; > - break; > + mm->exe_file->f_dentry->d_name.name); > } > > for (vma = mm->mmap; vma; vma = vma->vm_next) { > diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c > index 758b603..43e5279 100644 > --- a/arch/tile/mm/elf.c > +++ b/arch/tile/mm/elf.c > @@ -39,16 +39,12 @@ static void sim_notify_exec(const char *binary_name) > static int notify_exec(void) > { > int retval = 0; /* failure */ > - struct vm_area_struct *vma = current->mm->mmap; > - while (vma) { > - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) > - break; > - vma = vma->vm_next; > - } > - if (vma) { > + struct mm_struct *mm = current->mm; > + > + if (mm->exe_file) { > char *buf = (char *) __get_free_page(GFP_KERNEL); > if (buf) { > - char *path = d_path(&vma->vm_file->f_path, > + char *path = d_path(&mm->exe_file->f_path, > buf, PAGE_SIZE); > if (!IS_ERR(path)) { > sim_notify_exec(path); > diff --git a/drivers/oprofile/buffer_sync.c b/drivers/oprofile/buffer_sync.c > index f34b5b2..d93b2b6 100644 > --- a/drivers/oprofile/buffer_sync.c > +++ b/drivers/oprofile/buffer_sync.c > @@ -216,7 +216,7 @@ static inline unsigned long fast_get_dcookie(struct path *path) > } > > > -/* Look up the dcookie for the task's first VM_EXECUTABLE mapping, > +/* Look up the dcookie for the task's mm->exe_file, > * which corresponds loosely to "application name". This is > * not strictly necessary but allows oprofile to associate > * shared-library samples with particular applications > @@ -224,21 +224,10 @@ static inline unsigned long fast_get_dcookie(struct path *path) > static unsigned long get_exec_dcookie(struct mm_struct *mm) > { > unsigned long cookie = NO_COOKIE; > - struct vm_area_struct *vma; > - > - if (!mm) > - goto out; > > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - if (!vma->vm_file) > - continue; > - if (!(vma->vm_flags & VM_EXECUTABLE)) > - continue; > - cookie = fast_get_dcookie(&vma->vm_file->f_path); > - break; > - } > + if (mm && mm->exe_file) > + cookie = fast_get_dcookie(&mm->exe_file->f_path); > > -out: > return cookie; > } > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 553d134..3a4d721 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -88,7 +88,6 @@ extern unsigned int kobjsize(const void *objp); > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ > > -#define VM_EXECUTABLE 0x00001000 > #define VM_LOCKED 0x00002000 > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ > > @@ -1374,9 +1373,6 @@ extern void exit_mmap(struct mm_struct *); > extern int mm_take_all_locks(struct mm_struct *mm); > extern void mm_drop_all_locks(struct mm_struct *mm); > > -/* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */ > -extern void added_exe_file_vma(struct mm_struct *mm); > -extern void removed_exe_file_vma(struct mm_struct *mm); > extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file); > extern struct file *get_mm_exe_file(struct mm_struct *mm); > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 3cc3062..b480c06 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -378,7 +378,6 @@ struct mm_struct { > > /* store ref to file /proc/<pid>/exe symlink points to */ > struct file *exe_file; > - unsigned long num_exe_file_vmas; > #ifdef CONFIG_MMU_NOTIFIER > struct mmu_notifier_mm *mmu_notifier_mm; > #endif > diff --git a/include/linux/mman.h b/include/linux/mman.h > index 8b74e9b..77cec2f 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -86,7 +86,6 @@ calc_vm_flag_bits(unsigned long flags) > { > return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | > _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) | > - _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) | > _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ); > } > #endif /* __KERNEL__ */ > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index af1de0f..aa27a00 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -1164,21 +1164,8 @@ static void audit_log_task_info(struct audit_buffer *ab, struct task_struct *tsk > get_task_comm(name, tsk); > audit_log_format(ab, " comm="); > audit_log_untrustedstring(ab, name); > - > - if (mm) { > - down_read(&mm->mmap_sem); > - vma = mm->mmap; > - while (vma) { > - if ((vma->vm_flags & VM_EXECUTABLE) && > - vma->vm_file) { > - audit_log_d_path(ab, " exe=", > - &vma->vm_file->f_path); > - break; > - } > - vma = vma->vm_next; > - } > - up_read(&mm->mmap_sem); > - } > + if (mm && mm->exe_file) > + audit_log_d_path(ab, " exe=", &mm->exe_file->f_path); > audit_log_task_context(ab); > } > > diff --git a/kernel/fork.c b/kernel/fork.c > index b9372a0..40e4b49 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -587,26 +587,6 @@ void mmput(struct mm_struct *mm) > } > EXPORT_SYMBOL_GPL(mmput); > > -/* > - * We added or removed a vma mapping the executable. The vmas are only mapped > - * during exec and are not mapped with the mmap system call. > - * Callers must hold down_write() on the mm's mmap_sem for these > - */ > -void added_exe_file_vma(struct mm_struct *mm) > -{ > - mm->num_exe_file_vmas++; > -} > - > -void removed_exe_file_vma(struct mm_struct *mm) > -{ > - mm->num_exe_file_vmas--; > - if ((mm->num_exe_file_vmas == 0) && mm->exe_file) { > - fput(mm->exe_file); > - mm->exe_file = NULL; > - } > - > -} > - > void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) > { > if (new_exe_file) > @@ -614,20 +594,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) > if (mm->exe_file) > fput(mm->exe_file); > mm->exe_file = new_exe_file; > - mm->num_exe_file_vmas = 0; > } > > +/* > + * Caller must have mm->mm_users reference, > + * for example current->mm or acquired by get_task_mm(). > + */ > struct file *get_mm_exe_file(struct mm_struct *mm) > { > struct file *exe_file; > > - /* We need mmap_sem to protect against races with removal of > - * VM_EXECUTABLE vmas */ > - down_read(&mm->mmap_sem); > exe_file = mm->exe_file; > if (exe_file) > get_file(exe_file); > - up_read(&mm->mmap_sem); > return exe_file; > } > > diff --git a/mm/mmap.c b/mm/mmap.c > index 3d254ca..2647bb7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -230,11 +230,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) > might_sleep(); > if (vma->vm_ops && vma->vm_ops->close) > vma->vm_ops->close(vma); > - if (vma->vm_file) { > + if (vma->vm_file) > fput(vma->vm_file); > - if (vma->vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(vma->vm_mm); > - } > mpol_put(vma_policy(vma)); > kmem_cache_free(vm_area_cachep, vma); > return next; > @@ -616,11 +613,8 @@ again: remove_next = 1 + (end > next->vm_end); > mutex_unlock(&mapping->i_mmap_mutex); > > if (remove_next) { > - if (file) { > + if (file) > fput(file); > - if (next->vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(mm); > - } > if (next->anon_vma) > anon_vma_merge(vma, next); > mm->map_count--; > @@ -1293,8 +1287,6 @@ munmap_back: > error = file->f_op->mmap(file, vma); > if (error) > goto unmap_and_free_vma; > - if (vm_flags & VM_EXECUTABLE) > - added_exe_file_vma(mm); > > /* Can addr have changed?? > * > @@ -1969,11 +1961,8 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, > if (anon_vma_clone(new, vma)) > goto out_free_mpol; > > - if (new->vm_file) { > + if (new->vm_file) > get_file(new->vm_file); > - if (vma->vm_flags & VM_EXECUTABLE) > - added_exe_file_vma(mm); > - } > > if (new->vm_ops && new->vm_ops->open) > new->vm_ops->open(new); > @@ -1991,11 +1980,8 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, > /* Clean everything up if vma_adjust failed. */ > if (new->vm_ops && new->vm_ops->close) > new->vm_ops->close(new); > - if (new->vm_file) { > - if (vma->vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(mm); > + if (new->vm_file) > fput(new->vm_file); > - } > unlink_anon_vmas(new); > out_free_mpol: > mpol_put(pol); > @@ -2377,11 +2363,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > new_vma->vm_start = addr; > new_vma->vm_end = addr + len; > new_vma->vm_pgoff = pgoff; > - if (new_vma->vm_file) { > + if (new_vma->vm_file) > get_file(new_vma->vm_file); > - if (vma->vm_flags & VM_EXECUTABLE) > - added_exe_file_vma(mm); > - } > if (new_vma->vm_ops && new_vma->vm_ops->open) > new_vma->vm_ops->open(new_vma); > vma_link(mm, new_vma, prev, rb_link, rb_parent); > diff --git a/mm/nommu.c b/mm/nommu.c > index afa0a15..d617d5c 100644 > --- a/mm/nommu.c > +++ b/mm/nommu.c > @@ -789,11 +789,8 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) > kenter("%p", vma); > if (vma->vm_ops && vma->vm_ops->close) > vma->vm_ops->close(vma); > - if (vma->vm_file) { > + if (vma->vm_file) > fput(vma->vm_file); > - if (vma->vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(mm); > - } > put_nommu_region(vma->vm_region); > kmem_cache_free(vm_area_cachep, vma); > } > @@ -1287,10 +1284,6 @@ unsigned long do_mmap_pgoff(struct file *file, > get_file(file); > vma->vm_file = file; > get_file(file); > - if (vm_flags & VM_EXECUTABLE) { > - added_exe_file_vma(current->mm); > - vma->vm_mm = current->mm; > - } > } > > down_write(&nommu_region_sem); > @@ -1443,8 +1436,6 @@ error: > kmem_cache_free(vm_region_jar, region); > if (vma->vm_file) > fput(vma->vm_file); > - if (vma->vm_flags & VM_EXECUTABLE) > - removed_exe_file_vma(vma->vm_mm); > kmem_cache_free(vm_area_cachep, vma); > kleave(" = %d", ret); > return ret; > diff --git a/security/tomoyo/util.c b/security/tomoyo/util.c > index 867558c..b929dd3 100644 > --- a/security/tomoyo/util.c > +++ b/security/tomoyo/util.c > @@ -949,19 +949,11 @@ bool tomoyo_path_matches_pattern(const struct tomoyo_path_info *filename, > const char *tomoyo_get_exe(void) > { > struct mm_struct *mm = current->mm; > - struct vm_area_struct *vma; > const char *cp = NULL; > > - if (!mm) > - return NULL; > - down_read(&mm->mmap_sem); > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file) { > - cp = tomoyo_realpath_from_path(&vma->vm_file->f_path); > - break; > - } > - } > - up_read(&mm->mmap_sem); > + if (mm && mm->exe_file) > + cp = tomoyo_realpath_from_path(&mm->exe_file->f_path); > + > return cp; > } > > |
From: Konstantin K. <khl...@op...> - 2012-04-03 05:06:23
|
Matt Helsley wrote: > On Sat, Mar 31, 2012 at 01:29:29PM +0400, Konstantin Khlebnikov wrote: >> Currently the kernel sets mm->exe_file during sys_execve() and then tracks >> number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon as >> this counter drops to zero kernel resets mm->exe_file to NULL. Plus it resets >> mm->exe_file at last mmput() when mm->mm_users drops to zero. >> >> Vma with VM_EXECUTABLE flag appears after mapping file with flag MAP_EXECUTABLE, >> such vmas can appears only at sys_execve() or after vma splitting, because >> sys_mmap ignores this flag. Usually binfmt module sets mm->exe_file and mmaps >> some executable vmas with this file, they hold mm->exe_file while task is running. >> >> comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), >> where all this stuff was introduced: >> >>> The kernel implements readlink of /proc/pid/exe by getting the file from >>> the first executable VMA. Then the path to the file is reconstructed and >>> reported as the result. >>> >>> Because of the VMA walk the code is slightly different on nommu systems. >>> This patch avoids separate /proc/pid/exe code on nommu systems. Instead of >>> walking the VMAs to find the first executable file-backed VMA we store a >>> reference to the exec'd file in the mm_struct. >>> >>> That reference would prevent the filesystem holding the executable file >>> from being unmounted even after unmapping the VMAs. So we track the number >>> of VM_EXECUTABLE VMAs and drop the new reference when the last one is >>> unmapped. This avoids pinning the mounted filesystem. >> >> So, this logic is hooked into every file mmap/unmmap and vma split/merge just to >> fix some hypothetical pinning fs from umounting by mm which already unmapped all >> its executable files, but still alive. Does anyone know any real world example? >> mm can be borrowed by swapoff or some get_task_mm() user, but it's not a big problem. >> >> Thus, we can remove all this stuff together with VM_EXECUTABLE flag and >> keep mm->exe_file alive till final mmput(). >> >> After that we can access current->mm->exe_file without any locks >> (after checking current->mm and mm->exe_file for NULL) >> >> Some code around security and oprofile still uses VM_EXECUTABLE for retrieving >> task's executable file, after this patch they will use mm->exe_file directly. >> In tomoyo and audit mm is always current->mm, oprofile uses get_task_mm(). > > Perhaps I'm missing something but it seems like you ought to split > this into two patches. The first could fix up the cell, tile, etc. arch > code to use the exe_file reference rather than walk the VMAs. Then the > second patch could remove the unusual logic used to allow userspace to unpin > the mount and we could continue to discuss that separately. It would > also make the git log somewhat cleaner I think... Ok, I'll resend this patch as independent patch-set, anyway I need to return mm->mmap_sem locking back. |
From: Andrew M. <ak...@li...> - 2012-04-06 23:08:33
|
On Tue, 03 Apr 2012 09:06:12 +0400 Konstantin Khlebnikov <khl...@op...> wrote: > Ok, I'll resend this patch as independent patch-set, > anyway I need to return mm->mmap_sem locking back. We need to work out what to do with "c/r: prctl: add ability to set new mm_struct::exe_file". I'm still sitting on the 3.4 c/r patch queue for various reasons, one of which is that I need to go back and re-review all the discussion, which was lengthy. Early next week, hopefully. |
From: Konstantin K. <khl...@op...> - 2012-04-03 05:10:31
|
Matt Helsley wrote: > On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: >> On 03/31, Konstantin Khlebnikov wrote: >>> >>> comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), >>> where all this stuff was introduced: >>> >>>> ... >>>> This avoids pinning the mounted filesystem. >>> >>> So, this logic is hooked into every file mmap/unmmap and vma split/merge just to >>> fix some hypothetical pinning fs from umounting by mm which already unmapped all >>> its executable files, but still alive. Does anyone know any real world example? >> >> This is the question to Matt. > > This is where I got the scenario: > > https://lkml.org/lkml/2007/7/12/398 Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" gives userspace ability to unpin vfsmount explicitly. https://lkml.org/lkml/2012/3/16/449 > > Cheers, > -Matt Helsley > > PS: I seem to keep coming back to this so I hope folks don't mind if I leave > some more references to make (re)searching this topic easier: > > Thread with Cyrill Gorcunov discussing c/r of symlink: > https://lkml.org/lkml/2012/3/16/448 > > Thread with Oleg Nesterov re: cleanups: > https://lkml.org/lkml/2012/3/5/240 > > Thread with Alexey Dobriyan re: cleanups: > https://lkml.org/lkml/2009/6/4/625 > > mainline commit 925d1c401fa6cfd0df5d2e37da8981494ccdec07 > Date: Tue Apr 29 01:01:36 2008 -0700 > > procfs task exe symlink > |
From: Matt H. <mat...@us...> - 2012-04-03 18:17:22
|
On Tue, Apr 03, 2012 at 09:10:20AM +0400, Konstantin Khlebnikov wrote: > Matt Helsley wrote: > >On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > >>On 03/31, Konstantin Khlebnikov wrote: > >>> > >>>comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > >>>where all this stuff was introduced: > >>> > >>>>... > >>>>This avoids pinning the mounted filesystem. > >>> > >>>So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > >>>fix some hypothetical pinning fs from umounting by mm which already unmapped all > >>>its executable files, but still alive. Does anyone know any real world example? > >> > >>This is the question to Matt. > > > >This is where I got the scenario: > > > >https://lkml.org/lkml/2007/7/12/398 > > Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" > gives userspace ability to unpin vfsmount explicitly. Doesn't that break the semantics of the kernel ABI? Cheers, -Matt Helsley |
From: Cyrill G. <gor...@op...> - 2012-04-03 19:32:17
|
On Tue, Apr 03, 2012 at 11:16:31AM -0700, Matt Helsley wrote: > On Tue, Apr 03, 2012 at 09:10:20AM +0400, Konstantin Khlebnikov wrote: > > Matt Helsley wrote: > > >On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > > >>On 03/31, Konstantin Khlebnikov wrote: > > >>> > > >>>comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > > >>>where all this stuff was introduced: > > >>> > > >>>>... > > >>>>This avoids pinning the mounted filesystem. > > >>> > > >>>So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > > >>>fix some hypothetical pinning fs from umounting by mm which already unmapped all > > >>>its executable files, but still alive. Does anyone know any real world example? > > >> > > >>This is the question to Matt. > > > > > >This is where I got the scenario: > > > > > >https://lkml.org/lkml/2007/7/12/398 > > > > Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" > > gives userspace ability to unpin vfsmount explicitly. > > Doesn't that break the semantics of the kernel ABI? Which one? exe_file can be changed iif there is no MAP_EXECUTABLE left. Still, once assigned (via this prctl) the mm_struct::exe_file can't be changed again, until program exit. Cyrill |
From: Matt H. <mat...@us...> - 2012-04-05 20:29:17
|
On Tue, Apr 03, 2012 at 11:32:04PM +0400, Cyrill Gorcunov wrote: > On Tue, Apr 03, 2012 at 11:16:31AM -0700, Matt Helsley wrote: > > On Tue, Apr 03, 2012 at 09:10:20AM +0400, Konstantin Khlebnikov wrote: > > > Matt Helsley wrote: > > > >On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > > > >>On 03/31, Konstantin Khlebnikov wrote: > > > >>> > > > >>>comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > > > >>>where all this stuff was introduced: > > > >>> > > > >>>>... > > > >>>>This avoids pinning the mounted filesystem. > > > >>> > > > >>>So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > > > >>>fix some hypothetical pinning fs from umounting by mm which already unmapped all > > > >>>its executable files, but still alive. Does anyone know any real world example? > > > >> > > > >>This is the question to Matt. > > > > > > > >This is where I got the scenario: > > > > > > > >https://lkml.org/lkml/2007/7/12/398 > > > > > > Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" > > > gives userspace ability to unpin vfsmount explicitly. > > > > Doesn't that break the semantics of the kernel ABI? > > Which one? exe_file can be changed iif there is no MAP_EXECUTABLE left. > Still, once assigned (via this prctl) the mm_struct::exe_file can't be changed > again, until program exit. The prctl() interface itself is fine as it stands now. As far as I can tell Konstantin is proposing that we remove the unusual counter that tracks the number of mappings of the exe_file and require userspace use the prctl() to drop the last reference. That's what I think will break the ABI because after that change you *must* change userspace code to use the prctl(). It's an ABI change because the same sequence of system calls with the same input bits produces different behavior. Cheers, -Matt |
From: Cyrill G. <gor...@op...> - 2012-04-05 20:53:16
|
On Thu, Apr 05, 2012 at 01:29:04PM -0700, Matt Helsley wrote: ... > > > Doesn't that break the semantics of the kernel ABI? > > > > Which one? exe_file can be changed iif there is no MAP_EXECUTABLE left. > > Still, once assigned (via this prctl) the mm_struct::exe_file can't be changed > > again, until program exit. > > The prctl() interface itself is fine as it stands now. > > As far as I can tell Konstantin is proposing that we remove the unusual > counter that tracks the number of mappings of the exe_file and require > userspace use the prctl() to drop the last reference. That's what I think > will break the ABI because after that change you *must* change userspace > code to use the prctl(). It's an ABI change because the same sequence of > system calls with the same input bits produces different behavior. Hi Matt, I see what you mean (I misread your email at first, sorry). Sure it's impossible to patch already existing programs (and btw, this prctl code actually won't help a program to drop symlink completely and live without it then, because old one will gone but new one will be assigned) so personally I can't answer here on Konstantin's behalf, but I guess the main question is -- which programs use this 'drop-all-MAP_EXECUTABLE' feature? Cyrill |
From: Konstantin K. <khl...@op...> - 2012-04-05 21:04:57
|
Matt Helsley wrote: > On Tue, Apr 03, 2012 at 11:32:04PM +0400, Cyrill Gorcunov wrote: >> On Tue, Apr 03, 2012 at 11:16:31AM -0700, Matt Helsley wrote: >>> On Tue, Apr 03, 2012 at 09:10:20AM +0400, Konstantin Khlebnikov wrote: >>>> Matt Helsley wrote: >>>>> On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: >>>>>> On 03/31, Konstantin Khlebnikov wrote: >>>>>>> >>>>>>> comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), >>>>>>> where all this stuff was introduced: >>>>>>> >>>>>>>> ... >>>>>>>> This avoids pinning the mounted filesystem. >>>>>>> >>>>>>> So, this logic is hooked into every file mmap/unmmap and vma split/merge just to >>>>>>> fix some hypothetical pinning fs from umounting by mm which already unmapped all >>>>>>> its executable files, but still alive. Does anyone know any real world example? >>>>>> >>>>>> This is the question to Matt. >>>>> >>>>> This is where I got the scenario: >>>>> >>>>> https://lkml.org/lkml/2007/7/12/398 >>>> >>>> Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" >>>> gives userspace ability to unpin vfsmount explicitly. >>> >>> Doesn't that break the semantics of the kernel ABI? >> >> Which one? exe_file can be changed iif there is no MAP_EXECUTABLE left. >> Still, once assigned (via this prctl) the mm_struct::exe_file can't be changed >> again, until program exit. > > The prctl() interface itself is fine as it stands now. > > As far as I can tell Konstantin is proposing that we remove the unusual > counter that tracks the number of mappings of the exe_file and require > userspace use the prctl() to drop the last reference. That's what I think > will break the ABI because after that change you *must* change userspace > code to use the prctl(). It's an ABI change because the same sequence of > system calls with the same input bits produces different behavior. But common software does not require this at all. I did not found real examples, only hypothesis by Al Viro: https://lkml.org/lkml/2007/7/12/398 libhugetlbfs isn't good example too, the man proc says: /proc/[pid]/exe is alive until main thread is alive, but in case libhugetlbfs /proc/[pid]/exe disappears too early. Also I would not call it ABI, this corner-case isn't documented, I'm afraid only few people in the world knows about it =) |
From: Matt H. <mat...@us...> - 2012-04-05 21:45:01
|
On Fri, Apr 06, 2012 at 01:04:43AM +0400, Konstantin Khlebnikov wrote: > Matt Helsley wrote: > >On Tue, Apr 03, 2012 at 11:32:04PM +0400, Cyrill Gorcunov wrote: > >>On Tue, Apr 03, 2012 at 11:16:31AM -0700, Matt Helsley wrote: > >>>On Tue, Apr 03, 2012 at 09:10:20AM +0400, Konstantin Khlebnikov wrote: > >>>>Matt Helsley wrote: > >>>>>On Sat, Mar 31, 2012 at 10:13:24PM +0200, Oleg Nesterov wrote: > >>>>>>On 03/31, Konstantin Khlebnikov wrote: > >>>>>>> > >>>>>>>comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"), > >>>>>>>where all this stuff was introduced: > >>>>>>> > >>>>>>>>... > >>>>>>>>This avoids pinning the mounted filesystem. > >>>>>>> > >>>>>>>So, this logic is hooked into every file mmap/unmmap and vma split/merge just to > >>>>>>>fix some hypothetical pinning fs from umounting by mm which already unmapped all > >>>>>>>its executable files, but still alive. Does anyone know any real world example? > >>>>>> > >>>>>>This is the question to Matt. > >>>>> > >>>>>This is where I got the scenario: > >>>>> > >>>>>https://lkml.org/lkml/2007/7/12/398 > >>>> > >>>>Cyrill Gogcunov's patch "c/r: prctl: add ability to set new mm_struct::exe_file" > >>>>gives userspace ability to unpin vfsmount explicitly. > >>> > >>>Doesn't that break the semantics of the kernel ABI? > >> > >>Which one? exe_file can be changed iif there is no MAP_EXECUTABLE left. > >>Still, once assigned (via this prctl) the mm_struct::exe_file can't be changed > >>again, until program exit. > > > >The prctl() interface itself is fine as it stands now. > > > >As far as I can tell Konstantin is proposing that we remove the unusual > >counter that tracks the number of mappings of the exe_file and require > >userspace use the prctl() to drop the last reference. That's what I think > >will break the ABI because after that change you *must* change userspace > >code to use the prctl(). It's an ABI change because the same sequence of > >system calls with the same input bits produces different behavior. > > But common software does not require this at all. I did not found real examples, > only hypothesis by Al Viro: https://lkml.org/lkml/2007/7/12/398 > libhugetlbfs isn't good example too, the man proc says: /proc/[pid]/exe is alive until > main thread is alive, but in case libhugetlbfs /proc/[pid]/exe disappears too early. *shrug* Where did you look for real examples? chroot? pivot_root? various initrd systems? Which versions? This sort of argument brings up classic questions. How do we know when to stop looking given the incredible amount of obscure code that's out there -- most of which we're unlikely to even be aware of? Even if we only look at "popular" distros how far back do we go? etc. Perhaps before going through all that effort it would be better to verify that removing that code impacts performance enough to care. Do you have numbers? If the numbers aren't there then why bother with exhaustive and exhausting code searches? > > Also I would not call it ABI, this corner-case isn't documented, I'm afraid only few > people in the world knows about it =) I don't think the definition of an ABI is whether there's documentation for it. It's whether the interface is used or not. At least that's the impression I've gotten from reading Linus' rants over the years. I think of the ABI as bits input versus behavior (including bits) out. If the input bits remain the same the qualitative behavior should remain the same unless there is a bug. Here, roughly speaking, the input bits are the arguments passed to a sequence of one or more munmap() calls followed by a umount(). The output is a 0 return value from the umount. Your proposal would change that output value to -1 -- different bits and different behavior. Cheers, -Matt Helsley |
From: Linus T. <tor...@li...> - 2012-04-05 21:55:59
|
On Thu, Apr 5, 2012 at 2:44 PM, Matt Helsley <mat...@us...> wrote: > > I don't think the definition of an ABI is whether there's documentation > for it. It's whether the interface is used or not. At least that's the > impression I've gotten from reading Linus' rants over the years. Yes. That said, I *do* have some very dim memory of us having had real issues with the /proc/<pid>/exe thing and having regressions due to holding refcounts to executables that were a.out binaries and not demand-loaded. And people wanting to unmount filesystems despite the binaries being live. That said, I suspect that whatever issues we used to have with that are pretty long gone. I don't think people use non-mmap'ed binaries any more. So I think we can try it and see. And revert if somebody actually notices and has problems. Linus |
From: Konstantin K. <khl...@op...> - 2012-04-06 04:36:17
|
Linus Torvalds wrote: > On Thu, Apr 5, 2012 at 2:44 PM, Matt Helsley<mat...@us...> wrote: >> >> I don't think the definition of an ABI is whether there's documentation >> for it. It's whether the interface is used or not. At least that's the >> impression I've gotten from reading Linus' rants over the years. > > Yes. > > That said, I *do* have some very dim memory of us having had real > issues with the /proc/<pid>/exe thing and having regressions due to > holding refcounts to executables that were a.out binaries and not > demand-loaded. And people wanting to unmount filesystems despite the > binaries being live. > > That said, I suspect that whatever issues we used to have with that > are pretty long gone. I don't think people use non-mmap'ed binaries > any more. So I think we can try it and see. And revert if somebody > actually notices and has problems. Instead of tracking count of vma with VM_EXECUTABLE bit we can track count of vma with vma->vm_file == mm->exe_file, this will be nearly the same behaviour. This was in early version of my patch, but I prefer to go deeper. So, we can revert it without introducing VM_EXECUTABLE again. > > Linus |