You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Avi K. <av...@qu...> - 2008-04-24 13:00:23
|
Yang, Sheng wrote: > On Thursday 24 April 2008 19:37:03 Avi Kivity wrote: > >> Yunfeng Zhao wrote: >> >>> Hi All, >>> >>> This is today's KVM test result against kvm.git >>> 873c05fa7e6fea27090b1bf0f67a073eadb04782 and kvm-userspace.git >>> d102d750f397b543fe620a3c77a7e5e42c483865. >>> >> I suspect 873c05fa7e6fea27090b1bf0f67a073eadb04782 itself, it's the only >> thing that has any chance of badness. >> >> Marcelo, any idea? Perhaps due to load, interrupts accumulate and can't >> be injected fast enough? >> >> These tests are run on a 2.6.22 host, which has a hacked >> smp_call_function_single() in external-module-compat.h, which may >> exaberate the problem. >> > > Yeah, I suspect the commit too(I tried tip without that, and found mostly > alright). In fact, I didn't use kvm_vcpu_kick() just because that I found > this function may causing hang on my host... But I didn't do more investigate > so I can't tell what's wrong, then I just chose way to keep it working... I > am sorry for not clarify... > I think smp_call_function_single() is miscompiled when using the compatibility code. I took it out-of-line to be sure (it is now in kernel/external-module-compat.c). No evidence, but... -- error compiling committee.c: too many arguments to function |
From: Yang, S. <she...@in...> - 2008-04-24 12:58:48
|
On Thursday 24 April 2008 20:54:10 Avi Kivity wrote: > I propose moving the kvm lists to vger.kernel.org, for the following > benefits: > > - better spam control > - faster service (I see significant lag with the sourceforge lists) > - no ads appended to the end of each email > > If no objections are raised, and if the vger postmasters agree, I will > mass subscribe the current subscribers so that there will be no service > interruption. > > Opinions? Yeah, finally we don't need to meet something like this everyday: "This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/" ;) -- Thanks Yang, Sheng |
From: Yang, S. <she...@in...> - 2008-04-24 12:55:05
|
On Thursday 24 April 2008 19:37:03 Avi Kivity wrote: > Yunfeng Zhao wrote: > > Hi All, > > > > This is today's KVM test result against kvm.git > > 873c05fa7e6fea27090b1bf0f67a073eadb04782 and kvm-userspace.git > > d102d750f397b543fe620a3c77a7e5e42c483865. > > I suspect 873c05fa7e6fea27090b1bf0f67a073eadb04782 itself, it's the only > thing that has any chance of badness. > > Marcelo, any idea? Perhaps due to load, interrupts accumulate and can't > be injected fast enough? > > These tests are run on a 2.6.22 host, which has a hacked > smp_call_function_single() in external-module-compat.h, which may > exaberate the problem. Yeah, I suspect the commit too(I tried tip without that, and found mostly alright). In fact, I didn't use kvm_vcpu_kick() just because that I found this function may causing hang on my host... But I didn't do more investigate so I can't tell what's wrong, then I just chose way to keep it working... I am sorry for not clarify... -- Thanks Yang, Sheng |
From: Avi K. <av...@qu...> - 2008-04-24 12:54:17
|
I propose moving the kvm lists to vger.kernel.org, for the following benefits: - better spam control - faster service (I see significant lag with the sourceforge lists) - no ads appended to the end of each email If no objections are raised, and if the vger postmasters agree, I will mass subscribe the current subscribers so that there will be no service interruption. Opinions? -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-24 12:47:48
|
Dor Laor wrote: > while investigating the revert of "fix sci irq set when acpi timer" I > discovered the reason. Please also re-revert the original patch. > Applied, but system_powerdown still doesn't work with the sci acpi timer fix. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-24 12:38:20
|
Jerone Young wrote: > 1 file changed, 37 insertions(+), 21 deletions(-) > kernel/Makefile | 58 +++++++++++++++++++++++++++++++++++-------------------- > > > - This adapts perviously sent patch to new changes to kernel/Makefile > - Fixes improper check in conditional > > This patch add the ability for make sync in the kernel directory to work for mulitiple architectures and not just x86. > > I addressed this in a different way by always syncing headers from all architectures. This means the tarballs can be used to build userspace on any arch (though kernel modules are limited to x86, mostly due to arch limitations). In addition, we no longer refer to KERNELDIR when building userspace, so it ought to be easier to build on random kernels. Patches to be pushed shortly... -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-24 11:37:06
|
Yunfeng Zhao wrote: > Hi All, > > This is today's KVM test result against kvm.git > 873c05fa7e6fea27090b1bf0f67a073eadb04782 and kvm-userspace.git > d102d750f397b543fe620a3c77a7e5e42c483865. > > I suspect 873c05fa7e6fea27090b1bf0f67a073eadb04782 itself, it's the only thing that has any chance of badness. Marcelo, any idea? Perhaps due to load, interrupts accumulate and can't be injected fast enough? These tests are run on a 2.6.22 host, which has a hacked smp_call_function_single() in external-module-compat.h, which may exaberate the problem. -- error compiling committee.c: too many arguments to function |
From: Yunfeng Z. <yun...@in...> - 2008-04-24 10:30:27
|
Hi All, This is today's KVM test result against kvm.git 873c05fa7e6fea27090b1bf0f67a073eadb04782 and kvm-userspace.git d102d750f397b543fe620a3c77a7e5e42c483865. In today's nightly testing, we meet host hang while booting multiple guests several times. This issue could be easily reproduced. Two Old Issues: ================================================ 1. Booting four guests likely fails https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1919354&group_id=180599 2. Cannot boot guests with hugetlbfs https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1941302&group_id=180599 Test environment ================================================ Platform Woodcrest CPU 4 Memory size 8G' Details ================================================ IA32-pae: 1. boot guest with 256M memory PASS 2. boot two windows xp guest PASS 3. boot 4 same guest in parallel PASS 4. boot linux and windows guest in parallel PASS 5. boot guest with 1500M memory PASS 6. boot windows 2003 with ACPI enabled PASS 7. boot Windows xp with ACPI enabled PASS 8. boot Windows 2000 without ACPI PASS 9. kernel build on SMP linux guest PASS 10. LTP on linux guest PASS 11. boot base kernel linux PASS 12. save/restore 32-bit HVM guests PASS 13. live migration 32-bit HVM guests PASS 14. boot SMP Windows xp with ACPI enabled PASS 15. boot SMP Windows 2003 with ACPI enabled PASS 16. boot SMP Windows 2000 with ACPI enabled PASS ================================================ IA32e: 1. boot four 32-bit guest in parallel PASS 2. boot four 64-bit guest in parallel PASS 3. boot 4G 64-bit guest PASS 4. boot 4G pae guest PASS 5. boot 32-bit linux and 32 bit windows guest in parallel PASS 6. boot 32-bit guest with 1500M memory PASS 7. boot 64-bit guest with 1500M memory PASS 8. boot 32-bit guest with 256M memory PASS 9. boot 64-bit guest with 256M memory PASS 10. boot two 32-bit windows xp in parallel PASS 11. boot four 32-bit different guest in para PASS 12. save/restore 64-bit linux guests PASS 13. save/restore 32-bit linux guests PASS 14. boot 32-bit SMP windows 2003 with ACPI enabled PASS 15. boot 32-bit SMP Windows 2000 with ACPI enabled PASS 16. boot 32-bit SMP Windows xp with ACPI enabled PASS 17. boot 32-bit Windows 2000 without ACPI PASS 18. boot 64-bit Windows xp with ACPI enabled PASS 19. boot 32-bit Windows xp without ACPI PASS 20. boot 64-bit UP vista PASS 21. boot 64-bit SMP vista PASS 22. kernel build in 32-bit linux guest OS PASS 23. kernel build in 64-bit linux guest OS PASS 24. LTP on 32-bit linux guest OS PASS 25. LTP on 64-bit linux guest OS PASS 26. boot 64-bit guests with ACPI enabled PASS 27. boot 32-bit x-server PASS 28. boot 64-bit SMP windows XP with ACPI enabled PASS 29. boot 64-bit SMP windows 2003 with ACPI enabled PASS 30. live migration 64bit linux guests PASS 31. live migration 32bit linux guests PASS 32. reboot 32bit windows xp guest PASS 33. reboot 32bit windows xp guest PASS Report Summary on IA32-pae Summary Test Report of Last Session ===================================================================== Total Pass Fail NoResult Crash ===================================================================== control_panel 7 7 0 0 0 Restart 2 2 0 0 0 gtest 15 15 0 0 0 ===================================================================== control_panel 7 7 0 0 0 :KVM_LM_PAE_gPAE 1 1 0 0 0 :KVM_four_sguest_PAE_gPA 1 1 0 0 0 :KVM_256M_guest_PAE_gPAE 1 1 0 0 0 :KVM_linux_win_PAE_gPAE 1 1 0 0 0 :KVM_1500M_guest_PAE_gPA 1 1 0 0 0 :KVM_SR_PAE_gPAE 1 1 0 0 0 :KVM_two_winxp_PAE_gPAE 1 1 0 0 0 Restart 2 2 0 0 0 :GuestPAE_PAE_gPAE 1 1 0 0 0 :BootTo32pae_PAE_gPAE 1 1 0 0 0 gtest 15 15 0 0 0 :ltp_nightly_PAE_gPAE 1 1 0 0 0 :boot_up_acpi_PAE_gPAE 1 1 0 0 0 :reboot_xp_PAE_gPAE 1 1 0 0 0 :boot_up_vista_PAE_gPAE 1 1 0 0 0 :boot_up_acpi_xp_PAE_gPA 1 1 0 0 0 :boot_up_acpi_win2k3_PAE 1 1 0 0 0 :boot_base_kernel_PAE_gP 1 1 0 0 0 :boot_smp_acpi_win2k3_PA 1 1 0 0 0 :boot_smp_acpi_win2k_PAE 1 1 0 0 0 :boot_up_acpi_win2k_PAE_ 1 1 0 0 0 :boot_smp_acpi_xp_PAE_gP 1 1 0 0 0 :boot_up_noacpi_win2k_PA 1 1 0 0 0 :boot_smp_vista_PAE_gPAE 1 1 0 0 0 :bootx_PAE_gPAE 1 1 0 0 0 :kb_nightly_PAE_gPAE 1 1 0 0 0 ===================================================================== Total 24 24 0 0 0 Report Summary on IA32e Summary Test Report of Last Session ===================================================================== Total Pass Fail NoResult Crash ===================================================================== control_panel 15 14 1 0 0 Restart 3 3 0 0 0 gtest 25 25 0 0 0 ===================================================================== control_panel 15 14 1 0 0 :KVM_LM_64_g64 1 1 0 0 0 :KVM_four_sguest_64_gPAE 1 1 0 0 0 :KVM_4G_guest_64_g64 1 1 0 0 0 :KVM_four_sguest_64_g64 1 1 0 0 0 :KVM_linux_win_64_gPAE 1 1 0 0 0 :KVM_1500M_guest_64_gPAE 1 1 0 0 0 :KVM_SR_64_g64 1 0 1 0 0 :KVM_LM_64_gPAE 1 1 0 0 0 :KVM_256M_guest_64_g64 1 1 0 0 0 :KVM_1500M_guest_64_g64 1 1 0 0 0 :KVM_4G_guest_64_gPAE 1 1 0 0 0 :KVM_SR_64_gPAE 1 1 0 0 0 :KVM_256M_guest_64_gPAE 1 1 0 0 0 :KVM_two_winxp_64_gPAE 1 1 0 0 0 :KVM_four_dguest_64_gPAE 1 1 0 0 0 Restart 3 3 0 0 0 :GuestPAE_64_gPAE 1 1 0 0 0 :BootTo64_64_gPAE 1 1 0 0 0 :Guest64_64_gPAE 1 1 0 0 0 gtest 25 25 0 0 0 :boot_up_acpi_64_gPAE 1 1 0 0 0 :boot_up_noacpi_xp_64_gP 1 1 0 0 0 :boot_smp_acpi_xp_64_g64 1 1 0 0 0 :boot_base_kernel_64_gPA 1 1 0 0 0 :boot_smp_acpi_win2k3_64 1 1 0 0 0 :boot_smp_acpi_win2k_64_ 1 1 0 0 0 :boot_base_kernel_64_g64 1 1 0 0 0 :bootx_64_gPAE 1 1 0 0 0 :kb_nightly_64_gPAE 1 1 0 0 0 :ltp_nightly_64_g64 1 1 0 0 0 :boot_up_acpi_64_g64 1 1 0 0 0 :boot_up_noacpi_win2k_64 1 1 0 0 0 :boot_smp_acpi_xp_64_gPA 1 1 0 0 0 :boot_smp_vista_64_gPAE 1 1 0 0 0 :boot_up_acpi_win2k3_64_ 1 1 0 0 0 :reboot_xp_64_gPAE 1 1 0 0 0 :bootx_64_g64 1 1 0 0 0 :boot_up_vista_64_g64 1 1 0 0 0 :boot_smp_vista_64_g64 1 1 0 0 0 :boot_up_acpi_xp_64_g64 1 1 0 0 0 :boot_up_vista_64_gPAE 1 1 0 0 0 :ltp_nightly_64_gPAE 1 1 0 0 0 :boot_smp_acpi_win2k3_64 1 1 0 0 0 :boot_up_noacpi_win2k3_6 1 1 0 0 0 :kb_nightly_64_g64 1 1 0 0 0 ===================================================================== Total 43 42 1 0 0 Best Regards, Yunfeng |
From: 564 <lgn...@so...> - 2008-04-24 10:27:03
|
现代制造业仓库管理与库存控制 主办单位:众人行管理咨询 联系人:凌小姐 传 真:0755-61351396 邮 箱:36...@99... 电 话:0755-26075365 课程对象:仓库管理、物流管理、采购管理、生产管理、供应链管理等人员 培训方式:专家讲解、案例分析、现场演练、师生互动、培训游戏、视频研讨等 参会费用:2800元/人(含:授课费,全套资料与讲义费,证书费,中餐, 点心及学员合影照片并邮寄等相关服务.) 参会时间:2008年04月17-18日 上海 课程目标: 从案例中领悟和提升,随着企业对物流管理的越来越重视,人们更多的认识到了采购对于库存的影响、 库存对于企业总体经营战略的重要性,也意识到了库房管理对于库存战略和采购管理的客观影响;但是, 仓库是成本中心吗?谁对库存负责?为什么库存总是控制不住?ERP平台上看到的库存信息为何经常失真? 采购要对库存负哪些责任?仓库如何快速响应生产需求及其变化?如何计算仓库空间和面积、设备、人员 需求?仓库需要什么软件? 面对以上问题,人们仍然显得不知所措本课程就是结合当前企业在采购和库存管理过程中遭遇到的 实际问题,站在企业物流管理系统的高度来剖析企业采购对库存构成的巨大影响和对库房的定位,从而让 学员在领略先进企业的先进做法的同时,领会其具有本企业特色的库存库房优化策略,从而提升学员个人 和企业的综合管理能力。 课程大纲: 库存控制与仓库管理综述 走出传统的库存思维看库存 库存产生的三大原因 库存结构的四个模块 生产企业物流管理与库存管理 现代企业中仓库形式及其管理模式的变化 仓库在企业供应链中的重要地位 1、 影响库存控制的因素分析 关于“提水案例” 库存分析方法 影响库存的六大背景因素 现场表现 综合问题 2、 物料交接与仓储管理 现代物流环境下的仓库重新定位 突破“我们的”仓库的第一印象 八家知名企业的仓库经典图解分析 质疑仓库管理的存在 仓库管理与库存控制的实质内容 不同企业、行业的仓库管理类型细分 仓库管理基本经济利益 3、 行业仓库现场管理问题分析与基本管理工具 录像:先进的仓库管理 错误的仓库管理思维:经验的错失 我国企业库房管理现状描述 图解仓库管理的八大基本瓶颈问题 管理思维与仓库层级定位 是否关注过物料? 物料如何包装? 是否关注过物料的流程? 操作人员怎么知道自己做的是对的? 如何保证信息是正确的 仓库效率决定什么?决定于什么? 仓储管理的分析、业务指标及考核? 仓库管理成本诱因分析 5S管理的短板和基础需求 目视管理 颜色管理 定置管理 看板管理 灯板管理 先进先出(FIFO)管理的3个最省力方法 4、 库存如何维持生产需要 谁在预测?如何预测?预测有多少用处? 预测对于计划的龙头式影响 生产计划对于物料的影响模式 库存如何动态性应对生产需求变化的 静态物料流动化模式 精益生产与精益物流模式 零库存采购系统的达成 上门收货(milk-run) 供应商管理库存(VMI)与准时化供应 5、 如何有效降低库存 工厂库存的结构分类与分析 仓库管理流程 仓库作业标准时间确定 提高仓库管理执行力 库存控制的傻瓜化模式推介 不要让检验成为库存流动瓶颈 超越资金的物料分类管理模式与帕累托原则 谁对库存负责? 主生产计划体系与MRP 库存报警和订货系统 最低库存保有量的计算方法 生产库存模式因素分析与计算方法 原材料库存影响因素分析和计算方法 成品、经销库存因素分析与算法 定量订货与定期订货适应性 经济订货批量适应性 企业库存的经验算法 库存的变动细数折算法 库存周转率的计算、使用与误区的规避 库存资源约束与算法 6、 库存数量管理与盘点 突破经验模式的分组游戏:盘点 为什么盘点? 盘点过程困惑? 盘点的成本 盘点盈亏分析 如何取消帐-卡-物的传统管理模式 ERP对于仓库动态管理的无奈 库存信息管理与仓库管理软件(WMS)介绍 如何选用WMS软件 如何规避软件失败 世界顶尖仓库管理软件模式分析 7、 库存与库房的管理绩效提升的五大切入点 仓库规划、专业咨询和理念定位等 仓库保管原则 仓库规划与后期效益体现 仓库管理的动态变换趋势 仓库规划程序与要求 仓库功能分区与面积算法 分区分类管理的基本原理 某公司仓库规划案例分享 物流设备的合理配置 设备选用与设计原则 合理选择存储设备 搬运与堆高设备 物流容器设计与选用 工位器具非标准设计 仓库流程的合理化 现代企业物流管理组织与仓库地位确认 仓库基本流程规范 流程时间与标准工时 基于生产线需求模式的jit管理策略 JIT文化需求 七大浪费之库存表现 集约化生产管理需求六原则 仓库管理提升策略的技术经济性分析 仓库成本与物流成本 如何应证仓库是省钱的部门 物流冰山说 8、 仓库绩效考核与约束条件 ISO9000与仓库审核条列 仓库管理考核KPI指标及其本质分析 仓库消防与安全管理要求 五大约束条件 企业(尤其是决策层)对采购和仓库管理的定位 企业产品与行业特色约束 采购和库存管理的绩效周期 采购与仓库管理之间存在的“二律背反说” 人力资源与人才管理 讲师推荐 邱先生:上海交通大学工程硕士,中国机械工程学会物流工程分会理事,上海投资咨询公司(上海市计委) 物流园区投资计划与方案特聘专家;浙江大学EMBA总裁班客座教授;德国莱茵(TUV)(中国)学术部课程 长期特聘讲师。邱老师曾任上海大众汽车 SANTANA 轿车生产卫星工厂生产经营厂长;EVERWELL(中国)工 程部物流项目负责人;普茨迈斯特(中国)物流项目总监等职务。邱老师作为物流管理专业顾问组织完成了 美的电器、海尔、华晨汽车等多个大型物流咨询项目,其课程具备很强的实战性,并由此获得诸多客户的好 评。 擅长课题: 生产企业物流管理、企业供应链、库存控制与仓库管理、有效降低物流成本、生产计划与物料控 制、物流人才职业生涯规划等 荣誉客户:上海大众、上海通用、郑州宇通客车、张家港博泽汽车部件、沈阳金杯汽车、海尔集团、美的 集团、苏州旭电 、芜湖西门子VDO电子、南京菲尼克斯电气、华虹-NEC、天津通用半导体、上海施耐德电气、 苏州西门子电器、诺基亚苏州电信、上海通用电气、飞索半导体(苏州)、无锡罗地亚制药、汉高化工、 BASF染料化工、上海福斯油品、卜内门油漆、苏州葛兰素制药、惠氏白宫制药、上海汽轮机、普兹迈斯特 工程设备、广东摩恩、山西国际铸造、苏州百得电动工、上海物资集团、麦当劳华中地区配送中心、迅达 电梯上海物流中心、武汉红金龙卷烟、贝塔斯曼… |
From: Robin H. <ho...@sg...> - 2008-04-24 09:52:42
|
I am not certain of this, but it seems like this patch leaves things in a somewhat asymetric state. At the very least, I think that asymetry should be documented in the comments of either mmu_notifier.h or .c. Before I do the first mmu_notifier_register, all places that test for mm_has_notifiers(mm) will return false and take the fast path. After I do some mmu_notifier_register()s and their corresponding mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true and the slow path will be taken. This, despite all registered notifiers having unregistered. It seems to me the work done by mmu_notifier_mm_destroy should really be done inside the mm_lock()/mm_unlock area of mmu_unregister and mm_notifier_release when we have removed the last entry. That would give the users job the same performance after they are done using the special device that they had prior to its use. On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: ... > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c ... > @@ -603,25 +605,39 @@ > * readonly mappings. The tradeoff is that copy_page_range is more > * efficient than faulting. > */ > + ret = 0; > if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { > if (!vma->anon_vma) > - return 0; > + goto out; > } > > - if (is_vm_hugetlb_page(vma)) > - return copy_hugetlb_page_range(dst_mm, src_mm, vma); > + if (unlikely(is_vm_hugetlb_page(vma))) { > + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); > + goto out; > + } > > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier_invalidate_range_start(src_mm, addr, end); > + > + ret = 0; I don't think this is needed. ... > +/* avoid memory allocations for mm_unlock to prevent deadlock */ > +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) > +{ > + if (mm->map_count) { > + if (data->nr_anon_vma_locks) > + mm_unlock_vfree(data->anon_vma_locks, > + data->nr_anon_vma_locks); > + if (data->i_mmap_locks) I think you really want data->nr_i_mmap_locks. ... > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > new file mode 100644 > --- /dev/null > +++ b/mm/mmu_notifier.c ... > +/* > + * This function can't run concurrently against mmu_notifier_register > + * or any other mmu notifier method. mmu_notifier_register can only > + * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is > + * zero). All other tasks of this mm already quit so they can't invoke > + * mmu notifiers anymore. This can run concurrently only against > + * mmu_notifier_unregister and it serializes against it with the > + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go > + * away from under us as the exit_mmap holds a mm_count pin itself. > + * > + * The ->release method can't allow the module to be unloaded, the > + * module can only be unloaded after mmu_notifier_unregister run. This > + * is because the release method has to run the ret instruction to > + * return back here, and so it can't allow the ret instruction to be > + * freed. > + */ The second paragraph of this comment seems extraneous. ... > + /* > + * Wait ->release if mmu_notifier_unregister run list_del_rcu. > + * srcu can't go away from under us because one mm_count is > + * hold by exit_mmap. > + */ These two sentences don't make any sense to me. ... > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + int before_release = 0, srcu; > + > + BUG_ON(atomic_read(&mm->mm_count) <= 0); > + > + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); > + spin_lock(&mm->mmu_notifier_mm->unregister_lock); > + if (!hlist_unhashed(&mn->hlist)) { > + hlist_del_rcu(&mn->hlist); > + before_release = 1; > + } > + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); > + if (before_release) > + /* > + * exit_mmap will block in mmu_notifier_release to > + * guarantee ->release is called before freeing the > + * pages. > + */ > + mn->ops->release(mn, mm); I am not certain about the need to do the release callout when the driver has already told this subsystem it is done. For XPMEM, this callout would immediately return. I would expect it to be the same or GRU. Thanks, Robin |
From: Gerd H. <kr...@re...> - 2008-04-24 08:37:15
|
Signed-off-by: Gerd Hoffmann <kr...@re...> --- arch/x86/Kconfig | 4 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/pvclock.c | 146 +++++++++++++++++++++++++++++++++++++++++++++ include/asm-x86/pvclock.h | 6 ++ 4 files changed, 157 insertions(+), 0 deletions(-) create mode 100644 arch/x86/kernel/pvclock.c create mode 100644 include/asm-x86/pvclock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a22be4a..fe73d38 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -400,6 +400,10 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_CLOCK + bool + default n + endif config MEMTEST_BOOTPARAM diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index fa19c38..ab7999c 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -83,6 +83,7 @@ obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o obj-$(CONFIG_KVM_GUEST) += kvm.o obj-$(CONFIG_KVM_CLOCK) += kvmclock.o obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o +obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o ifdef CONFIG_INPUT_PCSPKR obj-y += pcspeaker.o diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c new file mode 100644 index 0000000..fecf17a --- /dev/null +++ b/arch/x86/kernel/pvclock.c @@ -0,0 +1,146 @@ +/* paravirtual clock -- common code used by kvm/xen + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +*/ + +#include <linux/kernel.h> +#include <linux/percpu.h> +#include <asm/pvclock.h> + +/* + * These are perodically updated + * xen: magic shared_info page + * kvm: gpa registered via msr + * and then copied here. + */ +struct pvclock_shadow_time { + u64 tsc_timestamp; /* TSC at last update of time vals. */ + u64 system_timestamp; /* Time, in nanosecs, since boot. */ + u32 tsc_to_nsec_mul; + int tsc_shift; + u32 version; +}; + +/* + * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, + * yielding a 64-bit result. + */ +static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift) +{ + u64 product; +#ifdef __i386__ + u32 tmp1, tmp2; +#endif + + if (shift < 0) + delta >>= -shift; + else + delta <<= shift; + +#ifdef __i386__ + __asm__ ( + "mul %5 ; " + "mov %4,%%eax ; " + "mov %%edx,%4 ; " + "mul %5 ; " + "xor %5,%5 ; " + "add %4,%%eax ; " + "adc %5,%%edx ; " + : "=A" (product), "=r" (tmp1), "=r" (tmp2) + : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) ); +#elif __x86_64__ + __asm__ ( + "mul %%rdx ; shrd $32,%%rdx,%%rax" + : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) ); +#else +#error implement me! +#endif + + return product; +} + +static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +{ + u64 delta = native_read_tsc() - shadow->tsc_timestamp; + return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift); +} + +/* + * Reads a consistent set of time-base values from hypervisor, + * into a shadow data area. + */ +static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, + struct kvm_vcpu_time_info *src) +{ + do { + dst->version = src->version; + rmb(); /* fetch version before data */ + dst->tsc_timestamp = src->tsc_timestamp; + dst->system_timestamp = src->system_time; + dst->tsc_to_nsec_mul = src->tsc_to_system_mul; + dst->tsc_shift = src->tsc_shift; + rmb(); /* test version after fetching data */ + } while ((src->version & 1) | (dst->version ^ src->version)); + + return dst->version; +} + +/* + * This is our read_clock function. The host puts an tsc timestamp each time + * it updates a new time. Without the tsc adjustment, we can have a situation + * in which a vcpu starts to run earlier (smaller system_time), but probes + * time later (compared to another vcpu), leading to backwards time + */ +cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src) +{ + struct pvclock_shadow_time shadow; + unsigned version; + cycle_t ret; + + do { + version = pvclock_get_time_values(&shadow, src); + barrier(); + ret = shadow.system_timestamp + pvclock_get_nsec_offset(&shadow); + barrier(); + } while (version != src->version); + + return ret; +} + +void pvclock_read_wallclock(struct kvm_wall_clock *wall_clock, + struct kvm_vcpu_time_info *vcpu_time, + struct timespec *ts) +{ + u32 version; + u64 delta; + struct timespec now; + + /* get wallclock at system boot */ + do { + version = wall_clock->wc_version; + rmb(); /* fetch version before time */ + now.tv_sec = wall_clock->wc_sec; + now.tv_nsec = wall_clock->wc_nsec; + rmb(); /* fetch time before checking version */ + } while ((wall_clock->wc_version & 1) || (version != wall_clock->wc_version)); + + delta = pvclock_clocksource_read(vcpu_time); /* time since system boot */ + delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec; + + now.tv_nsec = do_div(delta, NSEC_PER_SEC); + now.tv_sec = delta; + + set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); +} diff --git a/include/asm-x86/pvclock.h b/include/asm-x86/pvclock.h new file mode 100644 index 0000000..2b9812f --- /dev/null +++ b/include/asm-x86/pvclock.h @@ -0,0 +1,6 @@ +#include <linux/clocksource.h> +#include <asm/kvm_para.h> +cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src); +void pvclock_read_wallclock(struct kvm_wall_clock *wall, + struct kvm_vcpu_time_info *vcpu, + struct timespec *ts); -- 1.5.4.1 |
From: Gerd H. <kr...@re...> - 2008-04-24 08:37:15
|
Signed-off-by: Gerd Hoffmann <kr...@re...> --- arch/x86/Kconfig | 1 + arch/x86/kernel/kvmclock.c | 66 ++++++++++--------------------------------- 2 files changed, 17 insertions(+), 50 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index fe73d38..ed1a679 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -373,6 +373,7 @@ config VMI config KVM_CLOCK bool "KVM paravirtualized clock" select PARAVIRT + select PARAVIRT_CLOCK depends on !(X86_VISWS || X86_VOYAGER) help Turning on this option will allow you to run a paravirtualized clock diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ddee040..476b7c7 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -18,6 +18,7 @@ #include <linux/clocksource.h> #include <linux/kvm_para.h> +#include <asm/pvclock.h> #include <asm/arch_hooks.h> #include <asm/msr.h> #include <asm/apic.h> @@ -37,17 +38,9 @@ early_param("no-kvmclock", parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct kvm_vcpu_time_info, hv_clock); -#define get_clock(cpu, field) per_cpu(hv_clock, cpu).field - -static inline u64 kvm_get_delta(u64 last_tsc) -{ - int cpu = smp_processor_id(); - u64 delta = native_read_tsc() - last_tsc; - return (delta * get_clock(cpu, tsc_to_system_mul)) >> KVM_SCALE; -} static struct kvm_wall_clock wall_clock; -static cycle_t kvm_clock_read(void); + /* * The wallclock is the time of day when we booted. Since then, some time may * have elapsed since the hypervisor wrote the data. So we try to account for @@ -55,35 +48,19 @@ static cycle_t kvm_clock_read(void); */ unsigned long kvm_get_wallclock(void) { - u32 wc_sec, wc_nsec; - u64 delta; + struct kvm_vcpu_time_info *vcpu_time; struct timespec ts; - int version, nsec; int low, high; low = (int)__pa(&wall_clock); high = ((u64)__pa(&wall_clock) >> 32); - - delta = kvm_clock_read(); - native_write_msr(MSR_KVM_WALL_CLOCK, low, high); - do { - version = wall_clock.wc_version; - rmb(); - wc_sec = wall_clock.wc_sec; - wc_nsec = wall_clock.wc_nsec; - rmb(); - } while ((wall_clock.wc_version != version) || (version & 1)); - - delta = kvm_clock_read() - delta; - delta += wc_nsec; - nsec = do_div(delta, NSEC_PER_SEC); - set_normalized_timespec(&ts, wc_sec + delta, nsec); - /* - * Of all mechanisms of time adjustment I've tested, this one - * was the champion! - */ - return ts.tv_sec + 1; + + vcpu_time = &get_cpu_var(hv_clock); + pvclock_read_wallclock(&wall_clock, vcpu_time, &ts); + put_cpu_var(hv_clock); + + return ts.tv_sec; } int kvm_set_wallclock(unsigned long now) @@ -91,28 +68,17 @@ int kvm_set_wallclock(unsigned long now) return 0; } -/* - * This is our read_clock function. The host puts an tsc timestamp each time - * it updates a new time. Without the tsc adjustment, we can have a situation - * in which a vcpu starts to run earlier (smaller system_time), but probes - * time later (compared to another vcpu), leading to backwards time - */ static cycle_t kvm_clock_read(void) { - u64 last_tsc, now; - int cpu; + struct kvm_vcpu_time_info *src; + cycle_t ret; - preempt_disable(); - cpu = smp_processor_id(); - - last_tsc = get_clock(cpu, tsc_timestamp); - now = get_clock(cpu, system_time); - - now += kvm_get_delta(last_tsc); - preempt_enable(); - - return now; + src = &get_cpu_var(hv_clock); + ret = pvclock_clocksource_read(src); + put_cpu_var(hv_clock); + return ret; } + static struct clocksource kvm_clock = { .name = "kvm-clock", .read = kvm_clock_read, -- 1.5.4.1 |
From: Gerd H. <kr...@re...> - 2008-04-24 08:37:15
|
Hi folks, My first attempt to send out a patch series with git ... The patches fix the kvm paravirt clocksource code to be compatible with xen and they also factor out some code which can be shared into a separate source files used by both kvm and xen. cheers, Gerd |
From: Gerd H. <kr...@re...> - 2008-04-24 08:37:12
|
Signed-off-by: Gerd Hoffmann <kr...@re...> --- arch/x86/xen/Kconfig | 1 + arch/x86/xen/time.c | 110 +++++--------------------------------------------- 2 files changed, 12 insertions(+), 99 deletions(-) diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig index 4d5f264..47f0cdc 100644 --- a/arch/x86/xen/Kconfig +++ b/arch/x86/xen/Kconfig @@ -5,6 +5,7 @@ config XEN bool "Xen guest support" select PARAVIRT + select PARAVIRT_CLOCK depends on X86_32 depends on X86_CMPXCHG && X86_TSC && !NEED_MULTIPLE_NODES && !(X86_VISWS || X86_VOYAGER) help diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index c39e1a5..3d5f945 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -13,6 +13,7 @@ #include <linux/clockchips.h> #include <linux/kernel_stat.h> +#include <asm/pvclock.h> #include <asm/xen/hypervisor.h> #include <asm/xen/hypercall.h> @@ -30,17 +31,6 @@ static cycle_t xen_clocksource_read(void); -/* These are perodically updated in shared_info, and then copied here. */ -struct shadow_time_info { - u64 tsc_timestamp; /* TSC at last update of time vals. */ - u64 system_timestamp; /* Time, in nanosecs, since boot. */ - u32 tsc_to_nsec_mul; - int tsc_shift; - u32 version; -}; - -static DEFINE_PER_CPU(struct shadow_time_info, shadow_time); - /* runstate info updated by Xen */ static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate); @@ -230,95 +220,14 @@ unsigned long xen_cpu_khz(void) return xen_khz; } -/* - * Reads a consistent set of time-base values from Xen, into a shadow data - * area. - */ -static unsigned get_time_values_from_xen(void) -{ - struct vcpu_time_info *src; - struct shadow_time_info *dst; - - /* src is shared memory with the hypervisor, so we need to - make sure we get a consistent snapshot, even in the face of - being preempted. */ - src = &__get_cpu_var(xen_vcpu)->time; - dst = &__get_cpu_var(shadow_time); - - do { - dst->version = src->version; - rmb(); /* fetch version before data */ - dst->tsc_timestamp = src->tsc_timestamp; - dst->system_timestamp = src->system_time; - dst->tsc_to_nsec_mul = src->tsc_to_system_mul; - dst->tsc_shift = src->tsc_shift; - rmb(); /* test version after fetching data */ - } while ((src->version & 1) | (dst->version ^ src->version)); - - return dst->version; -} - -/* - * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, - * yielding a 64-bit result. - */ -static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift) -{ - u64 product; -#ifdef __i386__ - u32 tmp1, tmp2; -#endif - - if (shift < 0) - delta >>= -shift; - else - delta <<= shift; - -#ifdef __i386__ - __asm__ ( - "mul %5 ; " - "mov %4,%%eax ; " - "mov %%edx,%4 ; " - "mul %5 ; " - "xor %5,%5 ; " - "add %4,%%eax ; " - "adc %5,%%edx ; " - : "=A" (product), "=r" (tmp1), "=r" (tmp2) - : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) ); -#elif __x86_64__ - __asm__ ( - "mul %%rdx ; shrd $32,%%rdx,%%rax" - : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) ); -#else -#error implement me! -#endif - - return product; -} - -static u64 get_nsec_offset(struct shadow_time_info *shadow) -{ - u64 now, delta; - now = native_read_tsc(); - delta = now - shadow->tsc_timestamp; - return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift); -} - static cycle_t xen_clocksource_read(void) { - struct shadow_time_info *shadow = &get_cpu_var(shadow_time); + struct vcpu_time_info *src; cycle_t ret; - unsigned version; - - do { - version = get_time_values_from_xen(); - barrier(); - ret = shadow->system_timestamp + get_nsec_offset(shadow); - barrier(); - } while (version != __get_cpu_var(xen_vcpu)->time.version); - - put_cpu_var(shadow_time); + src = &get_cpu_var(xen_vcpu)->time; + ret = pvclock_clocksource_read((void*)src); + put_cpu_var(xen_vcpu); return ret; } @@ -349,9 +258,14 @@ static void xen_read_wallclock(struct timespec *ts) unsigned long xen_get_wallclock(void) { + const struct shared_info *s = HYPERVISOR_shared_info; + struct kvm_wall_clock *wall_clock = (void*)&(s->wc_version); + struct vcpu_time_info *vcpu_time; struct timespec ts; - xen_read_wallclock(&ts); + vcpu_time = &get_cpu_var(xen_vcpu)->time; + pvclock_read_wallclock(wall_clock, (void*)vcpu_time, &ts); + put_cpu_var(xen_vcpu); return ts.tv_sec; } @@ -576,8 +490,6 @@ __init void xen_time_init(void) { int cpu = smp_processor_id(); - get_time_values_from_xen(); - clocksource_register(&xen_clocksource); if (HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer, cpu, NULL) == 0) { -- 1.5.4.1 |
From: Gerd H. <kr...@re...> - 2008-04-24 08:37:12
|
Signed-off-by: Gerd Hoffmann <kr...@re...> --- arch/x86/kvm/x86.c | 63 +++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 53 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ce5563..45b71c6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -493,7 +493,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) { static int version; struct kvm_wall_clock wc; - struct timespec wc_ts; + struct timespec now,sys,boot; if (!wall_clock) return; @@ -502,9 +502,16 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) kvm_write_guest(kvm, wall_clock, &version, sizeof(version)); - wc_ts = current_kernel_time(); - wc.wc_sec = wc_ts.tv_sec; - wc.wc_nsec = wc_ts.tv_nsec; +#if 0 + /* Hmm, getboottime() isn't exported to modules ... */ + getboottime(&boot); +#else + now = current_kernel_time(); + ktime_get_ts(&sys); + boot = ns_to_timespec(timespec_to_ns(&now) - timespec_to_ns(&sys)); +#endif + wc.wc_sec = boot.tv_sec; + wc.wc_nsec = boot.tv_nsec; wc.wc_version = version; kvm_write_guest(kvm, wall_clock, &wc, sizeof(wc)); @@ -537,20 +544,58 @@ static void kvm_write_guest_time(struct kvm_vcpu *v) /* * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate - * state, we just write "2" at the end + * state, we just increase by 2 at the end. */ - vcpu->hv_clock.version = 2; + vcpu->hv_clock.version += 2; shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0); memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock, - sizeof(vcpu->hv_clock)); + sizeof(vcpu->hv_clock)); kunmap_atomic(shared_kaddr, KM_USER0); mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT); } +static uint32_t div_frac(uint32_t dividend, uint32_t divisor) +{ + uint32_t quotient, remainder; + + __asm__ ( "divl %4" + : "=a" (quotient), "=d" (remainder) + : "0" (0), "1" (dividend), "r" (divisor) ); + return quotient; +} + +static void kvm_set_time_scale(uint32_t tsc_khz, struct kvm_vcpu_time_info *hv_clock) +{ + uint64_t nsecs = 1000000000LL; + int32_t shift = 0; + uint64_t tps64; + uint32_t tps32; + + tps64 = tsc_khz * 1000LL; + while (tps64 > nsecs*2) { + tps64 >>= 1; + shift--; + } + + tps32 = (uint32_t)tps64; + while (tps32 <= (uint32_t)nsecs) { + tps32 <<= 1; + shift++; + } + + hv_clock->tsc_shift = shift; + hv_clock->tsc_to_system_mul = div_frac(nsecs, tps32); + +#if 0 + printk(KERN_DEBUG "%s: tsc_khz %u, tsc_shift %d, tsc_mul %u\n", + __FUNCTION__, tsc_khz, hv_clock->tsc_shift, + hv_clock->tsc_to_system_mul); +#endif +} int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { @@ -599,9 +644,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) /* ...but clean it before doing the actual write */ vcpu->arch.time_offset = data & ~(PAGE_MASK | 1); - vcpu->arch.hv_clock.tsc_to_system_mul = - clocksource_khz2mult(tsc_khz, 22); - vcpu->arch.hv_clock.tsc_shift = 22; + kvm_set_time_scale(tsc_khz, &vcpu->arch.hv_clock); down_read(¤t->mm->mmap_sem); vcpu->arch.time_page = -- 1.5.4.1 |
From: Avi K. <av...@qu...> - 2008-04-24 07:59:23
|
Yang, Sheng wrote: > On Thursday 24 April 2008 15:37:53 Avi Kivity wrote: > >> Yang, Sheng wrote: >> >>>> Why not use ept_identity_pagetable != NULL to encode >>>> ept_identity_pagetable_done? >>>> >>> ept_identity_pagetable_done was used to indicate if the pagetable was >>> setted up, and ept_identity_pagetable was used to indicate if the page >>> used for pagetable was allocated... I don't want to run >>> alloc_identity_pagetable() again and again and again... Another method is >>> read several bits at the front of page to tell if the pagetable was >>> setted up, but somehow tricky... >>> >> No, better to avoid tricks. But rmode_tss is only allocated once, so if >> you unify the allocations, the identity table will also only be >> allocated once. >> > > But set_tss_addr() is a x86_ops, if the identity mapping allocation using the > same approach, another null function should be added into SVM side, but this > thing is VMX specific one... > We could rename set_tss_addr() to mean set_tss_and_ept_identity_pt_addr(), but it looks like the ioctl expects just three pages... so let's keep the flag for now. -- error compiling committee.c: too many arguments to function |
From: Yang, S. <she...@in...> - 2008-04-24 07:52:50
|
On Thursday 24 April 2008 15:37:53 Avi Kivity wrote: > Yang, Sheng wrote: > >> Why not use ept_identity_pagetable != NULL to encode > >> ept_identity_pagetable_done? > > > > ept_identity_pagetable_done was used to indicate if the pagetable was > > setted up, and ept_identity_pagetable was used to indicate if the page > > used for pagetable was allocated... I don't want to run > > alloc_identity_pagetable() again and again and again... Another method is > > read several bits at the front of page to tell if the pagetable was > > setted up, but somehow tricky... > > No, better to avoid tricks. But rmode_tss is only allocated once, so if > you unify the allocations, the identity table will also only be > allocated once. But set_tss_addr() is a x86_ops, if the identity mapping allocation using the same approach, another null function should be added into SVM side, but this thing is VMX specific one... -- Thanks Yang, Sheng |
From: Avi K. <av...@qu...> - 2008-04-24 07:37:53
|
Yang, Sheng wrote: >> Why not use ept_identity_pagetable != NULL to encode >> ept_identity_pagetable_done? >> > > ept_identity_pagetable_done was used to indicate if the pagetable was setted > up, and ept_identity_pagetable was used to indicate if the page used for > pagetable was allocated... I don't want to run alloc_identity_pagetable() > again and again and again... Another method is read several bits at the front > of page to tell if the pagetable was setted up, but somehow tricky... > No, better to avoid tricks. But rmode_tss is only allocated once, so if you unify the allocations, the identity table will also only be allocated once. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-24 07:35:49
|
Chris Lalancette wrote: > Avi, Joerg, > While trying to boot a RHEL-4 guest on latest KVM tip on an AMD machine, I > found that the guest would consistently crash when trying to setup the NMI > watchdog. I traced it down to the following commit: > > 51ef1ac7b23ee32bfcc61c229d634fdc1c68b38a > > It seems that in that commit, the K7_EVNTSEL MSR's were set to fail if the data > != 0. That test is actually fine, the problem is how the code around it is > generated. That is, we are only supposed to go to unhandled if data != 0; but > for some reason, we are *always* going to unhandled, even when the data == 0. > That causes RHEL-4 kernel to crash. If I rearrange the code to look like this: > > case MSR_K7_EVNTSEL0: > case MSR_K7_EVNTSEL1: > case MSR_K7_EVNTSEL2: > case MSR_K7_EVNTSEL3: > > if (data != 0) > return kvm_set_msr_common(vcpu, ecx, data); > > default: > return kvm_set_msr_common(vcpu, ecx, data); > } > > Then everything works again. A patch that does just this is attached. It might > be slightly nicer to say "if (data == 0) return 0" and then just fall through to > the default case, but I don't much care either way. > You mean the gcc generates wrong code? It seems fine here (though wonderfully obfuscated). Can you attach an objdump -Sr svm.o? Also, what gcc version are you using? -- error compiling committee.c: too many arguments to function |
From: Borut <zei...@DC...> - 2008-04-24 07:34:04
|
Stop being the joke among the ladies, visit our website here now http://www.oeoiage.com/ |
From: Yang, S. <she...@in...> - 2008-04-24 07:24:57
|
On Thursday 24 April 2008 15:15:30 Avi Kivity wrote: > Yang, Sheng wrote: > > On Tuesday 22 April 2008 18:16:41 Avi Kivity wrote: > >> Yang, Sheng wrote: > >>> From 73c33765f3d879001818cd0719038c78a0c65561 Mon Sep 17 00:00:00 2001 > >>> From: Sheng Yang <she...@in...> > >>> Date: Fri, 18 Apr 2008 17:15:39 +0800 > >>> Subject: [PATCH] kvm: qemu: Enable EPT support for real mode > >>> > >>> This patch build a identity page table on the last page of VGA bios, > >>> and use it as the guest page table in nonpaging mode for EPT. > >> > >> Doing this in qemu means older versions of qemu can't work with an > >> ept-enabled kernel. Also, placing the table in the vga bios might > >> conflict with video card assignment to a guest. > >> > >> Suggest placing this near the realmode tss (see vmx.c:init_rmode_tss()) > >> which serves a similar function. > > > > Something like this? (along with one page reserved in e820 table) > > > > I put the page it into 0xfffbc000 now. But I think the following > > implement is not very elegant... Too complex compared to the qemu one. > > > > BTW: The S/R and live migration problem was fixed. > > Ah, good. > > > +static int init_rmode_identity_map(struct kvm *kvm) > > +{ > > + int i, r, ret; > > + pfn_t identity_map_pfn; > > + u32 table[PT32_ENT_PER_PAGE]; > > That's 4KB. On i386 with 4K stacks, this may cause a stack overflow. > Even with 8K stacks you're on thin ice here, with the temperature > rapidly rising. Oops... I forgot that... > > + > > + if (kvm->arch.ept_identity_pagetable_done) > > + return 1; > > + ret = 0; > > + identity_map_pfn = VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT; > > + r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); > > + if (r < 0) > > + goto out; > > + /* > > + * Set up identity-mapping pagetable for EPT in real mode, also verify > > + * the contain of page > > s/contain/contents/ > > > + * 0xe7 = _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | > > + * _PAGE_DIRTY | _PAGE_PSE > > + */ > > + for (i = 0; i < PT32_ENT_PER_PAGE; i++) > > + table[i] = (i << 22) + 0xe7; > > Instead of the comment, you can put the identifiers into the code > instead of 0xe7. And, to avoid the stack overflow, simply use > kvm_write_guest() here. OK. > > +static int alloc_identity_pagetable(struct kvm *kvm) > > +{ > > + struct kvm_userspace_memory_region kvm_userspace_mem; > > + int r = 0; > > + > > + down_write(&kvm->slots_lock); > > + if (kvm->arch.ept_identity_pagetable) > > + goto out; > > + kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; > > + kvm_userspace_mem.flags = 0; > > + kvm_userspace_mem.guest_phys_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR; > > + kvm_userspace_mem.memory_size = PAGE_SIZE; > > + r = __kvm_set_memory_region(kvm, &kvm_userspace_mem, 0); > > + if (r) > > + goto out; > > + > > + down_read(¤t->mm->mmap_sem); > > + kvm->arch.ept_identity_pagetable = gfn_to_page(kvm, > > + VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT); > > + up_read(¤t->mm->mmap_sem); > > +out: > > + up_write(&kvm->slots_lock); > > + return r; > > +} > > There's already a memslot for the tss, no? Why not expand it by a page? Agree. > > + > > static void allocate_vpid(struct vcpu_vmx *vmx) > > { > > int vpid; > > @@ -1904,6 +1960,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) > > return 0; > > } > > > > +static int init_rmode(struct kvm *kvm) > > +{ > > + if (!init_rmode_tss(kvm)) > > + return 0; > > + if (!init_rmode_identity_map(kvm)) > > + return 0; > > + return 1; > > +} > > + > > static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) > > { > > struct vcpu_vmx *vmx = to_vmx(vcpu); > > @@ -1911,7 +1976,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) > > int ret; > > > > down_read(&vcpu->kvm->slots_lock); > > - if (!init_rmode_tss(vmx->vcpu.kvm)) { > > + if (!init_rmode(vmx->vcpu.kvm)) { > > ret = -ENOMEM; > > goto out; > > } > > @@ -2967,6 +3032,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm > > *kvm, unsigned int id) > > if (alloc_apic_access_page(kvm) != 0) > > goto free_vmcs; > > > > + if (vm_need_ept()) > > + if (alloc_identity_pagetable(kvm) != 0) > > + goto free_vmcs; > > + > > return &vmx->vcpu; > > > > free_vmcs: > > diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h > > index 8f662e3..469a107 100644 > > --- a/arch/x86/kvm/vmx.h > > +++ b/arch/x86/kvm/vmx.h > > @@ -340,6 +340,7 @@ enum vmcs_field { > > #define MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED 0x4 > > > > #define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 9 > > +#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 10 > > > > #define VMX_NR_VPIDS (1 << 16) > > #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1 > > @@ -362,4 +363,6 @@ enum vmcs_field { > > #define VMX_EPT_FAKE_ACCESSED_MASK (1ul << 62) > > #define VMX_EPT_FAKE_DIRTY_MASK (1ul << 63) > > > > +#define VMX_EPT_IDENTITY_PAGETABLE_ADDR 0xfffbc000ul > > + > > #endif > > diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h > > index 003bc0e..69afbab 100644 > > --- a/include/asm-x86/kvm_host.h > > +++ b/include/asm-x86/kvm_host.h > > @@ -314,6 +314,9 @@ struct kvm_arch{ > > struct page *apic_access_page; > > > > gpa_t wall_clock; > > + > > + struct page *ept_identity_pagetable; > > + bool ept_identity_pagetable_done; > > Why not use ept_identity_pagetable != NULL to encode > ept_identity_pagetable_done? ept_identity_pagetable_done was used to indicate if the pagetable was setted up, and ept_identity_pagetable was used to indicate if the page used for pagetable was allocated... I don't want to run alloc_identity_pagetable() again and again and again... Another method is read several bits at the front of page to tell if the pagetable was setted up, but somehow tricky... |
From: Avi K. <av...@qu...> - 2008-04-24 07:15:31
|
Yang, Sheng wrote: > On Tuesday 22 April 2008 18:16:41 Avi Kivity wrote: > >> Yang, Sheng wrote: >> >>> From 73c33765f3d879001818cd0719038c78a0c65561 Mon Sep 17 00:00:00 2001 >>> From: Sheng Yang <she...@in...> >>> Date: Fri, 18 Apr 2008 17:15:39 +0800 >>> Subject: [PATCH] kvm: qemu: Enable EPT support for real mode >>> >>> This patch build a identity page table on the last page of VGA bios, and >>> use it as the guest page table in nonpaging mode for EPT. >>> >> Doing this in qemu means older versions of qemu can't work with an >> ept-enabled kernel. Also, placing the table in the vga bios might >> conflict with video card assignment to a guest. >> >> Suggest placing this near the realmode tss (see vmx.c:init_rmode_tss()) >> which serves a similar function. >> > > Something like this? (along with one page reserved in e820 table) > > I put the page it into 0xfffbc000 now. But I think the following implement is > not very elegant... Too complex compared to the qemu one. > > BTW: The S/R and live migration problem was fixed. > > Ah, good. > > +static int init_rmode_identity_map(struct kvm *kvm) > +{ > + int i, r, ret; > + pfn_t identity_map_pfn; > + u32 table[PT32_ENT_PER_PAGE]; > That's 4KB. On i386 with 4K stacks, this may cause a stack overflow. Even with 8K stacks you're on thin ice here, with the temperature rapidly rising. > + > + if (kvm->arch.ept_identity_pagetable_done) > + return 1; > + ret = 0; > + identity_map_pfn = VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT; > + r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); > + if (r < 0) > + goto out; > + /* > + * Set up identity-mapping pagetable for EPT in real mode, also verify > + * the contain of page > s/contain/contents/ > + * 0xe7 = _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | > + * _PAGE_DIRTY | _PAGE_PSE > + */ > + for (i = 0; i < PT32_ENT_PER_PAGE; i++) > + table[i] = (i << 22) + 0xe7; > Instead of the comment, you can put the identifiers into the code instead of 0xe7. And, to avoid the stack overflow, simply use kvm_write_guest() here. > +static int alloc_identity_pagetable(struct kvm *kvm) > +{ > + struct kvm_userspace_memory_region kvm_userspace_mem; > + int r = 0; > + > + down_write(&kvm->slots_lock); > + if (kvm->arch.ept_identity_pagetable) > + goto out; > + kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; > + kvm_userspace_mem.flags = 0; > + kvm_userspace_mem.guest_phys_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR; > + kvm_userspace_mem.memory_size = PAGE_SIZE; > + r = __kvm_set_memory_region(kvm, &kvm_userspace_mem, 0); > + if (r) > + goto out; > + > + down_read(¤t->mm->mmap_sem); > + kvm->arch.ept_identity_pagetable = gfn_to_page(kvm, > + VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT); > + up_read(¤t->mm->mmap_sem); > +out: > + up_write(&kvm->slots_lock); > + return r; > +} > There's already a memslot for the tss, no? Why not expand it by a page? > + > static void allocate_vpid(struct vcpu_vmx *vmx) > { > int vpid; > @@ -1904,6 +1960,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) > return 0; > } > > +static int init_rmode(struct kvm *kvm) > +{ > + if (!init_rmode_tss(kvm)) > + return 0; > + if (!init_rmode_identity_map(kvm)) > + return 0; > + return 1; > +} > + > static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) > { > struct vcpu_vmx *vmx = to_vmx(vcpu); > @@ -1911,7 +1976,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) > int ret; > > down_read(&vcpu->kvm->slots_lock); > - if (!init_rmode_tss(vmx->vcpu.kvm)) { > + if (!init_rmode(vmx->vcpu.kvm)) { > ret = -ENOMEM; > goto out; > } > @@ -2967,6 +3032,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm > *kvm, unsigned int id) > if (alloc_apic_access_page(kvm) != 0) > goto free_vmcs; > > + if (vm_need_ept()) > + if (alloc_identity_pagetable(kvm) != 0) > + goto free_vmcs; > + > return &vmx->vcpu; > > free_vmcs: > diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h > index 8f662e3..469a107 100644 > --- a/arch/x86/kvm/vmx.h > +++ b/arch/x86/kvm/vmx.h > @@ -340,6 +340,7 @@ enum vmcs_field { > #define MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED 0x4 > > #define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 9 > +#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 10 > > #define VMX_NR_VPIDS (1 << 16) > #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1 > @@ -362,4 +363,6 @@ enum vmcs_field { > #define VMX_EPT_FAKE_ACCESSED_MASK (1ul << 62) > #define VMX_EPT_FAKE_DIRTY_MASK (1ul << 63) > > +#define VMX_EPT_IDENTITY_PAGETABLE_ADDR 0xfffbc000ul > + > #endif > diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h > index 003bc0e..69afbab 100644 > --- a/include/asm-x86/kvm_host.h > +++ b/include/asm-x86/kvm_host.h > @@ -314,6 +314,9 @@ struct kvm_arch{ > struct page *apic_access_page; > > gpa_t wall_clock; > + > + struct page *ept_identity_pagetable; > + bool ept_identity_pagetable_done; Why not use ept_identity_pagetable != NULL to encode ept_identity_pagetable_done? -- error compiling committee.c: too many arguments to function |
From: Yang, S. <she...@in...> - 2008-04-24 06:54:09
|
On Tuesday 22 April 2008 18:16:41 Avi Kivity wrote: > Yang, Sheng wrote: > > From 73c33765f3d879001818cd0719038c78a0c65561 Mon Sep 17 00:00:00 2001 > > From: Sheng Yang <she...@in...> > > Date: Fri, 18 Apr 2008 17:15:39 +0800 > > Subject: [PATCH] kvm: qemu: Enable EPT support for real mode > > > > This patch build a identity page table on the last page of VGA bios, and > > use it as the guest page table in nonpaging mode for EPT. > > Doing this in qemu means older versions of qemu can't work with an > ept-enabled kernel. Also, placing the table in the vga bios might > conflict with video card assignment to a guest. > > Suggest placing this near the realmode tss (see vmx.c:init_rmode_tss()) > which serves a similar function. Something like this? (along with one page reserved in e820 table) I put the page it into 0xfffbc000 now. But I think the following implement is not very elegant... Too complex compared to the qemu one. BTW: The S/R and live migration problem was fixed. From b1836738e82ed416c8fb43cffd85b3d17ab10260 Mon Sep 17 00:00:00 2001 From: Sheng Yang <she...@in...> Date: Thu, 24 Apr 2008 14:23:50 +0800 Subject: [PATCH] KVM: VMX: Perpare a identity page table for EPT in real mode Signed-off-by: Sheng Yang <she...@in...> --- arch/x86/kvm/vmx.c | 75 ++++++++++++++++++++++++++++++++++++++++++-- arch/x86/kvm/vmx.h | 3 ++ include/asm-x86/kvm_host.h | 3 ++ 3 files changed, 78 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 588b9ea..b19b2b2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -87,7 +87,7 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu) return container_of(vcpu, struct vcpu_vmx, vcpu); } -static int init_rmode_tss(struct kvm *kvm); +static int init_rmode(struct kvm *kvm); static DEFINE_PER_CPU(struct vmcs *, vmxarea); static DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -1345,7 +1345,7 @@ static void enter_rmode(struct kvm_vcpu *vcpu) fix_rmode_seg(VCPU_SREG_FS, &vcpu->arch.rmode.fs); kvm_mmu_reset_context(vcpu); - init_rmode_tss(vcpu->kvm); + init_rmode(vcpu->kvm); } #ifdef CONFIG_X86_64 @@ -1707,6 +1707,37 @@ out: return ret; } +static int init_rmode_identity_map(struct kvm *kvm) +{ + int i, r, ret; + pfn_t identity_map_pfn; + u32 table[PT32_ENT_PER_PAGE]; + + if (kvm->arch.ept_identity_pagetable_done) + return 1; + ret = 0; + identity_map_pfn = VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT; + r = kvm_clear_guest_page(kvm, identity_map_pfn, 0, PAGE_SIZE); + if (r < 0) + goto out; + /* + * Set up identity-mapping pagetable for EPT in real mode, also verify + * the contain of page + * 0xe7 = _PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | + * _PAGE_DIRTY | _PAGE_PSE + */ + for (i = 0; i < PT32_ENT_PER_PAGE; i++) + table[i] = (i << 22) + 0xe7; + r = kvm_write_guest_page(kvm, identity_map_pfn, + table, 0, PAGE_SIZE); + if (r < 0) + goto out; + kvm->arch.ept_identity_pagetable_done = true; + ret = 1; +out: + return ret; +} + static void seg_setup(int seg) { struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg]; @@ -1741,6 +1772,31 @@ out: return r; } +static int alloc_identity_pagetable(struct kvm *kvm) +{ + struct kvm_userspace_memory_region kvm_userspace_mem; + int r = 0; + + down_write(&kvm->slots_lock); + if (kvm->arch.ept_identity_pagetable) + goto out; + kvm_userspace_mem.slot = IDENTITY_PAGETABLE_PRIVATE_MEMSLOT; + kvm_userspace_mem.flags = 0; + kvm_userspace_mem.guest_phys_addr = VMX_EPT_IDENTITY_PAGETABLE_ADDR; + kvm_userspace_mem.memory_size = PAGE_SIZE; + r = __kvm_set_memory_region(kvm, &kvm_userspace_mem, 0); + if (r) + goto out; + + down_read(¤t->mm->mmap_sem); + kvm->arch.ept_identity_pagetable = gfn_to_page(kvm, + VMX_EPT_IDENTITY_PAGETABLE_ADDR >> PAGE_SHIFT); + up_read(¤t->mm->mmap_sem); +out: + up_write(&kvm->slots_lock); + return r; +} + static void allocate_vpid(struct vcpu_vmx *vmx) { int vpid; @@ -1904,6 +1960,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) return 0; } +static int init_rmode(struct kvm *kvm) +{ + if (!init_rmode_tss(kvm)) + return 0; + if (!init_rmode_identity_map(kvm)) + return 0; + return 1; +} + static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -1911,7 +1976,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) int ret; down_read(&vcpu->kvm->slots_lock); - if (!init_rmode_tss(vmx->vcpu.kvm)) { + if (!init_rmode(vmx->vcpu.kvm)) { ret = -ENOMEM; goto out; } @@ -2967,6 +3032,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id) if (alloc_apic_access_page(kvm) != 0) goto free_vmcs; + if (vm_need_ept()) + if (alloc_identity_pagetable(kvm) != 0) + goto free_vmcs; + return &vmx->vcpu; free_vmcs: diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h index 8f662e3..469a107 100644 --- a/arch/x86/kvm/vmx.h +++ b/arch/x86/kvm/vmx.h @@ -340,6 +340,7 @@ enum vmcs_field { #define MSR_IA32_FEATURE_CONTROL_VMXON_ENABLED 0x4 #define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 9 +#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 10 #define VMX_NR_VPIDS (1 << 16) #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1 @@ -362,4 +363,6 @@ enum vmcs_field { #define VMX_EPT_FAKE_ACCESSED_MASK (1ul << 62) #define VMX_EPT_FAKE_DIRTY_MASK (1ul << 63) +#define VMX_EPT_IDENTITY_PAGETABLE_ADDR 0xfffbc000ul + #endif diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 003bc0e..69afbab 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -314,6 +314,9 @@ struct kvm_arch{ struct page *apic_access_page; gpa_t wall_clock; + + struct page *ept_identity_pagetable; + bool ept_identity_pagetable_done; }; struct kvm_vm_stat { -- 1.5.4.5 |
From: Andrea A. <an...@qu...> - 2008-04-24 06:49:49
|
On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote: > /dev/kvm closure. Given this can be a considered a bugfix to > mmu_notifier_unregister I'll apply it to 1/N and I'll release a new I'm not sure anymore this can be considered a bugfix given how large change this resulted in the locking and register/unregister/release behavior. Here a full draft patch for review and testing. Works great with KVM so far at least... - mmu_notifier_register has to run on current->mm or on get_task_mm() (in the later case it can mmput after mmu_notifier_register returns) - mmu_notifier_register in turn can't race against mmu_notifier_release as that runs in exit_mmap after the last mmput - mmu_notifier_unregister can run at any time, even after exit_mmap completed. No mm_count pin is required, it's taken automatically by register and released by unregister - mmu_notifier_unregister serializes against all mmu notifiers with srcu, and it serializes especially against a concurrent mmu_notifier_unregister with a mix of a spinlock and SRCU - the spinlock let us keep track who run first between mmu_notifier_unregister and mmu_notifier_release, this makes life much easier for the driver to handle as the driver is then guaranteed that ->release will run. - The first that runs executes ->release method as well after dropping the spinlock but before releasing the srcu lock - it was unsafe to unpin the module count from ->release, as release itself has to run the 'ret' instruction to return back to the mmu notifier code - the ->release method is mandatory as it has to run before the pages are freed to zap all existing sptes - the one that arrives second between mmu_notifier_unregister and mmu_notifier_register waits the first with srcu As said this is a much larger change than I hoped, but as usual it can only affect KVM/GRU/XPMEM if something is wrong with this. I don't exclude we'll have to backoff to the previous mm_users model. The main issue with taking a mm_users pin is that filehandles associated with vmas aren't closed by exit() if the mm_users is pinned (that simply leaks ram with kvm). It looks more correct not to relay on the mm_users being >0 only in mmu_notifier_register. The other big change is that ->release is mandatory and always called by the first between mmu_notifier_unregister or mmu_notifier_release. Both mmu_notifier_unregister and mmu_notifier_release are slow paths so taking a spinlock there is no big deal. Impact when the mmu notifiers are disarmed is unchanged. The interesting part of the kvm patch to test this change is below. After this last bit KVM patch status is almost final if this new mmu notifier update is remotely ok, I've another one that does the locking change to remove the page pin. +static void kvm_free_vcpus(struct kvm *kvm); +/* This must zap all the sptes because all pages will be freed then */ +static void kvm_mmu_notifier_release(struct mmu_notifier *mn, + struct mm_struct *mm) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + kvm_free_pit(kvm); + kfree(kvm->arch.vpic); + kfree(kvm->arch.vioapic); + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + if (kvm->arch.apic_access_page) + put_page(kvm->arch.apic_access_page); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .release = kvm_mmu_notifier_release, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .clear_flush_young = kvm_mmu_notifier_clear_flush_young, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } @@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_free_pit(kvm); - kfree(kvm->arch.vpic); - kfree(kvm->arch.vioapic); - kvm_free_vcpus(kvm); - kvm_free_physmem(kvm); - if (kvm->arch.apic_access_page) - put_page(kvm->arch.apic_access_page); + /* + * kvm_mmu_notifier_release() will be called before + * mmu_notifier_unregister returns, if it didn't run + * already. + */ + mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm); kfree(kvm); } Let's call this mmu notifier #v14-test1. Signed-off-by: Andrea Arcangeli <an...@qu...> Signed-off-by: Nick Piggin <np...@su...> Signed-off-by: Christoph Lameter <cla...@sg...> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,6 +1050,27 @@ unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * mm_lock will take mmap_sem writably (to prevent all modifications + * and scanning of vmas) and then also takes the mapping locks for + * each of the vma to lockout any scans of pagetables of this address + * space. This can be used to effectively holding off reclaim from the + * address space. + * + * mm_lock can fail if there is not enough memory to store a pointer + * array to all vmas. + * + * mm_lock and mm_unlock are expensive operations that may take a long time. + */ +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -19,6 +19,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -225,6 +226,9 @@ #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,251 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/mm_types.h> + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER +#include <linux/srcu.h> + +struct mmu_notifier_mm { + struct hlist_head list; + struct srcu_struct srcu; + /* to serialize mmu_notifier_unregister against mmu_notifier_release */ + spinlock_t unregister_lock; +}; + +struct mmu_notifier_ops { + /* + * Called after all other threads have terminated and the executing + * thread is the only remaining execution thread. There are no + * users of the mm_struct remaining. + * + * If the methods are implemented in a module, the module + * can't be unloaded until release() is called. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed by the Linux pte + * because the old page hasn't been freed yet. If required + * set_page_dirty has to be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem is held and/or + * the semaphores protecting the reverse maps. Both functions + * may sleep. The subsystem must guarantee that no additional + * references to the pages in the range established between + * the call to invalidate_range_start() and the matching call + * to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be permitted + * by the driver or the driver may exclude other invalidation + * from proceeding by blocking on new invalidate_range_start() + * callback that overlap invalidates that are already in + * progress. Either way the establishment of sptes to the + * range can only be allowed if all invalidate_range_stop() + * function have been called. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -395,6 +397,7 @@ BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -194,7 +194,7 @@ if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -799,6 +800,7 @@ BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -819,6 +821,7 @@ } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -596,6 +597,7 @@ unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -603,25 +605,39 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ + ret = 0; if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma->anon_vma) - return 0; + goto out; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (unlikely(is_vm_hugetlb_page(vma))) { + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); + goto out; + } + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); +out: + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -825,7 +841,9 @@ unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -876,6 +894,7 @@ } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1463,10 +1482,11 @@ { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1474,6 +1494,7 @@ if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1675,7 +1696,7 @@ * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/vmalloc.h> +#include <linux/sort.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2038,6 +2041,7 @@ /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2242,3 +2246,144 @@ return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + unsigned long _a = (unsigned long)*(spinlock_t **)a; + unsigned long _b = (unsigned long)*(spinlock_t **)b; + + cond_resched(); + if (_a < _b) + return -1; + if (_a > _b) + return 1; + return 0; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + down_write(&mm->mmap_sem); + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) { + up_write(&mm->mmap_sem); + return -ENOMEM; + } + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + up_write(&mm->mmap_sem); + vfree(anon_vma_locks); + return -ENOMEM; + } + + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* avoid memory allocations for mm_unlock to prevent deadlock */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } + up_write(&mm->mmap_sem); +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,241 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/err.h> +#include <linux/srcu.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +/* + * This function can't run concurrently against mmu_notifier_register + * or any other mmu notifier method. mmu_notifier_register can only + * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is + * zero). All other tasks of this mm already quit so they can't invoke + * mmu notifiers anymore. This can run concurrently only against + * mmu_notifier_unregister and it serializes against it with the + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go + * away from under us as the exit_mmap holds a mm_count pin itself. + * + * The ->release method can't allow the module to be unloaded, the + * module can only be unloaded after mmu_notifier_unregister run. This + * is because the release method has to run the ret instruction to + * return back here, and so it can't allow the ret instruction to be + * freed. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_lock(&mm->mmu_notifier_mm->unregister_lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing else than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init(&mn->hlist); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes before the pages in the + * mm are freed. + */ + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); + /* SRCU will block mmu_notifier_unregister */ + mn->ops->release(mn, mm); + spin_lock(&mm->mmu_notifier_mm->unregister_lock); + } + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + /* + * Wait ->release if mmu_notifier_unregister run list_del_rcu. + * srcu can't go away from under us because one mm_count is + * hold by exit_mmap. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0, srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely like + * with get_task_mm(). mmput can be called after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct mm_lock_data data; + int ret; + + BUG_ON(atomic_read(&mm->mm_users) <= 0); + + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out; + + if (!mm_has_notifiers(mm)) { + mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), + GFP_KERNEL); + ret = -ENOMEM; + if (unlikely(!mm_has_notifiers(mm))) + goto out_unlock; + + ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu); + if (unlikely(ret)) { + kfree(mm->mmu_notifier_mm); + mmu_notifier_mm_init(mm); + goto out_unlock; + } + INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list); + spin_lock_init(&mm->mmu_notifier_mm->unregister_lock); + } + atomic_inc(&mm->mm_count); + + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list); +out_unlock: + mm_unlock(mm, &data); +out: + BUG_ON(atomic_read(&mm->mm_users) <= 0); + return ret; +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* this is called after the last mmu_notifier_unregister() returned */ +void __mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); + cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu); + kfree(mm->mmu_notifier_mm); + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ +} + +/* + * This releases the mm_count pin automatically and frees the mm + * structure if it was the last user of it. It serializes against + * running mmu notifiers with SRCU and against mmu_notifier_unregister + * with the unregister lock + SRCU. All sptes must be dropped before + * calling mmu_notifier_unregister. ->release or any other notifier + * method may be invoked concurrently with mmu_notifier_unregister, + * and only after mmu_notifier_unregister returned we're guaranteed + * that ->release or any other method can't run anymore. + */ +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + int before_release = 0, srcu; + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_lock(&mm->mmu_notifier_mm->unregister_lock); + if (!hlist_unhashed(&mn->hlist)) { + hlist_del_rcu(&mn->hlist); + before_release = 1; + } + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); + if (before_release) + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee ->release is called before freeing the + * pages. + */ + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + /* wait any running method to finish, including ->release */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + mmdrop(mm); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -198,10 +199,12 @@ dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -74,7 +75,11 @@ struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include <linux/module.h> #include <linux/kallsyms.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> @@ -287,7 +288,7 @@ if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -456,7 +457,7 @@ pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) |
From: Jerone Y. <jy...@us...> - 2008-04-24 04:56:46
|
2 files changed, 6 insertions(+), 1 deletion(-) configure | 6 ++++++ libkvm/Makefile | 1 - This is a relic of the big userspace refactoring, but today libkvm does should not include settings from the test suite. This patch resolves this and removes the overwriting of setting from the main config.mak with test suite settings. Signed-off-by: Jerone Young <jy...@us...> diff --git a/configure b/configure --- a/configure +++ b/configure @@ -2,6 +2,9 @@ prefix=/usr/local kerneldir=/lib/modules/$(uname -r)/build +cc=cc +ld=ld +objcopy=objcopy want_module=1 qemu_cc= qemu_cflags= @@ -131,4 +134,7 @@ KERNELDIR=$kerneldir KERNELDIR=$kerneldir WANT_MODULE=$want_module CROSS_COMPILE=$cross_prefix +CC=$cross_prefix$cc +LD=$cross_prefix$ld +OBJCOPY=$cross_prefix$objcopy EOF diff --git a/libkvm/Makefile b/libkvm/Makefile --- a/libkvm/Makefile +++ b/libkvm/Makefile @@ -1,5 +1,4 @@ include ../config.mak include ../config.mak -include ../user/config.mak include config-$(ARCH).mak # cc-option |