From: Martin J. B. <mb...@ar...> - 2003-02-20 04:25:58
|
The performance delta between 2.5.62-mjb1 and 2.5.62-mjb2 is caused by the partial object-based rmap patch (written by Dave McCracken). I expect this patch to have an increasing impact on workloads with more processes, and it should give a substantial space saving as well as a performance increase. Results from 16x NUMA-Q system ... Profile comparison: before 15525 page_remove_rmap 6415 page_add_rmap after 2055 page_add_rmap 1983 page_remove_rmap (performance of 62 was equivalent to 61) Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed User System CPU 2.5.61 45.77 561.71 118.87 1486.50 2.5.62-mjb1 45.81 564.41 112.76 1478.00 2.5.62-mjb2 44.88 563.64 98.29 1474.50 Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus) Elapsed User System CPU 2.5.61 47.46 565.70 144.77 1496.33 2.5.62-mjb1 47.21 569.17 139.55 1500.67 2.5.62-mjb2 46.09 568.19 121.83 1496.67 DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and the run results are non-compliant and not-comparable with any published results. Results are shown as percentages of the first set displayed SDET 1 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 4.5% 2.5.62-mjb1 95.2% 3.0% 2.5.62-mjb2 97.1% 4.6% SDET 2 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 4.3% 2.5.62-mjb1 93.4% 6.6% 2.5.62-mjb2 99.4% 6.1% SDET 4 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 2.4% 2.5.62-mjb1 101.9% 7.1% 2.5.62-mjb2 116.9% 1.5% SDET 8 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 7.4% 2.5.62-mjb1 111.1% 1.8% 2.5.62-mjb2 121.8% 5.9% SDET 16 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 2.0% 2.5.62-mjb1 105.0% 2.2% 2.5.62-mjb2 107.0% 2.1% SDET 32 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 7.9% 2.5.62-mjb1 112.4% 1.2% 2.5.62-mjb2 116.8% 0.9% SDET 64 (see disclaimer) Throughput Std. Dev 2.5.61 100.0% 1.2% 2.5.62-mjb1 111.0% 0.7% 2.5.62-mjb2 116.0% 0.9% |
From: William L. I. I. <wl...@ho...> - 2003-02-20 04:49:47
|
On Wed, Feb 19, 2003 at 08:25:52PM -0800, Martin J. Bligh wrote: > Profile comparison: > before > 15525 page_remove_rmap > 6415 page_add_rmap > after > 2055 page_add_rmap > 1983 page_remove_rmap Could I get a larger, multiplicative differential profile? i.e. ratios of the fractions of profile hits? If you have trouble generating such I can do so myself from fuller profile results. Thanks. -- wli |
From: Martin J. B. <mb...@ar...> - 2003-02-20 05:14:16
|
>> Profile comparison: >> before >> 15525 page_remove_rmap >> 6415 page_add_rmap >> after >> 2055 page_add_rmap >> 1983 page_remove_rmap > > Could I get a larger, multiplicative differential profile? > i.e. ratios of the fractions of profile hits? > > If you have trouble generating such I can do so myself from > fuller profile results. before: 187256 total after: 170196 total M. |
From: William L. I. I. <wl...@ho...> - 2003-02-20 05:18:13
|
At some point in the past, I wrote: >> Could I get a larger, multiplicative differential profile? >> i.e. ratios of the fractions of profile hits? >> If you have trouble generating such I can do so myself from >> fuller profile results. On Wed, Feb 19, 2003 at 09:14:10PM -0800, Martin J. Bligh wrote: > before: > 187256 total > after: > 170196 total Well, I was trying to get an idea of what got slower to compensate for the improvement in page_(add|remove)_rmap() times. -- wli |
From: Martin J. B. <mb...@ar...> - 2003-02-20 05:27:47
|
>>> Could I get a larger, multiplicative differential profile? >>> i.e. ratios of the fractions of profile hits? >>> If you have trouble generating such I can do so myself from >>> fuller profile results. > > On Wed, Feb 19, 2003 at 09:14:10PM -0800, Martin J. Bligh wrote: >> before: >> 187256 total >> after: >> 170196 total > > Well, I was trying to get an idea of what got slower to compensate > for the improvement in page_(add|remove)_rmap() times. diffprofile: 1514 default_idle 921 .text.lock.file_table 145 vm_enough_memory 131 release_pages 93 file_move 68 __copy_to_user_ll 62 fd_install 60 buffered_rmqueue 58 vma_merge 52 __fput ... -65 kmem_cache_alloc -72 copy_page_range -73 do_anonymous_page -76 get_empty_filp -98 filemap_nopage -168 unmap_all_pages -352 find_get_page -402 kmem_cache_free -594 __pte_chain_free -4360 page_add_rmap -13542 page_remove_rmap -17060 total I'll try out akpm's file table stuff later today. M. |
From: Martin J. B. <mb...@ar...> - 2003-02-21 01:53:12
|
> The performance delta between 2.5.62-mjb1 and 2.5.62-mjb2 is caused > by the partial object-based rmap patch (written by Dave McCracken). > I expect this patch to have an increasing impact on workloads with > more processes, and it should give a substantial space saving as > well as a performance increase. Results from 16x NUMA-Q system ... > > Profile comparison: > > before > 15525 page_remove_rmap > 6415 page_add_rmap > > after > 2055 page_add_rmap > 1983 page_remove_rmap Did some space consumption comparisons on make -j 256: before: 24116 pte_chain objects in slab cache after: 716 pte_chain objects in slab cache The vast majority of anonymous pages (for which we're using the non object based method) are singletons, and hence use pte_direct ... hence the massive space reduction. M. |
From: Andrew M. <ak...@di...> - 2003-02-21 03:06:50
|
"Martin J. Bligh" <mb...@ar...> wrote: > > ... > Did some space consumption comparisons on make -j 256: > > before: > 24116 pte_chain objects in slab cache > after: > 716 pte_chain objects in slab cache > > The vast majority of anonymous pages (for which we're using the non > object based method) are singletons, and hence use pte_direct ... > hence the massive space reduction. > OK. And that'll help the fork+exit CPU overhead too. Let's talk about this a bit. The problem remains that this design means that we need to walk all the vma's which are attached to a page's address_space in an attempt to unmap that page. Whereas with the existing pte_chain walk, we only need to visit those pte's which are still mapping the page. So with 1000 processes, each holding 1000 vma's against the same file, we need to visit 1,000,000 vma's to unmap one page. If that unmap attempt is unsuccessful, we *still* need to walk 1,000,000 vma's next time we try to unmap it. Whereas with pte-based rmap, we do not need to walk over already-unmapped pte's. So it is easy to see how there is at least a theoretical collapse in search complexity here. Which affects all platforms. And, of course there is a demonstrated space consumption collapse with the existing code which affects big ia32 machines. It is not clear how realistic the search complexity problem really is - the 2.4 virtual scan has basically the same problem and there is not a lot of evidence of people hurting from it. In fact we see more hurt from full rmap than from the 1000x1000 thing. The search complexity _may_ affect less exotic things than databases. Many processes mapping libc, say. But in that case the number of VMA's is very small - if this was a realistic problem then 2.4 would be in real strife. Dave's patch is simple enough to merit serious consideration. <scratches head> I think the guiding principle here is that we should not optimise for the uncommon case (as rmap is doing), and we should not allow the uncommon case to be utterly terrible (as Dave's patch can do). It would be overly optimistic to assume that hugetlb pages will save us here. Any bright ideas? |
From: Rik v. R. <ri...@im...> - 2003-02-21 03:26:03
|
On Thu, 20 Feb 2003, Andrew Morton wrote: > I think the guiding principle here is that we should not optimise for the > uncommon case (as rmap is doing), and we should not allow the uncommon case > to be utterly terrible (as Dave's patch can do). This "guiding principle" appears to be the primary reason why we've taken over a year to stabilise linux 2.0 and linux 2.2 and linux 2.4 ... or at least, too much of a focus on the first half of this guiding principle. ;) We absolutely have to take care in avoiding the worst case scenarios, since statistics pretty much guarantee that somebody will run into nothing but that scenario ... cheers, Rik -- Engineers don't grow up, they grow sideways. http://www.surriel.com/ http://kernelnewbies.org/ |
From: Andrew M. <ak...@di...> - 2003-02-21 03:46:29
|
Rik van Riel <ri...@im...> wrote: > > On Thu, 20 Feb 2003, Andrew Morton wrote: > > > I think the guiding principle here is that we should not optimise for the > > uncommon case (as rmap is doing), and we should not allow the uncommon case > > to be utterly terrible (as Dave's patch can do). > > This "guiding principle" appears to be the primary reason why > we've taken over a year to stabilise linux 2.0 and linux 2.2 > and linux 2.4 ... or at least, too much of a focus on the first > half of this guiding principle. ;) > > We absolutely have to take care in avoiding the worst case > scenarios, since statistics pretty much guarantee that somebody > will run into nothing but that scenario ... > We see things like the below report, which realistically shows the problems with the reverse map. I have yet to see _any_ report that the problems to which you refer are causing difficulty in the field. I think there's a middle ground. Hint: MAP_ORACLE. Begin forwarded message: Date: Tue, 17 Sep 2002 13:30:42 -0500 From: "Peter Wong" <wp...@us...> To: lin...@kv..., lse...@li... Cc: ri...@nl..., ak...@zi..., mj...@us..., wl...@ho..., dm...@us..., gh...@us..., "Bill Hartner" <bha...@us...>, "Troy C Wilson" <wi...@us...> Subject: Examining the Performance and Cost of Revesemaps on 2.5.26 Under Heavy DB Workload I measured a decision support workload using two 2.5.26-based kernels, one of which does NOT have the rmap rollup patch while the other HAS. The database size is 100GB. The 2.5.26 rmap rollup patch was created by Dave McCracken. Based upon the throughput rate and CPU utilization, the two kernels achieve similar performance. Based upon /proc/meminfo, the maximum reversemap size is 22,817,885. Based upon /proc/slabinfo, the maximum number of active pte_chain objects is 3,411,018 with 32 bytes each. It consumes about 104 MB. The maximum number of slabs allocated for pte_chains is 30,186 with 4KB each, corresponding to about 118 MB. Similar memory consumption for rmaps is observed while running the same workload and using Andrew Morton's mm2 patch under 2.5.32. Andrew's patch can be found at http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.32/2.5.32-mm2/. Note that since readv is not available on 2.5.26, the runs above used "read" instead of "readv". My previous note (Sept. 10, 2002) indicated that the memory consumption for rmaps under 2.5.32 using "readv" is about 223 MB. And "readv" is the preferred method for this workload. The difference of memory consumption by using read (118 MB) and readv (223 MB) is likely due to the different I/O algorithms used by the DBMS. When there are multiple instances of this workload running concurrently, the amount of memory allocated to rmaps is even more significant. More data will be provided later. ---------------------------------------------------------------------- System Information: - 8-way 700MHz Pentium III Xeon processors, 2MB L2 Cache - 4GB RAM - 6 SCSI controllers connected to 120 9.1 GB disks with 10K RPM ---------------------------------------------------------------------- Regards, Peter Peter Wai Yee Wong IBM Linux Technology Center, Performance Analysis email: wp...@us... |
From: Rik v. R. <ri...@im...> - 2003-02-21 04:26:08
|
On Thu, 20 Feb 2003, Andrew Morton wrote: > We see things like the below report, which realistically shows > the problems with the reverse map. Indeed, these need to be fixed. > I have yet to see _any_ report that the problems to which you refer > are causing difficulty in the field. Well, Dave's patch has only been out for a day ;)) > I think there's a middle ground. Hint: MAP_ORACLE. I think Dave's current patch could be a good start. In many cases where object based mapping hurts, the system would use nonlinear VMAs anyawy. This means we can probably get away with using Dave's object based rmaps in the areas where they are currently supported, while using page based rmaps for anonymous memory and nonlinear vmas. For object-based reverse mapping, the worst case is with large objects which are sparsely mapped many times (nonlinear vmas) and deeply inherited COW anonymous memory (apache w/ 300 children). For pte chains, the worst case is with heavily shared mostly read only areas (shared libraries) or heavily shared and used shared memory segments. The hybrid scheme in Dave's patch looks like it uses the right code in the right situation, avoiding these worst cases by using the other reverse mapping scheme. The more I think about it, the more I think there are reasons we might want to stick to a hybrid scheme forever... regards, Rik -- Engineers don't grow up, they grow sideways. http://www.surriel.com/ http://kernelnewbies.org/ |
From: William L. I. I. <wl...@ho...> - 2003-02-21 10:02:15
|
On Fri, Feb 21, 2003 at 01:25:25AM -0300, Rik van Riel wrote: > For object-based reverse mapping, the worst case is with large > objects which are sparsely mapped many times (nonlinear vmas) > and deeply inherited COW anonymous memory (apache w/ 300 children). Actually few users of this care about time, only lowmem space. It was (re)invented, after all, to save lowmem used for vma's. The technical issues with 32-bit aren't anywhere near as difficult to deal with as the massive amount of backlash anything targeted at addressing 32-bit issues gets. These extremely irritating attempts to marginalize every line of code written to address 32-bit issues are probably best dealt with by showing common benefits with the so-called "desktop" machines and/or workloads. So witness a 768MB, 600MHz Athlon, running xmms, xterms, and mozilla: Mem: 767376k av, 761932k used, 5444k free, 0k shrd, 70044k buff 619264k active, 104160k inactive Swap: 506008k av, 155860k used, 350148k free 197644k cached pte_chain 5182K 5188K 99.88% radix_tree_node 1338K 3088K 43.32% dentry_cache 1591K 2970K 53.59% reiser_inode_cache 480K 1226K 39.17% buffer_head 1040K 1043K 99.79% size-4096 828K 828K 100.00% size-32 796K 800K 99.54% biovec-BIO_MAX_PAGES 768K 780K 98.46% pgd 704K 704K 100.00% nr_page_table_pages 689 which amounts to 2756KB RAM used for PTE's, demonstrating large amounts of internal fragmentation within the pte_chain slab. Reducing kernel memory consumption would reduce swapping requirements, which generally speeds things up on the precious "desktop". The vfs should probably get its act together too, since normal usage regularly triggers explosive dentry_cache and reiser_inode_cache space usage. -- wli |
From: William L. I. I. <wl...@ho...> - 2003-02-21 19:17:41
|
On Fri, Feb 21, 2003 at 02:00:10AM -0800, William Lee Irwin III wrote: > nr_page_table_pages 689 > which amounts to 2756KB RAM used for PTE's, demonstrating large Pagetables are also grossly fragmented. From the same sample: nr_reverse_maps 156027 To preempt the FAQ, this is the number of reverse mappings performed, where one is done for every instantiated non-swap PTE. $ echo $(( (689*1024)/156027.0 )) 4.5218840328917427 So the space flushed down the toilet there is also relatively large. Approximately 4.5 PTE's per pagetable page, or a utilization of: $ echo $(( (100*689)/156027.0 )) 0.44159023758708427 percent. -- wli |
From: William L. I. I. <wl...@ho...> - 2003-02-21 19:30:11
|
On Fri, Feb 21, 2003 at 02:00:10AM -0800, William Lee Irwin III wrote: >> nr_page_table_pages 689 >> which amounts to 2756KB RAM used for PTE's, demonstrating large > Pagetables are also grossly fragmented. From the same sample: > nr_reverse_maps 156027 > To preempt the FAQ, this is the number of reverse mappings performed, > where one is done for every instantiated non-swap PTE. ARGH too early in the morning. $ echo $(( (100*156027.0)/(689*1024) )) 22.11467593432511 percent utilization or $ echo $(( 156027.0/689 )) 226.45428156748912 PTE's per pagetable page. -- wli |
From: Martin J. B. <mb...@ar...> - 2003-02-21 03:32:43
|
>> ... >> Did some space consumption comparisons on make -j 256: >> >> before: >> 24116 pte_chain objects in slab cache >> after: >> 716 pte_chain objects in slab cache >> >> The vast majority of anonymous pages (for which we're using the non >> object based method) are singletons, and hence use pte_direct ... >> hence the massive space reduction. >> > > OK. And that'll help the fork+exit CPU overhead too. > > Let's talk about this a bit. > > The problem remains that this design means that we need to walk all the vma's > which are attached to a page's address_space in an attempt to unmap that > page. > > Whereas with the existing pte_chain walk, we only need to visit those pte's > which are still mapping the page. > > So with 1000 processes, each holding 1000 vma's against the same file, we > need to visit 1,000,000 vma's to unmap one page. If that unmap attempt is > unsuccessful, we *still* need to walk 1,000,000 vma's next time we try to > unmap it. That's completely true, but I don't accept the 1000*1000 as a realistic scenario. I do acknowledge that it will perform worse on pageout operations in some cases ... there's certainly a tradeoff here, and we need to measure things on the other side of the scenario. > Whereas with pte-based rmap, we do not need to walk over already-unmapped > pte's. > > So it is easy to see how there is at least a theoretical collapse in search > complexity here. Which affects all platforms. And, of course there is a > demonstrated space consumption collapse with the existing code which affects > big ia32 machines. The benefits come in both space and performance ... the space issue is, admittedly, most important for ia32, but will to some extent affect all machines. > It is not clear how realistic the search complexity problem really is - the > 2.4 virtual scan has basically the same problem and there is not a lot of > evidence of people hurting from it. In fact we see more hurt from full rmap > than from the 1000x1000 thing. > > The search complexity _may_ affect less exotic things than databases. Many > processes mapping libc, say. But in that case the number of VMA's is very > small - if this was a realistic problem then 2.4 would be in real strife. > > Dave's patch is simple enough to merit serious consideration. Indeed, it's really astoundingly small. > <scratches head> > > I think the guiding principle here is that we should not optimise for the > uncommon case (as rmap is doing), and we should not allow the uncommon case > to be utterly terrible (as Dave's patch can do). Absolutely. As long as uncommon is slightly realistic that it might happen. > It would be overly optimistic to assume that hugetlb pages will save us here. > > Any bright ideas? I'd like to see a semi-realistic workload that demonstrates the downside of this method ... both to prove that it can exceed the upside, and so that we can use it to analyse any possible fixes. All I have at the moment is some theory - I can perfectly understand the problem in theory, and see that it exists ... but I'd like at least a synthetic microbenchmark that pretends at least to simulate something vaguely realistic. I'm perfectly prepared to write it myself if someone can show some pseudo-code or whatever that gives a rough scenario, and some description of why they think that's realistic. Personally, I don't really see it in reality. We definitely need to do some UP tests under reasonably memory pressure swapping to disk ... I'll try to get that underway soon. M. |