|
From: Philippe W. <phi...@sk...> - 2013-09-04 22:31:57
|
At least on x86 (old pentium, and on gcc20), I observe some performance regression (e.g. on perf/bz2) between valgrind 3.8.1 and trunk. Also, I have the feeling that regression tests are slower. Does anybody share such feelings ? And better, some measurements e.g. on mips or s390 or arm, as I have no access (anymore) to such platforms ? e.g. assuming you have a trunk and a 3.8.1 build, you could do perl perf/vg_perf --vg=../trunk --vg=../3.8.1 --reps=5 perf (even if these perf tests are somewhat not very reliable) Philippe |
|
From: Maran P. <ma...@li...> - 2013-09-05 06:09:24
Attachments:
perf-s390.out
|
On 09/05/2013 04:02 AM, Philippe Waroquiers wrote: > At least on x86 (old pentium, and on gcc20), I observe some > performance regression (e.g. on perf/bz2) between valgrind 3.8.1 > and trunk. > Also, I have the feeling that regression tests are slower. > > Does anybody share such feelings ? > > And better, some measurements e.g. on mips or s390 or arm, > as I have no access (anymore) to such platforms ? > > e.g. assuming you have a trunk and a 3.8.1 build, you could do > perl perf/vg_perf --vg=../trunk --vg=../3.8.1 --reps=5 perf > Attached is the output of the command on s390 (z196, gcc 4.3.4). Overall, there is a performance regression in s390 as well specifically with 39% slowdown in bz2 -- bz2 -- bz2 trunk :0.70s no: 6.6s ( 9.4x, -----) me:21.5s (30.7x, -----) bz2 valgrind-3.8.1:0.70s no: 6.4s ( 9.1x, 2.9%) me:13.1s (18.8x, 38.8%) --Maran |
|
From: Julian S. <js...@ac...> - 2013-09-06 10:13:54
|
On 09/05/2013 12:32 AM, Philippe Waroquiers wrote:
> At least on x86 (old pentium, and on gcc20), I observe some
> performance regression (e.g. on perf/bz2) between valgrind 3.8.1
> and trunk.
I can't reproduce much of a problem, running 32-bit perf/bz2.c with
argument "x", either time wise or when profiling with cachegrind.
What I do see is that the debuginfo reader is slower with the trunk,
which is an expected outcome of the DiImage changes earlier this year.
But that's a one-time (startup) cost. In particular from the numbers
below, there is no change in the number of JIT generated insns run
(indeed, the trunk runs about 0.1% fewer insns) and also the number
of insns in the Memcheck helper functions,
vgMemCheck_helperc_{LOAD,STORE}*.
That said, the trunk does use about 389 million more insns than
3_8_BRANCH for this case (7446 MI vs 7057 MI). A detailed breakdown
is below. ???:??? is JIT-generated code. I have marked the differences
on a per-function basis with "**". I can account for 286 million
of the 389 million extra insns like that.
It looks like there's more expense from the debuginfo reader's
new DiImage abstraction -- that's expected. It also looks like
there is more VG_(arena_malloc)/VG_(arena_free) activity, and I am
not sure why that is, but I suspect that is also debuginfo-reader
related.
Can you get some comparable figures for the slowdown cases you have?
The command line I am using is:
vTRUNK --smc-check=all-non-file --sim-hints=enable-outer \
--tool=cachegrind -v --trace-children=yes \
./i-38branch/vg-in-place -v ./trunk/bz2-32 x
etc
J
38 (VALGRIND_BRANCH_3_8)
I refs: 7,057,290,720
I1 misses: 7,824,272
LLi misses: 15,473
I1 miss rate: 0.11%
LLi miss rate: 0.00%
D refs: 3,127,582,246 (1,754,466,827 rd + 1,373,115,419 wr)
D1 misses: 9,468,214 ( 5,176,692 rd + 4,291,522 wr)
LLd misses: 1,445,493 ( 369,350 rd + 1,076,143 wr)
D1 miss rate: 0.3% ( 0.2% + 0.3% )
LLd miss rate: 0.0% ( 0.0% + 0.0% )
LL refs: 17,292,486 ( 13,000,964 rd + 4,291,522 wr)
LL misses: 1,460,966 ( 384,823 rd + 1,076,143 wr)
LL miss rate: 0.0% ( 0.0% + 0.0% )
tr (trunk)
I refs: 7,446,972,329
I1 misses: 8,131,984
LLi misses: 16,967
I1 miss rate: 0.10%
LLi miss rate: 0.00%
D refs: 3,344,138,213 (1,885,318,942 rd + 1,458,819,271 wr)
D1 misses: 9,781,723 ( 5,366,971 rd + 4,414,752 wr)
LLd misses: 1,508,829 ( 369,808 rd + 1,139,021 wr)
D1 miss rate: 0.2% ( 0.2% + 0.3% )
LLd miss rate: 0.0% ( 0.0% + 0.0% )
LL refs: 17,913,707 ( 13,498,955 rd + 4,414,752 wr)
LL misses: 1,525,796 ( 386,775 rd + 1,139,021 wr)
LL miss rate: 0.0% ( 0.0% + 0.0% )
7,446,972,329 tr PROGRAM TOTALS
7,057,290,720 38 PROGRAM TOTALS
2,304,640,154 tr ???:???
2,314,330,603 38 ???:???
812,116,345 tr/memcheck/mc_main.c:vgMemCheck_helperc_LOADV32le
812,116,308 38/memcheck/mc_main.c:vgMemCheck_helperc_LOADV32le
437,757,807 tr/VEX/priv/host_generic_reg_alloc2.c:doRegisterAllocation
441,968,566 38/VEX/priv/host_generic_reg_alloc2.c:doRegisterAllocation
417,179,379 tr/memcheck/mc_main.c:vgMemCheck_helperc_STOREV32le
417,178,554 38/memcheck/mc_main.c:vgMemCheck_helperc_STOREV32le
260,456,841 tr/memcheck/mc_main.c:vgMemCheck_helperc_LOADV8
260,457,673 38/memcheck/mc_main.c:vgMemCheck_helperc_LOADV8
201,871,752 tr/memcheck/mc_main.c:vgMemCheck_helperc_STOREV8
201,871,568 38/memcheck/mc_main.c:vgMemCheck_helperc_STOREV8
* 136,129,108 tr/coregrind/m_debuginfo/image.c:get
111,178,911 tr/coregrind/m_libcbase.c:bm_qsort
106,745,540 38/coregrind/m_libcbase.c:bm_qsort
98,783,956 tr/memcheck/mc_main.c:vgMemCheck_helperc_STOREV16le
98,783,872 38/memcheck/mc_main.c:vgMemCheck_helperc_STOREV16le
95,418,150 tr/coregrind/m_debuginfo/storage.c:compare_DiLoc
91,288,890 38/coregrind/m_debuginfo/storage.c:compare_DiLoc
87,177,022 tr/memcheck/mc_main.c:vgMemCheck_helperc_LOADV16le
87,176,987 38/memcheck/mc_main.c:vgMemCheck_helperc_LOADV16le
86,779,773 tr/VEX/priv/ir_defs.c:sanityCheckIRSB
87,054,730 38/VEX/priv/ir_defs.c:sanityCheckIRSB
73,738,760 tr/VEX/priv/host_generic_reg_alloc2.c:sortRRLRarray
74,097,480 38/VEX/priv/host_generic_reg_alloc2.c:sortRRLRarray
72,615,700 tr/VEX/priv/host_generic_regs.c:addHRegUse
75,023,955 38/VEX/priv/host_generic_regs.c:addHRegUse
71,861,152 tr/coregrind/m_libcbase.c:vgPlain_memset
63,402,325 38/coregrind/m_libcbase.c:vgPlain_memset
68,147,584 tr/VEX/priv/ir_opt.c:ado_treebuild_BB
62,918,278 38/VEX/priv/ir_opt.c:ado_treebuild_BB
63,971,363 tr/coregrind/m_seqmatch.c:vgPlain_generic_match
60,850,900 38/coregrind/m_seqmatch.c:vgPlain_generic_match
62,063,269 tr/memcheck/mc_main.c:mc_STOREVn_slow
62,084,101 38/memcheck/mc_main.c:mc_STOREVn_slow
** 60,510,400 tr/coregrind/m_debuginfo/image.c:ensure_valid
56,873,073 tr/VEX/priv/ir_defs.c:addStmtToIRSB
56,875,973 38/VEX/priv/ir_defs.c:addStmtToIRSB
55,888,006 tr/coregrind/m_libcbase.c:vgPlain_strlen
53,063,600 38/coregrind/m_libcbase.c:vgPlain_strlen
** 45,747,346 tr/coregrind/m_mallocfree.c:vgPlain_arena_malloc
36,967,725 38/coregrind/m_mallocfree.c:vgPlain_arena_malloc
45,085,199 tr/VEX/priv/ir_defs.c:typeOfIRExpr
44,287,428 38/VEX/priv/ir_defs.c:typeOfIRExpr
43,175,340 tr/coregrind/m_libcbase.c:vgPlain_memcpy
42,725,797 38/coregrind/m_libcbase.c:vgPlain_memcpy
42,969,895 tr/VEX/priv/ir_defs.c:tcExpr
42,915,212 38/VEX/priv/ir_defs.c:tcExpr
42,439,360 tr/coregrind/m_oset.c:avl_lookup
42,419,764 38/coregrind/m_oset.c:avl_lookup
37,080,429 tr/coregrind/m_debuginfo/storage.c:vgModuleLocal_addLineInfo
35,809,595 38/coregrind/m_debuginfo/storage.c:vgModuleLocal_addLineInfo
36,488,566
tr/coregrind/m_debuginfo/readdwarf.c:vgModuleLocal_read_debuginfo_dwarf3
34,279,124
38/coregrind/m_debuginfo/readdwarf.c:vgModuleLocal_read_debuginfo_dwarf3
** 36,271,473 tr/coregrind/m_debuginfo/image.c:vgModuleLocal_img_get_UChar
33,610,964 tr/coregrind/m_libcbase.c:bm_swapfunc
32,434,962 38/coregrind/m_libcbase.c:bm_swapfunc
33,578,981 tr/VEX/priv/host_x86_defs.c:getRegUsage_X86Instr
33,444,576 38/VEX/priv/host_x86_defs.c:getRegUsage_X86Instr
33,390,654 tr/VEX/priv/ir_opt.c:cprop_BB
32,559,432 38/VEX/priv/ir_opt.c:cprop_BB
30,437,042 tr/VEX/priv/ir_opt.c:do_deadcode_BB
30,407,458 38/VEX/priv/ir_opt.c:do_deadcode_BB
29,588,775 tr/VEX/priv/ir_defs.c:useBeforeDef_Expr
29,564,448 38/VEX/priv/ir_defs.c:useBeforeDef_Expr
26,803,264 tr/coregrind/m_translate.c:get_SP_delta
26,860,296 38/coregrind/m_translate.c:get_SP_delta
26,501,047 tr/VEX/priv/ir_opt.c:subst_Expr
26,453,483 38/VEX/priv/ir_opt.c:subst_Expr
26,329,365 tr/coregrind/m_debuginfo/storage.c:compare_DiCfSI
25,147,065 38/coregrind/m_debuginfo/storage.c:compare_DiCfSI
26,241,102 tr/memcheck/mc_main.c:mc_LOADVn_slow
26,250,724 38/memcheck/mc_main.c:mc_LOADVn_slow
25,535,225 tr/coregrind/m_libcbase.c:vgPlain_strcmp
25,646,207 38/coregrind/m_libcbase.c:vgPlain_strcmp
** 25,102,518 tr/coregrind/m_mallocfree.c:vgPlain_arena_free
17,308,847 38/coregrind/m_mallocfree.c:vgPlain_arena_free
** 23,670,379 tr/coregrind/m_debuginfo/readdwarf.c:step_leb128
26,618,544 38/coregrind/m_debuginfo/readdwarf.c:read_leb128
22,930,457 tr/VEX/priv/ir_opt.c:addUses_Expr
22,964,761 38/VEX/priv/ir_opt.c:addUses_Expr
22,522,597 tr/VEX/priv/main_main.c:LibVEX_Translate
22,629,945 38/VEX/priv/main_main.c:LibVEX_Translate
22,245,402 tr/VEX/priv/ir_defs.c:typeOfPrimop
22,214,549 38/VEX/priv/ir_defs.c:typeOfPrimop
21,987,289
tr/coregrind/m_debuginfo/storage.c:vgModuleLocal_canonicaliseTables
21,143,977
38/coregrind/m_debuginfo/storage.c:vgModuleLocal_canonicaliseTables
19,269,294 tr/VEX/priv/ir_opt.c:fold_Expr
19,066,110 38/VEX/priv/ir_opt.c:fold_Expr
18,788,278 tr/coregrind/m_debuginfo/readdwarf.c:run_CF_instruction.isra.19
17,294,121 38/coregrind/m_debuginfo/readdwarf.c:run_CF_instruction.isra.10
18,768,679 tr/coregrind/m_translate.c:vg_SP_update_pass
18,789,947 38/coregrind/m_translate.c:vg_SP_update_pass
18,762,360 tr/coregrind/m_seqmatch.c:vgPlain_string_match
17,599,620 38/coregrind/m_seqmatch.c:vgPlain_string_match
18,612,347 tr/coregrind/m_xarray.c:vgPlain_addToXA
18,595,595 38/coregrind/m_xarray.c:vgPlain_addToXA
18,496,986 tr/memcheck/mc_main.c:set_address_range_perms
18,498,033 38/memcheck/mc_main.c:set_address_range_perms
17,841,966 tr/VEX/priv/ir_defs.c:isFlatIRStmt
17,654,999 38/VEX/priv/ir_defs.c:isFlatIRStmt
17,436,987 tr/VEX/priv/ir_opt.c:atbSubst_Expr
17,701,033 38/VEX/priv/ir_opt.c:atbSubst_Expr
17,357,130 tr/coregrind/m_debuginfo/readdwarf.c:index_WordArray.isra.9
16,593,930 38/coregrind/m_debuginfo/readdwarf.c:index_WordArray.isra.2
17,173,815 tr/VEX/priv/host_generic_regs.c:addHInstr
17,417,145 38/VEX/priv/host_generic_regs.c:addHInstr
** 17,145,796 tr/coregrind/m_mallocfree.c:blockSane.isra.10
11,861,896 38/coregrind/m_mallocfree.c:blockSane.isra.10
** 17,140,668 tr/coregrind/m_mallocfree.c:mkFreeBlock
12,190,038 38/coregrind/m_mallocfree.c:mkFreeBlock
** 17,120,340 tr/coregrind/m_debuginfo/image.c:vgModuleLocal_img_get
16,714,522 tr/VEX/priv/ir_opt.c:addToHHW
16,710,864 38/VEX/priv/ir_opt.c:addToHHW
16,606,069 tr/coregrind/m_mallocfree.c:pszB_to_listNo
15,022,533 38/coregrind/m_mallocfree.c:pszB_to_listNo
16,576,319 tr/VEX/priv/ir_opt.c:invalidateOverlaps
16,570,209 38/VEX/priv/ir_opt.c:invalidateOverlaps
** 16,124,374 tr/coregrind/m_mallocfree.c:unlinkBlock
11,605,359 38/coregrind/m_mallocfree.c:unlinkBlock
15,593,494
tr/coregrind/m_debuginfo/debuginfo.c:vgModuleLocal_find_rx_mapping
14,923,244
38/coregrind/m_debuginfo/debuginfo.c:vgModuleLocal_find_rx_mapping
15,480,727 tr/coregrind/m_redir.c:generate_and_add_actives
14,460,381 38/coregrind/m_redir.c:generate_and_add_actives
15,026,454 tr/coregrind/m_debuginfo/storage.c:vgModuleLocal_addDiCfSI
13,740,477 38/coregrind/m_debuginfo/storage.c:vgModuleLocal_addDiCfSI
** 14,831,773
tr/coregrind/m_debuginfo/readdwarf.c:vgModuleLocal_read_callframe_info_dwarf3
12,966,381
38/coregrind/m_debuginfo/readdwarf.c:vgModuleLocal_read_callframe_info_dwarf3
14,634,988 tr/memcheck/mc_main.c:set_sec_vbits8
14,634,988 38/memcheck/mc_main.c:set_sec_vbits8
14,536,545 38/memcheck/mc_translate.c:isAlwaysDefd.isra.7
14,541,239 tr/memcheck/mc_translate.c:isAlwaysDefd.isra.8
** 13,971,306 tr/VEX/priv/host_x86_isel.c:iselSB_X86
13,075,995 38/VEX/priv/host_x86_isel.c:iselSB_X86
** 13,814,584 tr/coregrind/m_mallocfree.c:mkInuseBlock
9,714,720 38/coregrind/m_mallocfree.c:mkInuseBlock
12,944,056 tr/VEX/priv/host_x86_defs.c:emit_X86Instr
12,986,740 38/VEX/priv/host_x86_defs.c:emit_X86Instr
12,664,489 tr/memcheck/mc_translate.c:vgMemCheck_instrument
12,542,508 38/memcheck/mc_translate.c:vgMemCheck_instrument
12,485,517 tr/VEX/priv/ir_defs.c:newIRTemp
12,471,403 38/VEX/priv/ir_defs.c:newIRTemp
11,575,473 tr/coregrind/m_seqmatch.c:char_p_EQ_i
11,049,636 38/coregrind/m_seqmatch.c:char_p_EQ_i
11,525,758 tr/VEX/priv/host_generic_regs.h:doRegisterAllocation
12,032,581 38/VEX/priv/host_generic_regs.h:doRegisterAllocation
11,272,953 tr/VEX/priv/ir_opt.c:redundant_put_removal_BB
11,582,604 38/VEX/priv/ir_opt.c:redundant_put_removal_BB
10,017,301 tr/VEX/priv/ir_opt.c:lookupHHW
10,013,083 38/VEX/priv/ir_opt.c:lookupHHW
10,007,386 tr/memcheck/mc_main.c:get_secmap_for_writing
10,011,994 38/memcheck/mc_main.c:get_secmap_for_writing
|
|
From: Florian K. <fl...@ei...> - 2013-09-06 13:42:04
|
On 09/05/2013 12:32 AM, Philippe Waroquiers wrote: > At least on x86 (old pentium, and on gcc20), I observe some > performance regression (e.g. on perf/bz2) between valgrind 3.8.1 > and trunk. > Also, I have the feeling that regression tests are slower. > > Does anybody share such feelings ? > > And better, some measurements e.g. on mips or s390 or arm, > as I have no access (anymore) to such platforms ? > As Maran already reported there is performance regression on s390. It was guessed by Julian that r13278 might be to blame (at least partially). That was a good guess. Here is what I see on a z10-EC: tr-13278 is trunk with r13278 (only) backed out. perl perf/vg_perf --vg=../3.8.1 --vg=../trunk --reps=5 perf -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 3.8.1 :0.49s no: 4.6s ( 9.3x, -----) me: 8.1s (16.6x, -----) bigcode1 trunk :0.50s no: 4.8s ( 9.5x, -3.5%) me: 8.4s (16.8x, -3.7%) bigcode1 tr-13278 :0.49s no: 4.8s ( 9.7x, -3.7%) me: 8.3s (16.9x, -2.1%) -- bigcode2 -- bigcode2 3.8.1 :0.49s no: 8.9s (18.2x, -----) me:17.1s (34.9x, -----) bigcode2 trunk :0.51s no: 9.0s (17.7x, -1.5%) me:17.8s (34.9x, -4.0%) bigcode2 tr-13278 :0.49s no: 9.1s (18.5x, -1.9%) me:17.4s (35.5x, -1.9%) -- bz2 -- bz2 3.8.1 :1.07s no: 5.5s ( 5.1x, -----) me:24.2s (22.6x, -----) bz2 trunk :1.07s no: 5.7s ( 5.3x, -2.7%) me:29.0s (27.1x,-19.7%) bz2 tr-13278 :1.07s no: 5.5s ( 5.2x, -1.1%) me:21.5s (20.1x, 11.1%) -- fbench -- fbench 3.8.1 :0.80s no: 2.8s ( 3.5x, -----) me:10.6s (13.3x, -----) fbench trunk :0.80s no: 3.6s ( 4.5x,-26.2%) me:12.2s (15.3x,-15.2%) fbench tr-13278 :0.80s no: 3.5s ( 4.3x,-22.7%) me:10.5s (13.1x, 0.8%) -- ffbench -- ffbench 3.8.1 :0.52s no: 1.5s ( 2.8x, -----) me: 5.7s (11.0x, -----) ffbench trunk :0.52s no: 2.3s ( 4.4x,-57.5%) me: 7.7s (14.9x,-35.6%) ffbench tr-13278 :0.52s no: 2.3s ( 4.4x,-54.4%) me: 6.2s (11.9x, -8.1%) -- heap -- heap 3.8.1 :0.33s no: 1.8s ( 5.5x, -----) me:14.4s (43.7x, -----) heap trunk :0.33s no: 2.2s ( 6.7x,-20.2%) me:16.3s (49.5x,-13.2%) heap tr-13278 :0.33s no: 1.9s ( 5.7x, -3.3%) me:14.4s (43.7x, 0.1%) -- heap_pdb4 -- heap_pdb 3.8.1 :0.33s no: 2.0s ( 6.1x, -----) me:21.2s (64.2x, -----) heap_pdb trunk :0.32s no: 2.3s ( 7.2x,-11.7%) me:23.0s (71.8x, -8.6%) heap_pdb tr-13278 :0.33s no: 2.1s ( 6.4x, -5.0%) me:21.2s (64.2x, 0.0%) -- many-loss-records -- many-los 3.8.1 :0.03s no: 0.4s (14.7x, -----) me: 3.7s (122.0x, ----) many-los trunk :0.04s no: 0.5s (13.0x,-15.6%) me: 3.9s (96.2x, -5.2%) many-los tr-13278 :0.03s no: 0.5s (17.3x,-18.2%) me: 3.6s (119.7x, 1.9%) -- many-xpts -- many-xpt 3.8.1 :0.06s no: 0.7s (11.5x, -----) me: 5.0s (83.5x, -----) many-xpt trunk :0.08s no: 0.8s (10.1x,-20.9%) me: 5.4s (67.9x, -9.7%) many-xpt tr-13278 :0.06s no: 0.8s (12.5x, -8.7%) me: 5.0s (83.0x, 0.6%) -- sarp -- sarp 3.8.1 :0.04s no: 0.5s (13.0x, -----) me: 5.7s (142.8x, -----) sarp trunk :0.04s no: 0.7s (16.8x,-26.4%) me: 7.2s (179.8x,-26.1%) sarp tr-13278 :0.04s no: 0.6s (14.0x, -7.7%) me: 6.3s (158.0x,-10.7%) -- tinycc -- tinycc 3.8.1 :0.35s no: 3.9s (11.1x, -----) me:26.3s (75.1x, -----) tinycc trunk :0.35s no: 4.1s (11.8x, -6.7%) me:30.3s (86.7x,-15.5%) tinycc tr-13278 :0.35s no: 3.9s (11.2x, -1.0%) me:25.3s (72.3x, 3.7%) -- Finished tests in perf ------------------------------------ sarp and ffbench still regress notably. Florian |
|
From: Philippe W. <phi...@sk...> - 2013-09-07 18:35:46
|
On Fri, 2013-09-06 at 15:52 +0200, Julian Seward wrote: > Also, r13278 has effect for 64 bit targets but not for 32 bit targets > since 32 bit targets deal with all 4GB "fast" anyway. So: > > - it can't account for any slowdowns we see on x86-linux > - it should also give visible slowdowns on other 64 bit targets, > not just s390 > > We need more numbers :) Below, you will find numbers for ppc64 and ppc32 (gcc110). Basically, ppc32 trunk is (slightly) faster than 3.8.1 ppc64 trunk is similar to 3.8.1, sometimes slightly faster sometimes slightly slower. sarp trunk is however significantly slower than 3.8.1 (for the record, the below also contains numbers for 3.7.1 which is significantly slower for all tests). So, nothing to worry on ppc except maybe for sarp in 64 bits. Philippe ppc32: -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 32bits_trunk_untouched:0.15s no: 2.1s (13.9x, -----) me: 4.4s (29.3x, -----) bigcode1 valgrind-3.8.1:0.15s no: 2.1s (14.2x, -1.9%) me: 4.6s (30.7x, -4.8%) bigcode1 valgrind-3.7.0:0.15s no: 2.8s (18.8x,-34.9%) me: 5.5s (36.6x,-25.1%) -- bigcode2 -- bigcode2 32bits_trunk_untouched:0.19s no: 4.9s (26.0x, -----) me:11.3s (59.6x, -----) bigcode2 valgrind-3.8.1:0.19s no: 5.0s (26.4x, -1.4%) me:11.5s (60.4x, -1.4%) bigcode2 valgrind-3.7.0:0.19s no: 5.7s (29.9x,-15.2%) me:12.4s (65.1x, -9.2%) -- bz2 -- bz2 32bits_trunk_untouched:0.70s no: 3.9s ( 5.6x, -----) me:10.9s (15.6x, -----) bz2 valgrind-3.8.1:0.70s no: 3.9s ( 5.6x, 0.3%) me:11.2s (16.1x, -2.9%) bz2 valgrind-3.7.0:0.70s no: 4.6s ( 6.6x,-16.8%) me:12.0s (17.1x, -9.7%) -- fbench -- fbench 32bits_trunk_untouched:0.33s no: 2.1s ( 6.2x, -----) me: 4.8s (14.6x, -----) fbench valgrind-3.8.1:0.33s no: 2.0s ( 6.1x, 2.9%) me: 4.9s (14.8x, -1.4%) fbench valgrind-3.7.0:0.33s no: 2.2s ( 6.6x, -5.3%) me: 5.1s (15.4x, -5.4%) -- ffbench -- ffbench 32bits_trunk_untouched:0.36s no: 0.9s ( 2.6x, -----) me: 2.2s ( 6.2x, -----) ffbench valgrind-3.8.1:0.36s no: 0.9s ( 2.6x, 0.0%) me: 2.4s ( 6.8x, -8.0%) ffbench valgrind-3.7.0:0.36s no: 1.0s ( 2.9x, -9.6%) me: 2.4s ( 6.6x, -5.3%) -- heap -- heap 32bits_trunk_untouched:0.44s no: 2.0s ( 4.6x, -----) me: 8.2s (18.6x, -----) heap valgrind-3.8.1:0.44s no: 1.9s ( 4.4x, 4.0%) me: 8.7s (19.8x, -6.6%) heap valgrind-3.7.0:0.44s no: 2.3s ( 5.3x,-14.9%) me: 8.7s (19.7x, -5.9%) -- heap_pdb4 -- heap_pdb4 32bits_trunk_untouched:0.45s no: 2.2s ( 4.9x, -----) me:11.9s (26.5x, -----) heap_pdb4 valgrind-3.8.1:0.45s no: 2.1s ( 4.6x, 5.9%) me:12.5s (27.9x, -5.1%) heap_pdb4 valgrind-3.7.0:0.45s no: 2.5s ( 5.5x,-11.3%) me:15.6s (34.8x,-31.1%) -- many-loss-records -- many-loss-records 32bits_trunk_untouched:0.04s no: 0.4s (10.5x, -----) me: 1.7s (42.8x, -----) many-loss-records valgrind-3.8.1:0.04s no: 0.4s ( 9.5x, 9.5%) me: 1.8s (45.0x, -5.3%) many-loss-records valgrind-3.7.0:0.04s no: 0.4s ( 9.5x, 9.5%) me: 1.9s (47.8x,-11.7%) -- many-xpts -- many-xpts 32bits_trunk_untouched:0.08s no: 0.8s ( 9.9x, -----) me: 2.1s (26.5x, -----) many-xpts valgrind-3.8.1:0.08s no: 0.8s ( 9.4x, 5.1%) me: 2.1s (26.4x, 0.5%) many-xpts valgrind-3.7.0:0.08s no: 0.7s ( 9.0x, 8.9%) me: 2.1s (26.6x, -0.5%) -- sarp -- sarp 32bits_trunk_untouched:0.02s no: 0.3s (15.5x, -----) me: 2.5s (126.5x, -----) sarp valgrind-3.8.1:0.02s no: 0.3s (14.5x, 6.5%) me: 2.5s (125.5x, 0.8%) sarp valgrind-3.7.0:0.02s no: 0.3s (13.5x, 12.9%) me: 2.7s (133.5x, -5.5%) -- tinycc -- tinycc 32bits_trunk_untouched:0.32s no: 2.9s ( 9.2x, -----) me:10.5s (32.9x, -----) tinycc valgrind-3.8.1:0.32s no: 2.9s ( 9.2x, 0.0%) me:10.9s (34.0x, -3.2%) tinycc valgrind-3.7.0:0.32s no: 3.0s ( 9.5x, -3.4%) me:11.1s (34.5x, -4.8%) -- Finished tests in perf ---------------------------------------------- == 11 programs, 66 timings ================= ppc64: perl perf/vg_perf --vg=../trunk_untouched --vg=../valgrind-3.8.1 --vg=../valgrind-3.7.0 --reps=5 perf 2>&1 | tee perf.out -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 trunk_untouched:0.21s no: 1.6s ( 7.5x, -----) me: 3.0s (14.1x, -----) bigcode1 valgrind-3.8.1:0.21s no: 1.7s ( 8.0x, -5.7%) me: 2.8s (13.3x, 5.4%) bigcode1 valgrind-3.7.0:0.21s no: 2.3s (10.8x,-43.7%) me: 3.6s (17.3x,-23.0%) -- bigcode2 -- bigcode2 trunk_untouched:0.21s no: 1.5s ( 7.3x, -----) me: 2.9s (14.0x, -----) bigcode2 valgrind-3.8.1:0.21s no: 1.5s ( 7.2x, 0.7%) me: 2.9s (13.8x, 1.0%) bigcode2 valgrind-3.7.0:0.21s no: 2.3s (11.0x,-50.3%) me: 3.7s (17.7x,-27.0%) -- bz2 -- bz2 trunk_untouched:0.72s no: 4.6s ( 6.3x, -----) me:11.8s (16.4x, -----) bz2 valgrind-3.8.1:0.72s no: 4.5s ( 6.3x, 1.1%) me:12.3s (17.1x, -4.0%) bz2 valgrind-3.7.0:0.72s no: 5.4s ( 7.6x,-19.0%) me:13.2s (18.3x,-11.4%) -- fbench -- fbench trunk_untouched:0.34s no: 2.1s ( 6.3x, -----) me: 5.2s (15.4x, -----) fbench valgrind-3.8.1:0.34s no: 2.1s ( 6.2x, 2.3%) me: 5.3s (15.5x, -0.6%) fbench valgrind-3.7.0:0.34s no: 2.3s ( 6.8x, -7.9%) me: 5.5s (16.1x, -4.0%) -- ffbench -- ffbench trunk_untouched:0.44s no: 1.0s ( 2.3x, -----) me: 2.5s ( 5.7x, -----) ffbench valgrind-3.8.1:0.44s no: 1.0s ( 2.2x, 4.9%) me: 2.6s ( 5.8x, -2.4%) ffbench valgrind-3.7.0:0.44s no: 1.2s ( 2.6x,-13.7%) me: 2.5s ( 5.7x, 0.8%) -- heap -- heap trunk_untouched:0.41s no: 2.4s ( 5.9x, -----) me: 9.9s (24.1x, -----) heap valgrind-3.8.1:0.41s no: 2.4s ( 5.8x, 2.1%) me: 9.9s (24.1x, -0.3%) heap valgrind-3.7.0:0.41s no: 3.1s ( 7.6x,-28.6%) me: 9.7s (23.7x, 1.6%) -- heap_pdb4 -- heap_pdb4 trunk_untouched:0.41s no: 2.6s ( 6.3x, -----) me:13.9s (34.0x, -----) heap_pdb4 valgrind-3.8.1:0.41s no: 2.5s ( 6.2x, 1.6%) me:13.8s (33.6x, 1.2%) heap_pdb4 valgrind-3.7.0:0.41s no: 3.4s ( 8.2x,-30.6%) me:17.4s (42.4x,-24.9%) -- many-loss-records -- many-loss-records trunk_untouched:0.03s no: 0.6s (18.3x, -----) me: 2.2s (74.0x, -----) many-loss-records valgrind-3.8.1:0.03s no: 0.5s (17.3x, 5.5%) me: 2.2s (74.3x, -0.5%) many-loss-records valgrind-3.7.0:0.03s no: 0.6s (18.3x, 0.0%) me: 2.3s (77.0x, -4.1%) -- many-xpts -- many-xpts trunk_untouched:0.07s no: 0.7s (10.6x, -----) me: 3.4s (48.1x, -----) many-xpts valgrind-3.8.1:0.07s no: 0.7s ( 9.9x, 6.8%) me: 3.4s (47.9x, 0.6%) many-xpts valgrind-3.7.0:0.07s no: 0.7s (10.4x, 1.4%) me: 3.3s (47.6x, 1.2%) -- sarp -- sarp trunk_untouched:0.02s no: 0.4s (21.0x, -----) me: 3.9s (195.5x, -----) sarp valgrind-3.8.1:0.02s no: 0.4s (18.5x, 11.9%) me: 3.2s (160.5x, 17.9%) sarp valgrind-3.7.0:0.02s no: 0.3s (16.5x, 21.4%) me: 3.1s (155.5x, 20.5%) -- tinycc -- tinycc trunk_untouched:0.26s no: 3.0s (11.5x, -----) me:14.3s (54.8x, -----) tinycc valgrind-3.8.1:0.26s no: 2.9s (11.3x, 1.3%) me:14.3s (54.9x, -0.1%) tinycc valgrind-3.7.0:0.26s no: 3.1s (12.0x, -4.7%) me:14.2s (54.7x, 0.2%) -- Finished tests in perf ---------------------------------------------- == 11 programs, 66 timings ================= |
|
From: Petar J. <mip...@gm...> - 2013-09-08 02:06:45
|
Here are the numbers for MIPS32. $ perl perf/vg_perf --vg=../3.8.1 --vg=../trunk --reps=5 perf 2>&1 | tee perf_mips.txt -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 3.8.1 :0.47s no: 9.3s (19.9x, -----) me:16.5s (35.1x, -----) bigcode1 trunk :0.47s no: 9.8s (20.8x, -4.8%) me:14.2s (30.3x, 13.9%) -- bigcode2 -- bigcode2 3.8.1 :0.49s no:16.4s (33.4x, -----) me:30.8s (62.8x, -----) bigcode2 trunk :0.49s no:16.8s (34.2x, -2.6%) me:28.8s (58.7x, 6.5%) -- bz2 -- bz2 3.8.1 :2.40s no:13.6s ( 5.7x, -----) me:47.3s (19.7x, -----) bz2 trunk :2.40s no:13.9s ( 5.8x, -2.4%) me:39.7s (16.5x, 16.0%) -- fbench -- fbench 3.8.1 :1.79s no:40.0s (22.4x, -----) me:54.9s (30.7x, -----) fbench trunk :1.79s no:32.4s (18.1x, 19.0%) me:41.6s (23.3x, 24.1%) -- ffbench -- ffbench 3.8.1 :0.84s no:23.3s (27.7x, -----) me:37.4s (44.5x, -----) ffbench trunk :0.84s no:22.8s (27.1x, 2.2%) me:32.1s (38.2x, 14.0%) -- heap -- heap 3.8.1 :1.45s no: 5.9s ( 4.1x, -----) me:34.7s (23.9x, -----) heap trunk :1.45s no: 6.5s ( 4.5x,-10.3%) me:34.4s (23.7x, 0.7%) -- heap_pdb4 -- heap_pdb4 3.8.1 :1.52s no: 6.4s ( 4.2x, -----) me:52.7s (34.7x, -----) heap_pdb4 trunk :1.52s no: 7.3s ( 4.8x,-13.7%) me:55.3s (36.4x, -4.9%) -- many-loss-records -- many-loss-records 3.8.1 :0.16s no: 1.3s ( 7.9x, -----) me: 6.7s (41.9x, -----) many-loss-records trunk :0.16s no: 1.7s (10.6x,-33.9%) me: 7.2s (44.9x, -7.2%) -- many-xpts -- many-xpts 3.8.1 :0.30s no: 2.3s ( 7.6x, -----) me: 7.2s (24.1x, -----) many-xpts trunk :0.30s no: 2.7s ( 9.1x,-20.3%) me: 7.7s (25.8x, -6.9%) -- sarp -- sarp 3.8.1 :0.06s no: 1.4s (24.0x, -----) me: 8.8s (147.2x, -----) sarp trunk :0.06s no: 1.9s (32.2x,-34.0%) me:10.0s (167.2x,-13.6%) -- tinycc -- tinycc 3.8.1 :1.00s no: 9.9s ( 9.9x, -----) me:39.7s (39.7x, -----) tinycc trunk :1.00s no:10.3s (10.3x, -3.7%) me:37.8s (37.8x, 4.7%) -- Finished tests in perf ---------------------------------------------- == 11 programs, 44 timings ================= Regards, Petar On Sat, Sep 7, 2013 at 8:35 PM, Philippe Waroquiers < phi...@sk...> wrote: > On Fri, 2013-09-06 at 15:52 +0200, Julian Seward wrote: > > Also, r13278 has effect for 64 bit targets but not for 32 bit targets > > since 32 bit targets deal with all 4GB "fast" anyway. So: > > > > - it can't account for any slowdowns we see on x86-linux > > - it should also give visible slowdowns on other 64 bit targets, > > not just s390 > > > > We need more numbers :) > Below, you will find numbers for ppc64 and ppc32 (gcc110). > > Basically, > ppc32 trunk is (slightly) faster than 3.8.1 > ppc64 trunk is similar to 3.8.1, sometimes slightly faster > sometimes slightly slower. sarp trunk is however significantly > slower than 3.8.1 > (for the record, the below also contains numbers for 3.7.1 which > is significantly slower for all tests). > > So, nothing to worry on ppc except maybe for sarp in 64 bits. > > Philippe > > ppc32: > > -- Running tests in perf ---------------------------------------------- > -- bigcode1 -- > bigcode1 32bits_trunk_untouched:0.15s no: 2.1s (13.9x, -----) me: 4.4s > (29.3x, -----) > bigcode1 valgrind-3.8.1:0.15s no: 2.1s (14.2x, -1.9%) me: 4.6s (30.7x, > -4.8%) > bigcode1 valgrind-3.7.0:0.15s no: 2.8s (18.8x,-34.9%) me: 5.5s > (36.6x,-25.1%) > -- bigcode2 -- > bigcode2 32bits_trunk_untouched:0.19s no: 4.9s (26.0x, -----) me:11.3s > (59.6x, -----) > bigcode2 valgrind-3.8.1:0.19s no: 5.0s (26.4x, -1.4%) me:11.5s (60.4x, > -1.4%) > bigcode2 valgrind-3.7.0:0.19s no: 5.7s (29.9x,-15.2%) me:12.4s (65.1x, > -9.2%) > -- bz2 -- > bz2 32bits_trunk_untouched:0.70s no: 3.9s ( 5.6x, -----) me:10.9s > (15.6x, -----) > bz2 valgrind-3.8.1:0.70s no: 3.9s ( 5.6x, 0.3%) me:11.2s (16.1x, > -2.9%) > bz2 valgrind-3.7.0:0.70s no: 4.6s ( 6.6x,-16.8%) me:12.0s (17.1x, > -9.7%) > -- fbench -- > fbench 32bits_trunk_untouched:0.33s no: 2.1s ( 6.2x, -----) me: 4.8s > (14.6x, -----) > fbench valgrind-3.8.1:0.33s no: 2.0s ( 6.1x, 2.9%) me: 4.9s (14.8x, > -1.4%) > fbench valgrind-3.7.0:0.33s no: 2.2s ( 6.6x, -5.3%) me: 5.1s (15.4x, > -5.4%) > -- ffbench -- > ffbench 32bits_trunk_untouched:0.36s no: 0.9s ( 2.6x, -----) me: 2.2s ( > 6.2x, -----) > ffbench valgrind-3.8.1:0.36s no: 0.9s ( 2.6x, 0.0%) me: 2.4s ( 6.8x, > -8.0%) > ffbench valgrind-3.7.0:0.36s no: 1.0s ( 2.9x, -9.6%) me: 2.4s ( 6.6x, > -5.3%) > -- heap -- > heap 32bits_trunk_untouched:0.44s no: 2.0s ( 4.6x, -----) me: 8.2s > (18.6x, -----) > heap valgrind-3.8.1:0.44s no: 1.9s ( 4.4x, 4.0%) me: 8.7s (19.8x, > -6.6%) > heap valgrind-3.7.0:0.44s no: 2.3s ( 5.3x,-14.9%) me: 8.7s (19.7x, > -5.9%) > -- heap_pdb4 -- > heap_pdb4 32bits_trunk_untouched:0.45s no: 2.2s ( 4.9x, -----) me:11.9s > (26.5x, -----) > heap_pdb4 valgrind-3.8.1:0.45s no: 2.1s ( 4.6x, 5.9%) me:12.5s (27.9x, > -5.1%) > heap_pdb4 valgrind-3.7.0:0.45s no: 2.5s ( 5.5x,-11.3%) me:15.6s > (34.8x,-31.1%) > -- many-loss-records -- > many-loss-records 32bits_trunk_untouched:0.04s no: 0.4s (10.5x, -----) > me: 1.7s (42.8x, -----) > many-loss-records valgrind-3.8.1:0.04s no: 0.4s ( 9.5x, 9.5%) me: 1.8s > (45.0x, -5.3%) > many-loss-records valgrind-3.7.0:0.04s no: 0.4s ( 9.5x, 9.5%) me: 1.9s > (47.8x,-11.7%) > -- many-xpts -- > many-xpts 32bits_trunk_untouched:0.08s no: 0.8s ( 9.9x, -----) me: 2.1s > (26.5x, -----) > many-xpts valgrind-3.8.1:0.08s no: 0.8s ( 9.4x, 5.1%) me: 2.1s (26.4x, > 0.5%) > many-xpts valgrind-3.7.0:0.08s no: 0.7s ( 9.0x, 8.9%) me: 2.1s (26.6x, > -0.5%) > -- sarp -- > sarp 32bits_trunk_untouched:0.02s no: 0.3s (15.5x, -----) me: 2.5s > (126.5x, -----) > sarp valgrind-3.8.1:0.02s no: 0.3s (14.5x, 6.5%) me: 2.5s (125.5x, > 0.8%) > sarp valgrind-3.7.0:0.02s no: 0.3s (13.5x, 12.9%) me: 2.7s (133.5x, > -5.5%) > -- tinycc -- > tinycc 32bits_trunk_untouched:0.32s no: 2.9s ( 9.2x, -----) me:10.5s > (32.9x, -----) > tinycc valgrind-3.8.1:0.32s no: 2.9s ( 9.2x, 0.0%) me:10.9s (34.0x, > -3.2%) > tinycc valgrind-3.7.0:0.32s no: 3.0s ( 9.5x, -3.4%) me:11.1s (34.5x, > -4.8%) > -- Finished tests in perf ---------------------------------------------- > > == 11 programs, 66 timings ================= > > > > > ppc64: > > perl perf/vg_perf --vg=../trunk_untouched --vg=../valgrind-3.8.1 > --vg=../valgrind-3.7.0 --reps=5 perf 2>&1 | tee perf.out > -- Running tests in perf ---------------------------------------------- > -- bigcode1 -- > bigcode1 trunk_untouched:0.21s no: 1.6s ( 7.5x, -----) me: 3.0s (14.1x, > -----) > bigcode1 valgrind-3.8.1:0.21s no: 1.7s ( 8.0x, -5.7%) me: 2.8s (13.3x, > 5.4%) > bigcode1 valgrind-3.7.0:0.21s no: 2.3s (10.8x,-43.7%) me: 3.6s > (17.3x,-23.0%) > -- bigcode2 -- > bigcode2 trunk_untouched:0.21s no: 1.5s ( 7.3x, -----) me: 2.9s (14.0x, > -----) > bigcode2 valgrind-3.8.1:0.21s no: 1.5s ( 7.2x, 0.7%) me: 2.9s (13.8x, > 1.0%) > bigcode2 valgrind-3.7.0:0.21s no: 2.3s (11.0x,-50.3%) me: 3.7s > (17.7x,-27.0%) > -- bz2 -- > bz2 trunk_untouched:0.72s no: 4.6s ( 6.3x, -----) me:11.8s (16.4x, > -----) > bz2 valgrind-3.8.1:0.72s no: 4.5s ( 6.3x, 1.1%) me:12.3s (17.1x, > -4.0%) > bz2 valgrind-3.7.0:0.72s no: 5.4s ( 7.6x,-19.0%) me:13.2s > (18.3x,-11.4%) > -- fbench -- > fbench trunk_untouched:0.34s no: 2.1s ( 6.3x, -----) me: 5.2s (15.4x, > -----) > fbench valgrind-3.8.1:0.34s no: 2.1s ( 6.2x, 2.3%) me: 5.3s (15.5x, > -0.6%) > fbench valgrind-3.7.0:0.34s no: 2.3s ( 6.8x, -7.9%) me: 5.5s (16.1x, > -4.0%) > -- ffbench -- > ffbench trunk_untouched:0.44s no: 1.0s ( 2.3x, -----) me: 2.5s ( 5.7x, > -----) > ffbench valgrind-3.8.1:0.44s no: 1.0s ( 2.2x, 4.9%) me: 2.6s ( 5.8x, > -2.4%) > ffbench valgrind-3.7.0:0.44s no: 1.2s ( 2.6x,-13.7%) me: 2.5s ( 5.7x, > 0.8%) > -- heap -- > heap trunk_untouched:0.41s no: 2.4s ( 5.9x, -----) me: 9.9s (24.1x, > -----) > heap valgrind-3.8.1:0.41s no: 2.4s ( 5.8x, 2.1%) me: 9.9s (24.1x, > -0.3%) > heap valgrind-3.7.0:0.41s no: 3.1s ( 7.6x,-28.6%) me: 9.7s (23.7x, > 1.6%) > -- heap_pdb4 -- > heap_pdb4 trunk_untouched:0.41s no: 2.6s ( 6.3x, -----) me:13.9s (34.0x, > -----) > heap_pdb4 valgrind-3.8.1:0.41s no: 2.5s ( 6.2x, 1.6%) me:13.8s (33.6x, > 1.2%) > heap_pdb4 valgrind-3.7.0:0.41s no: 3.4s ( 8.2x,-30.6%) me:17.4s > (42.4x,-24.9%) > -- many-loss-records -- > many-loss-records trunk_untouched:0.03s no: 0.6s (18.3x, -----) me: 2.2s > (74.0x, -----) > many-loss-records valgrind-3.8.1:0.03s no: 0.5s (17.3x, 5.5%) me: 2.2s > (74.3x, -0.5%) > many-loss-records valgrind-3.7.0:0.03s no: 0.6s (18.3x, 0.0%) me: 2.3s > (77.0x, -4.1%) > -- many-xpts -- > many-xpts trunk_untouched:0.07s no: 0.7s (10.6x, -----) me: 3.4s (48.1x, > -----) > many-xpts valgrind-3.8.1:0.07s no: 0.7s ( 9.9x, 6.8%) me: 3.4s (47.9x, > 0.6%) > many-xpts valgrind-3.7.0:0.07s no: 0.7s (10.4x, 1.4%) me: 3.3s (47.6x, > 1.2%) > -- sarp -- > sarp trunk_untouched:0.02s no: 0.4s (21.0x, -----) me: 3.9s (195.5x, > -----) > sarp valgrind-3.8.1:0.02s no: 0.4s (18.5x, 11.9%) me: 3.2s (160.5x, > 17.9%) > sarp valgrind-3.7.0:0.02s no: 0.3s (16.5x, 21.4%) me: 3.1s (155.5x, > 20.5%) > -- tinycc -- > tinycc trunk_untouched:0.26s no: 3.0s (11.5x, -----) me:14.3s (54.8x, > -----) > tinycc valgrind-3.8.1:0.26s no: 2.9s (11.3x, 1.3%) me:14.3s (54.9x, > -0.1%) > tinycc valgrind-3.7.0:0.26s no: 3.1s (12.0x, -4.7%) me:14.2s (54.7x, > 0.2%) > -- Finished tests in perf ---------------------------------------------- > > == 11 programs, 66 timings ================= > > > > > ------------------------------------------------------------------------------ > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! > Discover the easy way to master current and previous Microsoft technologies > and advance your career. Get an incredible 1,500+ hours of step-by-step > tutorial videos with LearnDevNow. Subscribe today and save! > http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Julian S. <js...@ac...> - 2013-09-06 13:53:17
|
On 09/06/2013 03:41 PM, Florian Krohm wrote: > As Maran already reported there is performance regression on s390. It > was guessed by Julian that r13278 might be to blame (at least Thanks for the numbers. > -- bz2 -- > bz2 3.8.1 :1.07s no: 5.5s ( 5.1x, -----) me:24.2s (22.6x, -----) > bz2 trunk :1.07s no: 5.7s ( 5.3x, -2.7%) me:29.0s (27.1x,-19.7%) > bz2 tr-13278 :1.07s no: 5.5s ( 5.2x, -1.1%) me:21.5s (20.1x, 11.1%) Something clearly bad happened. But I would be hard put to guess why r13278 has had such a big effect. The only thing it does is to double the size of Memcheck's primary map, so we can deal "fast" with up to 32GB of address space rather than 16GB. Maybe it caused a lot more L2/L3 misses -- but the number of instructions should be the same, unless it caused the s390 code generator or chaining-patchers to generate worse code somehow. Without profiling with cachegrind, it's impossible to find out. Also, r13278 has effect for 64 bit targets but not for 32 bit targets since 32 bit targets deal with all 4GB "fast" anyway. So: - it can't account for any slowdowns we see on x86-linux - it should also give visible slowdowns on other 64 bit targets, not just s390 We need more numbers :) J |
|
From: Philippe W. <phi...@sk...> - 2013-09-07 01:10:21
|
On Fri, 2013-09-06 at 15:52 +0200, Julian Seward wrote: > Also, r13278 has effect for 64 bit targets but not for 32 bit targets > since 32 bit targets deal with all 4GB "fast" anyway. So: > > - it can't account for any slowdowns we see on x86-linux > - it should also give visible slowdowns on other 64 bit targets, > not just s390 > > We need more numbers :) Yes. I looked a little bit more in depth at the x86 slowdown. This seems to be limited to the old pentium 4. It looks like there is no slowdozn for a 32 bit bz2 with Valgrind on a 64 bit OS/modern cpu. (I already encounteed in the past not understandable variations of performance on a P4). Would be nice to have mips or arm (also 32 bits) performances. Philippe |
|
From: Florian K. <fl...@ei...> - 2013-09-13 17:32:23
|
On 09/06/2013 03:52 PM, Julian Seward wrote:
>
> On 09/06/2013 03:41 PM, Florian Krohm wrote:
>> As Maran already reported there is performance regression on s390. It
>> was guessed by Julian that r13278 might be to blame (at least
>>
>> -- bz2 --
>> bz2 3.8.1 :1.07s no: 5.5s ( 5.1x, -----) me:24.2s (22.6x, -----)
>> bz2 trunk :1.07s no: 5.7s ( 5.3x, -2.7%) me:29.0s (27.1x,-19.7%)
>> bz2 tr-13278 :1.07s no: 5.5s ( 5.2x, -1.1%) me:21.5s (20.1x, 11.1%)
>
I ran an outer cachegrind on an inner memcheck for this test and these
are the numbers: (sorted by decreasing value of Ir for 3.8.1):
Ir I1mr ILmr Dr D1mr DLmr
Dw D1mw DLmw
----------------------------------------------------------------------------------------------------------------------
72,997,397,630 12,797,190 2,274 22,347,380,826 2,371,512 26,461
17,403,922,739 1,534,747 266,262 3.8.1 PROGRAM TOTALS
69,333,739,511 3,511,368 2,312 21,740,631,694 2,909,031 36,899
17,764,621,510 1,533,404 231,036 trunk PROGRAM TOTALS
35,608,425,742 7,562,756 1 6,976,846,407 283,066 3,044
11,858,485,037 147,238 10,104 3.8.1/???:???
31,128,957,860 642,860 1 6,089,014,798 291,818 3,043
12,095,896,510 147,855 10,105 trunk/???:???
10,034,568,249 1,354 1 4,362,833,755 57,902 5
1,308,861,804 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_LOADV32be
10,034,568,717 1,353 1 4,362,833,950 58,048 4
1,308,861,867 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_LOADV32be
8,559,800,930 3,663 1 3,231,420,385 25,856 3
1,053,083,991 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_LOADV8
8,559,797,583 4,348 1 3,231,419,110 24,830 3
1,053,083,538 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_LOADV8
4,072,857,138 2,087 1 1,851,233,620 3,321 0
555,388,164 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_LOADV64be
4,072,856,570 1,477 1 1,851,233,540 3,319 0
555,388,056 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_LOADV64be
3,841,038,415 2,389 2 1,501,089,815 4,826 0
500,875,259 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_STOREV32be
3,842,564,343 1,955 2 1,501,852,779 5,517 1
500,875,259 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_STOREV32be
2,216,590,810 1,670 2 636,072,059 11,428 1
203,838,900 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_STOREV8
2,320,601,318 2,401 2 688,077,737 11,606 1
203,838,708 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_STOREV8
1,016,610,024 3,779 3 257,248,286 336 0
128,222,600 32 0 3.8.1/memcheck/mc_main.c:mc_STOREVn_slow
1,016,633,940 3,250 4 257,260,244 798 0
128,222,600 33 0 trunk/memcheck/mc_main.c:mc_STOREVn_slow
858,943,054 1,478 3 397,509,173 6,917 0
230,818,125 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_STOREV16be
901,694,116 1,006 2 418,884,704 8,902 0
230,818,125 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_STOREV16be
827,544,222 8,913 2 388,568,411 18,172 2
84,117,990 108 0 3.8.1/coregrind/m_oset.c:avl_lookup
827,545,408 8,278 2 388,569,016 16,714 2
84,118,130 94 0 trunk/coregrind/m_oset.c:avl_lookup
756,414,389 1,771 2 526,198,587 5,736 0
328,874,130 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_LOADV16be
756,414,389 1,425 2 526,198,587 8,056 0
328,874,130 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_LOADV16be
529,328,963 88,096 32 180,711,106 101 0
15,451,624 39,686 328
3.8.1/VEX/priv/host_generic_reg_alloc2.c:doRegisterAllocation
434,023,470 72,418 27 147,728,895 2,534 1
15,115,589 37,159 330
trunk/VEX/priv/host_generic_reg_alloc2.c:doRegisterAllocation
399,498,327 286,213 2 156,865,141 1,080 0
57,037,400 0 0
3.8.1/memcheck/mc_main.c:vgMemCheck_helperc_STOREV64be
428,017,399 1,752 2 171,124,656 1,147 1
57,037,428 0 0
trunk/memcheck/mc_main.c:vgMemCheck_helperc_STOREV64be
337,993,314 6,079 4 94,303,881 2,791 58
55,437,763 71 0 3.8.1/memcheck/mc_main.c:mc_LOADVn_slow
337,980,469 4,022 3 94,300,384 3,392 59
55,436,640 100 0 trunk/memcheck/mc_main.c:mc_LOADVn_slow
238,512,464 103 7 55,165,051 563,560 16,442
1,437 34 0 3.8.1/memcheck/mc_main.c:mc_expensive_sanity_check
470,254,654 103 7 108,645,874 1,120,560 32,804
1,437 34 0 trunk/memcheck/mc_main.c:mc_expensive_sanity_check
236,458,849 1,011 3 92,128,136 3,992 0
61,210,467 3 0 3.8.1/memcheck/mc_main.c:set_sec_vbits8
236,458,849 1,274 4 92,128,136 4,386 0
61,210,467 14 0 trunk/memcheck/mc_main.c:set_sec_vbits8
219,237,214 528 0 116,066,130 68 0
51,584,978 0 0
3.8.1/memcheck/mc_main.c:get_secmap_for_writing
245,028,998 597 1 128,962,022 73 0
51,584,978 0 0
trunk/memcheck/mc_main.c:get_secmap_for_writing
175,278,743 4,570 6 50,133,077 2,755 3
25,067,980 136,917 0
3.8.1/memcheck/mc_main.c:set_address_range_perms
175,361,385 4,207 6 50,174,443 2,715 2
25,067,980 138,111 0
trunk/memcheck/mc_main.c:set_address_range_perms
154,140,408 1,288 4 104,295,120 379 1
74,355,320 5,979 0 3.8.1/coregrind/m_oset.c:avl_insert
154,142,722 1,288 4 104,296,639 0 0
74,356,377 5,695 0 trunk/coregrind/m_oset.c:avl_insert
141,926,115 4,203 1 125,102,517 0 0
84,117,990 40 0
3.8.1/coregrind/m_oset.c:vgPlain_OSetGen_Lookup
141,926,301 4,097 1 125,102,675 0 0
84,118,130 94 0
trunk/coregrind/m_oset.c:vgPlain_OSetGen_Lookup
136,250,533 946,565 4 49,545,425 3,174 3
24,772,678 68 0
3.8.1/VEX/priv/guest_s390_helpers.c:s390_calculate_cc
136,251,842 1,980 4 55,738,612 1,613 3
30,965,857 0 0
trunk/VEX/priv/guest_s390_helpers.c:s390_calculate_cc
100,868,160 375 1 63,042,600 73 0
46,231,240 7 0 3.8.1/memcheck/mc_main.c:get_sec_vbits8
100,868,160 6 0 63,042,600 153 0
46,231,240 7 0 trunk/memcheck/mc_main.c:get_sec_vbits8
110,499,099 59,512 14 47,944,007 17,548 2
5,100,072 9,640 29 3.8.1/VEX/priv/ir_defs.c:sanityCheckIRSB
110,635,475 65,148 17 46,003,128 14,053 2
5,153,814 8,758 35 trunk/VEX/priv/ir_defs.c:sanityCheckIRSB
92,879,250 1,366,062 1 74,303,400 0 0
61,919,500 0 0
3.8.1/VEX/priv/guest_s390_helpers.c:s390_calculate_cond
92,879,280 1,519 1 74,303,424 0 0
61,919,520 0 0
trunk/VEX/priv/guest_s390_helpers.c:s390_calculate_cond
92,657,442 34 7 4,627,102 365 0
1,966,023 469 0 3.8.1/coregrind/m_libcbase.c:bm_qsort
97,665,418 56 5 4,753,215 356 0
2,015,684 564 0 trunk/coregrind/m_libcbase.c:bm_qsort
82,156,834 19,100 3 2,227,760 852 1
41,101,853 388,930 170,240 3.8.1/coregrind/m_libcbase.c:vgPlain_memset
87,343,362 19,144 2 2,659,170 1,238 1
42,769,382 399,225 124,215 trunk/coregrind/m_libcbase.c:vgPlain_memset
81,386,818 36,941 10 57,753,517 6,164 1
24,241,126 6,445 45 3.8.1/VEX/priv/ir_opt.c:ado_treebuild_BB
94,454,223 55,204 15 66,588,822 6,290 1
33,799,942 6,020 46 trunk/VEX/priv/ir_opt.c:ado_treebuild_BB
I would not expect a slowdown of ~20% from this.
Out of curiosity I ran bz2 again with trunk and 3.8.1 both compiled with
and without --enable-inner:
with --enable-inner
bz2 3.8.1: 1.09s no: 5.6s ( 5.1x, -----) me:29.6s (27.1x, -----)
bz2 trunk: 1.04s no: 5.6s ( 5.4x, -----) me:29.6s (28.4x, -----)
without --enable-inner
bz2 3.8.1: 1.05s no: 5.5s ( 5.2x, -----) me:24.2s (23.1x, -----)
bz2 trunk: 1.05s no: 5.6s ( 5.3x, -----) me:29.0s (27.6x, -----)
Note, that with --enable-inner (which is what those cachegrind numbers
above are reflecting) there is no performance regression. And the
interesting case (without --enable-inner) is not observable this way.
Florian
|
|
From: Philippe W. <phi...@sk...> - 2013-09-13 20:37:13
|
On Fri, 2013-09-13 at 19:31 +0200, Florian Krohm wrote: > Out of curiosity I ran bz2 again with trunk and 3.8.1 both compiled with > and without --enable-inner: > > with --enable-inner > > bz2 3.8.1: 1.09s no: 5.6s ( 5.1x, -----) me:29.6s (27.1x, -----) > bz2 trunk: 1.04s no: 5.6s ( 5.4x, -----) me:29.6s (28.4x, -----) > > without --enable-inner > > bz2 3.8.1: 1.05s no: 5.5s ( 5.2x, -----) me:24.2s (23.1x, -----) > bz2 trunk: 1.05s no: 5.6s ( 5.3x, -----) me:29.0s (27.6x, -----) I also observed that an inner valgrind is slower than a normal valgrind. Did not investigate why (an inner valgrind is here and there doing something more such as inner requests to inform the outer of e.g. the allocations that are done). Should not impact a lot when not running under an outer. > > Note, that with --enable-inner (which is what those cachegrind numbers > above are reflecting) there is no performance regression. And the > interesting case (without --enable-inner) is not observable this way. I did some more measurements on amd64. No slowdown observed, except sarp. I then used outer callgrind/inner memcheck on bz2 and sarp. On sarp, the only significant change in the numbers were the memcheck_expensive_check data miss rate, due to reading more primary maps entries. On my old pentium x86 (where I observed the initial (significant) slowdown), I "make clean" trunk, recompiled all. Slowdown disappeared. In summary, not very clear what was the cause (I do not remember having done anything special to compile the trunk without optimisation). At work, I also did performance measurements on a big application (x86). No significant difference detected. So, at this stage, at least on x86, this perf degradation looks to be a red herring. Philippe |