Menu

#5000 trace - inexplicable large overhead by ensemble commands

obsolete: 8.5.11
open-fixed
46. Traces (50)
3
2012-02-10
2012-02-08
No

Hello,

I've found that by "execution" trace some ensemble commands produce a lot of overhead...
without trace it is comparable with not ensemble commands
i have own profiler (to check which code works slowly), that calls somthing like this :

Tcl_CreateObjTrace( interp, level, TCL_ALLOW_INLINE_COMPILATION | ((TCL_TRACE_ANY_EXEC) >> 2),
&tclexProfilerTraceProc, inDbgObj, NULL );
The C procedure "tclexProfilerTraceProc" i have rewritten to test this, that it performs "return TCL_OK;" only.

Strange is the execution times of "string length" and "::tcl::string::length" are not different,
but "string range" or "string first" are 100 times slower as its not ensemble commands "::tcl::string::range" or "::tcl::string::first".

That overhead make me not possible, to profile anything code, using ensembles.

Tcl code to be "profiled":
%%%%
set s [string repeat A 50000]; puts ok
puts "--- string range ----"
puts [time {::tcl::string::range $s 2000 2001} 100000]
puts [time {::string range $s 2000 2001} 10000]
puts "--- string first ----"
puts [time {::tcl::string::first A $s 2000} 100000]
puts [time {::string first A $s 2000} 10000]
puts "--- string length ---"
puts [time {::tcl::string::length $s} 100000]
puts [time {::string length $s} 10000]
puts "--- test ------------"
proc ::test1_string_range {s from to} {set a 5}
proc ::test2_string_range {ens s from to} {set a 5}
puts [time {::test1_string_range $s 2000 2001} 10000]
puts [time {::test2_string_range range $s 2000 2001} 10000]
%%%%

Result is :
ok
--- string range ----
1.4 microseconds per iteration
168.8 microseconds per iteration !!!
--- string first ----
1.09 microseconds per iteration
168.8 microseconds per iteration !!!
--- string length ---
0.31 microseconds per iteration
0.0 microseconds per iteration
--- test ------------
1.6 microseconds per iteration
1.5 microseconds per iteration

Discussion

1 2 > >> (Page 1 of 2)
  • Serg G. Brester

    Serg G. Brester - 2012-02-08

    the difference between execution times of "ensemble" and "not ensemble" is then greater if string "s" is longer, and vice versa.

     
  • Don Porter

    Don Porter - 2012-02-08

    Does the 8.5.11 release have this problem,
    or only the tip of the core-8-5-branch ?

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-08

    > Don: original release 8.5.11

    What is funny - but even tcl proc is "in trace" faster than "ensemble":
    %
    if {[info command ::org_string] == {}} {
    rename ::string ::org_string
    proc ::string {args} {{*}[lreplace $args 0 0 "::tcl::string::[lindex $args 0]"]}
    }
    .... the same test code ....
    %

    The result is :
    ok
    --- string range ----
    2.04 microseconds per iteration
    7.8 microseconds per iteration
    --- string first ----
    1.87 microseconds per iteration
    7.8 microseconds per iteration
    --- string length ---
    0.32 microseconds per iteration
    6.2 microseconds per iteration
    --- test ------------
    3.1 microseconds per iteration
    1.6 microseconds per iteration

     
  • Don Porter

    Don Porter - 2012-02-08

    ok, reverse the question then. Is the problem
    already corrected on the tip of the core-8-5-branch?

     
  • Don Porter

    Don Porter - 2012-02-08

    Another releated Q, is it all ensembles
    that have this issue, or only those that
    are byte compiled?

     
  • Don Porter

    Don Porter - 2012-02-08

    The bytecompiled subcommand [string length] is good.

    The subcommands that don't compile [string range,first] show
    the problem.

    I suspect that every iteration of [time] is re-attempting the
    compile. No hard evidence yet.

     
  • Don Porter

    Don Porter - 2012-02-08
    • labels: 104240 --> 105662
    • assigned_to: dgp --> dkf
     
  • Don Porter

    Don Porter - 2012-02-08

    Reproduction using just what's in the
    source distro:

    $ make runtest
    ...
    % set s [string repeat A 50000]; puts ok
    ok
    % testcmdtrace resulttest {time {string range $s 2000 2001} 1000}
    171.787 microseconds per iteration
    % testcmdtrace resulttest {time {tcl::string::range $s 2000 2001} 1000}
    2.357 microseconds per iteration

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-08

    > The subcommands that don't compile [string range,first] show
    > the problem.
    But how you explain, that pure tcl proc is faster as "ensemble" by non compiled subcommands? Anything is wrong in ensemble call.
    %
    if {[info command ::org_string] == {}} {
    rename ::string ::org_string
    proc ::string {args} {{*}[lreplace $args 0 0 "::tcl::string::[lindex $args 0]"]}
    }
    .... the same test code ....
    %

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-09

    by the way, I could not make runtest with windows/MVS 6.0.

    ..\generic\tclTest.c(5347) : error C2065: 'intptr_t' : undeclared identifier
    ..\generic\tclTest.c(5347) : error C2146: syntax error : missing ')' before identifier 'Tcl_GetCurrentThread'

    my quick workaround :

    #if (defined( _MSC_VER ))
    typedef void * intptr_t;
    #endif

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-09

    No hard evidence, but in debugger by brreakpoint in NsEnsembleImplementationCmd
    it breaks 5 times by "testcmdtrace resulttest {time {string range $s 2000 2001} 5}"
    and zero times by "testcmdtrace resulttest {time {string length $s} 5}"

    Don, you got right: ensemble "string range" will be recompiled each time in "time" body.
    But why "range" only?

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-09

    Yep,
    compared to "length", by the "range" it returns in TclCompileEnsemble with TCL_ERROR after checking (... cmdPtr->compileProc == NULL), and go further to recompile in TclCompileScript.

    So, either EACH compiled ensemble should implements ALL compiled subcommands also, or in TclCompileEnsemble (or hereafter in TclCompileScript) should be supported a direct call "cmdPtr->objProc" of the subcommand.
    At least by DONT_COMPILE_CMDS_INLINE or TCL_ALLOW_INLINE_COMPILATION.

    I will try to build it later, but each quick solution or workaround are welcome...
    PS. cause I should as soon as possible complete my profiling :)

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-09

    I noticed something else: why recompile or parse is dependent on length of string "s"?
    It should still parse the token "$s" and not the value of "s". Or I'm way off here?

     
  • Don Porter

    Don Porter - 2012-02-09

    Further probing demonstrates the
    {string range ....} script is getting compiled
    only one time. It compiles to direct invocation
    of the string command procedure, which is
    NsEnsembleImplementationCmd, because "string"
    is implemented as an ensemble. This points to
    performance problems there.

    A possible alternative would be to have TclCompileEnsemble()
    compile to direct invocation of ::tcl::string::range instead. To
    go that way would required an examination of the epoch implications,
    since changes in any ensemble dispatch map would then have the
    potential to invalidate bytecode. (That may already be true).

    If we go that way, there may be refactoring opportunities in breaking
    out from TclCompileScript() the central engine that compiles a single
    command so that it could be called by TCE as well.

     
  • Donal K. Fellows

    Most of cost is due to TEOV_RunEnterTraces, which spends a lot of time converting something to a string.

    Timings:

    % testcmdtrace resulttest {time {string range $s 2000 2001} 100000}
    590.6396843900001 microseconds per iteration
    % time {string range $s 2000 2001} 100000
    24.18864962 microseconds per iteration
    % time {::tcl::string::range $s 2000 2001} 100000
    12.64128215 microseconds per iteration

     
  • Donal K. Fellows

    profiler sample during slow execution run

     
  • Donal K. Fellows

    I've attached a profiler trace captured while doing a run of the slowest version from my previous comment. ($s is 50k chars)

     
  • Don Porter

    Don Porter - 2012-02-09

    Ok, that's it. This oldest style of execution tracing
    wants to have available the command it is tracing
    in string form. Most of the time this isn't a big demand,
    because the command started in string form, and we just
    have to get to it.

    But for ensemble subcommand dispatch, the redirected
    subcommand execution is done by building an objv
    array and directly calling to Tcl_EvalObjv(). When the
    the string version is needed, a list is built and the string
    rep of that list is generated, costing time proportional to
    the total length of all the argument values.

     
  • Don Porter

    Don Porter - 2012-02-09

    The usual opinion on this has been that
    when this level of tracing is active, yes
    performance will be harmed. But if you
    have this level of tracing turned on, we
    assume performance is not the concern,
    but rather some sort of debugging or profiling. Optimizing
    under these trace conditions hasn't been a priority.

     
  • Don Porter

    Don Porter - 2012-02-09
    • labels: 105662 --> 46. Traces
    • priority: 5 --> 3
    • assigned_to: dkf --> dgp
     
  • Don Porter

    Don Porter - 2012-02-09

    The cost of TclScanElement() in dkf's profile
    caught my eye, notably as compared to
    TclConvertElement(). We can't do anything
    about the inherent O(N) nature of the scan,
    but the constant seemed higher than needed.

    On the dgp-scan-element branch I have some
    revisions aimed at brining down the constant.
    On the test at issue in this ticket, it knocks about
    half off the tracing cost. No new test failures.
    Still running tclbench to see if there's some
    countervailing cost somewhere.

     
  • Don Porter

    Don Porter - 2012-02-09

    If the dgp-scan-element branch has significant
    performance pros or cons, they are not clearly
    demonstrated by tclbench.

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-09

    the patch is greate...
    > they are not clearly demonstrated by tclbench.
    There are probably large effect on some things such initial compile and loading (source etc).

     
  • Donal K. Fellows

    I'm not sure that this is a cost that tclbench is set up to spot; it's largely focussed on bytecode/command costs as they bite most people much harder. Still, this seems to make quite a bit of difference to the speed of ensemble command dispatch with 8.5 (around 40% faster with an --enable-symbols=all build) which is very good, if naively surprising. In a fully optimized build, it makes 17.5% difference when traces are turned on, and seems to be irrelevant without traces.

    Make it so.

     
  • Serg G. Brester

    Serg G. Brester - 2012-02-10

    In my tcl fork I have midified a some routines such Tcl_CreateEnsemble, Tcl_SetEnsembleSubcommandList, Tcl_SetEnsembleMappingDict etc.

    So after ensemble namespace and subcommands are created, the main ensemle command will be a "dummy" command without compileProc (TclCompileEnsemble), that simple forwards a first arg to corresponding command in ensemble namespace.

    I try later to create a patch for core-8-5-11...

    My performance now w/o tracing:
    --- string range ----
    0.62 microseconds per iteration
    0.78 microseconds per iteration
    --- string first ----
    0.63 microseconds per iteration
    0.78 microseconds per iteration
    --- string length ---
    0.31 microseconds per iteration
    0.16 microseconds per iteration

    Performance with tracing:
    --- string range ----
    2.16 microseconds per iteration
    2.8 microseconds per iteration
    --- string first ----
    1.78 microseconds per iteration
    1.8 microseconds per iteration
    --- string length ---
    0.32 microseconds per iteration
    0.32 microseconds per iteration

    By tclbench it seems to be equal.
    But strange is, that some my own test scripts running up to one-fifth faster as earlier.

     
1 2 > >> (Page 1 of 2)
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.