Menu

#261 Performance of the list comprehension predicates

Performance problem
open
nobody
None
5
2023-07-18
2023-06-26
No

I have ported a machine learning application from SWI-Prolog to XSB, hoping to make use of more advanced tabling and the better performance of compiled code.
The former is working beautifully.
However, there is a serious performance bottleneck in counting operations.
Although the individual pure Prolog queries are all faster than in SWI-Prolog, it seems that counting aggregation is very slow, and that this seems to be because of very poor performance of the list comprehension predicates.
For instance, with the following code, where childof simply has 16 ground facts listed in the file, test(100000) takes 2.83 s CPU time to execute in XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating childof(X,Y) itself in the same loop is actually around twice as fast in XSB than in SWI-Prolog. bagof performs equally poorly.

test(M) :-
    between(1,M,_),
    findall((X,Y),childof(X,Y),_),
    fail.
test(_).

Is this a known problem?
I have tried various more or less creative alternatives for counting that avoid explicit list aggregation, such as writing a C process and calling it recursively or using po-tabling with a counting predicate with set semantics, but neither of them improved peformance considerably due to C-interface and tabling overhead respectively.
Since I also have several students working on statistical relational AI applications in Python in which we were hoping to use XSB (via px) as a deductive database engine for solution counting, we would be very happy to contribute what we can to a solution!

Related

Bugs: #261

Discussion

  • David S. Warren

    David S. Warren - 2023-06-26

    Hi Felix,
    We have noticed no particular performance issues with findall in XSB. Our implementation is in C and was contributed by Bart Demoen.
    I have run your test on my laptop, having instantiated childof to childof(i,i+1) for i=1..16. (I did have to add :- import between/3 from basics.) I get

    sh-4.4$ xsb
    [xsb_configuration loaded]
    [sysinitrc loaded]
    [xsbbrat loaded]

    XSB Version 5.0.0 (Green Tea) of May 15, 2022
    [x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
    [Build date: 2022-07-06]

    | ?- [findalltest].
    [Compiling .\findalltest]
    [findalltest compiled, cpu time used: 0.031 seconds]
    [findalltest loaded]

    yes
    | ?- time(test(100000)).
    % 0.219 CPU in 0.219 seconds (100% CPU)

    yes
    | ?-

    My laptop processor is: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz, and RAM is 16.0 GB (15.7 GB usable). And it's running windows 10.

    I doubt your processor is 10 times slower than mine, so I can't explain what you are seeing.

    BUT, now that I think about it, there was a time period during which findall was slow. I had inadvertently introduced a performance bug while fixing another bug. I have since fixed that performance bug. Your version of XSB might be from that period. and that is most likely the explanation for what you are seeing. I'm pretty sure the most recent release of XSB has the fixed version. Certainly the current git sources on sourceforge have the fixed version.
    If your version of emu/findall.c does not define (and use) a function incr_memset, then you have the bug.

    I strongly suspect that this is the problem. Sorry for the inconvenience.
    -David

     
  • Felix Weitkämper

    Very strange! My students and I on our various systems (all Linux though) have the same experience, despite using the current XSB versions compiled from source.
    I will do some more experiments as soon as I get to it, but since I will be attending the ICLP in London next week, it might be a little while.

     
  • David S. Warren

    David S. Warren - 2023-07-14

    As discussed, a special hack for count of the number of times a goal succeeds should be more efficient:

    :- import term_arg/3,term_set_arg/4 from machine.

    count(Goal,Cnt) :-
    Term = cnt(_),
    term_set_arg(Term,1,0,0),
    (do_all
    call(Goal),
    term_arg(Term,1,Cnt0),
    Cnt1 is Cnt0+1,
    term_set_arg(Term,1,Cnt1,1)
    ),
    term_arg(Term,1,Cnt).

    This uses term_set_arg with the fourth argument of 1 to not trail the assignment, to keep the value over backtracking. (I'm not sure if these predicates are documented, so I tried to find and test them, and thought I'd send them along.)

     
  • David S. Warren

    David S. Warren - 2023-07-15

    Hi Felix,
    Again, I am stumped. With the following file, compiled:

    :- import term_arg/3,term_set_arg/4 from machine.
    :- import for/3 from basics.

    test(K) :-
    for(,1,K),
    count(childof(
    ,_),Cnt),
    Cnt =\= 16, % check we got right answer
    writeln(oops),
    fail.

    count(Goal,Cnt) :-
    Term = cnt(_),
    term_set_arg(Term,1,0,0),
    (do_all
    call(Goal),
    term_arg(Term,1,Cnt0),
    Cnt1 is Cnt0+1,
    term_set_arg(Term,1,Cnt1,1)
    ),
    term_arg(Term,1,Cnt).

    childof(a1,a2).
    childof(a2,a3).
    childof(a4,a5).
    childof(a6,a7).
    childof(a8,a9).
    childof(a10,a11).
    childof(a12,a13).
    childof(a14,a15).
    childof(a16,a17).
    childof(a18,a19).
    childof(a20,a21).
    childof(a22,a13).
    childof(a24,a25).
    childof(a26,a27).
    childof(a28,a29).
    childof(a30,a31).

    I get:

    sh-4.4$ xsb
    [xsb_configuration loaded]
    [sysinitrc loaded]
    [xsbbrat loaded]

    XSB Version 5.0.0 (Green Tea) of May 15, 2022
    [x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
    [Build date: 2022-07-06]

    | ?- [term_set_arg_test].
    [Compiling .\term_set_arg_test]
    [term_set_arg_test compiled, cpu time used: 0.047 seconds]
    [term_set_arg_test loaded]

    yes
    | ?- time(test(1000000)).
    % 1.281 CPU in 1.273 seconds (100% CPU)

    no
    | ?- halt.

    End XSB (cputime 1.34 secs, elapsetime 41.54 secs)
    sh-4.4$ xsb
    [xsb_configuration loaded]
    [sysinitrc loaded]
    [xsbbrat loaded]

    XSB Version 5.0.0 (Green Tea) of May 15, 2022
    [x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
    [Build date: 2022-07-06]

    | ?- [term_set_arg_test].
    [term_set_arg_test loaded]

    yes
    | ?- time(test(100000)).
    % 0.125 CPU in 0.118 seconds (105% CPU)

    no
    | ?-

    This is on my laptop, a reasonably fast Dell running windows. So the first run does a million repetitions and takes 1.281 secs. The second is for 100K reps and takes 0.125 secs. So those times are consistent with each other, and I think much faster than what you get? So I'm stumped.

    I have ubuntu on Windows on my machine, but am now unable to compile xsb on it (Some weird problem with line endings, I think, since it shares files with windows. I've compiled it before, but it will take time for me to track down the silly problems again and fix it.)

     
  • David S. Warren

    David S. Warren - 2023-07-15

    (It looks like an underscore variable got lost in the call to count(childof(,),Cnt), in test in the above code. It was there when I executed the code (otw a syntax error). Sorry if that was confusing.)

     
  • David S. Warren

    David S. Warren - 2023-07-15

    Well, I got XSB compiled on my ubuntu for windows system. And indeed, when I run that same program, I get:

    warren@xsb-pc62:/mnt/c/xsb-tests$ xsb
    [xsb_configuration loaded]
    [sysinitrc loaded]
    [xsbbrat loaded]

    XSB Version 5.0.0 (Green Tea) of May 15, 2022
    [x86_64-pc-linux-gnu 64 bits; mode: optimal; engine: slg-wam; scheduling: local]
    [Build date: 2023-07-15]

    | ?- [term_set_arg_test].
    [Compiling ./term_set_arg_test]
    [term_set_arg_test compiled, cpu time used: 0.015 seconds]
    [term_set_arg_test loaded]

    yes
    | ?- time(test(100000)).
    % 2.547 CPU in 2.541 seconds (100% CPU)

    no
    | ?-

    I.e., it is very slow. Now someone (I guess me?) has to figure out why this program is more than an order of magnitude slower on ubuntu linux than on windows....

     
  • Felix Weitkämper

    Well, I am very grateful at least that it has been possible to reproduce the phenomenon, and confirm that it is indeed occasioned by the choice of operating system.

     
  • David S. Warren

    David S. Warren - 2023-07-17

    The problem is that gcc compiles the builtin_call function, which is very large, so that it is very expensive to call. I suspect it has to do with how it allocates (and initializes?) local variables. I will find some time to try to factor it so that components that need significant space will have their own sub-function. I'll let you know what I find.

     
  • Felix Weitkämper

    Would it help if one/I tried compiling it with clang instead of gcc to see whether the issue persists?

     
  • David S. Warren

    David S. Warren - 2023-07-17

    OK. I found and have fixed the problem. A few more tests and then I'll commit the changes to sourceforge.
    I dumped the assembler code that gcc generates for the builtin.c file. I know little about x86_64 assembler code, but it seems that gcc generates code such that on entry to a function, it loops through all the pages for the frame that will contain local variables. For every page, it seems to "or" a 0 to one (double) word in each page. (every 4096 words(?)). The only reason I can think of for it to do this is to ensure that all the pages of the current frame can be written to and have been paged into memory. I'm guessing here. So the problem was that one case in the large builtin case statement (the one for excess_vars) reserves and inordinate amount of space (a word for every possible arity, which is now 0-64K, and it allocates 2 such arrays!) So every entry to any builtin had to run through all those pages. So that seems to be the time. I just moved the 2 huge local variables out of the function, making them global, which works, if perhaps not so elegant. And with that I get very close to the same times for MSVC compiled and GCC compiled XSB.
    So this should speed up all calls to builtins in XSB under ubuntu, so you might see other performance improvements as well.
    Thanks for finding this bug and hanging with us long enough for us to figure it out. I know other users of ubuntu will be happy, too.

     
    • Theresa Swift

      Theresa Swift - 2023-07-17

      Congratulations !

      On Mon, Jul 17, 2023 at 9:23 PM David S. Warren dwarren@users.sourceforge.net wrote:

      OK. I found and have fixed the problem. A few more tests and then I'll
      commit the changes to sourceforge.
      I dumped the assembler code that gcc generates for the builtin.c file. I
      know little about x86_64 assembler code, but it seems that gcc generates
      code such that on entry to a function, it loops through all the pages for
      the frame that will contain local variables. For every page, it seems to
      "or" a 0 to one (double) word in each page. (every 4096 words(?)). The only
      reason I can think of for it to do this is to ensure that all the pages of
      the current frame can be written to and have been paged into memory. I'm
      guessing here. So the problem was that one case in the large builtin case
      statement (the one for excess_vars) reserves and inordinate amount of space
      (a word for every possible arity, which is now 0-64K, and it allocates 2
      such arrays!) So every entry to any builtin had to run through all those
      pages. So that seems to be the time. I just moved the 2 huge local
      variables out of the function, making them global, which works, if perhaps
      not so elegant. And with that I get very close to the same times for MSVC
      compiled and GCC compiled XSB.
      So this should speed up all calls to builtins in XSB under ubuntu, so
      you might see other performance improvements as well.
      Thanks for finding this bug and hanging with us long enough for us to
      figure it out. I know other users of ubuntu will be happy, too.


      [bugs:#261] https://sourceforge.net/p/xsb/bugs/261/ Performance of the
      list comprehension predicates

      Status: open
      Group: Performance problem
      Created: Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper
      Last Updated: Mon Jul 17, 2023 12:37 PM UTC
      Owner: nobody

      I have ported a machine learning application from SWI-Prolog to XSB,
      hoping to make use of more advanced tabling and the better performance of
      compiled code.
      The former is working beautifully.
      However, there is a serious performance bottleneck in counting operations.
      Although the individual pure Prolog queries are all faster than in
      SWI-Prolog, it seems that counting aggregation is very slow, and that this
      seems to be because of very poor performance of the list comprehension
      predicates.
      For instance, with the following code, where childof simply has 16 ground
      facts listed in the file, test(100000) takes 2.83 s CPU time to execute in
      XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating
      childof(X,Y) itself in the same loop is actually around twice as fast in
      XSB than in SWI-Prolog. bagof performs equally poorly.

      test(M) :-
      between(1,M,),
      findall((X,Y),childof(X,Y),
      ),
      fail.test(_).

      Is this a known problem?
      I have tried various more or less creative alternatives for counting that
      avoid explicit list aggregation, such as writing a C process and calling it
      recursively or using po-tabling with a counting predicate with set
      semantics, but neither of them improved peformance considerably due to
      C-interface and tabling overhead respectively.
      Since I also have several students working on statistical relational AI
      applications in Python in which we were hoping to use XSB (via px) as a
      deductive database engine for solution counting, we would be very happy to
      contribute what we can to a solution!


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/xsb/bugs/261/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #261

    • Michael Kifer

      Michael Kifer - 2023-07-18
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <style id="bidiui-paragraph-margins" type="text/css">body p { margin-bottom: 0cm; margin-top: 0pt; } </style>

      impressive!

      Interesting that MSVC apparently does it better.


      On 7/17/23 4:23 PM, David S. Warren wrote:
      OK.  I found and have fixed the problem.  A few more tests and then I'll commit the changes to sourceforge.
      I dumped the assembler code that gcc generates for the builtin.c file.  I know little about x86_64 assembler code, but it seems that gcc generates code such that on entry to a function, it loops through all the pages for the frame that will contain local variables.  For every page, it seems to "or" a 0 to one (double) word in each page.  (every 4096 words(?)).  The only reason I can think of for it to do this is to ensure that all the pages of the current frame can be written to and have been paged into memory.  I'm guessing here.  So the problem was that one case in the large builtin case statement (the one for excess_vars) reserves and inordinate amount of space (a word for every possible arity, which is now 0-64K, and it allocates 2 such arrays!)  So every entry to any builtin had to run through all those pages.  So that seems to be the time.  I just moved the 2 huge local variables out of the function, making them global, which works, if perhaps not so elegant.  And with that I get very close to the same times for MSVC compiled and GCC compiled XSB.
      So this should speed up *all* calls to builtins in XSB under ubuntu, so you might see other performance improvements as well.
      Thanks for finding this bug and hanging with us long enough for us to figure it out.  I know other users of ubuntu will be happy, too.
      
      
      
      
      ---
      
      **[bugs:#261] Performance of the list comprehension predicates**
      
      **Status:** open
      **Group:** Performance problem
      **Created:** Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper
      **Last Updated:** Mon Jul 17, 2023 12:37 PM UTC
      **Owner:** nobody
      
      
      I have ported a machine learning application from SWI-Prolog to XSB, hoping to make use of more advanced tabling and the better performance of compiled code. 
      The former is working beautifully. 
      However, there is a serious performance bottleneck in counting operations. 
      Although the individual pure Prolog queries are all faster than in SWI-Prolog, it seems that counting aggregation is very slow, and that this seems to be because of very poor performance of the list comprehension predicates.
      For instance, with the following code, where childof simply has 16 ground facts listed in the file, test(100000) takes  2.83 s CPU time to execute in XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating childof(X,Y) itself in the same loop is actually around twice as fast in XSB than in SWI-Prolog. bagof performs equally poorly.
      
      
      test(M) :-
          between(1,M,_),
          findall((X,Y),childof(X,Y),_),
          fail.
      test(_).
      
      Is this a known problem? I have tried various more or less creative alternatives for counting that avoid explicit list aggregation, such as writing a C process and calling it recursively or using po-tabling with a counting predicate with set semantics, but neither of them improved peformance considerably due to C-interface and tabling overhead respectively. Since I also have several students working on statistical relational AI applications in Python in which we were hoping to use XSB (via px) as a deductive database engine for solution counting, we would be very happy to contribute what we can to a solution! --- Sent from sourceforge.net because you indicated interest in <https://sourceforge.net/p/xsb/bugs/261/> To unsubscribe from further messages, please visit <https://sourceforge.net/auth/subscriptions/>
       
  • Felix Weitkämper

    Fantastic work, thank you so much! Now, using the set_arg hack, XSB actually outperforms YAP and SWI-Prolog's set_arg versions on the counting benchmark by 15% on my setup, which is very gratifying. The performance benefit of using XSB rather than YAP on my actual application is even greater (30%, probably due to superior tabling performance).
    For an algorithm that runs in approximately linear time in data size, this makes a substantial difference!
    This fix also makes XSB and hence Janus usable for the applications I am developing with my students. This means a lot to us, as we really rely on the Python integration for all the generic machine learning analysis stuff.

     
  • David S. Warren

    David S. Warren - 2023-07-18

    That's great, Felix. And thanks again for bringing this to our attention. Too many users would see the results and just give up. It's users like you who help make XSB better (and faster!)

    Michael, I don't understand enough about operating system architecture and hardware to understand why GCC put this in. I'm assuming there are circumstances in which this initial touching of every page in the activation record is a good idea. Obviously MSVC doesn't do this at all. (Well, I haven't looked, but I think "obviously".) For our example MSVC's choice is clearly the better. I wonder the circumstances in which GCC's is better.

     
    • Felix Weitkämper

      I believe it may have to do with gcc's stack checking, which is basically what you desribed above: https://gcc.gnu.org/onlinedocs/gccint/Stack-Checking.html
      I have already deleted my pre-fix XSB sources, but I wonder whether setting the appropriate macros could also have helped with the issue?

       

      Last edit: Felix Weitkämper 2023-07-18
    • Michael Kifer

      Michael Kifer - 2023-07-20
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <style id="bidiui-paragraph-margins" type="text/css">body p { margin-bottom: 0cm; margin-top: 0pt; } </style>

      David,

      I ran the Flora/ErgoAI testsuite and do not see any obvious speedup.

      The tests in that testsuite do not call xsb built-ins a huge number of times  though



      --

             --- michael


       



      On 7/18/23 10:16 AM, David S. Warren wrote:
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      That's great, Felix. And thanks again for bringing this to our attention. Too many users would see the results and just give up. It's users like you who help make XSB better (and faster!)

      Michael, I don't understand enough about operating system architecture and hardware to understand why GCC put this in. I'm assuming there are circumstances in which this initial touching of every page in the activation record is a good idea. Obviously MSVC doesn't do this at all. (Well, I haven't looked, but I think "obviously".) For our example MSVC's choice is clearly the better. I wonder the circumstances in which GCC's is better.


      [bugs:#261] Performance of the list comprehension predicates

      Status: open
      Group: Performance problem
      Created: Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper
      Last Updated: Tue Jul 18, 2023 10:58 AM UTC
      Owner: nobody

      I have ported a machine learning application from SWI-Prolog to XSB, hoping to make use of more advanced tabling and the better performance of compiled code.
      The former is working beautifully.
      However, there is a serious performance bottleneck in counting operations.
      Although the individual pure Prolog queries are all faster than in SWI-Prolog, it seems that counting aggregation is very slow, and that this seems to be because of very poor performance of the list comprehension predicates.
      For instance, with the following code, where childof simply has 16 ground facts listed in the file, test(100000) takes 2.83 s CPU time to execute in XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating childof(X,Y) itself in the same loop is actually around twice as fast in XSB than in SWI-Prolog. bagof performs equally poorly.

      test(M) :-
          between(1,M,_),
          findall((X,Y),childof(X,Y),_),
          fail.
      test(_).
      

      Is this a known problem?
      I have tried various more or less creative alternatives for counting that avoid explicit list aggregation, such as writing a C process and calling it recursively or using po-tabling with a counting predicate with set semantics, but neither of them improved peformance considerably due to C-interface and tabling overhead respectively.
      Since I also have several students working on statistical relational AI applications in Python in which we were hoping to use XSB (via px) as a deductive database engine for solution counting, we would be very happy to contribute what we can to a solution!


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/xsb/bugs/261/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

      <link itemprop="url" href="https://sourceforge.net/p/xsb/bugs/261/"> <meta itemprop="name" content="View">
      <meta itemprop="description" content="View">
       
  • David S. Warren

    David S. Warren - 2023-07-18

    That looks like it. (You just gotta know where to look :-). When I get a chance, I'll experiment with defining that macro and see if that fixes the problem as well. Thanks!

     

Log in to post a comment.