XSB / Bugs / #261 Performance of the list comprehension predicates

David S. Warren - 2023-06-26

Hi Felix,
We have noticed no particular performance issues with findall in XSB. Our implementation is in C and was contributed by Bart Demoen.
I have run your test on my laptop, having instantiated childof to childof(i,i+1) for i=1..16. (I did have to add :- import between/3 from basics.) I get

sh-4.4$ xsb
[xsb_configuration loaded]
[sysinitrc loaded]
[xsbbrat loaded]

XSB Version 5.0.0 (Green Tea) of May 15, 2022
[x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
[Build date: 2022-07-06]

| ?- [findalltest].
[Compiling .\findalltest]
[findalltest compiled, cpu time used: 0.031 seconds]
[findalltest loaded]

yes
| ?- time(test(100000)).
% 0.219 CPU in 0.219 seconds (100% CPU)

yes
| ?-

My laptop processor is: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz, and RAM is 16.0 GB (15.7 GB usable). And it's running windows 10.

I doubt your processor is 10 times slower than mine, so I can't explain what you are seeing.

BUT, now that I think about it, there was a time period during which findall was slow. I had inadvertently introduced a performance bug while fixing another bug. I have since fixed that performance bug. Your version of XSB might be from that period. and that is most likely the explanation for what you are seeing. I'm pretty sure the most recent release of XSB has the fixed version. Certainly the current git sources on sourceforge have the fixed version.
If your version of emu/findall.c does not define (and use) a function incr_memset, then you have the bug.

I strongly suspect that this is the problem. Sorry for the inconvenience.
-David

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Felix Weitkämper - 2023-07-05

Very strange! My students and I on our various systems (all Linux though) have the same experience, despite using the current XSB versions compiled from source.
I will do some more experiments as soon as I get to it, but since I will be attending the ICLP in London next week, it might be a little while.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-14

As discussed, a special hack for count of the number of times a goal succeeds should be more efficient:

:- import term_arg/3,term_set_arg/4 from machine.

count(Goal,Cnt) :-
Term = cnt(_),
term_set_arg(Term,1,0,0),
(do_all
call(Goal),
term_arg(Term,1,Cnt0),
Cnt1 is Cnt0+1,
term_set_arg(Term,1,Cnt1,1)
),
term_arg(Term,1,Cnt).

This uses term_set_arg with the fourth argument of 1 to not trail the assignment, to keep the value over backtracking. (I'm not sure if these predicates are documented, so I tried to find and test them, and thought I'd send them along.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-15

Hi Felix,
Again, I am stumped. With the following file, compiled:

:- import term_arg/3,term_set_arg/4 from machine.
:- import for/3 from basics.

test(K) :-
for(,1,K),
count(childof(,_),Cnt),
Cnt =\= 16, % check we got right answer
writeln(oops),
fail.

count(Goal,Cnt) :-
Term = cnt(_),
term_set_arg(Term,1,0,0),
(do_all
call(Goal),
term_arg(Term,1,Cnt0),
Cnt1 is Cnt0+1,
term_set_arg(Term,1,Cnt1,1)
),
term_arg(Term,1,Cnt).

childof(a1,a2).
childof(a2,a3).
childof(a4,a5).
childof(a6,a7).
childof(a8,a9).
childof(a10,a11).
childof(a12,a13).
childof(a14,a15).
childof(a16,a17).
childof(a18,a19).
childof(a20,a21).
childof(a22,a13).
childof(a24,a25).
childof(a26,a27).
childof(a28,a29).
childof(a30,a31).

I get:

sh-4.4$ xsb
[xsb_configuration loaded]
[sysinitrc loaded]
[xsbbrat loaded]

XSB Version 5.0.0 (Green Tea) of May 15, 2022
[x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
[Build date: 2022-07-06]

| ?- [term_set_arg_test].
[Compiling .\term_set_arg_test]
[term_set_arg_test compiled, cpu time used: 0.047 seconds]
[term_set_arg_test loaded]

yes
| ?- time(test(1000000)).
% 1.281 CPU in 1.273 seconds (100% CPU)

no
| ?- halt.

End XSB (cputime 1.34 secs, elapsetime 41.54 secs)
sh-4.4$ xsb
[xsb_configuration loaded]
[sysinitrc loaded]
[xsbbrat loaded]

XSB Version 5.0.0 (Green Tea) of May 15, 2022
[x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local]
[Build date: 2022-07-06]

| ?- [term_set_arg_test].
[term_set_arg_test loaded]

yes
| ?- time(test(100000)).
% 0.125 CPU in 0.118 seconds (105% CPU)

no
| ?-

This is on my laptop, a reasonably fast Dell running windows. So the first run does a million repetitions and takes 1.281 secs. The second is for 100K reps and takes 0.125 secs. So those times are consistent with each other, and I think much faster than what you get? So I'm stumped.

I have ubuntu on Windows on my machine, but am now unable to compile xsb on it (Some weird problem with line endings, I think, since it shares files with windows. I've compiled it before, but it will take time for me to track down the silly problems again and fix it.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-15

(It looks like an underscore variable got lost in the call to count(childof(,),Cnt), in test in the above code. It was there when I executed the code (otw a syntax error). Sorry if that was confusing.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-15

Well, I got XSB compiled on my ubuntu for windows system. And indeed, when I run that same program, I get:

warren@xsb-pc62:/mnt/c/xsb-tests$ xsb
[xsb_configuration loaded]
[sysinitrc loaded]
[xsbbrat loaded]

XSB Version 5.0.0 (Green Tea) of May 15, 2022
[x86_64-pc-linux-gnu 64 bits; mode: optimal; engine: slg-wam; scheduling: local]
[Build date: 2023-07-15]

| ?- [term_set_arg_test].
[Compiling ./term_set_arg_test]
[term_set_arg_test compiled, cpu time used: 0.015 seconds]
[term_set_arg_test loaded]

yes
| ?- time(test(100000)).
% 2.547 CPU in 2.541 seconds (100% CPU)

no
| ?-

I.e., it is very slow. Now someone (I guess me?) has to figure out why this program is more than an order of magnitude slower on ubuntu linux than on windows....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Felix Weitkämper - 2023-07-16

Well, I am very grateful at least that it has been possible to reproduce the phenomenon, and confirm that it is indeed occasioned by the choice of operating system.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-17

The problem is that gcc compiles the builtin_call function, which is very large, so that it is very expensive to call. I suspect it has to do with how it allocates (and initializes?) local variables. I will find some time to try to factor it so that components that need significant space will have their own sub-function. I'll let you know what I find.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Felix Weitkämper - 2023-07-17

Would it help if one/I tried compiling it with clang instead of gcc to see whether the issue persists?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-17

OK. I found and have fixed the problem. A few more tests and then I'll commit the changes to sourceforge.
I dumped the assembler code that gcc generates for the builtin.c file. I know little about x86_64 assembler code, but it seems that gcc generates code such that on entry to a function, it loops through all the pages for the frame that will contain local variables. For every page, it seems to "or" a 0 to one (double) word in each page. (every 4096 words(?)). The only reason I can think of for it to do this is to ensure that all the pages of the current frame can be written to and have been paged into memory. I'm guessing here. So the problem was that one case in the large builtin case statement (the one for excess_vars) reserves and inordinate amount of space (a word for every possible arity, which is now 0-64K, and it allocates 2 such arrays!) So every entry to any builtin had to run through all those pages. So that seems to be the time. I just moved the 2 huge local variables out of the function, making them global, which works, if perhaps not so elegant. And with that I get very close to the same times for MSVC compiled and GCC compiled XSB.
So this should speed up all calls to builtins in XSB under ubuntu, so you might see other performance improvements as well.
Thanks for finding this bug and hanging with us long enough for us to figure it out. I know other users of ubuntu will be happy, too.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Theresa Swift - 2023-07-17
  
  Congratulations !
  
  On Mon, Jul 17, 2023 at 9:23 PM David S. Warren dwarren@users.sourceforge.net wrote:
  
  OK. I found and have fixed the problem. A few more tests and then I'll
  commit the changes to sourceforge.
  I dumped the assembler code that gcc generates for the builtin.c file. I
  know little about x86_64 assembler code, but it seems that gcc generates
  code such that on entry to a function, it loops through all the pages for
  the frame that will contain local variables. For every page, it seems to
  "or" a 0 to one (double) word in each page. (every 4096 words(?)). The only
  reason I can think of for it to do this is to ensure that all the pages of
  the current frame can be written to and have been paged into memory. I'm
  guessing here. So the problem was that one case in the large builtin case
  statement (the one for excess_vars) reserves and inordinate amount of space
  (a word for every possible arity, which is now 0-64K, and it allocates 2
  such arrays!) So every entry to any builtin had to run through all those
  pages. So that seems to be the time. I just moved the 2 huge local
  variables out of the function, making them global, which works, if perhaps
  not so elegant. And with that I get very close to the same times for MSVC
  compiled and GCC compiled XSB.
  So this should speed up all calls to builtins in XSB under ubuntu, so
  you might see other performance improvements as well.
  Thanks for finding this bug and hanging with us long enough for us to
  figure it out. I know other users of ubuntu will be happy, too.
  
  [bugs:#261] https://sourceforge.net/p/xsb/bugs/261/ Performance of the
  list comprehension predicates
  
  Status: open
  Group: Performance problem
  Created: Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper
  Last Updated: Mon Jul 17, 2023 12:37 PM UTC
  Owner: nobody
  
  I have ported a machine learning application from SWI-Prolog to XSB,
  hoping to make use of more advanced tabling and the better performance of
  compiled code.
  The former is working beautifully.
  However, there is a serious performance bottleneck in counting operations.
  Although the individual pure Prolog queries are all faster than in
  SWI-Prolog, it seems that counting aggregation is very slow, and that this
  seems to be because of very poor performance of the list comprehension
  predicates.
  For instance, with the following code, where childof simply has 16 ground
  facts listed in the file, test(100000) takes 2.83 s CPU time to execute in
  XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating
  childof(X,Y) itself in the same loop is actually around twice as fast in
  XSB than in SWI-Prolog. bagof performs equally poorly.
  
  test(M) :-
  between(1,M,),
  findall((X,Y),childof(X,Y),),
  fail.test(_).
  
  Is this a known problem?
  I have tried various more or less creative alternatives for counting that
  avoid explicit list aggregation, such as writing a C process and calling it
  recursively or using po-tabling with a counting predicate with set
  semantics, but neither of them improved peformance considerably due to
  C-interface and tabling overhead respectively.
  Since I also have several students working on statistical relational AI
  applications in Python in which we were hoping to use XSB (via px) as a
  deductive database engine for solution counting, we would be very happy to
  contribute what we can to a solution!
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/xsb/bugs/261/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: #261
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Michael Kifer - 2023-07-18
  
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <style id="bidiui-paragraph-margins" type="text/css">body p { margin-bottom: 0cm; margin-top: 0pt; } </style>
  impressive!
  
  Interesting that MSVC apparently does it better.
  
  On 7/17/23 4:23 PM, David S. Warren wrote:
  
  OK. I found and have fixed the problem. A few more tests and then I'll commit the changes to sourceforge. I dumped the assembler code that gcc generates for the builtin.c file. I know little about x86_64 assembler code, but it seems that gcc generates code such that on entry to a function, it loops through all the pages for the frame that will contain local variables. For every page, it seems to "or" a 0 to one (double) word in each page. (every 4096 words(?)). The only reason I can think of for it to do this is to ensure that all the pages of the current frame can be written to and have been paged into memory. I'm guessing here. So the problem was that one case in the large builtin case statement (the one for excess_vars) reserves and inordinate amount of space (a word for every possible arity, which is now 0-64K, and it allocates 2 such arrays!) So every entry to any builtin had to run through all those pages. So that seems to be the time. I just moved the 2 huge local variables out of the function, making them global, which works, if perhaps not so elegant. And with that I get very close to the same times for MSVC compiled and GCC compiled XSB. So this should speed up *all* calls to builtins in XSB under ubuntu, so you might see other performance improvements as well. Thanks for finding this bug and hanging with us long enough for us to figure it out. I know other users of ubuntu will be happy, too. --- **[bugs:#261] Performance of the list comprehension predicates** **Status:** open **Group:** Performance problem **Created:** Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper **Last Updated:** Mon Jul 17, 2023 12:37 PM UTC **Owner:** nobody I have ported a machine learning application from SWI-Prolog to XSB, hoping to make use of more advanced tabling and the better performance of compiled code. The former is working beautifully. However, there is a serious performance bottleneck in counting operations. Although the individual pure Prolog queries are all faster than in SWI-Prolog, it seems that counting aggregation is very slow, and that this seems to be because of very poor performance of the list comprehension predicates. For instance, with the following code, where childof simply has 16 ground facts listed in the file, test(100000) takes 2.83 s CPU time to execute in XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating childof(X,Y) itself in the same loop is actually around twice as fast in XSB than in SWI-Prolog. bagof performs equally poorly.
  test(M) :- between(1,M,_), findall((X,Y),childof(X,Y),_), fail. test(_).
  Is this a known problem? I have tried various more or less creative alternatives for counting that avoid explicit list aggregation, such as writing a C process and calling it recursively or using po-tabling with a counting predicate with set semantics, but neither of them improved peformance considerably due to C-interface and tabling overhead respectively. Since I also have several students working on statistical relational AI applications in Python in which we were hoping to use XSB (via px) as a deductive database engine for solution counting, we would be very happy to contribute what we can to a solution! --- Sent from sourceforge.net because you indicated interest in <https://sourceforge.net/p/xsb/bugs/261/> To unsubscribe from further messages, please visit <https://sourceforge.net/auth/subscriptions/>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Felix Weitkämper - 2023-07-18

Fantastic work, thank you so much! Now, using the set_arg hack, XSB actually outperforms YAP and SWI-Prolog's set_arg versions on the counting benchmark by 15% on my setup, which is very gratifying. The performance benefit of using XSB rather than YAP on my actual application is even greater (30%, probably due to superior tabling performance).
For an algorithm that runs in approximately linear time in data size, this makes a substantial difference!
This fix also makes XSB and hence Janus usable for the applications I am developing with my students. This means a lot to us, as we really rely on the Python integration for all the generic machine learning analysis stuff.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-18

That's great, Felix. And thanks again for bringing this to our attention. Too many users would see the results and just give up. It's users like you who help make XSB better (and faster!)

Michael, I don't understand enough about operating system architecture and hardware to understand why GCC put this in. I'm assuming there are circumstances in which this initial touching of every page in the activation record is a good idea. Obviously MSVC doesn't do this at all. (Well, I haven't looked, but I think "obviously".) For our example MSVC's choice is clearly the better. I wonder the circumstances in which GCC's is better.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Felix Weitkämper - 2023-07-18
  
  I believe it may have to do with gcc's stack checking, which is basically what you desribed above: https://gcc.gnu.org/onlinedocs/gccint/Stack-Checking.html
  I have already deleted my pre-fix XSB sources, but I wonder whether setting the appropriate macros could also have helped with the issue?
  
  Last edit: Felix Weitkämper 2023-07-18
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Michael Kifer - 2023-07-20
  
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <style id="bidiui-paragraph-margins" type="text/css">body p { margin-bottom: 0cm; margin-top: 0pt; } </style>
  David,
  
  I ran the Flora/ErgoAI testsuite and do not see any obvious speedup.
  
  The tests in that testsuite do not call xsb built-ins a huge number of times though
  
  --
  
  --- michael
  
  On 7/18/23 10:16 AM, David S. Warren wrote:
  
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  
  That's great, Felix. And thanks again for bringing this to our attention. Too many users would see the results and just give up. It's users like you who help make XSB better (and faster!)
  
  Michael, I don't understand enough about operating system architecture and hardware to understand why GCC put this in. I'm assuming there are circumstances in which this initial touching of every page in the activation record is a good idea. Obviously MSVC doesn't do this at all. (Well, I haven't looked, but I think "obviously".) For our example MSVC's choice is clearly the better. I wonder the circumstances in which GCC's is better.
  
  [bugs:#261] Performance of the list comprehension predicates
  
  Status: open
  Group: Performance problem
  Created: Mon Jun 26, 2023 07:49 AM UTC by Felix Weitkämper
  Last Updated: Tue Jul 18, 2023 10:58 AM UTC
  Owner: nobody
  
  I have ported a machine learning application from SWI-Prolog to XSB, hoping to make use of more advanced tabling and the better performance of compiled code.
  The former is working beautifully.
  However, there is a serious performance bottleneck in counting operations.
  Although the individual pure Prolog queries are all faster than in SWI-Prolog, it seems that counting aggregation is very slow, and that this seems to be because of very poor performance of the list comprehension predicates.
  For instance, with the following code, where childof simply has 16 ground facts listed in the file, test(100000) takes 2.83 s CPU time to execute in XSB, but only 0.35s CPU time in SWI-Prolog, although evaluating childof(X,Y) itself in the same loop is actually around twice as fast in XSB than in SWI-Prolog. bagof performs equally poorly.
  
  test(M) :- between(1,M,_), findall((X,Y),childof(X,Y),_), fail. test(_).
  
  Is this a known problem?
  I have tried various more or less creative alternatives for counting that avoid explicit list aggregation, such as writing a C process and calling it recursively or using po-tabling with a counting predicate with set semantics, but neither of them improved peformance considerably due to C-interface and tabling overhead respectively.
  Since I also have several students working on statistical relational AI applications in Python in which we were hoping to use XSB (via px) as a deductive database engine for solution counting, we would be very happy to contribute what we can to a solution!
  
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/xsb/bugs/261/
  
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
  
  <link itemprop="url" href="https://sourceforge.net/p/xsb/bugs/261/"> <meta itemprop="name" content="View">
  <meta itemprop="description" content="View">
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David S. Warren - 2023-07-18

That looks like it. (You just gotta know where to look :-). When I get a chance, I'll experiment with defining that macro and see if that fixes the problem as well. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Performance of the list comprehension predicates

Group

Searches

Help

#261 Performance of the list comprehension predicates

Related

Discussion

Related