[Bigdata-developers] performance question

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I am doing a performance comparison between bigdata based solution and our previous solution, and I am getting *very* confused.

My question is what time is being used by bigdata which is not being measured as either user or sys time when running bigdata?

The task is as follows:

I have 11 queries that can be answered by both systems, and from a user point of view are identical.

I ask the suite of 11 queries 6 times over. In the bigdata set up, I am using bigdata as a sparql end point, and the queries are passed over http

I am currently just doing this on my Mac (a mountain lion, with SSD)

The wall time to run the queries is approx 30 seconds, however, the cpu time (both user and sys) recorded against the client and the server is a lot less, with about 1 second in the client and 5 seconds in the server.
I am having difficulty finding where the time is going - over 20 seconds is simply missing.

By running bigdata in the debugger and adding System.nanoTime() calls before and after QueryServlet.doQuery() I have convinced myself that the issue is server side not client side, and also not networking related.

When running inside yourkit, with the settings set to wall-time, the time seems to be explained in the following cryptic line:

java.lang.Thread.run() 88804ms Time,  84928ms Own Time
i.e. the vast bulk of the run-time (approximately three times the experienced time of 30 seconds) 
is accounted for in the Thread.run() method doing who knows what (waiting for Thread scheduling?)

I am getting very similar results with either of the following changes:
- use ramdisk rather than the SSD
- use only 1 cpu without hyper threading, instead of the quad core with hyper threading that my machine comes with.

(i.e. the actual execution time is the same with or without extra cores!)

===

I am continuing with testing, my next tests will be:
- parallelize the load and see if the quad core machine does better
- try on a linux box in AWS

Any thoughts would be appreciated

===

I am making extensive use of named graphs, with the select queries starting with approx 40 FROM NAMED and FROM clauses, otherwise I don't think there is anything particularly funky about my queries.

Jeremy J Carroll
Principal Architect
Syapse, Inc.

[Bigdata-developers] performance question

Fast, scalable, robust graph database platform

[Bigdata-developers] performance question