Re: [Bigdata-developers] Preliminary performance results for BSBM 50M comparing a single node clust

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Here is a point of comparison for the WORM Journal on BSBM 100M with the same query mix.  The difference between this and case (C) below is entirely due to the data scale (100M triples versus 50M triples) and the consequent need to touch the disk more frequently.  The machine was already at 98% disk utilization on 50M at 12000 QMpH.  At 100M, the disk utilization can only increase slightly (99%) and consequently throughput drops to 6800 QMpH.  In scale-out of course, we have more spindles and more IO bandwidth all of which will translate into increased query throughput once we eliminate the RMI overhead.

Both this trial and the ones that I reported below were with 8 concurrent clients.

QMpH:                   6846.91 query mixes per hour

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  6    200 8754476 128936 4572628    0    0  6612   134 2365 12502  8  2 57 33  0

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               1.02     0.17 1241.35    0.33  6613.13     2.00    10.66     6.63    5.34   0.80  99.26

Thanks,
Bryan
________________________________________
From: Bryan Thompson [br...@sy...]
Sent: Friday, August 06, 2010 10:28 AM
To: big...@li...
Subject: [Bigdata-developers] Preliminary performance results for BSBM 50M comparing a single node cluster and the standalone Journal

All,

What follows are three bigdata runs on BSBM using the 50M triple data set (scale factor of 141624) and the reduced query mix less the queries using optional joins (Q3 and Q7).  These runs compare;

   (A) a single node cluster;
   (B) a single node cluster w/o RMI when pushing binding sets during joins; and
   (C) a standalone WORM Journal.

All runs are on the same hardware (4 core AMD with 16G RAM), OS (Centos 5.4), and disk (3 x 15k SAS disks in a stripped RAID configuration).  QMpH is the basic metric for BSBM - Query Mixes per Hour.  Higher is better.

(A) Single node cluster: Cold OS cache; Hot JVM; Compact shards:
    QMpH:                   2129.57 query mixes per hour

(B) Single node cluster: Cold OS cache; Hot JVM; Compact shards; No RMI when pushing binding sets in joins.
    QMpH:                   3700.51 query mixes per hour

(C) WORM Journal: Cold OS cache; Hot JVM:
    QMpH:                   12132.81 query mixes per hour

Now the story. As I observed in an earlier email (yesterday?), RMI is accounting for a lot of the time, threads, and heap as observed using a profiler for the single node cluster.  If you compare (A) vs (B) you can see exactly how much overhead RMI added for moving the binding sets.  The single node cluster has approximately 50% of the throughput when it is pushing the binding sets using RMI (2100 QMpH) versus doing direct method calls (3700 QMpH).  This goes back to how the pipeline join operates, by pushing the intermediate solutions (binding sets) from shard to shard during join processing.  Since this is a single node cluster and the machine was only running a single DataService, all binding sets are actually being moved within that data service.  (I was able to make a one line change in DistributedJoinTask.getSink(PartitionLocator) to have it NOT do RMI when a DataService would push the binding sets to itself which resulted in this 2x increased in throughput.)

If you compare (B) versus (C) you can compare the performance of the single node cluster (3700 QMpH) versus the WORM Journal (12000 QMpH). The WORM Journal is approximately 3x faster.  You can see this reflected in the disk utilization as well (30% vs 98%), IO Wait (4% vs 34%) and context switching (78k versus 16k).  While I don't have the data below, context switching for condition (A) was approximately 140k -- another way in which using RMI as we are doing is introducing overhead due to increased thread count, synchronization points, etc.

The single node cluster (A and B) absorbed more of the RAM on the machine since it had to run 5 java processes, which of course left the OS with less RAM to buffer the disk.  However, the real reason the disk utilization is so much higher for (C) is that the context switching is that much lower -- that is, the join processing is more likely to run without having to hit a synchronization point.

I expect that we can bring condition (B - the single node cluster) to nearly the same performance level as condition (C - the WORM Journal) by a review of the additional synchronization points introduced by the distributed versions of the pipeline join tasks (DistributedJoinTask and DistributedMasterTask).  There are a couple of methods in there (DistributedJoinTask.getSink(PartitionLocator) which are already flagged as likely thread contention hot spots and some improvements could be introduced fairly easily (for example, using the Memoizer pattern in getSink()).

Once we address those hot spots and Mike finishes the work on native execution of optional joins we should be able to deliver some very impressive BSBM results for both the single-node cluster and the Journal.  In order to have equally impressive results for a multi-machine cluster, we will need to streamline how we move binding sets among the nodes as I outlined in a different thread yesterday.

The raw data for the DISK and CPU for conditions (B) and (C) are below.

Thanks,
Bryan

The DISK IO profile (from iostat) for conditions (B) and (C):

(B) iostat disk profile for the single node cluster w/o RMI for the binding sets.
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               1.35     1.42  304.52    0.43  3207.27     7.40    21.08     0.88    2.89   0.99  30.18

(C) iostat disk profile for the WORM Journal:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.97     0.18 1321.23    0.37  7376.93     2.20    11.17     6.02    4.55   0.75  98.78

The CPU profile (from vmstat) for conditions (B) and (C):

(B) vmstat CPU profile for the single node cluster w/o RMI for the binding sets.
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
11  0      0 5423780   7248 734812    0    0  3326   173 1438 78346 58 11 27  4  0

(C) vmstat CPU profile for the WORM Journal.
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  6    200 12293584  13648 1184924    0    0  7211   274 2499 16973 11  3 52 34  0

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev
_______________________________________________
Bigdata-developers mailing list
Big...@li...
https://lists.sourceforge.net/lists/listinfo/bigdata-developers

Re: [Bigdata-developers] Preliminary performance results for BSBM 50M comparing a single node clust

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] Preliminary performance results for BSBM 50M comparing a single node cluster and the standalone Journal