[Bigdata-developers] RWStore

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

All,

Martyn and I have been running some performance tests against the RW Journal mode and are currently running it through some larger data sets loads (several billion triples).  Things are looking good and we plan to bring the RWStore into the trunk shortly.

The RW Journal mode provides a nice complement to the current WORM journal.  Like the WORM, it is able to read from historical commit points (version history).  However, once the history retention period expires, it will release older database revisions and reuse the space on the disk allocated to those recycled revisions.  When the retention time is set to zero milliseconds, the RW mode can efficiently recycle allocations on the disk and has a much smaller footprint than the WORM (6x less space on the disk in the table below).  One advantage of the RW mode is that the recycling of allocations on the disk makes it practical to use much larger branching factors when compared to the WORM.  This accounts for the query performance difference in the table below between the WORM and RW journal modes.  We have nearly 2x the query performance in the cold disk condition when the branching factor is raised from m=32 to m=128.  (We see this same effect in scale-out, which uses larger branching factors for the index segments generated by dynamic sharding).

The table below gives some results on BSBM 100M for the RW and WORM journal modes.  These runs are using the BSBM reduced query mix with query 3 excluded.  Query 3 is being delegated to Sesame and runs very slowly, which is why it is excluded from these runs.   We will publish official BSBM results once we run query 3 natively -- ideally sometime late this month as part of evaluating the refactored query engine.

The BSBM metric is QMpH (Query Mixes per Hour).  Higher is better.  The best published score on BSBM 100M for 4 concurrent clients using the reduced query mix is 2822.  You can not directly compare the performance against that published number because the hardware, the #of concurrent clients, and the query mixtures are different.  However, I think that the scores below may be easily seen to be highly competitive.

There are three QMpH columns.  The cold disk condition was achieved by dropping the operating system file cache and is the baseline for this table.  BSBM forumlates the query mixes using a random number seed.  The "Different Seed" condition shows what happens when the disk cache is warm and we change the seed, which changes the query mix.  The "Same Seed" condition shows what happens when the disk cache is warm and we re-run the benchmark a second time using the same seed.  Performance in the "Different Seed" column is improved over the baseline due to overlap in query requsts between the sets of query mixes.  The "Same Seed" condition shows the potential throughput when most queries do not touch the disk. Disk utilization for all three conditions is between 98%+ as measured by iostat.

When running against Solid State Disk (SSD) we capture most of the "Same Seed" performance gain.  SSD is 8x faster for the Cold Disk condition and 2x faster for the.Different Seed condition.

Finally, BSBM has a lot of relatively large literals (greater than 1KB in length).  It is interesting to note that reducing the branching factor on the lexicon indices improves the load performance, but results in poorer query performance.

Thanks,
Bryan

BSBM 100M with 8 concurrent clients.
                                QMpH
Journal Mode    Load Parameters  Load Rate      Disk Space (GB) Cold Disk       Different Seed  Same Seed
WORM    q=8000, m=32            13,074  125     4234    8022    39467
RW      q=4000, m=128.          12,781  21      7468    12947   37804
RW      q=4000, m=128.  T2ID and ID2T overridden to m=32.               16,307  21      5631    9740    36731

q is the write retention queue capacity.
m is the default branching factor
The load rate is in triples per second.
Only 50 warmup trials were used.
The JVM is hot in all runs.
The Same Seed condition reports on running the benchmark twice in a row using the same seed in the second run.
The machine is a quad core AMD Phenom II X4 with 8MB Cache @ 3Ghz running Centos with 16G of RAM and a striped RAID array with SATA 15k spindles (Seagate Cheetah with 16MV Cache, 3.5").

[Bigdata-developers] RWStore

Fast, scalable, robust graph database platform

[Bigdata-developers] RWStore