Recent changes to Home

Discussion for Home page

chris fabri — Wed, 12 Mar 2014 11:46:33 -0000

Final Conclusions

Impl3, the no lock implementation, is the fastest one but is most sensitive to jitter and GC operations.

GC and GC dedicated threads

As always, we need to keep at least 1 thread (and associated core) for doing GC work. This would always be done - must allow for at least 1 dedicated thread (and core) for GC and operating systemm overhead.

Impact of concurrentHashMap Interaction

The max throughput with the BidKeeper active is 1.7M bids/sec. The performance with the BidKeeper inactive is 3M bids/sec.

Hence, the interaction with the BidKeeper slows the system down by 300k bids/sec. - around 11% (.3/2.7 ~ 11%).

ConcurrentHashMap & Jitter

With foreknowledge of the input data set, setting the map size appropriately (of course), is helpful in reducing 'jitter' on startup.

Discussion for Home page

chris fabri — Wed, 12 Mar 2014 11:34:38 -0000

Impl3 - No Lock Implementation - Asymptotic behavior and GC

To better isolate the performance of the 'concurrentlinkedqueue' from background effects like
1) GC operations
2) map resizing events

an additional test was performed with

1) 300 Items (vs 100)
2) BidKeeper turned off
3) Heap/Eden/Survivor spaces all increased

Adding more Items allows the test to probe for asymptotic behavior. is the performance periodic? Does it tail off monotonically down?

Turning off the BidKeeper means there is very little interaction with any ConcurrenthHashMaps in the system; the performance then just reflects contentious read/write operations into the concurrent linked queue and the CAS operation.

Adding more heap memory - and more memory to Eden and survivor spaces is an attempt at reducing the number of
1) Young generation collections - Parallel Scavance
2) Tenured space collection - parallel mark and sweep

Results

Attached graphs show results performance and jitter given these settings. The large spikes are identified as GC events (via VisualVM)

As expected, the throughput goes up a bit peaking past 3M bids/sec. The startup is less jittery. GC events cause massive drops in performance.

Conclusion

As expected, we need to keep at least 1 thread (and associated core) for doing GC work.

Discussion for Home page

chris fabri — Tue, 11 Mar 2014 14:19:04 -0000

Test Settings

Concurrent Hash Map

Concurrent Hash Map 'concurrency level' was left at the default 16.

Blocking Priority Queue

The queue uses a comparator to sort the bids. The default size was used.

GC

The simulations were run in eclipse with the following JVM arguments

-Xmx4096m -javaagent:classmexer.jar -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=1 -XX:SurvivorRatio=6

Eden space was made large - around 2 gig to avoid triggering a collection during any given simulation. System.gc was called between collections.

When triggered, the parallel collector operating on the Young collections (eden + survivor spaces) will run a Parallel Scavenge over Eden space - copying live objects over to one of the survivor spaces.

Generally, it will identify live objects by starting with the application threads and identifying associated objects.

Once complete, this GC operation leaves Eden 100% free. Parallel Scavenge GC operation is done in non-application threads but is a 'stop the world' event e.g. the application threads are stopped.

One can see 'dips' in performance on the attached charts for large number of client-Item scenarios - these are likely GC events.

They are also possibly hashmap table resizes and or read/write locks being hit.

Discussion for Home page

chris fabri — Mon, 10 Mar 2014 15:02:42 -0000

Jitter and Gapping

Impl3 uses two data sturctures
- concurrent hash map
- concurrent linked queue and CAS semantics on an atomic reference to keep track of 'peak bid' value

the latter is a full 'no lock' implementation. The former uses volatile, no lock read access if the hash maps directly to an element, else, it read locks the segment and searches for the element in the bucket.

the writes are lock protected.

This implementation is most sensitive to any thread blocking, starvation issues (since it doesn't benefit from any statistics which would blur the effect).

Attached are plots of Throughput and Jitter for different initial capacity settings of the concurrent hash map: 16, 1k, 1M.

The plots show performance and jitter are both reduced as the initial capacity is increased and intermittent map resizing is reduced.

Home modified by chris fabri

chris fabri — Sat, 08 Mar 2014 11:51:41 -0000

--- v1
+++ v2
@@ -1,6 +1,8 @@
 Welcome to your wiki!

 This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
+
+[Test Settings]

 The wiki uses [Markdown](/p/elise-ebay/wiki/markdown_syntax/) syntax.

Discussion for Home page

chris fabri — Sat, 08 Mar 2014 11:43:56 -0000

Test Settings

Concurrent Hash Map

Concurrent Hash Map 'concurrency level' was left at the default 16.

Blocking Priority Queue

The queue uses a comparator to sort the bids. The default size was used.

Blocking Priority Bag

Of course, we don't fully need a priority queue for this implementation. If one can assume that any added bids are rarely removed from the queue, then, t's only necessary to check any new bid against the current 'best' or 'max' bid, swap them if necessary, which operation is really just O(1).

GC

The simulations were run in eclipse with the following JVM arguments

-Xmx4096m -javaagent:classmexer.jar -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=1 -XX:SurvivorRatio=6

Eden space was made large - around 2 gig to avoid triggering a collection during any given simulation. System.gc was called between collections.

Generally, it will identify live objects by starting with the application threads and identifying associated objects.

Once complete, this GC operation leaves Eden 100% free. Parallel Scavenge GC operation is done in non-application threads but is a 'stop the world' event e.g. the application threads are stopped.

One can see 'dips' in performance on the attached charts for large number of client-Item scenarios - these are likely GC events.

They are also possibly hashmap table resizes and or read/write locks being hit.

Discussion for Home page

chris fabri — Fri, 07 Mar 2014 14:36:06 -0000

Better Performance - 'priority bag'

The main slowdown in the system is inserting Items into the blocking priority queue. This operations is klog(k) with k=number of Bids in the queue.

We don't need a fully sorted 'tree' of Bids however - we only need to keep track of the best bid. As such, we can implement a 'priority bag'; this bag keeps track of the highest value its seen but doesn't sort or order the items past the max value.

Peak performance with this custom priority bag was 2.5M bids/sec - that's 45% better than the peak of 1.7M bids/sec using the blocking priority queue.

As before, the peak throughput occurred when using 4 clients (and hence 4 threads). As expected, the peak throughput occured with 4 threads since the test was run on a 4 core single CPU machine

Discussion for Home page

chris fabri — Thu, 06 Mar 2014 19:01:22 -0000

Deterministic Performance thread jitter

One of the hallmarks of 'no lock' coding is that performance is more deterministic. Given identical input conditions, the time taken to execute a given scenario stays constant.

We can quantify this measure by doing some stats on the times associated with each thread of execution.

Basically, the threads were 100% deterministic, the times for each thread would be the same. hence, the stdDeviation of the times / times gives us a good measure.

Discussion for Home page

chris fabri — Thu, 06 Mar 2014 14:42:35 -0000

Bids vs Clients

for a fixed number of items, what happens as we add more clients?

The graph shows that the throughput peaks as the client count matches the core count - then stays constant as we add more clients.

The fact that the throughput stays constant as we add more clients is a reflection of 'no lock' read/writes across a well distributed set of buckets in the concurrent hash map.

The dips on the 32 client plot is likely GC execution.

Bids vs Items

For a fixed number of clients, what happens as we add more items?

The graph shows that the throughput peaks for client count matching the core count - around 4-5 clients - staying very similar as we add more clients. The plot of 3 clients is basically the same as the plot of 32 clients. Adding 10x the clients doesn't change the throughput plot at all.

The fact that the throughput plots stay basically the same as we add more clients past 3 is a reflection of 'no lock' read/writes across a well distributed set of buckets in the concurrent hash map.

Discussion for Home page

chris fabri — Thu, 06 Mar 2014 10:41:08 -0000