The purpose of these tests is to compare 'blocking' and 'no lock' data structures in a small but 'realistic' system environment; realistic in the sense that the system contains keepers, factories, and inter-related entities interacting in some realistic fashion (e.g. not just doing puts into a map in a loop).
Overview of System & Simulation
The system simulates ebay processing using the following classes:
1) Client
2) Item (to bid on)
3) Bid (on an item from a client)
4) associated Keeper & Factory classes
The simulation is straightforward - Clients bid on Items. Each client runs in its own thread. Each client bids a 1k times against each Item in the system. Clients contend for read/write access to Items.
Items are the 'hottest' part of the system b/c each Client, in formulating a bid for an item, first gets the current 'best bid' for that item. It then puts the new bid back to the item.
Each client goes through the following steps each time it bids on an item:
Concurrent clients, each bidding on the same Item, will hence contend heavily for read/write access to this Item, and, hence to the container holding the bids in the Item e.g. the following operations will be highly contentious:
The goal is to identify which implementation of this bid container maximizes performance throughput (bids/sec) while minimizing thread jitter.
The more deterministic the system, the more each thread should take the same amount of time to run, the lower the thread jitter.
Since each client, running in its own thread, does the exact same amount of work e.g. bids 1k times on each item (interleaved), the expectation is that thread jitter should be low.
Possible Issues
Blocking, GC events, map resizing all lead to higher thread jitter and unpredicatable completion times, both of which are undesirable.
Questions
Which container<bid> implementation shows the highest throughput?
Which has the lowest jitter?
Does performance deteriorate monotonically or collapse as </bid>
1) clients are added
2) items are added
3) both
what's the impact of setting init capacity for the hash maps?
ConcurrentHashMap uses a mix of 'no lock' reads (if the map.get(key) points to an object directly) and 'blocking' writes.
If the read misses, then a read lock is initiated and the bucket is scanned for the key.
Additionally, each Item object holds the bids made on it. This Bid container needs to hold the bids and also reveal the 'best bid', in a thread safe way, when asked.
3 separate implementations of this container were tested:
Each implementation uses a slightly different concurrency approach.
1) uses blocking mutex lock which does not differentiate between reads and writes
2) uses blocking reentrant read/write lock which does differentiate between reads and writes
3) uses no locking at all; no lock access to concurrent linked queue; the peak value is maintained via a CAS comparison with each insert into the queue
Time complexity - Big O Analysis
N = total bids in system
K = total bids on any given item
ConcurrentHashMap.get() best case O(1); worse case O(N) + read lock
ConcurrentHashMap.put() best case O(1) + write lock; worse case O(N) + write lock
Concurrent hash map reads are O(1) and are concurrent except where 2 clients are accessing items from the same bucket - in which case, the bucket is read locked and a linked list is traversed ~O(N) + read lock
Concurrent hash map writes are similar except the cost is for an exclusive write lock.
1) get peak value ~ O(1)
2) add new bid to Item ~ O(1) or O(log(K))
3) put new bid to map ~ O(1)
k = bids per item
O(log(k)) or O(1)
System Performance - Best case - tends to O(log(k)), if the priority queue is used or O(1) if the either of the other two implementations is used - plus any locking costs.
Worse case, O(N) get/put on map and O(log(k)) add on the queue + read/write locks on both.
Limits to performance
1) concurrent hash map concurrency level - set to default 16
2) Log(k) insert into priority queue
3) CAS spinning
4) gc operations on Eden and survivor space
The first constraint can be controlled and should be set roughly in line with the number of concurrent clients. The clients here ranged from 1-32 so the default value, 16, seemed a reasonable choice.
The second constraint defines the system asymptotic performance if the priority queue is used to hold bids.
Basically, each bid has to be entered into a priority queue of Bids for that item - the bids are 'sorted' per the comparator, the 'max' bid is put at the top of the tree.
given 32 clients x 32 items x 1k bids per client per item = 1,024,000 bids on any given Item.
log2(10^6)~ 13 is unnecessarily expensive since the fact the fact that bids are organized per the comparator is not used in any way.
Basically, the throughput will climb, peak and trail off as a log(K) with k=number of bids on any given item
The other two implementations do not 'sort' on insert and hence are O(1).
The third constraint depends on the concurrency level in any given item, leading to CAS operation spinning as it attempts to update a value.
The last constraint is a function of the overall Eden space memory size since GC is triggered in Eden when it fills up.
Last edit: chris fabri 2014-03-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Implementation 1, the priority queue is the slowest, throughput is 1.7M bids/sec, due largely to the Log(k) insert cost but also due to the mutex lock which doesn't differentiate b/w reads and writes.
The blocking list and concurrentLinkedQueue implementations - though they use different 'concurrent' approaches - have similar performance.
The concurrentLinkedQueue implementation is however, generally, the faster of the two. That additional performance is likely 100% due to the different concurrency strategies.
All simulations peak for 4 Clients (and roughly 45 items) consistent with the fact that the tests were run on a single CPU 4 core machine.
Asymptotic behavior
Throughput increases as Items are added to the system and plateaus.
Adding more clients past 4 doesn't increase the performance.
Adding more Items past 45 doesn't increase the performance, showing the peak has already been established.
Performance drops monotonically as clients, items or both are added past the peak values.
Plots: Bids vs items
Plots show throughput as number of items increases, for fixed number of clients. Performance peaks quickly at 4 clients, reflecting the number of cores in the system, and trails monotonically as additional threads/clients are added.
Gapping
Gapping - the sudden drop in performance (as seen in the plots).
In general, gapping is the result of GC operations, blocking, and array resizing (ignoring operating system issues).
Gapping is worse
atstartupformanyclientsanditems
Gapping at startup is likely due, in part, to data structures resizing quickly (see section that runs tests with different map initial capacity map values).
To get a sense of scale, when there is only 1 item in the system, the bidKeeper map needs to hold 1K* number of clients. With 4 running clients, that's 4k bids that need to be held.
If the map starts with the default capacity of 16, it will need to resize 8+ times in quick succession to accommodate the inflow (e.g. 16*2^8=4096).
If this is accurate, the jitter on startup will increase as the number of clients increases, reflecting the resize operations.
Gapping at higher number of items is also possibly due to data structure resizing or GC operations.
Presizing the map and the queue or list could improve this issue.
Subsequent tests probe performance for different values of the map init capacity and seek to isolate GC events more clearly.
One of the hallmarks of 'no lock' coding is that performance is more deterministic. Given identical input conditions, the time taken to execute a given scenario should stay the same with each repetition.
We can quantify this measure by defining a measure called jitter as
Basically, if the threads were 100% deterministic, the times for each thread would be the same. hence, the stdDeviation of the completion times would be zero and the jitter would be zero.
Threads that encounter read/write locks tend to behave non-deterministic ally, leading to unpredictable jitter e.g. some threads finish quickly, but some threads take much longer to finish. This type of undeterministic behavior is undesirable.
Data structures which suddenly 'resize' also cause 'jitter'.
In these tests, when the number of clients (and hence threads) was less than or equal to 4, the number of cores, the jitter was generally between 0 adn 6% - meaning, each thread took nearly the same clock time to complete. The width of the distribution (stdDeviation) was a fraction (0-6%) of the mean.
Jitter was higher at startup and later in the test for higher number of Items, Clients or both.
All implementations show jitter to varying degrees.
Comparing the three implementations shows that Impl3 has roughly speaking the highest jitter, followed by Impl2 and then Impl1.
Impl3 seems most sensitive to environmental factors that cause jitter e.g. GC events, data structure resizes.
Special Test 1 - Different Map init Capacity values & Impl3 (no lock)
Impl3 uses two data sturctures
concurrent hash map
concurrent linked queue and CAS semantics on an atomic reference to keep track of 'peak bid' value
the later is a 100% 'no lock' implementation. The former uses volatile, no lock read access if the requested key hash maps directly to an element, else, it read locks the segment and searches for the element in the bucket.
map writes are lock protected.
Impl3 is most sensitive to any thread blocking, starvation issues (since it doesn't benefit from any statistics which would blur the effect).
Attached are plots of Throughput and Jitter for different initial capacity settings of the concurrent hash map: 16, 1k, 1M.
Conclusion
Init capacity of 1k+ seems optimal given the these test conditions.
The plots show performance increases and jitter decreases as the initial capacity is increased and intermittent map resizing is reduced.
Eden space was made large - around 2 gig to avoid triggering a collection during any given simulation. System.gc was called between collections.
When triggered, the parallel collector operating on the Young collections (eden + survivor spaces) will run a Parallel Scavenge over Eden space - copying live objects over to one of the survivor spaces.
Generally, it will identify live objects by starting with the application threads and identifying associated objects.
Once complete, this GC operation leaves Eden 100% free. Parallel Scavenge GC operation is done in non-application threads but is a 'stop the world' event e.g. the application threads are stopped.
One can see 'dips' in performance on the attached charts for large number of client-Item scenarios - these are likely GC events.
They are also possibly hashmap table resizes and or read/write locks being hit.
GC - Special test 2
pushing the boundaries of the box (8 gig ram), we increase Heap further, increasing Eden and increasing the 2 survivor spaces.
Special Test 2 - Turn off BidKeeper, add more Items, tweak GC settings
To better isolate the initial and asymptotic performance of the 'concurrentlinkedqueue' from background effects like
1) GC operations
2) map resizing events
an additional test was performed with
1) 300 Items (vs 100)
2) BidKeeper turned off
3) Heap/Eden/Survivor spaces all increased
Adding more Items allows the test to probe for asymptotic behavior. is the performance periodic? Does it tail off monotonically down?
Turning off the BidKeeper means there is very little interaction with any ConcurrenthHashMaps in the system; the performance then just reflects contentious read/write operations into the concurrent linked queue and the CAS operation.
Adding more heap memory - and more memory to Eden and survivor spaces is an attempt at reducing the number of
1) Young generation collections - Parallel Scavenge
2) Tenured space collection - parallel mark and sweep
Results
With the BidKeeper turned off, the performance peaks at 3M bids/sec.
Impact of concurrentHashMap Interaction
The max throughput with the BidKeeper active is 1.7M bids/sec. The performance with the BidKeeper inactive is 3M bids/sec.
Hence, the interaction with the BidKeeper slows the system down by 300k bids/sec. - around 11% (.3/2.7 ~ 11%).
Results
Attached graphs show results performance and jitter given these settings. The large spikes are identified as GC events (via VisualVM)
As expected, the throughput goes up a bit peaking past 3M bids/sec. The startup is less jittery. GC events cause massive drops in performance.
Impl3, the no lock implementation, is clearly the fastest but is the most sensitive to environmental factors (GC, data structure resize).
Server Tuning
Doing standard server capacity tuning - as one would do for any server - cleans up the performance.
1) dedicate 1+ threads (and core) for GC/Operating system
2) tune map sizes to target input data set at peak capacity
3) tune Heap/Eden/Survivor spaces to reflect churn; ideally, use 'pools' of pre-constructed entities to fix memory usage at start-up and avoid all GC calls during live run
This is standard server 'tunning' that one would do anyway - but it's particularly impactful for the no-lock implementation.
Last edit: chris fabri 2014-03-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Purpose of Tests
The purpose of these tests is to compare 'blocking' and 'no lock' data structures in a small but 'realistic' system environment; realistic in the sense that the system contains keepers, factories, and inter-related entities interacting in some realistic fashion (e.g. not just doing puts into a map in a loop).
Overview of System & Simulation
The system simulates ebay processing using the following classes:
The simulation is straightforward - Clients bid on Items. Each client runs in its own thread. Each client bids a 1k times against each Item in the system. Clients contend for read/write access to Items.
The following classes use containers:
Items are the 'hottest' part of the system b/c each Client, in formulating a bid for an item, first gets the current 'best bid' for that item. It then puts the new bid back to the item.
Each client goes through the following steps each time it bids on an item:
Concurrent clients, each bidding on the same Item, will hence contend heavily for read/write access to this Item, and, hence to the container holding the bids in the Item e.g. the following operations will be highly contentious:
Different implementations of this container were tested.
Goal
The goal is to identify which implementation of this bid container maximizes performance throughput (bids/sec) while minimizing thread jitter.
The more deterministic the system, the more each thread should take the same amount of time to run, the lower the thread jitter.
Since each client, running in its own thread, does the exact same amount of work e.g. bids 1k times on each item (interleaved), the expectation is that thread jitter should be low.
Possible Issues
Blocking, GC events, map resizing all lead to higher thread jitter and unpredicatable completion times, both of which are undesirable.
Questions
Which container<bid> implementation shows the highest throughput?
Which has the lowest jitter?
Does performance deteriorate monotonically or collapse as </bid>
1) clients are added
2) items are added
3) both
what's the impact of setting init capacity for the hash maps?
Last edit: chris fabri 2014-03-19
Concurrency, Blocking & Containers
The code uses concurrent hash maps to hold instances of Client, Bid, Item objects:
ConcurrentHashMap uses a mix of 'no lock' reads (if the map.get(key) points to an object directly) and 'blocking' writes.
If the read misses, then a read lock is initiated and the bucket is scanned for the key.
Additionally, each Item object holds the bids made on it. This Bid container needs to hold the bids and also reveal the 'best bid', in a thread safe way, when asked.
3 separate implementations of this container were tested:
Each implementation uses a slightly different concurrency approach.
1) uses blocking mutex lock which does not differentiate between reads and writes
2) uses blocking reentrant read/write lock which does differentiate between reads and writes
3) uses no locking at all; no lock access to concurrent linked queue; the peak value is maintained via a CAS comparison with each insert into the queue
Time complexity - Big O Analysis
N = total bids in system
K = total bids on any given item
ConcurrentHashMap.get() best case O(1); worse case O(N) + read lock
ConcurrentHashMap.put() best case O(1) + write lock; worse case O(N) + write lock
Concurrent hash map reads are O(1) and are concurrent except where 2 clients are accessing items from the same bucket - in which case, the bucket is read locked and a linked list is traversed ~O(N) + read lock
Concurrent hash map writes are similar except the cost is for an exclusive write lock.
1) get peak value ~ O(1)
2) add new bid to Item ~ O(1) or O(log(K))
3) put new bid to map ~ O(1)
k = bids per item
O(log(k)) or O(1)
System Performance - Best case - tends to O(log(k)), if the priority queue is used or O(1) if the either of the other two implementations is used - plus any locking costs.
Worse case, O(N) get/put on map and O(log(k)) add on the queue + read/write locks on both.
Limits to performance
1) concurrent hash map concurrency level - set to default 16
2) Log(k) insert into priority queue
3) CAS spinning
4) gc operations on Eden and survivor space
The first constraint can be controlled and should be set roughly in line with the number of concurrent clients. The clients here ranged from 1-32 so the default value, 16, seemed a reasonable choice.
The second constraint defines the system asymptotic performance if the priority queue is used to hold bids.
Basically, each bid has to be entered into a priority queue of Bids for that item - the bids are 'sorted' per the comparator, the 'max' bid is put at the top of the tree.
given 32 clients x 32 items x 1k bids per client per item = 1,024,000 bids on any given Item.
log2(10^6)~ 13 is unnecessarily expensive since the fact the fact that bids are organized per the comparator is not used in any way.
Basically, the throughput will climb, peak and trail off as a log(K) with k=number of bids on any given item
The other two implementations do not 'sort' on insert and hence are O(1).
The third constraint depends on the concurrency level in any given item, leading to CAS operation spinning as it attempts to update a value.
The last constraint is a function of the overall Eden space memory size since GC is triggered in Eden when it fills up.
Last edit: chris fabri 2014-03-12
Basic Test - Performance Results
The basic tests compare impl1/impl2/impl3 using default settings
1) map init capacity - default (16)
2) map concurrency level - default (16)
3) blocking queue init capacity - default
Throughput (bids/sec) - higher is better
Define throughput as 'bids/sec' e.g. the number of bids per second by a client.
Here are the peak throughput numbers (bids/sec) for different implementations of the container used to hold the bids on each Item being bid on.
Impl1 - Blocking Priority Queue - 1.7M Bids/Sec
Impl2 -Blocking Peak Value+List - 2.5M bids/Sec
Impl3 - CAS Peak Value + ConcurrentLQueue: 2.6M+ bids/Sec
Conclusions
Implementation 3 is the fastest.
Implementation 1, the priority queue is the slowest, throughput is 1.7M bids/sec, due largely to the Log(k) insert cost but also due to the mutex lock which doesn't differentiate b/w reads and writes.
The blocking list and concurrentLinkedQueue implementations - though they use different 'concurrent' approaches - have similar performance.
The concurrentLinkedQueue implementation is however, generally, the faster of the two. That additional performance is likely 100% due to the different concurrency strategies.
All simulations peak for 4 Clients (and roughly 45 items) consistent with the fact that the tests were run on a single CPU 4 core machine.
Asymptotic behavior
Throughput increases as Items are added to the system and plateaus.
Adding more clients past 4 doesn't increase the performance.
Adding more Items past 45 doesn't increase the performance, showing the peak has already been established.
Performance drops monotonically as clients, items or both are added past the peak values.
Plots: Bids vs items
Plots show throughput as number of items increases, for fixed number of clients. Performance peaks quickly at 4 clients, reflecting the number of cores in the system, and trails monotonically as additional threads/clients are added.
Gapping
Gapping - the sudden drop in performance (as seen in the plots).
In general, gapping is the result of GC operations, blocking, and array resizing (ignoring operating system issues).
Gapping is worse
Gapping at startup is likely due, in part, to data structures resizing quickly (see section that runs tests with different map initial capacity map values).
To get a sense of scale, when there is only 1 item in the system, the bidKeeper map needs to hold 1K* number of clients. With 4 running clients, that's 4k bids that need to be held.
If the map starts with the default capacity of 16, it will need to resize 8+ times in quick succession to accommodate the inflow (e.g. 16*2^8=4096).
If this is accurate, the jitter on startup will increase as the number of clients increases, reflecting the resize operations.
Gapping at higher number of items is also possibly due to data structure resizing or GC operations.
Presizing the map and the queue or list could improve this issue.
Subsequent tests probe performance for different values of the map init capacity and seek to isolate GC events more clearly.
Last edit: chris fabri 2014-03-13
Basic Test - thread jitter
One of the hallmarks of 'no lock' coding is that performance is more deterministic. Given identical input conditions, the time taken to execute a given scenario should stay the same with each repetition.
We can quantify this measure by defining a measure called jitter as
Basically, if the threads were 100% deterministic, the times for each thread would be the same. hence, the stdDeviation of the completion times would be zero and the jitter would be zero.
Threads that encounter read/write locks tend to behave non-deterministic ally, leading to unpredictable jitter e.g. some threads finish quickly, but some threads take much longer to finish. This type of undeterministic behavior is undesirable.
Data structures which suddenly 'resize' also cause 'jitter'.
In these tests, when the number of clients (and hence threads) was less than or equal to 4, the number of cores, the jitter was generally between 0 adn 6% - meaning, each thread took nearly the same clock time to complete. The width of the distribution (stdDeviation) was a fraction (0-6%) of the mean.
Jitter was higher at startup and later in the test for higher number of Items, Clients or both.
All implementations show jitter to varying degrees.
Comparing the three implementations shows that Impl3 has roughly speaking the highest jitter, followed by Impl2 and then Impl1.
Impl3 seems most sensitive to environmental factors that cause jitter e.g. GC events, data structure resizes.
Last edit: chris fabri 2014-03-12
Special Test 1 - Different Map init Capacity values & Impl3 (no lock)
Impl3 uses two data sturctures
the later is a 100% 'no lock' implementation. The former uses volatile, no lock read access if the requested key hash maps directly to an element, else, it read locks the segment and searches for the element in the bucket.
map writes are lock protected.
Impl3 is most sensitive to any thread blocking, starvation issues (since it doesn't benefit from any statistics which would blur the effect).
Attached are plots of Throughput and Jitter for different initial capacity settings of the concurrent hash map: 16, 1k, 1M.
Conclusion
Init capacity of 1k+ seems optimal given the these test conditions.
The plots show performance increases and jitter decreases as the initial capacity is increased and intermittent map resizing is reduced.
Last edit: chris fabri 2014-03-12
Test Settings
JDK 1.7
GC - Basic Tests
The simulations were run in eclipse with the following JVM arguments
-Xmx4096m -javaagent:classmexer.jar -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=1 -XX:SurvivorRatio=6
Eden space was made large - around 2 gig to avoid triggering a collection during any given simulation. System.gc was called between collections.
When triggered, the parallel collector operating on the Young collections (eden + survivor spaces) will run a Parallel Scavenge over Eden space - copying live objects over to one of the survivor spaces.
Generally, it will identify live objects by starting with the application threads and identifying associated objects.
Once complete, this GC operation leaves Eden 100% free. Parallel Scavenge GC operation is done in non-application threads but is a 'stop the world' event e.g. the application threads are stopped.
One can see 'dips' in performance on the attached charts for large number of client-Item scenarios - these are likely GC events.
They are also possibly hashmap table resizes and or read/write locks being hit.
GC - Special test 2
pushing the boundaries of the box (8 gig ram), we increase Heap further, increasing Eden and increasing the 2 survivor spaces.
-Xmx5000m -javaagent:classmexer.jar -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=1 -XX:SurvivorRatio=1
Tenured = 2.5g
Young = 2.5g
Eden = 833M
Survivor1=833M
Survivor2=833M
Last edit: chris fabri 2014-03-12
Special Test 2 - Turn off BidKeeper, add more Items, tweak GC settings
To better isolate the initial and asymptotic performance of the 'concurrentlinkedqueue' from background effects like
1) GC operations
2) map resizing events
an additional test was performed with
1) 300 Items (vs 100)
2) BidKeeper turned off
3) Heap/Eden/Survivor spaces all increased
Adding more Items allows the test to probe for asymptotic behavior. is the performance periodic? Does it tail off monotonically down?
Turning off the BidKeeper means there is very little interaction with any ConcurrenthHashMaps in the system; the performance then just reflects contentious read/write operations into the concurrent linked queue and the CAS operation.
Adding more heap memory - and more memory to Eden and survivor spaces is an attempt at reducing the number of
1) Young generation collections - Parallel Scavenge
2) Tenured space collection - parallel mark and sweep
Results
With the BidKeeper turned off, the performance peaks at 3M bids/sec.
Impact of concurrentHashMap Interaction
The max throughput with the BidKeeper active is 1.7M bids/sec. The performance with the BidKeeper inactive is 3M bids/sec.
Hence, the interaction with the BidKeeper slows the system down by 300k bids/sec. - around 11% (.3/2.7 ~ 11%).
Results
Attached graphs show results performance and jitter given these settings. The large spikes are identified as GC events (via VisualVM)
As expected, the throughput goes up a bit peaking past 3M bids/sec. The startup is less jittery. GC events cause massive drops in performance.
Each plot shows periodic behavior.
Last edit: chris fabri 2014-03-19
Final Conclusions
Impl3, the no lock implementation, is clearly the fastest but is the most sensitive to environmental factors (GC, data structure resize).
Server Tuning
Doing standard server capacity tuning - as one would do for any server - cleans up the performance.
1) dedicate 1+ threads (and core) for GC/Operating system
2) tune map sizes to target input data set at peak capacity
3) tune Heap/Eden/Survivor spaces to reflect churn; ideally, use 'pools' of pre-constructed entities to fix memory usage at start-up and avoid all GC calls during live run
This is standard server 'tunning' that one would do anyway - but it's particularly impactful for the no-lock implementation.
Last edit: chris fabri 2014-03-12