Re: [javagroups-users] "Golden" configuration for ISPN 5.1 (JGroups 3.x)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 12/8/11 5:34 PM, Erik Salter wrote:
> Hi all,
>
>
>
> We are migrating from ISPN 5.0.x to ISPN 5.1.x, which effects a move from
> JGroups 2.x to 3.x.   Since I think this would be beneficial to all users
> who need to migrate and have a "quick-start" configuration, I'm sending it
> to this list.

Hmm, maybe you should have posted this to the Infinispan list instead. 
Anyway, I reply here and you can cross-port to ispn-dev if you want... 
and maybe we should continue there.

> Currently, our configuration is a TCP-based stack that we're using for a
> 6-12 distributed node cluster initially, scaling linearly.  We have two
> classifications of caches:
>
>
>
> -        Non-transactional caches keyed by unique values.  These record
> sizes are typically larger.  A write operation will write ~10K of data to
> the grid for each owner.
>
> -        Caches that have high lock contention.  These will write ~7K of
> data to the grid for each owner.   These caches also use distributed
> executor tasks that serialize the request (~20K) to be passed to the key's
> data owner.
>
>
>
> The data grid nodes live behind a round-robin load balancer.  We are trying
> to push the data grid as fast as it will go, which seems to be a total
> throughput of 120 total writes/sec on a 6 node cluster.

- Are these writes/sec *per node* ?
- Does the load balancer hit every node (I assume so) ?
- What's the message size ? 10K (non-transactional) or 20K/7K 
(transactional) ?

120 writes/sec is bad, even if it is per node ! Such a low number could 
only be if you hit the same key(s) on different nodes in a 
*transactional* cache. TX collisions and subsequent rollbacks could 
probably get you such a low write rate.

I ran some tests in our lab yesterday with 9 nodes, see my email below 
for reference:

================================ forwarded email =========================
- To start the test, run "jt UnicastTestRpcDist -props 
/home/bela/fast.xml -name A (-I)" on 9 nodes. I call the members A-I
- They should find each other (note that 'jt' sets some JVM options and 
a max mem of 500m (which is not that much !). I could probably get some 
more perf out of this, if I tuned the options better (e.g. 
ConcurentMarkAndSweep for the old gen, or use of the G1 collector). I'm 
also using JDK 1.6 build 23, which doesn't use CCMS or compressed 
pointers (this is done automatically starting with build 29)
- fast.xml sets mcast_addr to 232.x.x.x, which makes JGroups use eth1
- Once this is done, go to 1 node and press '1'
- Do this a couple of times, to warm the cluster up
- You can also change the read/write ratio, message size, number of 
messages to send, num-owners etc
- I used JGroups 3.1.0.Alpha1 (master): /home/bela/JGroups

Here are some numbers (cluster size is 9 nodes: cluster01-09, number of 
messages=20000 and anycast count (num-owners)=2: these are the defaults):

Message size:    Avg. message rate/sec/node:   Avg. throughput/sec/node:
     1000                      19'000      19.0MB
     2000                      19'000      38.0MB
     4000                      18'400      73.9MB
     8000                      17'000    135.0MB
   16000                      10'500   168.0MB
   32000                        5'200    167.0MB

Note that this is like Infinispan; a GET carries no payload but returns 
the payload (e.g. 1000 bytes). A PUT carries a payload and returns 
nothing. Also, there is no L1 cache enabled (or something similar).

Members are allowed to pick themselves for GETs and PUTs, that's why 
we're getting throughputs of over 125MB/sec.

These numbers should be the baseline against which Infinispan can be 
compared. As UnicastTestRpcDist doesn't do any real work (e.g. acquire 
locks, place values in a hashmap, access cache loaders etc), it should 
always be faster, say 30%. But the diff in perf should always be the 
same, and not change for different cluster sizes, message sizes etc.

============================== end of forwarded email 
=======================

As you can see, we get 10'500 messages/sec/node (168MB/sec/node) for a 
payload of 16K. This is with a read/write ratio of 0.8; when I change 
this to 0.20 (80% writes), then I still get 3'400 messages/sec/node 
(55MB/sec/node).

You could try to run UnicastTestRpcDist in your perf lab, and see if you 
get similar numbers. Take your existing config, and then take my 
suggested config, and see which one gets you better results, and use 
that one then.

> I've attached a sample configuration file for the 3.x version.

My preference is udp.xml or udp-largecluster.xml (both are shipped with 
JGroups). A few comments regarding your 3.x config:
- I recommend switch to UDP/PING
- The thread pools in the transport have min sizes which are too big. 
Also the rejection policies are "run" which is something that I don't 
recommend. I also recommend a queue for the default thread pool
- Use MERGE3 instead of MERGE2
- FD: increase the timeouts, or else you'll get false suspicions 
(consult the wiki for more details re FD versus FD_SOCK)
- NAKACK: use exponential_backoff, this saves memory
- UNICAST --> UNICAST2
- What's GMS doing *under* STABLE ????
- GMS: the merge_timeout of 600s is too high
- STABLE should have max_bytes set
- FC --> MFC/UFC
- If you use STREAMING_STATE_TRANSFER, you should have BARRIER in your 
config ! Note that SST doesn't exist anymore in 3.x, it's now called 
STATE or STATE_SOCK
- Remove FLUSH (don't think you need it, based on our IRC conversation)

Again, I suggest copy one of udp.xml, udp-largecluster.xml or tcp.xml 
(if you must!), and use it with minor changes.

-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat