Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

performance optimizations

Developers
2009-07-08
2013-05-14
  • So I'm an optimization nut, and I saw the eager claims of 60x performance in 1.6.  After reading through the source code, I didn't see a whole lot that was different, other than the java 1.5 dependency, and the AtomicXXX.

    So now that 1.5 is a dependency, I'd like to offer a dozen optimizations. But I'm not exactly sure the best way of going about this.  'feature request' or 'bug'.  Hense I'm starting here.

    I'll just list the general categories here - but will write/provide code as necessary.
    1) disk-store thread-hell (ehcache threading pisses me off). Executor pools or a single manager background thread should be more than sufficient for ehcache's needs.  I see 1.6 has thread-pools in other areas at least.

    2) disk-store 200ms wakeup. (when you have 100+ caches, there is horrid background activity).  The following basic algorithm should be less resource intensive for a thread-per-spooler/expiry:

    BlockingDequeue spool;
    run() {
    while (true) {
    Element e = spool.poll(expiryThreadInterval, TimeUnit.SECONDS);
    if (e == null) { expireElements(); continue; }
    tSpool = swapSpool();
    writeToDisk(e);
    writeToDisk(tSpoll);
    }
    }

    3) The disk-free allocator uses an ArrayList.remove(i) which does a System.copy to move everybody forward.  Use a LinkedList.

    4) The disk-free allocator is first-fit. This is the worst possible allocator, produces unbounded slack-space, and is O(n). Use a power-of-2 allocator, such that the nearest pwr2 is checked for free items. Being empty, larger blocks are checked and split until ultimately the file is grown (again to a power-of-2).  This is O(log(n)).

    5) disk-store Collections.synchronizedMap seems to be redundant with explicit calls to synchronized(mymap).  I haven't checked thoroughly to verify, but I think at least one of these wrappers can be removed.

    6) I've seen big issues with large keys running the JVM out of memory in disks-store, since the keys are cached in mem. There was a bug in hibernate where the query cache stored entity refs in the key, and thus stored the entire L1 cache in every key.  The bug was fixed, but it's not unreasonable to assume large keys.

    One possible solution is allow a user-enabled alternate key-mapper.which requires serializable keys (and are assumed to be large - say 128+ bytes). It would also require very long disk-expiry checking.  Basically keep the in-mem key-cache, BUT use Map<Integer,KeyBank>.  The key.hashCode() is used to map to a bank.  The bank contains actual Map<Object,DiskElement> records.  However, a Bank can optionally be passivated.  The entire map serialized to disk.. An LRU cache is used to keep say 16 banks of 64 entries in mem.  The rest stay on disk.

    7) As an extension to 6, for VERY large disk-store caches, the object references puts a LOT of pressure on the garbage collector (reference tree hell).. Yes minor collects can skip most of it. But in hibernate apps, there is already a lot of memory pressure, so full collects can happen often enough.  Basically you can switch out DiskElement into an index into int[] and boolean[] records.  It's VERY non-OO, but consider that 500,000 DiskElement's pointing to 500,000 Keys, all being pointed to by multiple 500,000 row arrays requires 1.5M ref traversals on every full GC.  And this is all just book-keeping for the supposedly spilled-to-disk manager.  The Bank'd key-sets from above becomes Map<Integer,Integer> (where key is hash-code, and value is index into a series of co-arrays).  Large contiguous arrays go into a special section of VM memory, so you've removed a non-trivial fraction of busy-work from the JVM.  Since we never throw away DiskElements, I think this is perfect.  Managing option 6 with this might be tricky, but I think it's still doable.

    8) In some file-systems, appending to a file is slower than pre-extending the file then updating blocks.  Thus I would recommend excessively growing files in multiples of say randomAccessFile.setLength(oldLength + Math.min(32k,align32k(obj.size())))

    As I said, I'm happy to code up the various changes, but wanted to get some initial feedback. I was a little dismayed by the new LRU algorithm (using 30 samples with map-walking).  I'm sure there's a way of creating a concurrent LRU mapper, taking the best of the old LinkedHashMap + LRU v.s. the new ConcurrentHashMap, but this might be littered with bugs, so I'm going to look into this more deeply only if these more obvious changes can successfully be integrated.

     
    • > I was a little dismayed by the new LRU algorithm (using 30 samples with map-walking). I'm
      > sure there's a way of creating a concurrent LRU mapper, taking the best of the old
      > LinkedHashMap + LRU v.s. the new ConcurrentHashMap, but this might be littered with bugs,

      I'll let Greg respond to your other ideas, but for this last one, I'd suggest looking at and perhaps contributing to the ConcurrentLinkedHashMap project on Google Code: http://code.google.com/p/concurrentlinkedhashmap/

       
    • I already noticed a few corrections needed from my post:

      2) current code uses a HashMap to continue operating while the background thread is sleeping/processing, thus a queue isn't appropriate for the elements.  I've found blocking-queues produce the simplest and most reliable code.  So one approach is to use a queue + synchronized HashMap, whereby you clear the queue instead of swapping, and swap the map as done currently. But unlike a nice blocking-queue which doesn't need external synchronizations, you'd still need to protect the atomic 2 part transaction.  Thus a more complex algorithm with a synchronize on a java 5 ReentrantLock with a notification and a timed wait are probably necessary.

      4) nitpic, the power-2 block allocator is O(k), not O(log(n)) where k is proportionate to  log2(max_size)

      Here is a composite idea addressing topic 1 and 6:
      9) The CacheManager starts a single background thread, to which all caches with DiskStores attach.  This uses a common disk-expiration-checker, but should be high performance

      class DiskStore:
      void put(Element e) {
      cacheManager.spoolLock.lock();
      try {
        this.spoolMap.put(e.key,e);
        if (!offered)  {
        cacheManager.spoolPutQueue.offer(this);
        offered = true;
        }
        cacheManager.spoolCondition.signal();
      } finally {cacheManager.spoolLock.unlock();}
      }

      void spoolThreadMain() {
        while (running) {
          Thread.sleep(200); // allow activity to build up
          cacheManager.spoolLock.lock();
          try {
              if (cacheManager.spoolPutQueue.isEmpty()
                  && cacheManager.spoolCondition.awaitUntil(nextDiskExpiry, TimeUnit.SECONDS)) {
                 cacheManager.spoolLock.unlock();
                 expireDiskElements();
                 nextDiskExpiry = new Date(System.currentTimeInMillis() + diskExpiryInterval);
              } else {
                 cacheManager.spoolLock.unlock();
                 while (!cacheManager.spoolPutQueue.isEmpty()) {
                   cacheManager.spoolLock.lock();
                   ldiskCache = cacheManager.spoolPutQueue.poll();
                   if (ldiskCache == null || !ldiskCache.isActive()) {
                       cacheManager.spoolLock.unlock();
                       continue;
                   }
                   lmap = ldiskCache.swapOutSpoolMap();
                   ldiskCache.offered = false;
                   cacheManager.spoolLock.unlock(); // don't block-the-world while we process
                   ldiskCache.writeElementsToDisk(lmap);
              }
           } catch (Exception e) { /* ignore */ }
          } finally { cacheManager.spoolLock.unlock(); }
        }
      }

      The basic idea being that each diskStore has a local Map<Key,Element> pair (which is swaped on disk-flushing), but there's an event queue which contains the diskcache that needs attention.  The event-queue + global lock lives in the cache manager.

      This algorithm requires a fixed disk-expiry interval. 

      9b) You could instead switch to a DelayQueue as with this truncated algorithm:

      class DiskStore:

      static class DelayQueueEvt implements Delayed {
      boolean put = true; DiskStore ds; long exp = System.currentTimeInMillis() + 200L;
      DelayQueueEvt(boolean put,DiskStore ds){this.put=put; this.ds = ds;}
      public long getDelay(TimeUnit unit) { return unit.excessNanos(System.currentTimeInMillis(),exp); }
      public int compareTo(DelayQueueEvt e) { return exp < e.exp ? -1 : exp == e.exp ? 0 : 1 }
      }

      void put(Element e) {
        synchronized (this.spoolMap) {
            if (!offered) {
            cacheManager.diskSpoolQueue.add(new DelayQueueEvt(true,this), 200ms);
            offered=true;
            }
            this.spoolMap.put(e.key,e);
        }
      }

      void init() {
        ...
        scheduleExpiry();
      }
      void scheduleExpiry() {
        cacheManager.diskSpoolQueue.add(new DelayQueueEvt(false,this), this.diskExpiryInterval);
      }

      void spoolThreadMain()
      {
        
         while (running) {
            DelayQueueEvt evt = cacheManager.diskSpoolQueue.take();
            if (!evt.put()) {
              evt.ds.expireEntries();
              evt.ds.scheduleExpiry();
            }
            else {evt.ds.flushToDisk(); }
         }
      }

      10) You can take 9b even further by making the 200ms delay configurable per cache (say to 1 hour).   This continues what is effectively happening now - an L1.5 cache sitting between L1 and L2.   To mitigate memory starvation with say 1M 'transient records', you can use a weak-reference with a finalizer that adds an immediate event to the delay-queue.  Thus a full GC triggers a semi-immediate flush.  Unfortunately, I don't know how reliable this technique is, I tend to avoid  weak-refs and finalizers with a passion (hell, even core sun-java fled them if you read the comment history).

       
      • Greg Luck
        Greg Luck
        2009-07-10

        Michael

        The performance claims are based on CacheTest.testConcurrentReadWriteRemoveLRU which you can readily re-run. Secondly they are based on the JBoss Cache performance suite, which is where the published numbers come from. These both attempt to show realistic workloads. So I think the performance numbers stand. And that they have been accomplished with a pretty minimal change set gives me comfort.

        The DiskStore has not had ANY performance work in 1.6. I saw this as too much change. However because cache.get/put are no longer synchronized, concurrent access to both the MemoryStore and the DiskStore are now possible. And concurrent reads to the MemoryStore will seldom be blocked.

        Version 1.7 will focus on the DiskStore. Your performance so your comments and energy are very welcome. Another idea is to try out JavaDB for DiskStore. Ehcache has a rich set of concurrent tests, so we should be able to evaluate your ideas. To get started can you pick your favourite idea and create a patch. I will then add it and evaluate it in detail. Sadly most patches are of poor quality and cannot be used as is requiring a lot of work on my part. If you can deliver a high quality first patch then we will see about your other ideas. Even though we have good test coverage, something that is often overlooked is to add further tests which probe the behaviour of the change. e.g. for ConcurrentHashMap I tested it with various sizes and came across the very slow iteration behavour versus HashMap. The LFU algorithm has used iteration for a couple of years now and worked well.

        The number of problems the DiskStore have goes on way beyond those you have identified. Using a real database which has had years of work put into it it worth a look. Incidentally that approach was tried right at the beginning with a java db I do not remember but it had lots of threading problems.  One problem you did not mention is that when we call back from the DiskStore, we put the element back in the MemoryStore which, if it is full, will cause the MemoryStore to overflow, so you get a constant churn. I would like to make the interactions on write and read a lot more configurable.

        The ConcurrentHashMap iterate is only done up to 5000 elements. After that we create an Array and sample it directly. Am I happy with this? No. I initially was using ConcurrentLinkedHashMap, from Ben Manes. In extensive testing I found it was not thread safe in all situations and turned my laptop into a fan heater. Thus what I have is the least worst option I could come up with. Ben is working on a new version which I will test in due course. Doug Lea is planning a CHM with eviction for Java 1.7. So we will see.

         
    • More DiskStore optimizations (item 6)

      It occurred to me that a disk-cache doesn't have to be as rigorous as an in-mem cache, and thus we can allow slightly random expirations at the expensive of memory space.  Namely work like a CPU cache which expires elements on collisions.

      Namely you have key.hashCode().. You also have whether the key can serialize or not.  If not, you have to keep the key in memory as currently happens.. Otherwise, store the int hashcode in a map, then null out the key.

      When you serialize, you serialize BOTH the key and the value.

      So now you have
      Map<Integer,DiskRecord>  hashKey2DiskRecord;

      so now your diskCache can safely prove that nothing matches on disk.. Or, on a match, you go load load the ObjectInputStream.. read the first object.. do loadedObj.equals(searchKey), and load the 'value' portion up if matched.

      There isn't much value if the serialized object is small.. Thus on passivation, you do the object-serialization:
      if (exception) keep key in mem
      if (buffSize < MIN_DISK_KEY_SIZE) keep key in mem

      Your diskelement then contains at least the following fields:

      class DiskElement {
      long diskOffset
      int diskSpace
      Object key;
      boolean key_passivated;
      ...

      Obviously it becomes VERY expensive to 'walk' the hash-linked-list, as you'd have to disk-fetch and deserialize every key on every search.  Thus you trigger an EVICT if putting over an existing item, or have subsequent overlapping hash-items keep their key in memory.  Since the probability of collisions is low, this should still mean most elements get to put their key on disk.

      Again this is targeting  large-key objects (such as hibernate CacheKey or QueryCacheKey)

      Now with persistent enabled, keys would already have to be valid to serialize, but due to legacy code, those that use diskStorage but not persistence would previously never have seen an inconsistent load error (if the key had transient fields for example that didn't properly re-initialize).. So I think this model needs to be off by default.  in other words, I think this CAN break existing code, so err on safety by default.

      You can add a persistentKeys=true option to the ehcache-failsafe.xml  Which is disabled by default (unless persistent=true).  Thus people have to enable this capability.

      In the code all it means is that we don't even try to passivate the keys, and thus the DiskStore.key_passivated is always false.  It also means that we are more eagerly evicting hash-collided disk elements, which personally I think is acceptable.  If there would have been thrashing, it would resolve itself in the in-mem cache.

       
    • Sub-subject:  LRU/LFU/FIFO 30-sample calculation performance issue.

      So I tested out the new 1.6 LRU with array-keys (cache-size > 5k) v.s. linear walks of the ConcurrentHashMap.  And it's performance collapsed as I speculated.

      ConcurrentHashMap is fast for gets/puts, but sucks a$$ for iterators.

      If I create ehcaches of max-size 100,   1k,   4k, 10k,      400k and 510k all with 500k sequential puts, here are the synthetic results:

      100:    8 sec
      1k:    17 sec
      4k:    62 sec
      10k:    8.8 sec
      400k:  6 sec
      510k:  5.6 sec
      HashMap: 0.94 sec
      ConcurrentHashMap: 1.02 sec

      So 510k EhCache should have been the same as CHM but it's 5x off.  Not the end of the world.
      BUT, the transition from below 5k dies exponentially.  Even the 100 record walk is slow compared to the 10k.  Plus what's the overhead of 5,000 key elements? nothing.

      I'm curious to see if there are any benefits whatsoever to the CHM iterator. Otherwise I recommend taking it out.

      This test isn't as synthetic as it might seem.  If you cache something with a near 0% hit rate (which happens a LOT in hibernate), then this test is applicable - though spread-out over time such that the JIT might not have optimized CHM or HM as much.

       
      • Greg Luck
        Greg Luck
        2009-07-20

        Michael

        You are absolutely correct about CHM iteration being too expensive even for small sizes. Here is my test and results:

            /**
             * Benchmark to test speed.
             *
             * With iteration up to 5000:
             * 100: Time for putSpeed: 2772
             * 1000: Time for putSpeed: 10943
             * 4000: Time for putSpeed: 42367
             * 10000: Time for putSpeed: 4179
             * 300000: Time for putSpeed: 6616
             *
             * With no iteration:
             * 100: Time for putSpeed: 2358
             * 1000: Time for putSpeed: 2692
             * 4000: Time for putSpeed: 2833
             * 10000: Time for putSpeed: 3630
             * 300000: Time for putSpeed: 6616
             */
            @Test
            public void testPutSpeed() throws Exception {
                cache = new Cache("testPutSpeed", 4000, false, true, 120, 120);
                manager.addCache(cache);
                store = cache.getMemoryStore();
                final Long key = 0L;
                byte[] value = new byte[1];
                StopWatch stopWatch = new StopWatch();

                // Add a bunch of entries
                for (int i = 0; i < 500000; i++) {
                    Element element = new Element(key + i, value);
                    store.put(element);
                }
                long time = stopWatch.getElapsedTime();
                LOG.log(Level.INFO, "Time for putSpeed: " + time);
                assertTrue("Too slow. Time was " + time, time < 4000);
            }

        I tried playing with the concurrency level setting in CHM but it did not make any difference. Based on my reproduction of you results I have ditched iteration completely. This will be in trunk and Maven snapshot as soon as I get a build and  1.6.1.

        Let me know if there is anything else that would be really good to get into 1.6.1 out of your list.

         
    • 11) Garbage collection
      It's possible to produce a thread-safe linked-list of of elments, a LinkedHashMap without all the complexity of the google LRU ConcurrentHashMap.  This isn't an LRU, it's just a LIFO-style LinkedHashMap.  Note, a FIFO could also be possible, but as you'll see, it's not necessary.

      AtomicReference<Element> head;
      while (true) {
      el.next = head.get();
      if (head.compareAndSet(el.next,el)) { break; }
      }
      Nice and efficient with a few spin-loops during heavy contention.  We already need to synchronize on the element for other purposes, so concurrent modification of the element shouldn't be a factor.

      This isn't useful by itself of course. The list is in reverse order.  Thus, every K puts, we move the head to a fresh row/bank/batch (whatever name best suits).  sqrt(maxSize), 32, 64 or 128 all could be good candidates for the row size....  Use an atomic int to identify when a rotatoin should occur. (I've issued a bug with the current keyArrayIndex iterator).

      while (true){
        int oldRowSize = rowSize.get();
        if (oldRowSize == maxRowSize) {
          if (!rowSize.compareAndSet(oldRowSize, 0) { continue; }
            synchronize (heads) {
              headTop++;
              if (headTop == heads.length) {
                 headTop = 0;
                 tryGc();
              }
               head = heads[headTop];
               return;
           }
          }
      } else if (rowSize.compareAndSet(oldRowSize, oldRowSize + 1){
          el.next = head.get();
           if (head.compareAndSet(el.next, el))
             return;
      }
      }

      The idea is that every time we do a full loop of all rows, we try to perform a garbage collect of the linked-list rows (defined below).  Otherwise, the guy that inserted the last element triggers a row transition.  Note this doesn't guarantee that each row is exactly maxRowSize, but we don't really care.  What we're trying to accomplish is check-points in time, such that a 'batch' of elements are grouped together moving forward in time.. This effectively allows us to have fully isolated bundles of elements that only one thread will EVER touch.  The only variable that is contended for by multiple threads is the head atomic-ref.

      Note evicts/removes/expires do NOT touch this list (as that would cause concurrent modification of a singlely linked list).  Instead there needs to be an el.deleted field which is set on expiration.  Only the 'GC-thread' can touch non-head rows (see below).  This somewhat solves the LinkedConcurrentHashMap concurrency issue.. And for the most part is instantaneous for puts (and is completely unaffected by gets, as only el.lastAccessed / el.numHits are updated).

      So now the gc function is where the devil lives.  It self-synchronizes, is triggered in one of many ways.  Either a max-cache-elements is getting near (say 75% full) or has been reached, or a 3x maxExpiration-time duration has traversed (since a first put).  The particular triggers are negotiable.  Either a background thread (I recommend a single executor-service pool (which facilitates scheduled jobs) of either 1 or 2 threads living in the CacheManager), OR the putting-thread with reason to trigger a GC grab the cache-specific lock and apply the following algorithm.

      Overall algorithm:  Walk 'old' rows searching for deleted,expired, idle.  If no obvoius choice can be made, pick a worst-candidate (an oldest, a least-frequently-used or a least-recently-used OF THE ROW).  walk rows until a desired number of candidates have been evicted.  For put-blocking threads, a single eviction is all that's required.  For background threads, a full row-length is desired (meaning the entire suite of memory MIGHT be walked in a single GC run).

      1) Find the row where the last GC run left off.  NEVER select the current head or the head's prior (as latent threads may still be putting).. The row-size should be large enough that it is unlikely that 3 levels deep of latent puts are ocuring (e.g. swapped-to-disk with lots of thread contention or whatever).  In any case, because the element should be locked, the worst case is that the GC through would stall when addressing the latent element.  But still, with hundreds of rows, it should be trivial to just say don't GC head-row - 2 or 3.

      2) Do an initial scan of the row, keep statistics for least-recently-used, least-frequently-used, oldest.  When recording, you need to keep a pointer to the parent, not the elment (because it's a singlelly linked list). Obviously expire any expired/idle/deleted records as we go.

      3) based on the policy (And I recommend that all 3 policy styles apply instead of just one.. We don't need a trivial A,B best-detection.. A cascade of the 3 policies seems to be most appropriate.  The stated policy merely implies the order of the application, but on ties, the alternate policies are also applied.  So LFU,LRU,FIFO  v.s. LRU,LFU,FIFO v.s. FIFO,LRU,LFU.

      4) For more aggressive expiration modes (say mem is full and we're the background thread), we identify multiple candidates.  In the simpliest case, we compute the median (accesstime,frequency,age) and blow away those below the average.

      4a) Alternatively, the gets can take on a slightly higher cost by doing two more steps.. In LFU mode, an 'el.adjustedHits' is incremented in addition to el.numHits.  In LRU mode, the put records the timestamp of the row transition. Then ALL records that haven't been accessed sooner than the row transition means that there have been zero hits since the put.  Other models might apply, but the idea is to identify records that haven't been hit recently (where recently would otherwise be hard to compute with multiple GC runs within the same second).  With the LFU adjustedHits, for example, every pass, we compute the average num-hits and subject that from each element (with a min of 1).  Thus a heavily hit items (say 200 hits) will live for at most 200 full GCs before having a zero hit-counter and thus be a candidate for 'aggressive' expiration.  But if many items in the row have many hits, the average hit-rate might instead be 25.. Thus only 4 full GCs would kill off the heavily hit item.  Again, by full GC, I mean we've walked the entire map via potentially multiple calls to tryGc().

      The advantage of row-transitions recording the time-stamp is that we don't have to do a pre-walk of a row to compute an average.. Thus we can decide to expire elements on the first pass.  For LFU, this unfortuntaely means mutating an adjustedHitCount on each element, and thus we're potentially triggering memory-dirties (which with multi-gig caches can mean disk-IO to prime the swap-space for potentially future swaps). ALTHOUGH, synchronized already mutates that memory spot.  This is really only needed if we have unlimited expiration and one item that was hit say 5,000 times and is never hit again.  Some sort of time-locality means that eventually the item would expire.

      5)  Similar to java's multi-stage GC, as you run GCs over and over, many mem-items are just never going to want to expire, and thus you're wasting CPU retesting them.  Thus the date-created column can be used to determine how many passes the element has had.. (you know the timestamp of each start-from-row-0 tryGc).  Thus if after 5, or even 1 GC-survival, the element can be moved to a 'tenured' list.  With survival-age == 1, this means every item is removed from the list on every pass.. They're either expired or moved to a tenured list.  This might be too aggressive though, will have to experiment.  The tenured list is only traversed one-row per full restart.  So with 100 tenure rows and a 900 main rows, by being tenured, you'll avoid 99 full GCs.  The max-expiration-time of the first/last item in the row is used to trigger GCs of that row to prevent wasted slots (a special variation of the GC).

      So the effect is that by removing tenureable elements early on, subsequent full passes (which are required to expire at LEAST 1 element from each row) will 'unfarily' expire new items.. But that's ok.. Most items should never be considered tenured unless they prove themself..

      The big advantage of this model over current and past ehcache eviction policies is that a background thread can pre-clean out expired elements instead of performing individual evictions (and forcing the put-thread to block until the eviction is cluster notified).  Evictions themselves can be enqueued and at the end of the put-tryGc, and the GC-thread can be notified that it has work to do (e.g. cleaning up the evicted els).

      The alternative is that the put-thread has to block until the GC thread wakes up and gets to you.  With CPU-load or simply competing caches with competing busy work, this might take seconds... Thus when the cache is truely full, the put-thread tries to perform the gc directly in aggresive but fail-fast mode, and again such that the eviction candidate is selected, but the eviction itself is still deferred to the GC thread (to minimize delay).

      The advantage is fast turn-around time.  Approximately as fast as the current LFU keyArray, except we're not using random samples, we're using grouped-in-time samples.  In normal mode, we like large rows (maybe 512 entries) which will give us better statistics than the current LFU/LRU 32 samples).  But in desperate mode, we can make decisions on smaller sample-sets.

      The other advantage is in bursty 'web traffic'.. When heavily loaded, the GC threads won't be able to keep up (all CPUs will already be busy).  But once there's a few moments of idleness, they'll have nicely scrubbed every single expired/idle element.  Thus the NEXT web hit will be fully responsive (with a fully scrubbed suite of caches - Assuming the caches were full to begin with).  The real question I need to test is if the overhead during the overloaded period is higher than the current model.  I'm speculating the answer is no - we'll see.

      So in summary, this isn't technically an LRU or FIFO, but neither is the current 1.6 implementation.  It has slightly worse statistical accuracy than the LFU sampler, but thanks to tenuring and potentially larger sample-sizes the accuracy is likely to be better.  This does'nt take up any more memory (just 4 bytes for a pointer in the element).  Doesn't incur any overhead on gets, and adds a SLIGHT spin-loop to the puts (though ANY AtomicXXX is a spin-loop, including numElements).

      I'm currently implementing some reference code and will try and utilize ehcache testing code to compare.

       
    • I've created an implementation of a 32 random sampling LRU,LFU,FIFO 'CompactConcurrentMap'  which is modeled on ConcurrentHashMap but adapted for non Object elements.

      Initial benchmarks put it at with 25% as fast as CHM.  Theoretically it can flat out replace half the ehcache operations, since expiration is embeded within the core code, though it has no disk-store capability.

      I still am writing unit-tests under a multitude of situations, so I don't have code to release yet. But I do have a blog summarizing the theory behind it:

      http://tech-maraist.blogspot.com/2009/08/alternate-mem-store-for-ehcache.html

      I'm actually trying to creating (or have historically created) a suite of data-structures optimized for various use-cases. So I'm looking to use this particular class for my own needs independent of ehcache.  However, it would be nice if this suites the needs of ehcache as well.  The main thing it provides is the ability to have multi-million element caches with minimial overhead.

      The code itself is stable for basic map operations, it's the expirators that I'm still tweaking. The main thing that's useful of this v.s. treating expiration separately is the fact that you don't need the external key-array - you perform direct entry-array sampling.

       
    • T C
      T C
      2009-08-28

      RE: ehcache threading...please consider using a small thread pool or a single dedicated thread for dispatching the asynchronous distributed remote peer notifications from a single queue rather than using a worker thread per cache.  The current implementation is not scalable at all.  If you have hundreds of caches you are forced to use synchronous notifications.

       
      • Greg Luck
        Greg Luck
        2009-08-28

        Can you please raise a feature request for this. I think we should do this.