Best Practices Help

2009-12-03
2013-06-03
1 2 > >> (Page 1 of 2)
  • Brian Frutchey
    Brian Frutchey
    2009-12-03

    I am hoping someone can help me with advice on large-scale usage of jdbm.  I have an application which is storing smallish objects (lets say 20 string tokens) into a main BTree with 4 other secondary BTrees used for index lookup.  There are 20M objects to store, and once stored the application is primarily read-only.

    My questions include:
    1) How do I optimize the application for disk/RAM/performance?  I currently am using compression in the RecordManager, still using transactions (found no performance diff when I turned them off), and have my own minimalist serializers.  Things are working well below 1M objects, but at 2.5M objects insertions (each insertion is both to data BTree and lookup BTrees) have gone from 2K/sec to only 160/sec.  I also had my previous run get an OutOfMemory error, but have then pulled latest trunk and made some mods to the TransactionManager based on forum comments to use buffered streams and make sure all streams get closed and latest run is still going (2.67M insertions so far).

    2) When to have data written to disk?  Seems that only RecordManager.commit() is making this happen - are there temp files getting written somewhere?  If I call commit() after each insert instead of just prior to closing RecordManager, the disk usage goes up by several orders of magnitude (a 7MB db file suddenly becomes 2GB db file - but contains same data!!!).  I read in forum though that behind the scenes some kind of auto-commit or auto-write is happening every X operations…  haven't traced any code to understand this though - can someone explain?

    3) What is the best way to empty a BTree, or remove lots of values - I am currently deleting the db files and re-creating the RecordManager as using the BTree.delete (and then recreating the BTree) is throwing all kinds of exceptions on future inserts.

    Thanks!

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-03

    Now I am troubleshooting an OutOfMemory exception thrown at the 2.9M object insertion mark when insertion rate is at only 33obj/sec.  Help?  I assumed jdbm managed memory for me and backed to disk when needed…

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-03

    Figure seeing my code might help:

        private void init() throws IOException {
    Properties props = new Properties();
    props.setProperty(RecordManagerOptions.COMPRESSOR, RecordManagerOptions.COMPRESSOR_BEST_COMPRESSION);
    //props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS, "TRUE");
    props.setProperty(RecordManagerOptions.CACHE_TYPE, RecordManagerOptions.NORMAL_CACHE);
    recman = RecordManagerFactory.createRecordManager(dbDirectory+File.separator+FILENAME, props);
    specTree = load("SpecTree", StringComparator.INSTANCE, StringSerializer.INSTANCE, new DimensionSerializer());
    idTree = load("IDTree", LongComparator.INSTANCE, LongSerializer.INSTANCE, StringSerializer.INSTANCE);
    nameExactTree = load("NameExactTree", StringComparator.INSTANCE, StringSerializer.INSTANCE, StringSerializer.INSTANCE);
    try {
    genTree = load("GenTree", IntegerComparator.INSTANCE, IntegerSerializer.INSTANCE,
    new StringCollectionSerializer<StringSet>(StringSet.class.getName()));
    typeTree = load("RangeTree", StringComparator.INSTANCE, StringSerializer.INSTANCE,
    new StringCollectionSerializer<StringSet>(StringSet.class.getName()));
    } catch (Exception e) {
    if(e instanceof IOException)
    throw (IOException)e;
    IOException ioe = new IOException(e.getMessage());
    ioe.setStackTrace(e.getStackTrace());
    throw ioe;
    }
    }

    @SuppressWarnings("unchecked")
    private BTree load(String spec, Comparator comparator, Serializer keySerializer,
    Serializer valueSerializer) throws IOException {
    // try to reload an existing B+Tree
            long recid = recman.getNamedObject( spec );
            if ( recid != 0 ) {
            log.info("Loading existing BTree for "+spec);
            return BTree.load( recman, recid );
            } else {
            log.info("Creating new BTree for "+spec);
            BTree newTree = BTree.createInstance(recman, comparator, keySerializer, valueSerializer);
            recman.setNamedObject( spec, newTree.getRecid() );
                return newTree;
            }
    }

    Inserts are done using the BTree.insert() method. 

    Are there properties that I need to use?  How often should I call commit()?  Do I need to manage pagesize?  Should each BTree be in its own recordmanager?

    Thanks!

     
  • Kevin Day
    Kevin Day
    2009-12-03

    All data is held in memory until you hit commit().  Even after you call commit(), if you have transactions enabled, 10 commit()s worth of transactions will be held in memory before they are flushed to the database file.

    I suspect that the reason things are slowing down is that you are running out of heap space, which is forcing the GC to work overtime.  You can connect visual VM (or another profiler) and see this.

    For high performance, you don't want to commit too frequently.  If you have transactions disabled, then a commit will be automatically fired every 10K page updates (this is roughly 80 MB of data in memory).  Otherwise, you can call commit() yourself when you want to (or change the jdbm.disableTransactions.autoCommitInterval property when you get the record manager.

    You can empty a BTree by deleting it from the record manager.  Removing lots of values is a bit harder (there isn't an efficient deletion algorithm that is able to do full BPage removes) - it may be faster to build a new tree, then delete the old one.  Or you can just remove keys, the trees will balance themselves.

    Note that deleting data from jdbm does *not* free disk space.  It puts the pages back into an available list.  Future inserts will pull from the available list before the file is grown.

    Which compressor are you using?  If your keys are Strings, you should definitely be calling the following on your BTrees:  .setKeyCompressionProvider(jdbm.helper.compression.LeadingValueCompressionProvider.STRING)

    If your keys aren't simple strings, and you are providing your own serializer, make sure that the byte encoding you choose has common leading bytes (for example, in standard string encoding, the length of the string is encoded in the first 2 bytes - if your strings aren't all the same length, the leading value compressor won't do much for you)

    Anyway, I think the bulk of the issues you are experiencing are caused by running with transactions turned on, and not calling commit() on a periodic basis.

    For the use-case that you describe (primarily read), the transactions may not be doing anything for you (they protect against corruption of the database file in the event that a write fails)

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-04

    Thanks for the help, but I seem to be making things worse.  I have disabled transactions (set to FALSE), and have tried adding commits after every 50K and 100K inserts (this is actually a per-BTree count) and while performance doesn't seem to degrade I now run out of memory earlier than before, at 1.7M inserts instead of 2.9M.  Oddly, I am printing out available memory after each commit and the free/used memory is LESS at this point than at some earlier checkpoints.  The OutOfMemory exception seems to be thrown when I call commit() though…

    I have tried WEAK and SOFT Caches, but those seem to hit performance hard (at .5M inserts have gone from the initial 2K/sec to 100/sec), so I am still using NORMAL cache.  Also, does one specify CACHE_SIZE in  bytes?

    When I don't call the commit() myself, and rely on the autocommit for disabled transactions, I am never seeing the data written to disk, even after 2M inserts.  I tried both not specifying the AUTOCOMMIT_INTERVAL and setting it to 50K (prior to not specifying it - default is 10K right?).  I was expecting commits quite frequently, but never see one ("seeing" one is watching db file size grow) - am I doing something wrong?

    I am now using COMPRESSOR_BEST_SPEED and the setting the keyCompressionProvider.

    I am now going to run with smaller commit batches, but what else might I be doing wrong?  Thanks!

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-04

    I tried commits every 10K inserts, and ran out of memory at 1.77M inserts.  I then tried commits every 1K inserts, and ran out of memory at 2.48M inserts - speed was unacceptable as well at about 48 inserts/sec.  Again, not sure why I ran out of memory, as the memory usage metrics I was printing out showed more free memory at the time  (and the same total memory size) than previous readings.  I may rerun the tests using visual vm to get more details, but each run takes some time…  Another interesting point that I would love some insight to is that the smaller the commit interval the larger the database file (from 10K to 1K file size grew by 6x).  However, after the file quickly grew up to about 2M inserts it then stopped growing for a bit - this leads me to believe that BPages must be being written to disk with that aren't yet full, and space is being allocated for future BPage entries.  Is that correct?

    Are there any objects that jdbm always keeps resident in memory that would be constantly growing?

    I am going to try playing with cache size next with larger commit intervals (I need the speed of larger intervals and the smaller database files).  Next I will try working with BPage size,

    I am still very surprised that running with no commits allows me to process 2.9M object inserts while running with commits runs out of memory sooner - isn't the disk backing supposed to help with memory, not hurt?

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-04

    I wish we could edit our posts… my post (#5) has an error - I have disable transactions by using:

        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS, "TRUE");
       
        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS_PERFORMSYNCONCLOSE, "TRUE");

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-04

    I just did some runs using Visual VM to track JVM and found some interesting results.  Does anyone know how to post pictures here?  I will send an email to the mailing list with screen shots attached.

    Basically though I see the Used Heap constantly growing with a constant slope (people say "linearly" but I think that is mathmatically overly broad)  with the exception of a large jump every 4minutes or so where it doubles in size.  If I GC after the large doubling in size though, the used heap drops back to where it would have been if growth remained constant.

    Heap dumps show that the bulk of the used heap is ALL IN ONE CHARACTER ARRAY.  Yes, a single char has like 90% of all the used memory in it.  So I am assuming that this array gets copied from time to time, and that is the source of the large jumps in used heap which disappear after GC.  This could explain my OOM error, if doubling occurs as my application is nearing the limit of JVM size.  Still, the fact that this array grows constantly also has me worried, as that suggests I will at some point run out of RAM just because the array is too large.

    I also am guessing that the array copy must be related to commits, which is why I run out of memory sooner when I commit periodically.  The OOM I get when not using commits must be due to the constantly growing char.

    Thoughts?  See the email I am going to send for some screenshots of visual vm…

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-04

    Great news for JDBM!  In further reviewing the Visual VM heap dump instance information, it would appear that the char which is chewing up memory is from the SAXParser which is pulling all my data out of a large XML file.  I will report back when I have that fixed.

    Thanks!

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-05

    Turns out part of the problem was the Xerces SAX Parser (default JAXP parser) which has a bug if you set the http://apache.org/xml/features/nonvalidating/load-external-dtd to false.  I have now worked around that issue (by setting SystemId on the InputSource to allow local DTDs to be found) and tested with crimson, oracle, and the xerces parser to see if there are any further items in the heap attached to the parser, of which I have found none.

    However, I am still running out of memory, now after 3.8M inserts (250K inserts between commits, default cachesize, pagesize of 32).  Heap dumps now show most memory now being held by bytearrays referenced by ObjectOutputStreams, or sometimes nothing at all.  Since JDBM is the only thing doing any serialization I am hoping someone can help me track down why these byte arrays are so large.

    Other things seem to be growing in memory too, as the used heap grows rather constantly as the application runs (just not as quickly as it used to).  The class types taking up the most memory are char (largest from BufferedWriters - one seems to be a logger), TreeMap$Entry (referenced by CacheRecordManager), and Strings (also most of the ones I looked at referenced by CacheRecordManager). 

    Please help!  I will send another email to mailing list with more screen shots of my Visual VM heap dumps…

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-05

    Seems the large byte arrays are a big piece of the problem because ObjectOutputStream does a copy during its write method.  Here is my  OOM stack trace:

        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                at java.util.Arrays.copyOf(Arrays.java:2786)
                at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
                at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1814)
                at java.io.ObjectOutputStream.write(ObjectOutputStream.java:662)
                at jdbm.btree.BPage.writeByteArray(BPage.java:973)
                at jdbm.btree.BPage.serialize(BPage.java:1231)
                at jdbm.recman.BaseRecordManager.update(BaseRecordManager.java:497)
                at jdbm.recman.CacheRecordManager.updateCacheEntries(CacheRecordManager.java:465)
                at jdbm.recman.CacheRecordManager.commit(CacheRecordManager.java:392)

    I am going to research two directions - first making the byte array smaller, second extending ObjectOutputStream to do a series of subarray copies instead of one large copy based on available memory.

     
  • Kevin Day
    Kevin Day
    2009-12-07

    If I may make a suggestion:  Try putting together a test harness that does nothing except create your trees (i.e. get rid of the SAXParser and everything else except simple inserts into the tree).  Right now, there is a lot of thrashing going around, probably caused by other issues.

    For example, just because the OOM exception happened to be thrown during writing of a BPage, it is crazy unlikely that the writing of the BPage is what is causing your memory to get consumed.  BPage objects shouldn't be all that big (if they are, you should adjust the number of elements in your pages when you set the tree up).

    If you can come up with a test harness that emulates the amount and size of data that you are pushing into jdbm, then we can take a look and see what might be going on.  I know this isn't always easy, but some quick analysis of your inputs should give you approximate ranges of average key/value size, etc…

    I'm not going to say that there absolutely isn't a memory leak in jdbm, but we have a decent amount of experience with long running applications based on jdbm, and we don't run into this sort of thing.

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-08

    I have done quite a bit of metrics tracking.  My average key size in 20bytes, and the data value size is an average of 40bytes (of course there are still the 4 other index BTrees).  The reason I suspect the BPage commit is because I have the BPage's writeByteArray method telling me about the size of the byte array being written and giving me a stack trace every time it is larger than 50% of the free memory.  Turns out that the byte array being written gets very large - in my case it is 140Mb at 2M inserts, and 500M at 11M inserts.  And the only calling method seems to be the commit.  I have learned that the ObjectOutputStream literally copies the ENTIRE array during the write.

    While there could very well be other factors leading to the OOM (500Mb of byte array is significant but only 10% of the 5GB max JVM memory size), the solving the problem of this large byte array seems the best target, as no other objects in the heap are close to it in size, and the application seems to be able to dereference and GC well enough when not doing a commit (judging by free memory charting).

    I have tried testing this theory by modifying the BPage's writeByteArray method as follows:

            static void writeByteArray( ObjectOutput out, byte buf )
            throws IOException
        {
            if ( buf == null ) {
                out.writeInt( -1 );
            }else {
            out.writeInt(buf.length);
            memorySensitiveWriteByteArray(out, buf, 0);
            }
        }
       
        // This is a naive way to specify how much memory growth the ObjectOutput.write will require.
        private static final double safetyGrowthFactor = .1;
       
        private static void memorySensitiveWriteByteArray(ObjectOutput out, byte buf, int offset) throws IOException {
        int chunksize = buf.length - offset;
            if(chunksize>Runtime.getRuntime().freeMemory()*safetyGrowthFactor) {
        Logger.getLogger("BPage").info("Serializing bytearray larger than freeMemory*"
        +safetyGrowthFactor+": "+chunksize+". Checking if GC frees enough memory to fit.");
        /*
        Exception e = new RuntimeException();
        StringBuffer stack = new StringBuffer();
        for(StackTraceElement ste : e.getStackTrace()) {
        stack.append(ste.getClassName()+":"+ste.getMethodName()+":"+ste.getLineNumber()+"\n");
        }
        Logger.getLogger("BPage").info(stack.toString());
        */
       
        // Try GC first
        System.gc();
        }
        if(chunksize>Runtime.getRuntime().freeMemory()*safetyGrowthFactor) {
        int freemem = (int)(Runtime.getRuntime().freeMemory()*safetyGrowthFactor);
        if(chunksize > freemem)
        chunksize = freemem;
        Logger.getLogger("BPage").info("Serializing bytearray still larger than freeMemory*"
        +safetyGrowthFactor+" after gc, only writing "+chunksize);
        }
        if(chunksize==buf.length)
        out.write(buf);
        else {
        byte chunk = new byte;
        System.arraycopy(buf, offset, chunk, 0, chunksize);
        out.write(chunk); // ObjectOutput.write(byte, int, int) still calls Arrays.copy on entire buf, we only want a portion copied
        }
        offset += chunksize;
        if(offset<buf.length) {
        Logger.getLogger("BPage").info("Still need to serialize "
        +(buf.length-offset)+" bytes");
        memorySensitiveWriteByteArray(out, buf, offset);
        }
        }

    Truly I would love to understand what is in this byteArray being written, and why it seems to get larger as more data is inserted.  There is alot of code to trace, so I would appreciate any tips/explanations.

    I have run tests with page sizes ranging from 8 to 200.  16 seems to be optimal for insertion speed over long periods of time, but as the application grows towards the maximum the machine seems to thrash and often even after the application is shutdown the machine often won't recover until reboot (it is Windows though…).  With 200 the initial hundred thousand inserts are incredibly fast, but then performance drops off quickly (I assume pages are taking longer to balance?).

    I have tried cache sizes from 0 to 10000.  Up to 1000 this improves performance, but seems to slow down performance above that point.

    I have tried commits from every insert to every 1M inserts.  Between 200K and 500K seems to balance performance with small disk usage.  If I don't call commit myself, and rely on the auto-commits from disabled transactions, the writeByteArray seems to be called quite frequently, but the data files on disk never change size… why is that?

    I can easily provide a test harness and my code if someone would want to troubleshoot (and tell me where to send it).  Sorry I am so haphazard in my tests/questions, I was trying to make a fast delivery and really didn't have time to trace code until I understand it sufficiently.  I have it working well enough that if I set Xmx high enough I can insert the 12M I initially need, and it only takes 5 hours to build the BTrees and the resulting disk file is the same size as the flat XML representation of the data.  Better than Berkeley DB JE from Oracle!  But I still think we can do better.

     
  • Kevin Day
    Kevin Day
    2009-12-08

    If you have a single BPage writing 140MB worth of data to a single byte array, then something is horribly, horribly wrong.  The whole point of a B+Tree is that each BPage should fit approximately in a single disk page (usually 4 or 8KB, depending on file system).  The issue here is not that OOS is copying the byte array, it's that the byte array is ginormous when it shouldn't be.

    No amount of setting changes/etc… are going to address this - there's something fundamentally wrong with how the data is getting inserted into the tree.

    As a next step, I would recommend that you log the size of the byte arrays being generated by *your* serializers (for both key and value).  The BPage serialization is pretty simple.  Basically, it serializes all of your keys, then all of your values, plus a little bit of meta data.  Any significant byte array size is going to come from serialization of your keys or values.

    If ALL of your serializers are generating small byte arrays, then go ahead and put together a *simple* test harness showing the issue and I'll take a look at what you are doing.  Just some code that inserts the same type of keys and values (with the same serializer).  Use a random number generator to create content, etc…  The key here is to keep it simple and remove all dependencies.  If I can load the code and run it, I can look at it quickly.  If I have to configure an Ant build and download 30 dependencies, it'll never happen.

    BTW - In looking at your code, memorySensitiveWriteByteArray is called recursively.  Probably not your immediate problem, but that can't be good.

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-08

    I have indeed checked the size of my own serializers, hence the 20byte key size and 40byte payload metrics.  How can I get you the test code?  And thanks for reminding me about the recursiveness!  While I don't ever expect it to recurse more than two or three times, I need to be setting the byte array copies to null after use or the GC won't get rid of them!  That would explain why it doesn't seem to help…

     
  • Kevin Day
    Kevin Day
    2009-12-08

    If all you are talking about is inserting a ton of 20 byte keys and 40 bytes values into a tree, that should be a pretty concise chunk of code.  Just post it here.

     
  • Brian Frutchey
    Brian Frutchey
    2009-12-11

    So my test code still shows the memory growth, but it doesn't grow as quickly as my real code.  The main difference between the code is that in the real code the payload and keys are variable sizes (some keys as small as 10Bytes and as large as 28Bytes, some payloads are as small as 35Bytes and as large as 60Bytes), and there are a few more overwrites and lookups.

    Here it is:

        import java.io.ByteArrayInputStream;
        import java.io.ByteArrayOutputStream;
        import java.io.File;
        import java.io.IOException;
        import java.io.ObjectInputStream;
        import java.io.ObjectOutputStream;
        import java.io.Serializable;
        import java.util.Comparator;
        import java.util.Properties;
       
        import jdbm.RecordManager;
        import jdbm.RecordManagerFactory;
        import jdbm.RecordManagerOptions;
        import jdbm.btree.BTree;
        import jdbm.helper.IntegerComparator;
        import jdbm.helper.IntegerSerializer;
        import jdbm.helper.LongComparator;
        import jdbm.helper.LongSerializer;
        import jdbm.helper.Serializer;
        import jdbm.helper.StringCollectionSerializer;
        import jdbm.helper.StringComparator;
        import jdbm.helper.StringSerializer;
       
        public class StoreTest {
        private RecordManager recman;
        // BTree of actual dimensions keyed by DimSpec
        private BTree<String, TestPayload> specTree;
        // BTree of DimSpecs keyed by ID
        private BTree<Long, String> idTree;
        // BTree of DimSpec sets keyed by generation
        private BTree<Integer, StringSet> genTree;
        // BTree of DimSpecs keyed by <DimensionName>_<Exact classifying Synonym value>
        private BTree<String, String> nameExactTree;
        // BTree of Dimension DimSpec sets keyed by <DimensionName>_E (for Exact) or _R (for Ranged).
        // Exact DimSpec sets are empty, as the key is only needed for listing all DimensionNames in store.
        // Be careful of managing this BTree, as addDimensionName needs to add name to BTree but won't be able
        // to assign correct extension until a child dimension is added (or name already exists in BTree).
        private BTree<String, StringSet> typeTree;
        private String dbDirectory;
        private long numAdds = 0;
        private static final String FILENAME = "DimensionStore";
        // Number of inserts between commits
        public int COMMIT_INTERVAL = 500000;
        // Number of items on each BTree page, must be even (JDBM default is 16)
        public int PAGESIZE = 16;
        public int CACHESIZE = 1000;
       
        public static void main(String args) {
        if(args.length!=5) {
        System.err.println("Usage: "+StoreTest.class.getName()+" <dbDirectory> <numInserts> <pageSize> <commit interval> <cacheSize>");
        }
        try {
        System.out.println("Size of test key in bytes: "+StringSerializer.INSTANCE.serialize(new TestPayload().spec).length);
        System.out.println("Size of value in bytes: "+new TestPayloadSerializer().serialize(new TestPayload()).length);
        StoreTest testStore = new StoreTest(args);
        testStore.PAGESIZE = Integer.parseInt(args);
        testStore.COMMIT_INTERVAL = Integer.parseInt(args);
        testStore.CACHESIZE = Integer.parseInt(args);
        testStore.init();
        int inserts = Integer.parseInt(args);
        for(int i=0; i<inserts; i++) {
        testStore.addNew();
        }
        testStore.close();
        }catch(Exception e) {
        e.printStackTrace();
        }
        }
       
        public StoreTest(String dbDirectory) throws IOException {
        this.dbDirectory = dbDirectory;
        }
       
        @SuppressWarnings("unchecked")
        public void init() throws IOException {
        Properties props = new Properties();
        props.setProperty(RecordManagerOptions.COMPRESSOR, RecordManagerOptions.COMPRESSOR_BEST_COMPRESSION);
        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS, "TRUE");
        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS_PERFORMSYNCONCLOSE, "TRUE");
        props.setProperty(RecordManagerOptions.CACHE_TYPE, RecordManagerOptions.NORMAL_CACHE);
        props.setProperty(RecordManagerOptions.CACHE_SIZE, Integer.toString(CACHESIZE));
        props.setProperty(RecordManagerOptions.LAZY_INSERT, "TRUE");
        recman = RecordManagerFactory.createRecordManager(dbDirectory+File.separator+FILENAME, props);
        specTree = load("SpecTree", StringComparator.INSTANCE, StringSerializer.INSTANCE, new TestPayloadSerializer());
        idTree = load("IDTree", LongComparator.INSTANCE, LongSerializer.INSTANCE, StringSerializer.INSTANCE);
        nameExactTree = load("NameExactTree", StringComparator.INSTANCE, StringSerializer.INSTANCE, StringSerializer.INSTANCE);
        try {
        genTree = load("GenTree", IntegerComparator.INSTANCE, IntegerSerializer.INSTANCE,
        new StringCollectionSerializer<StringSet>(StringSet.class.getName()));
        typeTree = load("RangeTree", StringComparator.INSTANCE, StringSerializer.INSTANCE,
        new StringCollectionSerializer<StringSet>(StringSet.class.getName()));
        } catch (Exception e) {
        if(e instanceof IOException)
        throw (IOException)e;
        IOException ioe = new IOException(e.getMessage());
        ioe.setStackTrace(e.getStackTrace());
        throw ioe;
        }
       
        System.out.println("BTrees initialized");
       
       
        }
       
        @SuppressWarnings("unchecked")
        private BTree load(String spec, Comparator comparator, Serializer keySerializer,
        Serializer valueSerializer) throws IOException {
        // try to reload an existing B+Tree
                long recid = recman.getNamedObject( spec );
                if ( recid != 0 ) {
                System.out.println("Loading existing BTree for "+spec);
                return BTree.load( recman, recid );
                } else {
                System.out.println("Creating new BTree for "+spec);
                BTree newTree = BTree.createInstance(recman, comparator, keySerializer, valueSerializer, PAGESIZE);
                if(keySerializer instanceof StringSerializer)
                newTree.setKeyCompressionProvider(jdbm.helper.compression.LeadingValueCompressionProvider.STRING);
               
                recman.setNamedObject( spec, newTree.getRecid() );
                    return newTree;
                }
        }
       
        public void addNew() throws IOException {
        countAdd();
        TestPayload tp = new TestPayload();
        specTree.insert(tp.spec, tp, true);
        idTree.insert(tp.id, tp.spec, true);
        if(tp.isClassify)
        nameExactTree.insert(tp.spec.substring(0,8)+"_"+tp.synonym, tp.spec, true);
        StringSet gens = genTree.find(0);
        if(gens == null)
        gens = new StringSet();
        gens.add(tp.spec);
        genTree.insert(0, gens, true);
        StringSet types = typeTree.find(tp.type+"_E");
        if(types == null)
        types = typeTree.find(tp.type+"_R");
        if(types == null)
        types = new StringSet();
        types.add(tp.spec);
        if(tp.type == 'A') {
        typeTree.insert("A_E", types, true);
        }else {
        typeTree.insert(tp.type+"_R", types, true);
        }
        }
       
        public void close() throws IOException {
        System.out.println("Committing all transactions and closing RecordManager.");
        recman.commit();
        //specTree.dump(System.out);
        recman.close();
        }
       
        private void countAdd() throws IOException {
        numAdds++;
        if(numAdds % 10000 == 0)
        System.out.println(System.currentTimeMillis()+": "+numAdds+" inserts processed");
        if(numAdds % COMMIT_INTERVAL == 0) {
        System.out.println("Calling commit");
        recman.commit();
        }
        }
       
        public static class TestPayload implements Serializable {
        private static final long serialVersionUID = -2196499496091523455L;
       
        char type = {(char)(4*Math.random()+65)};
        String spec = buildRandString(15);
        long id = (long)(Math.random()*Long.MAX_VALUE);
        String synonym = buildRandString(8);
        boolean isClassify = Math.random()<.5;
        boolean isSearch = Math.random()<.5;
        String boundVal = buildRandString(4);
        boolean oneBound = Math.random()<.5;
       
        public String buildRandString(int size) {
        StringBuilder sb = new StringBuilder();
        for(int i=0; i<size; i++)
        sb.append((char)(57*Math.random()+65));
        return sb.toString();
        }
        }
       
        public static class TestPayloadSerializer implements Serializer<TestPayload> {
        private static final long serialVersionUID = -153477714824811778L;
       
        public TestPayload deserialize(byte data) throws IOException {
        ObjectInputStream in = new ObjectInputStream(new ByteArrayInputStream(data));
        TestPayload tp = new TestPayload();
        tp.type = in.readUTF().toCharArray();
        tp.spec = in.readUTF();
        tp.id = in.readLong();
        tp.synonym = in.readUTF();
        int attrs = in.read();
        if(attrs==2 || attrs==4)
        tp.isClassify = true;
        else
        tp.isClassify = false;
        if(attrs==2 || attrs==3)
        tp.isSearch = true;
        else
        tp.isSearch = false;
        tp.boundVal = in.readUTF();
        if(in.read()==0)
        tp.oneBound = true;
        else
        tp.oneBound = false;
        return tp;
        }
       
        public byte serialize(TestPayload obj) throws IOException {
        ByteArrayOutputStream bout = new ByteArrayOutputStream();
        ObjectOutputStream out = new ObjectOutputStream(bout);
        out.writeUTF(new String(obj.type));
        out.writeUTF(obj.spec);
        out.writeLong(obj.id);
        out.writeUTF(obj.synonym);
        int attrs = obj.isClassify ? 0 : 1;
        attrs += obj.isSearch ? 2 : 4;
        out.write(attrs);
        out.writeUTF(obj.boundVal);
        out.write(obj.oneBound ? 0 : 1);
        out.close();
        return bout.toByteArray();
        }
        }
        }

    I did create a StringSet class to serialize a collection of Strings, which you will also need:

        import java.io.IOException;
        import java.io.ObjectInput;
        import java.io.ObjectOutput;
        import java.io.Serializable;
        import java.util.Collection;
        import java.util.Iterator;
        import java.util.Set;
        import java.util.TreeSet;
       
        public class StringSet implements Set<String>, Externalizable{
        private transient TreeSet<String> _set = new TreeSet<String>();
        private static final long serialVersionUID = 2582265528817983138L;
       
        @Override
        public StringSet clone() {
        StringSet clone = new StringSet();
        clone.addAll(_set);
        return clone;
        }
       
            public void writeExternal(ObjectOutput out) throws IOException {
        for(String s : _set)
        out.writeUTF(s);
            }
           
            public void readExternal(ObjectInput in) throws IOException {
            String s;
            while((s=in.readUTF()) != null)
            _set.add(s);
            }
       
        public boolean add(String e) {
        return _set.add(e);
        }
       
        public boolean addAll(Collection<? extends String> c) {
        return _set.addAll(C);
        }
       
        public void clear() {
        _set.clear();
        }
       
        public boolean contains(Object o) {
        return _set.contains(o);
        }
       
        public boolean containsAll(Collection<?> c) {
        return _set.containsAll(C);
        }
       
        public boolean isEmpty() {
        return _set.isEmpty();
        }
       
        public Iterator<String> iterator() {
        return _set.iterator();
        }
       
        public boolean remove(Object o) {
        return _set.remove(o);
        }
       
        public boolean removeAll(Collection<?> c) {
        return _set.removeAll(C);
        }
       
        public boolean retainAll(Collection<?> c) {
        return _set.retainAll(C);
        }
       
        public int size() {
        return _set.size();
        }
       
        public String toArray() {
        return (String)_set.toArray();
        }
       
        public <T> T toArray(T a) {
        return _set.toArray(a);
        }
       
        }

    And the serializer for string collections (i added it to jdbm helper package):

            package jdbm.helper;
       
        import java.io.ByteArrayInputStream;
        import java.io.ByteArrayOutputStream;
        import java.io.DataInputStream;
        import java.io.DataOutputStream;
        import java.io.IOException;
        import java.lang.reflect.Constructor;
        import java.lang.reflect.InvocationTargetException;
        import java.util.Collection;
       
        /**
         * Efficiently serializes and deserializes basic String collections.
         *
         * String collection must have a zero argument constructor, and be
         * completely restored from the members of the collection as no other
         * properties are serialized.
         * @param <C>
         */
        public class StringCollectionSerializer<C extends Collection<String>> implements Serializer<Collection<String>> {
       
        private static final long serialVersionUID = -5054668226991268996L;
        private String collectionImplName;
       
        StringCollectionSerializer() {
        // protected Zero-arg constructor for serialization
        }
       
        public StringCollectionSerializer(String collectionImplName)
        throws ClassNotFoundException, SecurityException, NoSuchMethodException {
        if(getZeroArgConstructor(collectionImplName)==null)
        throw new NullPointerException("No accessible zero-arg constructor found declared in "+collectionImplName);
        this.collectionImplName = collectionImplName;
        }
       
        @SuppressWarnings("unchecked")
        private Constructor<C> getZeroArgConstructor(String className)
        throws ClassNotFoundException, SecurityException, NoSuchMethodException {
        Class cls = Class.forName(className);
        Class params = new Class;
        Constructor<C> ctor = cls.getConstructor(params);
        return ctor;
        }
       
        public C deserialize(byte data) throws IOException {
        C c;
        try {
        c = getZeroArgConstructor(collectionImplName).newInstance();
        } catch (InvocationTargetException ite) {
        Throwable e = ite.getTargetException();
        e.printStackTrace();
        IOException ioe = new IOException(e.getMessage());
        ioe.setStackTrace(e.getStackTrace());
        throw ioe;
        } catch (Exception e) {
        IOException ioe = new IOException(e.getMessage());
        ioe.setStackTrace(e.getStackTrace());
        throw ioe;
        }
       
        int off = 0;
        while(off<data.length) {
        int len = Conversion.unpack2(data, off);
        off += 2; // increment by short just read
        c.add(Conversion.convertToString(data, off, len));
        off += len; // increment by string just read
        }
       
        return c;
        }
       
        public byte serialize(Collection<String> c) throws IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);
        for(String s : c)
        dos.writeUTF(s);
        dos.close();
        return baos.toByteArray();
        }
       
        public Object clone() throws CloneNotSupportedException {
            throw new CloneNotSupportedException();
        }
       
        }

     
  • Kevin Day
    Kevin Day
    2009-12-14

    I've loaded your code and taken a quick look.  First off, you have a lot of moving parts here.  The above doesn't convince me (yet) that jdbm is leaking anything - it's quite likely that something in your code is leaking - we'll see as I dig a bit deeper.

    One thing that immediately comes to light is that you aren't creating a new data file every test.  This means that all of your configuration settings that control b-page size are not taking effect after the first run - so any parametric tests that you've done involving page size were probably inconclusive.  Minor, but it's something you should be aware of.

    I'm a bit puzzled by the presence of write/readExternal in class StringSet - unless that's used for something outside the scope of the code you've provided, I'm pretty sure it isn't doing anything for you.

    After running for 10 minutes (and seeing the heap continue to grow), I took a heap dump, and see that there is one mongo byte - it's 524K in size, and associated with the 'BestCompressionRecordCompressor' class.  This is an area of jdbm that hasn't gotten a lot of use (I believe there is actually only one person who ever used it), but the only thing that's a little bit questionable is the re-use of a compressor object.  And it's quite possible that a well established, highly optimized compressor could be holding 500KB array.  So I doubt that this is really an issue, but we can certainly turn it off and see (FYI - unless disk space is a big concern, I recommend turning off record compression - it will slow things down quite a bit).

    So I removed all of the record manager settings except for disabling transactions, and re-profiled.

    At this point, things are fairly clean, inserts are done in pretty much constant time, and I see a flat heap after a period of time (it goes flat after about 7 minutes, and I'm running at 26 minutes now with it still nice and flat).

    For reference, the only properties I'm setting in my test are: 

        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS, "TRUE");
        props.setProperty(RecordManagerOptions.DISABLE_TRANSACTIONS_PERFORMSYNCONCLOSE, "TRUE");

    Just for grins, I'm going to re-run the test overnight with the record compressor enabled.  If growth goes off the charts, then it's fair to say that the record compressor is a problem, otherwise I'd say the windmill we are fighting is dead ;-)

     
  • Kevin Day
    Kevin Day
    2009-12-14

    ok - overnight, I got the OOM error as well.  I would say that the record level compressor is the problem.  I suggest running without record compression (and honestly, record level compression almost never makes sense.  There's rarely enough bytes to allow the compressor to really be effective).

    I'll leave it to Bryan (who wrote this area of code) to determine if this is something that should be fixed, or eliminated from the code base.

     
  • Bryan Thompson
    Bryan Thompson
    2009-12-14

    Kevin,

    This could be a Java bug .  Also see .  It appears that the code needs to be invoking end() to ensure prompt release of memory.  However, end() only applies when you are done with a Deflater instance and reuse of instances is encouraged - the code is in fact written to use only a single Deflater instance for the life cycle of the recman.  Is it possible that multiple instances of the record compressor, and hence multiple Deflater instances, are being created?  If so, then a native memory leak is likely.

    Bryan

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6293787

    http://www-01.ibm.com/support/docview.wss?uid=swg21227106

     
  • Kevin Day
    Kevin Day
    2009-12-14

    From code above, it looks unlikely that more than one compressor is being created (there's only one recman, and the compressor is being constructed during recman generation).

    Thanks for the reference to the mustang bugs on deflator - that's probably what's ocurring here.  Given the poor performance of zip compression for small records, it seems like the most effective solution may be to not offer record level compression.  Thoughts?

     
  • Bryan Thompson
    Bryan Thompson
    2009-12-14

    Kevin,

    Makes sense to me.  You could just deprecate the option with a reference to the memory management problems involved.  That way anyone running with the option would still be Ok.

    Bryan

     
  • Kevin Day
    Kevin Day
    2009-12-14

    OK - RecordManagerOptions has been updated.  frutchey, if you continue to have memory issues after turning off record level compression, let us know.

     
  • Bryan Thompson
    Bryan Thompson
    2009-12-15

    Kevin,

    It occurs to me that if this is running on Windows then the problem may be the Windows file system cache filling up.  This will happen for an IO intensive application.  I've inlined some performance tricks for windows and linux (which has a problematic kernel variable named 'swappiness').

    Bryan

    === Linux Performance Tips ===

    By default, Linux will start swapping out processes proactively once about 1/2 of the RAM has been allocated. This is done to ensure that RAM remains available for processes which might run, but it basically denies you 1/2 of your RAM for the processes that you want to run. Changing vm.swappiness from its default (60) to ZERO (0) fixes this.  This is NOT the same as disabling swap.  The host will just wait until the free memory is exhausted before it begins to swap.

    To change the swappiness kernel parameter, do the following as root on each host:
    <pre>
    sysctl -w vm.swappiness=0
    </pre>

    === Windows Performance Tips ===

    # Set to optimize scheduling for programs (rather than background processes) and memory for programs (rather than the system cache).
    # On win64 platforms, the kernel can wind up consuming all available memory for IO heavy applications, resulting in heavy swapping and poor performance.  The links below can help you to understand this issue and include a utility to specify the size of the file system cache.

          -
         
          -

     
  • Kevin Day
    Kevin Day
    2009-12-15

    I do see the Java heap growing without bound, but the OS memory also gets clobbered.  The issue is almost certainly related to holding the Deflater object for too long.

    But generic record level compression is really not desirable except for special circumstances (and in those cases, a custom serializer can do the trick).  Rather than going in and trying to change the mechanism for caching Deflater, I think deprecating the feature is the appropriate thing to do.

     
1 2 > >> (Page 1 of 2)