Hello,

I have written a utility to dump some data from a jdbm store.  Right now I am focused on
identifying in use records with wasted space.  My approach is to scan the translation page list
and, for each translation page, to compute the offset of each possible logical row id within that
page.  I then convert the logical row id to a physical row id using the data in the translation page
and fetch the first block containing the record.  Once I have the record I compare the size and
the capacity and note the wasted space (if any).  I also collect up a historgram of the in use record
size and another of the free record capacity.

I am working on a histogram of the #of records of a given "type", and one of the size of records
of a given type, but that will require a heuristic technique to characterize record types, e.g., by
the first N bytes, since different records can use different serializers.

The basic approach appears to work fine.  Once sourceforce CVS starts behaving again I will
commit what I've got as jdbm.recman.DumpUtility.  The output for our application and the code
is below.

-bryan

*** Record Waste Summary
Scanned 71 pages on the 'translation' page list.
Records: used=57215, free=0, total=57215
Capacity: used=132475097, free=0, total=132475097
Total waste of 1617993 bytes across 3420 records in which waste occurs. % wasted capacity: 1.22 Average waste per record with waste: 473 bytes. Average size of record with waste: 2464 bytes. % waste per record with waste: 19.19981972501657 Average in use record size: 2287 In used size histogram : #samples=57215

binSize count   percent
128     1       0%
256     1       0%
512     5       0%
1024    4       0%
1536    924     1%
2048    8680    15%
2176    22700   39%
2304    5420    9%
2432    506     0%
2560    399     0%
2688    907     1%
2816    10217   17%
2944    879     1%
4096    6572    11%
Free capacity histogram: #samples=0
binSize count   percent

/**
 * JDBM LICENSE v1.00
 *
 * Redistribution and use of this software and associated documentation
 * ("Software"), with or without modification, are permitted provided
 * that the following conditions are met:
 *
 * 1. Redistributions of source code must retain copyright
 *    statements and notices.  Redistributions must also contain a
 *    copy of this document.
 *
 * 2. Redistributions in binary form must reproduce the
 *    above copyright notice, this list of conditions and the
 *    following disclaimer in the documentation and/or other
 *    materials provided with the distribution.
 *
 * 3. The name "JDBM" must not be used to endorse or promote
 *    products derived from this Software without prior written
 *    permission of Cees de Groot.  For written permission,
 *    please contact cg@cdegroot.com.
 *
 * 4. Products derived from this Software may not be called "JDBM"
 *    nor may "JDBM" appear in their names without prior written
 *    permission of Cees de Groot.
 *
 * 5. Due credit should be given to the JDBM Project
 *    (http://jdbm.sourceforge.net/).
 *
 * THIS SOFTWARE IS PROVIDED BY THE JDBM PROJECT AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT
 * NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
 * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL
 * CEES DE GROOT OR ANY CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
 * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
 * OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * Copyright 2000 (C) Cees de Groot. All Rights Reserved.
 * Contributions are Copyright (C) 2000 by their associated contributors.
 *
 * $Id: DataPage.java,v 1.1 2000/05/06 00:00:31 boisvert Exp $  */

package jdbm.recman;

import java.io.File;
import java.io.IOException;
import java.io.PrintStream;

//import java.util.Properties;

//import jdbm.RecordManager;
//import jdbm.RecordManagerFactory;

/**
 * A utilty for dumping some aspects of the file structure of jdbm.
 *
 * @author thompsonbry
 */

public class DumpUtility
{
   
    /**
     * Underlying record file.
     */
    private RecordFile _file;

    /**
     * Physical row identifier manager.
     */
    private PhysicalRowIdManager _physMgr;

    /**
     * Logigal to Physical row identifier manager.
     */
    private LogicalRowIdManager _logMgr;

    /**
     * Page manager.
     */
    private PageManager _pageman;

    /**
     * Opens the named jdbm file.  Since the fields for the {@link
     * RecordFile}, {@link PageManager}, etc. on the {@link BaseRecordManager}
     * are all private, I have opted to just duplicate the logic require to
     * open the store.<p>
     *
     * Note: While the {@link DumpUtility} does not modify the state of the
     * store, there is way to declare that the store is being opened in a
read-
     * only state to the {@link RecordFile}.<p>
     */
   
    public DumpUtility( String filename )
        throws IOException
    {
       
        if( filename == null ) {
           
            throw new IllegalArgumentException();
           
        }

        if( ! new File( filename+".db" ).exists() ) {
           
            throw new IllegalArgumentException
                ( "Could not find file: "+filename
                  );
               
        }
           
        _file = new RecordFile( filename );
        _pageman = new PageManager( _file );
        _physMgr = new PhysicalRowIdManager( _file, _pageman );
        _logMgr = new LogicalRowIdManager( _file, _pageman );
       
        _file.disableTransactions();
       
    }

//    /**
//     * Return a nice name for the indicated page list.
//     *
//     * @param type The type of page list from {@link Magic}.
//     */
//
//    private String getListName( short type )
//    {
//       
//        switch( type ) {
//   
//        case Magic.FREE_PAGE: return "free";
//        case Magic.USED_PAGE: return "used";
//        case Magic.TRANSLATION_PAGE: return "translation";
//        case Magic.FREELOGIDS_PAGE: return "free logical ids";
//        case Magic.FREEPHYSIDS_PAGE: return "free physical ids";
//        default:
//            return "List type: "+type+" is unknown.";
//        }
//       
//    }

   
    /**
     * Powers of 2 that are used to collect records into bins by their
     * size or capacity.<p>
     *
     * Note: jdbm records may be up to 2^31 bytes long since a signed
     * integer is used to record the length of the record.<p>
     *
     * @todo This actually contains some entries that are not powers of
     * 2 for a finer grained analysis of the jdbm records.  Make the choice
     * of the bins or the granulatity something that can be done in main()
     * and via the constructor.<p>
     */
   
    static final int powersOf2[] = {
        2, 4, 8, 16, 32, 64, 128, 256, 512,
        1024, 1024+512,
        2048, 2048+128*1, 2048+128*2, 2048+128*3, 2048+128*4, 2048+128*5,
2048+128*6, 2048+128*7,
        4096,
        8192,
        2^14,
        2^15,
        2^16,
        2^31
    };

    /**
     * Returns an index into an array of the same dimension as {@link
     * #powersOf2} that is the first entry greater than or equal to
     * <i>n</i>.
     *
     * @param n The size or capacity of some record.  In jdbm the maximum
     * size or capacity of a record is 2^31.
     *
     * @return An index into an array of bins collecting the #of records
     * of some type.
     */
   
    private int getBin( int n )
    {

        for( int i=0; i<powersOf2.length; i++ ) {
           
            if( powersOf2[ i ] >= n ) return i;
           
        }
       
        throw new AssertionError();
       
    }
   
    /**
     * Writes out the non-zero entries from a histogram collected using {@link
     * #powersOf2}.
     *
     * @param ps
     * @param label
     * @param bins
     */
    protected void writeHistogram( PrintStream ps, String label, long[] bins
)
    {
       
        long sum = 0L;
       
        for( int i=0; i<bins.length; i++ ) {
           
            sum += bins[ i ];
           
        }
       
        ps.println( label+": #samples="+sum );
       
        ps.println( "binSize\tcount\tpercent" );
       
        for( int i=0; i<bins.length; i++ ) {
           
            if( bins[ i ] == 0 ) continue;
           
            ps.println( powersOf2[ i ]+"\t"+bins[ i ]+"\t"+((int)(((double)bins[i]/(double)sum)*100d*100d))/100+"%" );
           
        }
       
    }
   
    /**
     * Summarizes the wasted space in the in use records within the
     * store.  Each jdbm record has both a "size" and a "capacity".
     * This method scans all in use records and notes when the "capacity"
     * is greater than the "size", which constitutes at least potential
     * waste of space in the store.<p>
     *
     * The approach is to scan the list of translation pages.  Each
     * translation page maps a logical rowid onto a physical rowid.  We
     * use this mapping to get the header of the record from the store and
     * then examing the size vs capacity fields in the record header.<p>
     *
     * @param ps Where to write the data.
     */
   
    public void writeRecordWasteSummary(PrintStream ps)
        throws IOException
    {
       
        checkIfClosed();
       
        ps.println( "*** Record Waste Summary" );
       
        // #of translation pages scanned.
        long npages = 0L;
       
        // #of records scanned.
        long nrecs = 0L;

        // #of "in use" records (non-zero size).
        long nuserecs = 0L;
       
        // #of "free" records (zero size, but non-zero capacity).
        long nfreerecs = 0L;
       
        // total of "size" across all in use records.
        long usesize = 0L;
       
        // #of in use records in which there is some waste.
        long nwasterecs = 0L;
       
        // #of bytes of waste across all in use records.
        long totalwaste = 0L;
       
        // #of bytes capacity across all in use records.
        long usecapacity = 0L;

        // #of bytes capacity of currently deleted records.
        long freecapacity = 0L;
       
        // #of bytes capacity of all records with any wasted capacity.
        long wastecapacity = 0L;
       
        // The total #of "in use" records aggregated into bins by record
        // size.
        long usesizebin[] = new long[ powersOf2.length ];

        // The total #of "free" records aggregated into bins by record
        // capacity.
        long freecapacitybin[] = new long[ powersOf2.length ];
       
//      Magic.USED_PAGE;
//      Magic.FREE_PAGE
//      Magic.USED_PAGE
        short type = Magic.TRANSLATION_PAGE;
//      Magic.FREELOGIDS_PAGE
//      Magic.FREEPHYSIDS_PAGE

        // Setup a cursor over the translate page list.
       
        PageCursor tcurs = new PageCursor
            ( _pageman,
              type
              );
       
        while( tcurs.next() != 0 ) {

            long transBlockId = tcurs.getCurrent();
           
//            ps.println( "Scanning translation page="+transBlockId );
           
            try {
               
            BlockIo transBlock = _file.get( transBlockId );

            TranslationPage xlatPage = TranslationPage.getTranslationPageView
                ( transBlock
                  );

            for( int i=0; i<TranslationPage.ELEMS_PER_PAGE; i++ ) {

                // Compute the offset to the ith entry on this translation page.  Normally the
                // blockId for the translate page is taken from the block component of the Location
                // and the offset into the translation page is taken from the offset component of
                // the Location.  Since we are just scanning the translation page list we get the
                // block from the page list and have to compute the offsets ourselves.
               
                short offset = (short) ( TranslationPage.O_TRANS + ( i * PhysicalRowId.SIZE ) );

                // Note: Slot in the translate page are NOT cleared when a record is deleted, which
                // means that this will also visit "deleted" records -- we identify a "deleted"
                // record by a record size (vs capacity) of zero.  However the slot is zero if
                // that logical record has never been allocated, in which case we skip it.
               
                Location physid = new Location( xlatPage.get( offset ) );

                if( physid.getBlock() == 0 ) {
                   
                    // This record was never allocated.
                   
                    // ps.println( "Skipping: "+physid );
                   
                    continue;
                   
                }
               
//                ps.println( "   Examining: "+physid );
               
                //
                // Fetch the record header, which is on the first page for the record.  The
                // page is identified by the block component of [physid]. The offset to the
                // record on that page is given by the offset component of [physid].
                //
               
                try {

                    BlockIo block2 = _file.get( physid.getBlock() );
               
                    RecordHeader head = new RecordHeader( block2,
physid.getOffset() );
                   
//                    ps.println( head.toString() );
                   
                    int size = head.getCurrentSize();
                   
                    int capacity = head.getAvailableSize();

                    if( size <= 0 ) {
                       
                        // If the size is zero, then the record was
"deleted".                       

                        freecapacity += capacity;
                       
                        nfreerecs++;

//                        ps.println( head+" : free." );

                        freecapacitybin[ getBin( capacity ) ]++;
                       
                    } else {

                        nuserecs++;
                       
                        usesize += size;
                       
                        usecapacity += capacity;
                       
                        usesizebin[ getBin( capacity ) ]++;

                        int waste = capacity - size;
                       
                        if( waste > 0 ) {

//                            ps.println( head+" : waste="+waste+" bytes."
);

                            totalwaste += waste;

                            wastecapacity += capacity;
                           
                            nwasterecs++;
                           
                        }
                       
                    }
                   
                    nrecs++; // #of records scanned (free + in use).
                   
                }
               
                finally {
                    // release the record.
                    _file.release( physid.getBlock(), false );
                   
                }
               
                if( true ) {
                   
                    //
                    // @todo Fetch the record and attempt to categorize it as a member of
                    // some class (record type) and build up a distribution of the
                    // size of records of that class.
                    //
                   
                }

            }
           
            npages++; // #of translation pages scanned.

            }
           
            finally {
               
                // Release the translation page since we have finished scanning the
                // records linked from that page.
               
                _file.release( tcurs.getCurrent(), false );
               
            }
           
        }

        // #of bytes capacity across all records (in use and free).
        long totalcapacity = usecapacity + freecapacity;
       
        ps.println( "Scanned "+npages+" pages on the 'translation' page list." );
       
        ps.println( "Records: used="+nuserecs+", free="+nfreerecs+", total="+nrecs );
       
        ps.println( "Capacity: used="+usecapacity+", free="+freecapacity+", total="+totalcapacity );

        ps.println( "Total waste of "+totalwaste+" bytes across "+nwasterecs+" records in which waste occurs." );
       
        ps.println( "% wasted capacity: "+ ((int)(((double)totalwaste/(double)totalcapacity)*100d*100d))/100. );
       
        ps.println( "Average waste per record with waste: "+(totalwaste/nwasterecs)+" bytes." );

        ps.println( "Average size of record with waste: "+(wastecapacity/nwasterecs)+" bytes." );
       
        ps.println( "% waste per record with waste: "+((double)totalwaste/(double)wastecapacity)*100);
       
        ps.println( "Average in use record size: "+(usesize/nuserecs) );

        writeHistogram( ps, "In used size histogram ", usesizebin );

        writeHistogram( ps, "Free capacity histogram", freecapacitybin );
       
    }

    /**
     *  Closes the record manager.
     *
     *  @throws IOException when one of the underlying I/O operations fails.
     */
    public synchronized void close()
        throws IOException
    {
        checkIfClosed();

        _pageman.close();
        _pageman = null;

        _file.close();
        _file = null;
    }

    /**
     * Check if the file has been closed.  If so, throw an
     * IllegalStateException.
     */

    private void checkIfClosed()
        throws IllegalStateException
    {

        if ( _file == null ) {
           
            throw new IllegalStateException
                ( "The file has been closed"
                  );
           
        }
       
    }

    /**
     * Writes a usage message on stderr and exits with a non-zero
     * status code.
     *
     * @param args The command line arguments.
     */
   
    public static final void usage( String[] args )
    {
   
        System.err.println
           ( "DumpUtility [options] filename"
             );
       
        System.exit( 1 );
       
    }
   
    /**
     * Command line utility for dumping the state of the jdbm store.
     *
     * @param args <pre>[options] <i>filename</i>
     *
     * options - none are defined yet.
     *
     * filename - the base filename of a pre-existing jdbm store.
     *
     * </pre>
     */
   
    public static void main( String[] args )
        throws IOException
    {

        String filename = null;
       
        for( int i=0; i<args.length; i++ ) {
       
            String arg = args[ i ];
           
            if( arg.startsWith( "-" ) ) {
               
            } else {
               
                filename = arg;
               
            }
           
        }
       
        if( filename == null ) {
           
            usage( args );
           
        }

//        // Open the file.
//       
//        RecordManager tmp = RecordManagerFactory.createRecordManager
//            ( filename,
//              properties
//              );
//       
//        // Get the base record manager for the file.
//       
//        BaseRecordManager recman;
//       
//        while( true ) {
//           
//            if( tmp.getRecordManager() == tmp ) {
//               
//                recman = (BaseRecordManager) tmp;
//            
//                break;
//               
//            }
//           
//            tmp = tmp.getRecordManager();
//           
//        }

        // Instantiate the dump utility class.
       
        DumpUtility du = new DumpUtility
            ( filename
              );
       
        du.writeRecordWasteSummary( System.out );
       
        du.close();
       
        System.exit( 0 );
               
    }
   
}