Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
I am adding a new database implementation. This is the 2nd simplest type of database. The simplest is a text file, which I already have, but that is not practical because you have to rewrite the whole file every time you make a change. This approach is to have a folder with a separate text file for every record. You rewrite small files when you change records. Since I let the filesystem handle the random access, I call it fsdb, filesystem database.
The only thing tricky so far is that I divide the record files into 3 subfolders for INDI, FAM, and other. That way I can easily create the list of all INDI and FAM xrefs just by listing the folder contents. It is worth treating these 2 record types as special because they are the normal starting points.
The open question is whether this approach will scale. That depends on how well the OS handles folders with thousands of files. If it is really a problem, I could have it automatically create subfolders when it gets to a certain size. Another problem would be searches. That could be handled with a separate thread that caches the main info in the background. It would only be slow if the user searched immediately after starting the program.
There are 2 reasons for adding this database. One is that there is no native GDBI database. It can never be a stand-alone program. This may obsolete the need for jLifelines, since its biggest value was having a local database. Some people may actually use the GUI it adds, so I wont remove JLL (yet).
The other reason for a local database is to create a cache for PGV. We want to be able to download the entire database, modify it locally, and upload the changes later. I am not sure how to handle all that, but getting a local database was a prerequisite.
Writing this database went pretty quick because it is a combination of the PGV and text file databases. I copied those files as a starting point, and added the local file access. Now I have to figure out if I can refactor the duplicate code. I checked it in as a separate database, in gdbi/db/fsdb, but it is just a cache of other databases, I should move it under util. The text file database is in gdbi/util/parse. (The GEDCOM parse code evolved into a database.)
I should also note that I looked around for real databases first, but they all add so much overhead. I cant force the user to set up a SQL database. I could not find anything like sqlite that does not require a database. I looked at the Berkeley database, because all I need is record look-up, but adding their code would double the size of GDBI. So I decided to just use the filesystem.
You will definately start to run into trouble on unix file systems where I think 69K is the limit to files you can have in a folder. I'm not sure if there is a limit on NTFS and FAT will depend on if it is FAT16 or FAT32.
You might want to have a look at Derby http://db.apache.org/derby/
Derby is an embeddable Java SQL database that claims to be lightweight with a small footprint. ;)
Derby was formally IBM's CloudScape database which they have made Open Source and donated to Apache.
Probably what you should do is make a generic JDBC database that you can then specify a configuration property for. The configuration property would be the class name of the JDBC connector for the database you want to connect to. The JDBC API is the same for all SQL databases so long as you stick to standard SQL. You could then include Derby by default, but if somebody wanted to use MySQL or another more robust system they could, simply by changing the JDBC connector configuration parameter. (You probably still have some of the old SQL code from when I first connected the GDBI to PGV using JDBC that you could use to get started).
I already ran into a size limitation and modified the design. I was not able to grep the records when it got up around 4K entries. Argument list too long So I added a subdir level under each record directory. e.g. the first INDI is indi/001/I000001. I create a new subdir after every 1000 records, and there can be 1000 subdirs, giving me 1M max records.
Thanks for the pointer to Derby. The footprint is not that small, though. It is 2M, compared to my current 1M, which would triple the size. I think adding SQL support is inevitable, but it is not practical yet.
Since you mention sqlite, what are your reasons that you
will not use it. According to
the footprint it adds is between 150K and 250K. Sure, you will have to use SQL syntax, but it supports variable length records unlike the xbase ( dbase ) libraries that are at every corner of the internet.
I dont think you can use SQLite with Java. It is a C library. You would have to create a Java add-on for every platform. They would have to re-implement the whole library in Java first.
I had not seen that you are using Java. There are Java bindings for Sqlite, see
This would still mean that you need the library for each platform.
For Java there is HSQLDB ( http://www.hsqldb.org/ ), a Database implemented in Java, it is the Database that is used by Openoffice.org for its Base component. I have not found anything about its codesize, but they are talking that it is a "Lightweight 100% Java SQL Database Engine"
Thanks for the info on Sqlite and HSQLDB, but it looks like Derby would be better than these 2. Adding Sqlite wrappers for each platform defeats the purpose of using Java. And HSQLDB is 3M instead of Derby's 2M. If these are lightweight, it makes me wonder what heavyweight is.
It is almost time for a GDBI release, or at least another patch, but I did not want to release this new database yet, so I am only enabling it if debugging is set. You have to define DEBUG_DEFAULT = true in GdbiDebug to test it.
John, should I do the same thing with PGV Importer?
Yes, we should disabled the PGV Importer for the next release of GDBI as it still needs some refining.
I am working on the issue of searching for names right after start-up. I have added a thread to read and cache all the records, but it takes a while to read everything in. I may need to snapshot the names.