From: Peter C. <pet...@an...> - 2003-11-21 00:41:28
|
The ANU Data Mining Group is pleased to announce the release of Febrl 0.2.2, a prototype program code intended to make probabilistic record linkage easier, faster and more accurate for biomedical and other researchers. The programs, known collectively as "Febrl" - Freely Extensible Biomedical Record Linkage - address the data cleaning and standardisation tasks which are essential first steps for most record linkage projects, and provide routines for probabilistic record linkage and record deduplication. This fourth release Febrl Version 0.2.2 is a bug-fix release of Version 0.2.1. We would like to thank everybody who sent us bug- reports or other comments. The main features of the current release are: - Probabilistic and rules-based cleaning and standardisation routines for names, addresses and dates. - A variety of supplied look-up and frequency tables for names and addresses. - Various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons. - Several blocking (indexing) methods, including the traditional compound key blocking used in many record linkage programs. - Probabilistic record linkage routines based on the classical Fellegi and Sunter approach, as well as a 'flexible classifier' that allows a flexible definition of the weight calculation. - Process indicators that give estimations of remaining processing times. - Access methods for fixed format and comma-separated value (CSV) text files, as well as SQL databases. - Efficient temporary direct random access data set based on the Berkeley database library. - One-to-one assignment procedure for linked record pairs based on the 'Auction' algorithm. - Supports parallelism for higher performance on parallel platforms, based on MPI (Message Passing Interface), a standard for parallel programming, and Pypar, an efficient and easy-to-use module that allows Python programs to run in parallel on multiple processors and communicate using MPI. - A database generator which allows the creation of data sets of randomly created records (containing names, addresses and dates) with the possibility to include duplicate records with randomly introduced modifications. This allows for easy testing and evaluation of linkage (deduplication) processes. - Example project modules and example data sets allowing simple running of Febrl projects without any modifications needed. - An extensive 147 page manual. Note that you might have problems with printing the 'febrldoc-0.2.2.pdf' manual - use either the 'febrldoc-0.2.2.ps.gz' or 'febrldoc-0.2.2-destilled.pdf' versions instead. Febrl, which is written is the free, open source Python programming language, is itself available under a free, open source license, which we hope will encourage others to contribute to its further development and support. Contact details, background information, documentation and, of course, the program code are all available from the project Web site at http://datamining.anu.edu.au/linkage.html as well as from 'sourceforge.net' at http://sourceforge.net/projects/febrl We would like to stress that the programs are still in the early stages of development, and we do not yet recommend them for production use, but we encourage you to try them and to provide us with feedback. We particularly welcome bug reports and ideas for future development. There are many ways to help with the project: testing, programming and software engineering, documentation and technical writing, translation, provision of (anonymous, non-confidential) training and example data sets, and testing (did we mention that already?). We look forward to hearing from you. Peter Christen and Tim Churches Principal Developers of Febrl |