From: Tim C. <tc...@op...> - 2003-07-10 21:29:47
|
On Fri, 2003-07-11 at 04:48, Marion STURTEVANT wrote: > I work for the state of Oregon (USA), in public health. We are considerin= g using febrl > to deduplicate a client list with on the order of 1,000,000 records. Do a= ny of you have > any relevant experience with using (or failing to use) febrl on data sets= of that size? > I do know that we will need to install bsddb3 to get shelve to work prope= rly. The largest semi-formal deduplication test we have run (in which we compare results against AutoMatch) is 100,000 records. We are planning to run a test on 1 million records in November when Peter Christen returns from his (long, well-deserved) holiday. Just before he left, Peter was trying some larger deduplication runs using out main test dataset (about 950,000 real-life records from a population-based mothers-and-babies data collection), and was running into some problem with the auction assignment algorithm (which Febrl uses in place of Automatch's linear sum assignment algorithm) - it is clear that the auction code needs more work. Speed is also an issue at this stage - we haven't really begun to optimise Febrl by identifying which parts need to be re-written in C (as Python extension modules). So in answer to your question, by all means try, but please understand that Febrl is still only at the development (and experimental) stage, and thus there can be no guarantee of success. I am happy to help as much as possible, but can't devote days or weeks to debugging code in the next few months. I could arrange to hire some Python programmers to assist if you have some funds available, but I would prefer not to have to do that at this stage. I would suggest having an alternate plan for doing the deduplication, or use another record linkage system in parallel to provide a comparison. We (meaning Public Health Division of the New South Wales Dept of Health) don't plan to start using Febrl on a production basis until some time in 2004 - probably late 2004 - in the interim we are continuing to use AutoMatch and will be running Febrl in parallel on some jobs for testing purposes. Our eventual aim is to be able to deduplicate (or link, but deduplication tends to be a worst-case linkage scenario) 10 million record datasets on a few computers in a reasonable time (as in a few hours), and 100 million records on a small cluster of computers (20 to 100 PC workstations, not necessarily a dedicated parallel computer) overnight.=20 At this stage, we have definite funding to allow Peter to work on Febrl full-time for about 4 months starting in Nov 2003, and have applied for 3 years of Australian Research Council (ARC) funding - if we get that (fingers crossed) then Peter will be able to work almost full time on Febrl (and still do some comp sci teaching, I think), plus there will be a full-time PhD student, plus we can involve Markus Hegland a bit more to bring additional, very high-level computational mathematics/data mining/machine learning expertise to the project. All that by way of saying that we are committed to the Febrl project (and have some exciting ideas we want to trial), and it is probable that we will have funding to allow us to spend lots of time on it (after a brief hiatus for the next few months), but at this stage the software must still be regarded as experimental. And of course we are delighted to collaborate with others who want to contribute to its development, by writing code, documentation or, perhaps even more importantly, by testing it with real-life data and providing feedback. Regards, --=20 Tim C PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere or at http://members.optushome.com.au/tchur/pubkey.asc Key fingerprint =3D 8C22 BF76 33BA B3B5 1D5B EB37 7891 46A9 EAF9 93D0 |