Re: [Febrl-list] Experience using febrl to deduplicate large datasets?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, 2003-07-11 at 04:48, Marion STURTEVANT wrote:
> I work for the state of Oregon (USA), in public health. We are considerin=
g using febrl
> to deduplicate a client list with on the order of 1,000,000 records. Do a=
ny of you have
> any relevant experience with using (or failing to use) febrl on data sets=
 of that size?
> I do know that we will need to install bsddb3 to get shelve to work prope=
rly.

The largest semi-formal deduplication test we have run (in which we
compare results against AutoMatch) is 100,000 records. We are planning
to run a test on 1 million records in November when Peter Christen
returns from his (long, well-deserved) holiday. Just before he left,
Peter was trying some larger deduplication runs using out main test
dataset (about 950,000 real-life records from a population-based
mothers-and-babies data collection), and was running into some problem
with the auction assignment algorithm (which Febrl uses in place of
Automatch's linear sum assignment algorithm) - it is clear that the
auction code needs more work. Speed is also an issue at this stage - we
haven't really begun to optimise Febrl by identifying which parts need
to be re-written in C (as Python extension modules).

So in answer to your question, by all means try, but please understand
that Febrl is still only at the development (and experimental) stage,
and thus there can be no guarantee of success. I am happy to help as
much as possible, but can't devote days or weeks to debugging code in
the next few months. I could arrange to hire some Python programmers to
assist if you have some funds available, but I would prefer not to have
to do that at this stage. I would suggest having an alternate plan for
doing the deduplication, or use another record linkage system in
parallel to provide a comparison. We (meaning Public Health Division of
the New South Wales Dept of Health) don't plan to start using Febrl on a
production basis until some time in 2004 - probably late 2004 - in the
interim we are continuing to use AutoMatch and will be running Febrl in
parallel on some jobs for testing purposes. Our eventual aim is to be
able to deduplicate (or link, but deduplication tends to be a worst-case
linkage scenario) 10 million record datasets on a few computers in a
reasonable time (as in a few hours), and 100 million records on a small
cluster of computers (20 to 100 PC workstations, not necessarily a
dedicated parallel computer) overnight.=20

At this stage, we have definite funding to allow Peter to work on Febrl
full-time for about 4 months starting in Nov 2003, and have applied for
3 years of Australian Research Council (ARC) funding - if we get that
(fingers crossed) then Peter will be able to work almost full time on
Febrl (and still do some comp sci teaching, I think), plus there will be
a full-time PhD student, plus we can involve Markus Hegland a bit more
to bring additional, very high-level computational mathematics/data
mining/machine learning expertise to the project. All that by way of
saying that we are committed to the Febrl project (and have some
exciting ideas we want to trial), and it is probable that we will have
funding to allow us to spend lots of time on it (after a brief hiatus
for the next few months), but at this stage the software must still be
regarded as experimental. And of course we are delighted to collaborate
with others who want to contribute to its development, by writing code,
documentation or, perhaps even more importantly, by testing it with
real-life data and providing feedback.

Regards,

--=20

Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint =3D 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0