Thanks for the reply - I haven't looked at BigMatch - I'll give it a try. Since I'm only using FellegiSunter, I simply generated the weight-vector file as normal and then processed it myself.

It would be easy enough to change the relevant algorithms to use generators but I don't really want to stray to far away from the standard distribution.


2008/11/24 Albert-jan Roskam <>

Yep, I experienced the same thing. What I did was to divide my dataset in twelve pieces; one for each month. That didn't matter because The dob variable was a blocking variable anyway. Then I run 12 analyses on 12 computers, and glued the results back together. Fast!

I am not proficient enough (yet!) in Python, but I agree that it would be nicer to come up with a real solution. I believe generator expressions could be used for this. Another thing: did you try using the BigMatch algorithm?


--- On Sun, 11/23/08, Adi Eyal <> wrote:

> From: Adi Eyal <>
> Subject: [Febrl-list] Memory usage
> To:
> Date: Sunday, November 23, 2008, 9:42 PM
> Hi All
> I'm currently experiencing memory problems when
> de-duping a 160Mb file with
> 1M records. The problem seems to be the design of the
> classifier classes.
> The classify method returns three sets, match, non-match
> and possible match,
> which in effect doubles the number of rows in memory. A
> more memory
> efficient solution would not hold all the data in memory
> but would rather
> iterate over the weight-vector file. The three returned
> datasets makes this
> solution somewhat awkward. Has anyone encountered a similar
> problem? Anyway
> to work around the problem without getting my hands messy?
> Adi
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move
> Developer's challenge
> Build the coolest Linux based applications with Moblin SDK
> & win great prizes
> Grand prize is a trip for two to an Open Source event
> anywhere in the world
> Febrl-list mailing list