From: Adi E. <ad...@di...> - 2008-11-24 11:41:48
|
Thanks for the reply - I haven't looked at BigMatch - I'll give it a try. Since I'm only using FellegiSunter, I simply generated the weight-vector file as normal and then processed it myself. It would be easy enough to change the relevant algorithms to use generators but I don't really want to stray to far away from the standard distribution. Adi 2008/11/24 Albert-jan Roskam <fo...@ya...> > Hi, > > Yep, I experienced the same thing. What I did was to divide my dataset in > twelve pieces; one for each month. That didn't matter because The dob > variable was a blocking variable anyway. Then I run 12 analyses on 12 > computers, and glued the results back together. Fast! > > I am not proficient enough (yet!) in Python, but I agree that it would be > nicer to come up with a real solution. I believe generator expressions could > be used for this. Another thing: did you try using the BigMatch algorithm? > > Cheers!! > Albert-Jan > > > --- On Sun, 11/23/08, Adi Eyal <ad...@di...> wrote: > > > From: Adi Eyal <ad...@di...> > > Subject: [Febrl-list] Memory usage > > To: feb...@li... > > Date: Sunday, November 23, 2008, 9:42 PM > > Hi All > > > > I'm currently experiencing memory problems when > > de-duping a 160Mb file with > > 1M records. The problem seems to be the design of the > > classifier classes. > > The classify method returns three sets, match, non-match > > and possible match, > > which in effect doubles the number of rows in memory. A > > more memory > > efficient solution would not hold all the data in memory > > but would rather > > iterate over the weight-vector file. The three returned > > datasets makes this > > solution somewhat awkward. Has anyone encountered a similar > > problem? Anyway > > to work around the problem without getting my hands messy? > > > > Adi > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move > > Developer's challenge > > Build the coolest Linux based applications with Moblin SDK > > & win great prizes > > Grand prize is a trip for two to an Open Source event > > anywhere in the world > > > http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________ > > Febrl-list mailing list > > Feb...@li... > > https://lists.sourceforge.net/lists/listinfo/febrl-list > > > > |