From: Albert-jan R. <fo...@ya...> - 2008-11-24 09:27:33
|
Hi, Yep, I experienced the same thing. What I did was to divide my dataset in twelve pieces; one for each month. That didn't matter because The dob variable was a blocking variable anyway. Then I run 12 analyses on 12 computers, and glued the results back together. Fast! I am not proficient enough (yet!) in Python, but I agree that it would be nicer to come up with a real solution. I believe generator expressions could be used for this. Another thing: did you try using the BigMatch algorithm? Cheers!! Albert-Jan --- On Sun, 11/23/08, Adi Eyal <ad...@di...> wrote: > From: Adi Eyal <ad...@di...> > Subject: [Febrl-list] Memory usage > To: feb...@li... > Date: Sunday, November 23, 2008, 9:42 PM > Hi All > > I'm currently experiencing memory problems when > de-duping a 160Mb file with > 1M records. The problem seems to be the design of the > classifier classes. > The classify method returns three sets, match, non-match > and possible match, > which in effect doubles the number of rows in memory. A > more memory > efficient solution would not hold all the data in memory > but would rather > iterate over the weight-vector file. The three returned > datasets makes this > solution somewhat awkward. Has anyone encountered a similar > problem? Anyway > to work around the problem without getting my hands messy? > > Adi > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move > Developer's challenge > Build the coolest Linux based applications with Moblin SDK > & win great prizes > Grand prize is a trip for two to an Open Source event > anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________ > Febrl-list mailing list > Feb...@li... > https://lists.sourceforge.net/lists/listinfo/febrl-list |