Re: [Febrl-list] Memory usage

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Yep, I experienced the same thing. What I did was to divide my dataset in twelve pieces; one for each month. That didn't matter because The dob variable was a blocking variable anyway. Then I run 12 analyses on 12 computers, and glued the results back together. Fast! 

I am not proficient enough (yet!) in Python, but I agree that it would be nicer to come up with a real solution. I believe generator expressions could be used for this. Another thing: did you try using the BigMatch algorithm? 

Cheers!!
Albert-Jan

--- On Sun, 11/23/08, Adi Eyal <ad...@di...> wrote:

> From: Adi Eyal <ad...@di...>
> Subject: [Febrl-list] Memory usage
> To: feb...@li...
> Date: Sunday, November 23, 2008, 9:42 PM
> Hi All
> 
> I'm currently experiencing memory problems when
> de-duping a 160Mb file with
> 1M records. The problem seems to be the design of the
> classifier classes.
> The classify method returns three sets, match, non-match
> and possible match,
> which in effect doubles the number of rows in memory. A
> more memory
> efficient solution would not hold all the data in memory
> but would rather
> iterate over the weight-vector file. The three returned
> datasets makes this
> solution somewhat awkward. Has anyone encountered a similar
> problem? Anyway
> to work around the problem without getting my hands messy?
> 
> Adi
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move
> Developer's challenge
> Build the coolest Linux based applications with Moblin SDK
> & win great prizes
> Grand prize is a trip for two to an Open Source event
> anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________
> Febrl-list mailing list
> Feb...@li...
> https://lists.sourceforge.net/lists/listinfo/febrl-list