[Febrl-list] Memory usage

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi All

I'm currently experiencing memory problems when de-duping a 160Mb file with
1M records. The problem seems to be the design of the classifier classes.
The classify method returns three sets, match, non-match and possible match,
which in effect doubles the number of rows in memory. A more memory
efficient solution would not hold all the data in memory but would rather
iterate over the weight-vector file. The three returned datasets makes this
solution somewhat awkward. Has anyone encountered a similar problem? Anyway
to work around the problem without getting my hands messy?

Adi