You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
(2) |
Dec
(1) |
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
|
Jun
(3) |
Jul
(5) |
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
(1) |
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(5) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
(1) |
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Andrew B. <And...@sa...> - 2016-08-16 01:25:16
|
Hi everyone, Just an update on this - I managed to get it to work on Windows 64-bit. Here's what I used: https://www.python.org/ftp/python/2.7.12/python-2.7.12.amd64.msi http://ftp.gnome.org/pub/gnome/binaries/win64/gtk+/2.22/gtk+-bundle_2.22.1-20101229_win64.zip matplotlib-1.5.1-cp27-none-win_amd64.whl from https://pypi.python.org/pypi/matplotlib/ pygtk-2.22.0-cp27-none-win_amd64.whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygtk pycairo_gtk-1.10.0-cp27-none-win_amd64.whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygtk pygobject-2.28.6-cp27-none-win_amd64.whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygtk libsvm-3.21-cp27-none-win_amd64.whl from http://www.lfd.uci.edu/~gohlke/pythonlibs/#libsvm (then use pip to install the .whl files) The output looks ok and input files that were previously crashing with a memory error are now completing, but I'm not a statistician so I can't vouch for the correctness. I'm waiting on our end users to confirm the output is what was expected. Regards, Andrew From: Albert-Jan Roskam [mailto:fo...@ya...] Sent: Thursday, 30 June 2016 4:53 AM To: Andrew Brock <And...@sa...>; febrl mailing list <feb...@li...> Subject: Re: [Febrl-list] Running Febrl on 64-bit Python Hi, No experience with 64 bit Python, but I vaguely remember (it's been at least 5 years since I used Febrl!) experiencing MemoryErrors. Some dicts became too big. We used blocking variables so we decided to run separate Febrl sessions for each blocking value. Sorry if that might sound vague, it's just too long ago. AJ ________________________________ From: Andrew Brock <And...@sa...<mailto:And...@sa...>>; To: feb...@li...<mailto:feb...@li...> <feb...@li...<mailto:feb...@li...>>; Subject: [Febrl-list] Running Febrl on 64-bit Python Sent: Mon, Jun 27, 2016 12:12:30 AM Hi, We are running into out of memory issues when running Febrl. Our dataset has been growing organically over time so I suspect that we are hitting 32-bit Python's memory limit. Has anyone had any luck running Febrl in 64-bit Python (plus 64-bit versions of all of Febrl's dependencies)? Alternatively, is there a reason intrinsic to Febrl that would mean that it definitely won't work using a 64-bit Python stack? I just want to see if anyone has experience/knowledge in this area before I spend the time required to verify/validate this. Thanks, Andrew ------------------------------------------------------------------------------ Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ Febrl-list mailing list Feb...@li...<javascript:return> https://lists.sourceforge.net/lists/listinfo/febrl-list |
From: Manal H. <man...@gm...> - 2016-08-08 10:44:16
|
Hi, I am interested in doing probabilistic linkage using Febrl, and is having problems importing libraries in command line and in Anaconda (spyder). I managed to include all: from dataset import * # Data set routines from standardisation import * # Standardisation routines from comparison import * # Comparison functions from lookup import * # Look-up table routines from indexing import * # Indexing and blocking routines from simplehmm import * # Hidden Markov model (HMM) routines from classification import * # Classifiers for weight vectors except: from febrl import * # Main Febrl classes Traceback (most recent call last): File "<ipython-input-33-7d2e94d64e88>", line 1, in <module> from febrl import * # Main Febrl classes ImportError: No module named febrl I unzipped the febrl-0.4.2.tar.gz file and ran the tests successfully. I am working on CentOS, and pygtk and gtk seem to be windows only libraries, $sudo pip install pygtk Collecting pygtk Using cached pygtk-2.24.0.tar.bz2 Complete output from command python setup.py egg_info: ******************************************************************** * Building PyGTK using distutils is only supported on windows. * * To build PyGTK in a supported way, read the INSTALL file. * ******************************************************************** $ sudo pip install gtk Collecting gtk Could not find a version that satisfies the requirement gtk (from versions: ) No matching distribution found for gtk Otherwise, libsvm and matplotlib are working fine The gui gives the following error: $ ./guiFebrl.py Febrl directory: /home/mhelal/febrl WARNING:root:Cannot import svm module WARNING:root:Matplotlib module not installed. GTK and PyGTK not installed. when I run with root: $sudo ./guiFebrl.py WARNING:root:Cannot import svm module (guiFebrl.py:11169): libglade-WARNING **: could not look up stock id 'Data set generator' (guiFebrl.py:11169): libglade-WARNING **: could not look up stock id 'HMM training' (guiFebrl.py:11169): libglade-WARNING **: could not look up stock id 'String comparison training' (guiFebrl.py:11169): libglade-WARNING **: could not look up stock id 'Online help' and more like that, but I still managed to see the gui, load data and create a project file (attached) but I can not execute. I am trying to follow the documentation in: http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/node8.html and stopped on importing febrl classes I appreciate your help very much to keep going, thanks in advance, -- Kind Regards, Manal Helal |
From: Albert-Jan R. <fo...@ya...> - 2016-07-03 07:57:19
|
Hi Andrew, You're welcome. Yes, I agree it's nicer to have a real, scalable solution. It may be an option to use the 'shelve' module in lieu of that builtin-dict. It might be an easy fix since the interface is the same. Not sure what the performance penalty is, but with SSD it might still be acceptable. OTOH, 64-bit *should* also work. AFAIK, no C libraries or something like that are used. Best wishes AJ |
From: Andrew B. <And...@sa...> - 2016-06-30 00:40:01
|
Thanks AJ. That's what we're doing already, but we can foresee that we'll just have to keep on splitting things up as our dataset gets bigger and bigger. The idealist in me would like to have a solution that will work without modification for the years to come. Regards, Andrew From: Albert-Jan Roskam [mailto:fo...@ya...] Sent: Thursday, 30 June 2016 4:53 AM To: Andrew Brock <And...@sa...>; febrl mailing list <feb...@li...> Subject: Re: [Febrl-list] Running Febrl on 64-bit Python Hi, No experience with 64 bit Python, but I vaguely remember (it's been at least 5 years since I used Febrl!) experiencing MemoryErrors. Some dicts became too big. We used blocking variables so we decided to run separate Febrl sessions for each blocking value. Sorry if that might sound vague, it's just too long ago. AJ ________________________________ From: Andrew Brock <And...@sa...<mailto:And...@sa...>>; To: feb...@li...<mailto:feb...@li...> <feb...@li...<mailto:feb...@li...>>; Subject: [Febrl-list] Running Febrl on 64-bit Python Sent: Mon, Jun 27, 2016 12:12:30 AM Hi, We are running into out of memory issues when running Febrl. Our dataset has been growing organically over time so I suspect that we are hitting 32-bit Python's memory limit. Has anyone had any luck running Febrl in 64-bit Python (plus 64-bit versions of all of Febrl's dependencies)? Alternatively, is there a reason intrinsic to Febrl that would mean that it definitely won't work using a 64-bit Python stack? I just want to see if anyone has experience/knowledge in this area before I spend the time required to verify/validate this. Thanks, Andrew ------------------------------------------------------------------------------ Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San Francisco, CA to explore cutting-edge tech and listen to tech luminaries present their vision of the future. This family event has something for everyone, including kids. Get more information and register today. http://sdm.link/attshape _______________________________________________ Febrl-list mailing list Feb...@li...<javascript:return> https://lists.sourceforge.net/lists/listinfo/febrl-list |
From: Albert-Jan R. <fo...@ya...> - 2016-06-29 19:26:14
|
Hi, No experience with 64 bit Python, but I vaguely remember (it's been at least 5 years since I used Febrl!) experiencing MemoryErrors. Some dicts became too big. We used blocking variables so we decided to run separate Febrl sessions for each blocking value. Sorry if that might sound vague, it's just too long ago. AJ |
From: Andrew B. <And...@sa...> - 2016-06-27 00:31:59
|
Hi, We are running into out of memory issues when running Febrl. Our dataset has been growing organically over time so I suspect that we are hitting 32-bit Python's memory limit. Has anyone had any luck running Febrl in 64-bit Python (plus 64-bit versions of all of Febrl's dependencies)? Alternatively, is there a reason intrinsic to Febrl that would mean that it definitely won't work using a 64-bit Python stack? I just want to see if anyone has experience/knowledge in this area before I spend the time required to verify/validate this. Thanks, Andrew |
From: Peter C. <pet...@an...> - 2011-03-29 03:11:03
|
Dear Robin, the problem with the Truncate string method is that it does not look at individual words in a string (and then truncate each of them), but only looks at the beginning of the full string. For Albert-Jan's problem I am not aware of any off-the-shelve string comparison function that would be suitable. Ideally, a specific function that is customised to his data and his problem should be implemented. If anybody has (or plans to implement) such a function please let me know and I integrate it into Febrl. Kind regards, Peter On 28/03/11 11:32, Robin Gower wrote: > Hi Albert-Jan, > > I would imagine "String Truncate" might be the most appropriate > similarity measure for abbreviations. > > Robin > > On 24 March 2011 09:05, Albert-Jan Roskam <fo...@ya... > <mailto:fo...@ya...>> wrote: > > Hello, > I'm using Febrl 0.3 to match two datasets, using school names (among > others) as linkage variables. Dataset A has long versions of school > names, and dataset B has short(er) versions (e.g., partially > abbreviated) of the school names. What is the best string similarity > measure to use in such a case? Many similarity measures seem to be > designed for typos, not for cases such as this. > Cheers!! > Albert-Jan > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > All right, but apart from the sanitation, the medicine, education, > wine, public order, irrigation, roads, a fresh water system, and > public health, what have the Romans ever done for us? > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to > meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your > software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Febrl-list mailing list > Feb...@li... > <mailto:Feb...@li...> > https://lists.sourceforge.net/lists/listinfo/febrl-list > > > > > -- > Robin Gower > http://infonomics.ltd.uk > 0791 255 3187 > > Lies, damn lies, and your evidence base: > http://clients.infonomics.ltd.uk/?q=statisticslie > > > > ------------------------------------------------------------------------------ > Create and publish websites with WebMatrix > Use the most popular FREE web apps or write code yourself; > WebMatrix provides all the features you need to develop and publish > your website. http://p.sf.net/sfu/ms-webmatrix-sf > > > > _______________________________________________ > Febrl-list mailing list > Feb...@li... > https://lists.sourceforge.net/lists/listinfo/febrl-list -- =========================================================== Dr Peter Christen Associate Dean (Higher Degree Research) and Senior Lecturer Research School of Computer Science ANU College of Engineering and Computer Science CSIT Building (108), North Road The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 5690 F: +61 2 6125 0010 W: http://cs.anu.edu.au/~Peter.Christen AIAPA (Associate of the Institute of Analytics Professionals of Australia Limited) CRICOS Provider #00120C |
From: Robin G. <ro...@in...> - 2011-03-28 01:01:27
|
Hi Albert-Jan, I would imagine "String Truncate" might be the most appropriate similarity measure for abbreviations. Robin On 24 March 2011 09:05, Albert-Jan Roskam <fo...@ya...> wrote: > Hello, > > I'm using Febrl 0.3 to match two datasets, using school names (among > others) as linkage variables. Dataset A has long versions of school names, > and dataset B has short(er) versions (e.g., partially abbreviated) of the > school names. What is the best string similarity measure to use in such > a case? Many similarity measures seem to be designed for typos, > not for cases such as this. > > Cheers!! > Albert-Jan > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > All right, but apart from the sanitation, the medicine, education, wine, > public order, irrigation, roads, a fresh water system, and public health, > what have the Romans ever done for us? > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > Febrl-list mailing list > Feb...@li... > https://lists.sourceforge.net/lists/listinfo/febrl-list > > -- Robin Gower http://infonomics.ltd.uk 0791 255 3187 Lies, damn lies, and your evidence base: http://clients.infonomics.ltd.uk/?q=statisticslie |
From: Albert-Jan R. <fo...@ya...> - 2011-03-24 09:32:36
|
Hello, I'm using Febrl 0.3 to match two datasets, using school names (among others) as linkage variables. Dataset A has long versions of school names, and dataset B has short(er) versions (e.g., partially abbreviated) of the school names. What is the best string similarity measure to use in such a case? Many similarity measures seem to be designed for typos, not for cases such as this. Cheers!! Albert-Jan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a fresh water system, and public health, what have the Romans ever done for us? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
From: Zem A. <ze...@wp...> - 2009-09-12 22:15:59
|
Hallo, I find Febrl very interesting system but I stuck in error. The problem is that I don't know python and have no idea what to do next. I'd like to use febrl to deduplicate database with about 700 000 positions. At the beginning I have started with simple table which consisted only with Id and Name (up to 90 character string) - 13000 records. Because in Name I have almost all characters I had to use COL data set type. I used Fullindex and I wanted to check results for Damerau-Levenshtein algorithm. I tried different classification methods but always result is the same: Python's interpreter says Error name: global name "header_line_flag' is not defined !!! What I'm doing wrong? Examples described in manual work perfectly the only difference I found is data set type - csv and col. Could someone give me any ideas how to solve this problem? Thanks in advance # Febrl log. ============================================================================ === Generated Febrl code for "data" on Sat Sep 12 23:56:57 2009 # ---------------------------------------------------------------------------- - # Define input data set A: # data_set_a = dataset.DataSetCOL(description="Data set generated by Febrl GUI", access_mode="read", strip_fields=True, miss_val=[''], rec_ident="__rec_id_a__", file_name="D:\Pentaho\out\TEST.txt", header_line=True, field_list = [("ID",10), ("NAZWA",90)]) ============================================================================ === Generated Febrl code for "indexing" on Sat Sep 12 23:57:03 2009 # ---------------------------------------------------------------------------- - # Define indices for "blocking" # index = indexing.FullIndex(dataset1 = data_set_a, dataset2 = data_set_a, rec_comparator = rec_comp, index_sep_str = "", skip_missing = True, index_def = []) ============================================================================ === Generated Febrl code for "comparison" on Sat Sep 12 23:57:22 2009 # ---------------------------------------------------------------------------- - # Define field comparison functions # fc_funct_1 = comparison.FieldComparatorDaLeDist(agree_weight = 1.0, description = "Dam-Le-Edit-Dist-NAZWA-NAZWA", disagree_weight = 0.0, missing_weight = 0.0, threshold = 0.8) field_comp_list = [(fc_funct_1, "NAZWA", "NAZWA")] rec_comp = comparison.RecordComparator(data_set_a, data_set_a, field_comp_list) ============================================================================ === Generated Febrl code for "classification" on Sat Sep 12 23:57:37 2009 # ---------------------------------------------------------------------------- - # Define weight vector (record pair) classifier # classifier = classification.KMeans(dist_measure = mymath.distL2, max_iter_count = 1000, centroid_init = "min/max") ============================================================================ === Generated Febrl code for "output" on Sat Sep 12 23:57:41 2009 # ---------------------------------------------------------------------------- - # Define output file options # histo_str_list = output.GenerateHistogram(class_w_vec_dict, 1.0) for line in histo_str_list: print line |
From: Gaudenz S. <ga...@so...> - 2009-05-13 22:45:47
|
HMM improvements (febrl_simplehmm.patch) **************************************** Sorry, I no longer remember exactly why this patch was needed. It apparently takes care of the case when a tag is not present in the trained observation list or something such. |
From: Gaudenz S. <ga...@so...> - 2009-05-13 22:45:03
|
Name standardisation improvements (febrl_standardisation.patch) *************************************************************** This adds several improvements to name standardisation which are usefull to standardise names as gathered from internet sources (mostly email headers): * Keep case of individual tokens. In email headers the case of the tokens often contains significant information. * Change NameStadnardiser to tag names in all uppercase as SN if they are previously tagged as UN,SN,GM or GF. An often seen convention in email "From:" headers is to write the surname in all capitals. My test showed that this is a far better indicator than any name list. This is mainly used by french speaking and asian persons. * Add NameNicknameStandardiser as a derived class from NameStandardiser. This adds the ability to standardise names into an additional nickname component. Nicknames are frequently seen in internet communication and are a good indicator to later deduplicate records. Plus there are some bugfixes to bugs I noticed during my work: - allow datasets with 'readwrite' access - correct logic for removing leading and trailing brackets - Add a check for empty token lists to __get_name_hmm__ -- Ever tried. Ever failed. No matter. Try again. Fail again. Fail better. ~ Samuel Beckett ~ |
From: Gaudenz S. <ga...@so...> - 2009-05-13 22:43:58
|
Set comparison (febrl_set_comparison.patch) ******************************************* Extends RecordComparator to compare sets. Each element of a set (field1) is compared with every element in the other set (field2). This is usefull if you e.g. have multiple mailaddresses already associated with the records under deduplication and want to compare the sets of addresses of the records. The changes to indexing.py implement indexing over fields which consist of sets. |
From: Gaudenz S. <ga...@so...> - 2009-05-13 22:42:50
|
SQL Dataset (febrl_sql_dataset.patch) ************************************* This adds two SQL dataset classes. The first one reads and writes from a single table. The second one which is read only and uses two SQL queries to construct the dataset. The first query retruns al list of keys and the second query is used to fetch the rows of the dataset based on this list of keys. The second class is mainly interesting because it can fetch several values for a single key and for a single column. E.g. the key is a person identifier and you want to have multiple mailadresses as a set in a single row and a single column. As febrl currently does not support lists or sets as data types this is only usefull together with the set comparison feature below. |
From: Gaudenz S. <ga...@so...> - 2009-05-13 22:41:21
|
Hi I'll send 4 patches for febrl in separate messages following this one. I first tried to send the patches in one mail, but it was too big for the mailinglist software and never got approved. I used febrl standardise a very large set of email addresses as part of a research project on the Debian GNU/Linux Distribution. I also did some experiments for deduplication of the dataset, but ultimately ended up doing most of the deduplication with a manually supervised script which used the lists produced by febrl as input. The main task was the deduplication of all the mailaddresses and associated realnames and nicknames of all the people who reported bugs into the Debian bug database. Therefore my patches add some improvements for this kind of data. Please tell me if you need further information about any of the patches or if you would like the in another form (e.g. split up into smaller patches, ...). I would be happy if you could integrate some or all of these patches into the next release of febrl, but I leave it to your own judgement if they are good enough to be part of a release. If you have any suggestions for improvements I certainly willing to update the patches. Gaudenz -- Ever tried. Ever failed. No matter. Try again. Fail again. Fail better. ~ Samuel Beckett ~ |
From: Adi E. <ad...@di...> - 2008-12-03 18:16:01
|
Hi All Could someone please point me in the right direction. I've been using Febrl for a little while now and while the entity resolution works well, I'd like to know how to progress from there. According to my algorithm, I know that same(A, B) and same(B, C), I also know that not same(A,C) - clearly, if I made a mistake in my entity resolution. The question is, how do I resolve this issue? I am currently considering some sort of min-cut algorithm that will split my graph in two (i.e. fully connected graph where each person is a node and graph weights are my match score). I'm not clear how well this approach would work and I suspect that I'm not the first person who has come across this problem. Can anyone suggest any reading matter on the topic? Thanks Adi |
From: Adi E. <ad...@di...> - 2008-11-24 11:41:48
|
Thanks for the reply - I haven't looked at BigMatch - I'll give it a try. Since I'm only using FellegiSunter, I simply generated the weight-vector file as normal and then processed it myself. It would be easy enough to change the relevant algorithms to use generators but I don't really want to stray to far away from the standard distribution. Adi 2008/11/24 Albert-jan Roskam <fo...@ya...> > Hi, > > Yep, I experienced the same thing. What I did was to divide my dataset in > twelve pieces; one for each month. That didn't matter because The dob > variable was a blocking variable anyway. Then I run 12 analyses on 12 > computers, and glued the results back together. Fast! > > I am not proficient enough (yet!) in Python, but I agree that it would be > nicer to come up with a real solution. I believe generator expressions could > be used for this. Another thing: did you try using the BigMatch algorithm? > > Cheers!! > Albert-Jan > > > --- On Sun, 11/23/08, Adi Eyal <ad...@di...> wrote: > > > From: Adi Eyal <ad...@di...> > > Subject: [Febrl-list] Memory usage > > To: feb...@li... > > Date: Sunday, November 23, 2008, 9:42 PM > > Hi All > > > > I'm currently experiencing memory problems when > > de-duping a 160Mb file with > > 1M records. The problem seems to be the design of the > > classifier classes. > > The classify method returns three sets, match, non-match > > and possible match, > > which in effect doubles the number of rows in memory. A > > more memory > > efficient solution would not hold all the data in memory > > but would rather > > iterate over the weight-vector file. The three returned > > datasets makes this > > solution somewhat awkward. Has anyone encountered a similar > > problem? Anyway > > to work around the problem without getting my hands messy? > > > > Adi > > ------------------------------------------------------------------------- > > This SF.Net email is sponsored by the Moblin Your Move > > Developer's challenge > > Build the coolest Linux based applications with Moblin SDK > > & win great prizes > > Grand prize is a trip for two to an Open Source event > > anywhere in the world > > > http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________ > > Febrl-list mailing list > > Feb...@li... > > https://lists.sourceforge.net/lists/listinfo/febrl-list > > > > |
From: Albert-jan R. <fo...@ya...> - 2008-11-24 09:27:33
|
Hi, Yep, I experienced the same thing. What I did was to divide my dataset in twelve pieces; one for each month. That didn't matter because The dob variable was a blocking variable anyway. Then I run 12 analyses on 12 computers, and glued the results back together. Fast! I am not proficient enough (yet!) in Python, but I agree that it would be nicer to come up with a real solution. I believe generator expressions could be used for this. Another thing: did you try using the BigMatch algorithm? Cheers!! Albert-Jan --- On Sun, 11/23/08, Adi Eyal <ad...@di...> wrote: > From: Adi Eyal <ad...@di...> > Subject: [Febrl-list] Memory usage > To: feb...@li... > Date: Sunday, November 23, 2008, 9:42 PM > Hi All > > I'm currently experiencing memory problems when > de-duping a 160Mb file with > 1M records. The problem seems to be the design of the > classifier classes. > The classify method returns three sets, match, non-match > and possible match, > which in effect doubles the number of rows in memory. A > more memory > efficient solution would not hold all the data in memory > but would rather > iterate over the weight-vector file. The three returned > datasets makes this > solution somewhat awkward. Has anyone encountered a similar > problem? Anyway > to work around the problem without getting my hands messy? > > Adi > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move > Developer's challenge > Build the coolest Linux based applications with Moblin SDK > & win great prizes > Grand prize is a trip for two to an Open Source event > anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________ > Febrl-list mailing list > Feb...@li... > https://lists.sourceforge.net/lists/listinfo/febrl-list |
From: Adi E. <ad...@di...> - 2008-11-23 20:42:08
|
Hi All I'm currently experiencing memory problems when de-duping a 160Mb file with 1M records. The problem seems to be the design of the classifier classes. The classify method returns three sets, match, non-match and possible match, which in effect doubles the number of rows in memory. A more memory efficient solution would not hold all the data in memory but would rather iterate over the weight-vector file. The three returned datasets makes this solution somewhat awkward. Has anyone encountered a similar problem? Anyway to work around the problem without getting my hands messy? Adi |
From: Adi E. <ad...@di...> - 2008-11-13 20:49:15
|
Has anyone considered Mural (https://mural.dev.java.net/) - what are the benefits of Febrl over that project? Febrl seems to contain many more algorithms but being new to the field of record linkage, I don't know which ones I'm more likely to use. Any comments would be appreciated. Adi |
From: Adi E. <ad...@di...> - 2008-11-11 12:54:19
|
Hi Where can I submit bug reports (and fixes)? Adi |
From: Albert-jan R. <fo...@ya...> - 2008-07-30 07:30:43
|
Hi Dinu, I also had MemoryErrors before with Febrl 0.3 and I resolved them by dividing the data. One of my blocking vars was date of birth so I chopped up my data in 12 pieces/months. Then I wrote a script to put the results back together. In my case, it was even a blessing in disguise, because it allowed me to run one session on 12 different computers, slashing the run time from 24h to something like 3h. HTH, Albert-Jan --- On Tue, 7/29/08, Dinu Corbu <d....@gr...> wrote: > From: Dinu Corbu <d....@gr...> > Subject: [Febrl-list] Isuues with Febrl 4.02 > To: feb...@li... > Date: Tuesday, July 29, 2008, 11:07 AM > <div id=yiv732357793><font face="Default Sans > Serif,Verdana,Arial,Helvetica,sans-serif" > size="2"><div><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" size="3">Hi > there,</font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font size="3"><font > face="Default Serif,Times New > Roman,Times,serif">I did some work on exploring > Febrl version 4.02 and the record linking literature, in > view of preparing a deduplication project that I have to > complete soon. <span > style=""> </span></font></font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" size="3">Not > everything went well, and I hope somebody could help me > with some useful answers to the next list of > issues:</font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><ol > style="MARGIN-TOP:0cm;" type="1"> > <li class="MsoNormal" style="MARGIN:0cm > 0cm 0pt;"><font face="Default Serif,Times > New Roman,Times,serif" size="3">It > appeared that the installation process was successful. > However, when starting Febrl, the following message > appears on the shell that opens behind GUI: > </font></li></ol><p > class="MsoNormal" style="MARGIN:0cm 0cm 0pt > 18pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm 0pt > 18pt;"><font size="3"><font > face="Default Serif,Times New > Roman,Times,serif"><span > style=""> > </span><b style="">WARNING: root: > Cannot import Numeric and PyML modules > </b></font></font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm 0pt > 18pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><ol > style="MARGIN-TOP:0cm;" type="1" > start="2"> <li class="MsoNormal" > style="MARGIN:0cm 0cm 0pt;"><font > size="3"><font face="Default > Serif,Times New Roman,Times,serif"><span > style=""> </span>When trying to run > a de-duplication project on a dataset of 1,069,472, it > progressed till 7%, and after that Febrl got blocked, the > shell behind GUI showing the message > “MemoryError”</font></font></li></ol><p > class="MsoNormal" style="MARGIN:0cm 0cm 0pt > 18pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><ol > style="MARGIN-TOP:0cm;" type="1" > start="3"> <li class="MsoNormal" > style="MARGIN:0cm 0cm 0pt;"><font > face="Default Serif,Times New Roman,Times,serif" > size="3">When trying to run a de-duplication > project on a dataset of 130,213, it progressed 61% and gave > the message > “MemoryError”</font></li></ol><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><ol > style="MARGIN-TOP:0cm;" type="1" > start="4"> <li class="MsoNormal" > style="MARGIN:0cm 0cm 0pt;"><font > size="3"><font face="Default > Serif,Times New Roman,Times,serif">When playing > with the data set of 10,000 records > “dataset_A_10000.csv” that is in > instatllation folder > “C:\Febrl4\febrl-0.4.02\data\dedup-dsgen”, > everything worked. However, I verified the results of > several comparison functions (among them Jaro and > Winkler), and I found that Febrl gave values that were > lower than what I computed.<span > style=""> > </span></font></font></li></ol><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" size="3">I would be > grateful to anyone who can give me some advice on how these > issues can be solved.</font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3">Regards,</font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3">Dinu</font></p><p > class="MsoNormal" style="MARGIN:0cm 0cm > 0pt;"><font face="Default Serif,Times New > Roman,Times,serif" > size="3"> </font></p><br><div>---------------------------------------------------------------------<br>Dinu Corbu<br>Senior Research Assistant<br><br>Ph. +61 (07) 373 55600 <br>Fax +61 (07) 373 56812<br><br>Key Centre for Ethics, Law, Justice and Governance<br>Griffith University<br>Mt Gravatt campus<br>Messines Ridge Road, Mt Gravatt, QLD, 4122, Australia<br>---------------------------------------------------------------------<br></div></div></font> > </div>------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move > Developer's challenge > Build the coolest Linux based applications with Moblin SDK > & win great prizes > Grand prize is a trip for two to an Open Source event > anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________ > Febrl-list mailing list > Feb...@li... > https://lists.sourceforge.net/lists/listinfo/febrl-list |
From: Dinu C. <d....@gr...> - 2008-07-29 09:07:39
|
<FONT face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size=2><DIV><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>Hi there,</FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT size=3><FONT face="Default Serif,Times New Roman,Times,serif">I did some work on exploring Febrl version 4.02 and the record linking literature, in view of preparing a deduplication project that I have to complete soon. <SPAN style="mso-spacerun: yes"> </SPAN></FONT></FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>Not everything went well, and I hope somebody could help me with some useful answers to the next list of issues:</FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><OL style="MARGIN-TOP: 0cm" type=1> <LI class=MsoNormal style="MARGIN: 0cm 0cm 0pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>It appeared that the installation process was successful. However, when starting Febrl, the following message appears on the shell that opens behind GUI: </FONT></LI></OL><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt 18pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt 18pt"><FONT size=3><FONT face="Default Serif,Times New Roman,Times,serif"><SPAN style="mso-spacerun: yes"> </SPAN><B style="mso-bidi-font-weight: normal">WARNING: root: Cannot import Numeric and PyML modules <o:p></o:p></B></FONT></FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt 18pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><OL style="MARGIN-TOP: 0cm" type=1 start=2> <LI class=MsoNormal style="MARGIN: 0cm 0cm 0pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><FONT size=3><FONT face="Default Serif,Times New Roman,Times,serif"><SPAN style="mso-spacerun: yes"> </SPAN>When trying to run a de-duplication project on a dataset of 1,069,472, it progressed till 7%, and after that Febrl got blocked, the shell behind GUI showing the message “MemoryError”</FONT></FONT></LI></OL><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt 18pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><OL style="MARGIN-TOP: 0cm" type=1 start=3> <LI class=MsoNormal style="MARGIN: 0cm 0cm 0pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>When trying to run a de-duplication project on a dataset of 130,213, it progressed 61% and gave the message “MemoryError”</FONT></LI></OL><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><OL style="MARGIN-TOP: 0cm" type=1 start=4> <LI class=MsoNormal style="MARGIN: 0cm 0cm 0pt; mso-list: l0 level1 lfo1; tab-stops: list 36.0pt"><FONT size=3><FONT face="Default Serif,Times New Roman,Times,serif">When playing with the data set of 10,000 records “dataset_A_10000.csv” that is in instatllation folder “C:\Febrl4\febrl-0.4.02\data\dedup-dsgen”, everything worked. However, I verified the results of several comparison functions (among them Jaro and Winkler), and I found that Febrl gave values that were lower than what I computed.<SPAN style="mso-spacerun: yes"> </SPAN></FONT></FONT></LI></OL><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>I would be grateful to anyone who can give me some advice on how these issues can be solved.</FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>Regards,</FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT face="Default Serif,Times New Roman,Times,serif" size=3>Dinu</FONT></P><P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><o:p><FONT face="Default Serif,Times New Roman,Times,serif" size=3> </FONT></o:p></P><BR><DIV>---------------------------------------------------------------------<BR>Dinu Corbu<BR>Senior Research Assistant<BR><BR>Ph. +61 (07) 373 55600 <BR>Fax +61 (07) 373 56812<BR><BR>Key Centre for Ethics, Law, Justice and Governance<BR>Griffith University<BR>Mt Gravatt campus<BR>Messines Ridge Road, Mt Gravatt, QLD, 4122, Australia<BR>---------------------------------------------------------------------<BR></DIV></DIV></FONT> |
From: Ivan H. <Iva...@an...> - 2008-07-18 05:22:42
|
Hi Ana, Unfortunately that didn't work. I unzipped the FEBRL folders to a local drive I have full access to and ran guiFebrl.py as user. The *pyc files were created but the GUI did not open. My aim was to use FEBRL as a tool for standardising datasets for a multi-institution collaboration studying climate effects on health. Members of this collaboration are from other universities and mostly using winXP with generally restricted user profiles like mine and have limited python programming skills (so using FEBRL without the GUI is not really an option for them). If the FEBRL GUI cannot be run on their machines I will have to approach the problem differently. I would appreciate any advice on alternatives list members have. I am currently using R functions for string comparisons and linking datasets within a relational database and GIS framework, which works fine for me here but is not really something that I can get set up at the partner institutions easily. Thanks again for your the response! Ivan Hanigan Data Management Officer National Centre for Epidemiology and Population Health Australian National University Canberra, ACT, 0200 Ph: +61 2 6125 7767 Fax: +61 2 6125 0740 Mob: 0424 472 334 CRICOS provider #00120C -----Original Message----- From: feb...@li... [mailto:feb...@li...] On Behalf Of Ana Guerrero Sent: Sunday, 13 July 2008 10:58 PM To: Ivan Hanigan Cc: feb...@li... Subject: Re: [Febrl-list] FEBRL GUI won't open for user,but will for admin on windows machine On Fri, Jul 11, 2008 at 03:25:08PM +1000, Ivan Hanigan wrote: > Dear list, > I work at a university and our IT support has been unsuccessful trying > to get the FEBRL GUI (0.4.02) to open on my windows machine running XP. > > > The GUI opens fine when the administrator is logged on but not when I > log on. My IT support thinks this might be due to the restrictions of > group policy. If so we probably need to know where the program is > accessing files while it is running. > > Python and PyGTK are in a local folder where I have write access and > python imports the pygtk module manually OK, however whatever we have > tried the GUI only opens under the administrator profile. > > Can anyone help? > Sound like you did the first run of FEBRL as admin, it created the *pyc files read-only for admin and they can not be read by the users now. I suggest you uncompress the febrl folder as user, then try running it. Hope this helps, Ana ------------------------------------------------------------------------ - Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 _______________________________________________ Febrl-list mailing list Feb...@li... https://lists.sourceforge.net/lists/listinfo/febrl-list |
From: Ana G. <an...@de...> - 2008-07-13 12:48:37
|
On Fri, Jul 11, 2008 at 03:25:08PM +1000, Ivan Hanigan wrote: > Dear list, > I work at a university and our IT support has been unsuccessful trying > to get the FEBRL GUI (0.4.02) to open on my windows machine running XP. > > > The GUI opens fine when the administrator is logged on but not when I > log on. My IT support thinks this might be due to the restrictions of > group policy. If so we probably need to know where the program is > accessing files while it is running. > > Python and PyGTK are in a local folder where I have write access and > python imports the pygtk module manually OK, however whatever we have > tried the GUI only opens under the administrator profile. > > Can anyone help? > Sound like you did the first run of FEBRL as admin, it created the *pyc files read-only for admin and they can not be read by the users now. I suggest you uncompress the febrl folder as user, then try running it. Hope this helps, Ana |