dedupe-news Mailing List for Fuzzy Matching & Deduplication
Status: Pre-Alpha
Brought to you by:
ltickett
You can subscribe to this list here.
| 2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Lee T. <lti...@gm...> - 2006-07-03 15:37:51
|
Good afternoon! Again, thanks for your continued support, welcome to Project Dedupe's 4th edition! What's been going on since the last Newsletter; My time has all been focussed on the php dedupe script which is of course available on Subversion (link in the navigation panel of the wiki). Really feel this is progressing well- would appreciate anyone's input on the code so far; the files loop, prepare and function .php are all used in the identification of match_groups. elect_master and commit_dupe .php are then used to walk through the match_groups flagging master records and false positives. There are a number of configurables but some stuff still needs to be progressed- if you've got a dataset you wish to try running the script on but can't fathom how to apply these scripts please give me a shout! What's next; I will continue to focus my attention/time on the php scripts but all of the bits on the below list still need progressing! Some meat should be added to the Web GUI page(s) on the wiki Discussion should commence on what toolkit (if any) may be used for developing the Web GUI I've been very busy this week and slightly sidetracked so these are still 'to come'... Hopefully v0.1 of the file in/out function for visual basic will be posted to Subversion Hopefully v0.1 of the country extraction function for visual basic will be posted to Subversion If you haven't already, please register on the wiki and write a few lines about yourself (if you could include where you're from (hopefully having members from different countries will enable us to develop a smarter application), what interests you about this project and what you may be able to contribute) on your user page (http://tickett.net/dedupe/index.php?title=User:username&action=edit replacing username with your usersname) Some stats; There are now 9 users registered on the wiki (welcome to the 2 new users since the last newsletter!) and 3 users registered as developers on sourceforge (if you need me to register you as a developer on sourceforge so you can post to subversoin etc please drop me an e-mail) The main page has seen over 2,900 hits since launch (still seems like a lot of people are hitting the site, possibly even bookmarking it, but aren't yet drawn in enough to register/contribute) The most recently edited pages are; http://tickett.net/dedupe/index.php/Algorithms http://tickett.net/dedupe/index.php/Programming_languages http://tickett.net/dedupe/index.php/Existing_software This, and all previously archived newsletters can be viewed at http://sourceforge.net/mailarchive/forum.php?forum_id=48618 Thanks Lee (ltickett) Project Dedupe http://dedupe.sourceforge.net http://sourceforge.net/projects/dedupe |
|
From: Lee T. <lti...@gm...> - 2006-06-13 20:48:47
|
Good afternoon! Again, thanks for your continued support, welcome to Project Dedupe's 3nd edition! Still going to stick to the template... What's been going on since the last Newsletter; A few users have shown an interest in developing the http://tickett.net/dedupe/index.php/Web_GUI this page hasn't got any meat/structure to it yet so please feel free to add your $0.02 I've been tackling a new challenge to identify duplicate companies based on a fairly specific set of criteria; the approach has been to generate a key based on a primitive phonetic algorithm and some 'home-brew' regular expressions. I've managed to build a prototype using mySQL and php which will hopefully be available as an online demo soon (and the source code on subversoin of course!) What's next; As previously mentioned the code and an online demo from my investigations this week will hopefully be posted Some meat should be added to the Web GUI page(s) on the wiki Discussion should commence on what toolkit (if any) may be used for developing the Web GUI I've been very busy this week and slightly sidetracked so these are still 'to come'... Hopefully v0.1 of the file in/out function for visual basic will be posted to Subversion Hopefully v0.1 of the country extraction function for visual basic will be posted to Subversion If you haven't already, please register on the wiki and write a few lines about yourself (if you could include where you're from (hopefully having members from different countries will enable us to develop a smarter application), what interests you about this project and what you may be able to contribute) on your user page (http://tickett.net/dedupe/index.php?title=User:username&action=edit replacing username with your usersname) Some stats; There are now 7 users registered on the wiki (welcome to the 3 new users since the last newsletter!) and 3 users registered as developers on sourceforge (if you need me to register you as a developer on sourceforge so you can post to subversoin etc please drop me an e-mail) The main page has seen over 2,200 hits since launch (still seems like a lot of people are hitting the site, possibly even bookmarking it, but aren't yet drawn in enough to register/contribute) The most recently edited pages are; http://tickett.net/dedupe/index.php/Web_GUI http://tickett.net/dedupe/index.php?title=User:Macvijay1985 http://tickett.net/dedupe/index.php/Main_Page <http://tickett.net/dedupe/index.php/Programming_languages> <http://tickett.net/dedupe/index.php/Talk:Match_Levels> This, and all previously archived newsletters can be viewed at http://sourceforge.net/mailarchive/forum.php?forum_id=48618 Thanks Lee (ltickett) Project Dedupe http://dedupe.sourceforge.net http://sourceforge.net/projects/dedupe |
|
From: Lee T. <lti...@gm...> - 2006-05-30 19:56:03
|
Good aftertoon! Again, thanks for your continued support, welcome to Project Dedupe's 2nd edition! Let's see if I can stick to the template... What's been going on since the last Newsletter; A little research on Microsoft's attempt at fuzzy grouping in SQL Server 2005 - http://tickett.net/dedupe/index.php/Existing_software A draft visual basic module is available on Subversion for cleaning up contact numbers (phone, fax etc) - http://svn.sourceforge.net/viewcvs.cgi/dedupe/vb_source Work has begun on a method of dealing with data in visual basic (reading input files, working with those files and then writing output files) Work has begun on extracting the country from "dirty data" in visual basic What's next; Hopefully v0.1 of the file in/out function for visual basic will be posted to Subversion Hopefully v0.1 of the country extraction function for visual basic will be posted to Subversion If you haven't already, please register on the wiki and write a few lines about yourself (if you could include where you're from (hopefully having members from different countries will enable us to develop a smarter application), what interests you about this project and what you may be able to contribute) on your user page ( http://tickett.net/dedupe/index.php?title=User:username&action=editreplacing username with your usersname) Some stats; There are still only 4 users registered on the wiki (including me!) and 3 users registered as developers on sourceforge (again, including me!) The main page has seen over 1,600 hits since launch (seems like a lot of people are hitting the site, possibly even bookmarking it, but aren't yet drawn in enough to register/contribute) The most recently edited pages are; http://tickett.net/dedupe/index.php/Existing_software http://tickett.net/dedupe/index.php/Data_I/O http://tickett.net/dedupe/index.php/Programming_languages <http://tickett.net/dedupe/index.php/Talk:Match_Levels> This, and all previously archieved newsletters can be viewed at http://sourceforge.net/mailarchive/forum.php?forum_id=48618 Thanks Lee (ltickett) Project Dedupe http://dedupe.sourceforge.net http://sourceforge.net/projects/dedupe |
|
From: Lee T. <lti...@gm...> - 2006-05-21 19:31:46
|
Good aftertoon! Firstly thanks for you support, and welcome to Project Dedupe's 1st edition= ! I've tried to come up with a simple template (which I will try and stick to= ) for the relatively regular project updates; What's been going on since the last Newsletter; Well, as this is the first installment, a lot (i've mentioned just a few points). The project is now live (registered on sourceforge and the wiki is up and running). A handful of visual basic functions have been drafted for future use. What's next; As most of you already know... this still remains a question. I'm still gathering information and carrying out as much research as possible (populating the wiki etc). I hope some decisions can be made amongst the contributers/developers within the coming weeks/months. If you haven't already, please register on the wiki and write a few lines about yourself (if you could include where you're from, what interests you about this project and what you may be able to contribute) on your user pag= e( http://tickett.net/dedupe/index.php?title=3DUser:username&action=3Deditrepl= acing username with your usersname) Some stats; There are currently just 4 users registered on the wiki (including me!) and 3 users registered as developers on sourceforge (again, including me!) The main page has seen over 1,300 hits since launch (seems like a lot of people are hitting the site, possibly even bookmarking it, but aren't yet drawn in enough to register/contribute) The most recently edited pages are; http://tickett.net/dedupe/index.php/Existing_software http://tickett.net/dedupe/index.php/Talk:Main_Page http://tickett.net/dedupe/index.php/Talk:Match_Levels Thanks Lee (ltickett) Project Dedupe http://dedupe.sourceforge.net http://sourceforge.net/projects/dedupe |