Is it possible for febrl to deduplicate dataset records that have varying number of attributes? For example, any given company record may have:
- a set of phone numbers(one or many)
- a set of company names the company goes by
- a set of postal addresses
there is a new phone number standardisation module in Febrl-0.3.
If you have more than one phone number, business name, or address then it becomes a bit tricky. You most likely have to modify not only all the look-up tables provided and retrain the HMMs, but also modify the standardisation modules, as currently they assume there is only one name and one address (made of various components though), etc.
Alternatively, you could create 'duplicate records' by splitting an original record so that you get a set of records with only one name etc. (and having a record identifier which allows you re-identify the original record).
I hope this helps a bit..
What parameters do you use to standardise company names. I've got a look-up table from my own attempt to do this through ETL. The names/ phones/ addresses options seem quite prescriptive.
I realise that I'll need to retrain the HMM.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.