Is it possible for febrl to deduplicate dataset records that have varying number of attributes? For example, any given company record may have:
- a set of phone numbers(one or many)
- a set of company names the company goes by
- a set of postal addresses
kf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
there is a new phone number standardisation module in Febrl-0.3.
If you have more than one phone number, business name, or address then it becomes a bit tricky. You most likely have to modify not only all the look-up tables provided and retrain the HMMs, but also modify the standardisation modules, as currently they assume there is only one name and one address (made of various components though), etc.
Alternatively, you could create 'duplicate records' by splitting an original record so that you get a set of records with only one name etc. (and having a record identifier which allows you re-identify the original record).
I hope this helps a bit..
Cheers,
Peter
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What parameters do you use to standardise company names. I've got a look-up table from my own attempt to do this through ETL. The names/ phones/ addresses options seem quite prescriptive.
I realise that I'll need to retrain the HMM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Is it possible for febrl to deduplicate dataset records that have varying number of attributes? For example, any given company record may have:
- a set of phone numbers(one or many)
- a set of company names the company goes by
- a set of postal addresses
kf
Hi Kosta,
there is a new phone number standardisation module in Febrl-0.3.
If you have more than one phone number, business name, or address then it becomes a bit tricky. You most likely have to modify not only all the look-up tables provided and retrain the HMMs, but also modify the standardisation modules, as currently they assume there is only one name and one address (made of various components though), etc.
Alternatively, you could create 'duplicate records' by splitting an original record so that you get a set of records with only one name etc. (and having a record identifier which allows you re-identify the original record).
I hope this helps a bit..
Cheers,
Peter
What parameters do you use to standardise company names. I've got a look-up table from my own attempt to do this through ETL. The names/ phones/ addresses options seem quite prescriptive.
I realise that I'll need to retrain the HMM.