britske - 2008-01-24


from various sources I have to dedup hotelnames and addresses (to uniquely identify hotels)

consider for example the following hotelnames which all identify the same hotel:

La Quinta Inn & Suites Houston Bush Intl Airport Hotel
La Quinta Inn and Suites Houston Bush Intl Airport Hotel
Houston Bush Intl Airport Hotel
Hotel Houston Bush International Airport

and possible variations thereof.

Without prior understanding of special similarity-metrics I'm thinking (but am open for suggestions!) :

- a list of Hotelchain names which are optional
- abbrevations (intl -> international , etc)
- a list of optional words as (hotel, inn, bed & breakfast, motel, etc)
- order of some words
- lower / upper casing
- some spelling corrections when dealing with for example translated thai/ chinese names

All in all the similarity has to match for a high percentage since even for example in the city multiple hotels can have the exact same name, soe deviations of the perfect name may in fact give an even bigger set of possible real matches.

Moreover, cost of the algorithm is also important since potentially a lot of matches must be done.
Although pruning the dataset based on country/cityname (need similatiry-matches here!) or better latitude/longitude (don't have these all the time) is possible.

Moreover, I'm in need for a similatiry-algorithm for matching international addresses (street level) . I can imagine some sort of industry excepted standard exists for this?

Thanks in advance,