From: Gaudenz S. <ga...@so...> - 2009-05-13 22:45:03
Attachments:
febrl_standardisation.patch
|
Name standardisation improvements (febrl_standardisation.patch) *************************************************************** This adds several improvements to name standardisation which are usefull to standardise names as gathered from internet sources (mostly email headers): * Keep case of individual tokens. In email headers the case of the tokens often contains significant information. * Change NameStadnardiser to tag names in all uppercase as SN if they are previously tagged as UN,SN,GM or GF. An often seen convention in email "From:" headers is to write the surname in all capitals. My test showed that this is a far better indicator than any name list. This is mainly used by french speaking and asian persons. * Add NameNicknameStandardiser as a derived class from NameStandardiser. This adds the ability to standardise names into an additional nickname component. Nicknames are frequently seen in internet communication and are a good indicator to later deduplicate records. Plus there are some bugfixes to bugs I noticed during my work: - allow datasets with 'readwrite' access - correct logic for removing leading and trailing brackets - Add a check for empty token lists to __get_name_hmm__ -- Ever tried. Ever failed. No matter. Try again. Fail again. Fail better. ~ Samuel Beckett ~ |