I joined Ask.com a few months ago and some of the first things I inherited were various offline data mining and correlation tasks for some of our search verticals (businesses, events, movies, etc.) Like most players in the space, we aggregate datasets from various sources (e.g. web crawls and other proprietary sources) and are tasked with making sense of it. One of the first things we have to do is figure out when various records correlate to the same entity (e.g. using business names, addresses). It's amazing how 'dirty' the data can be - addresses in dramatically different formats, matching business names with almost nothing in common in the name (and the converse, independent entities which share a lot in common - e.g. businesses with a hotel, "mark adam's hotel", "mark adams florist"), etc.
Anyways, the specialized search space is fairly new here and we haven't yet ramped up the engineering or research resources to really attack this with the same vigor we've put into regular search. Given pressing deadlines, and my own lack of background in this space (I was much more systems focused, at Berkeley and before), your papers and the SecondString source library were remarkably helpful in bringing me up to speed and allowing me to come up with a fairly reasonable first pass at solving some of these problems. I ended up porting a large chunk of the library to C# (if you have any interest in it, feel free to let me know), and I've been using some combination of the SoftTFIDF algorithm class directly and the SoftTFIDFDictionary wrapper for many of the problems. They've done a much better job than the more naive algorithms the team had been using, with correlation rates ranging from 60%-90% (rough estimates) of the smaller dataset (depending upon the dataset), with a fairly low false positive rate above a reasonable score threshold. I'm currently scoring name and address matchings separately, but I'm planning on experimenting with the multistring wrapper and seeing if I can gain any benefit by dealing with the fields in combination (beyond just averaging the paired scores, which is essentially what I'm already doing).
But I'm rambling a bit. I noticed the forum here was fairly quiet, and I just thought I'd extend a personal thank you to those of you that have worked on the project. When I was investigating this problem, I found a wealth of research papers on the topics... but almost nothing 'tangible'. Your unified algorithm comparisons and library were fabulously helpful and, god knows, I was not looking forward to having to implement a handful of the algorithms from theory and start from scratch on the evaluation. Thank you. =)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I joined Ask.com a few months ago and some of the first things I inherited were various offline data mining and correlation tasks for some of our search verticals (businesses, events, movies, etc.) Like most players in the space, we aggregate datasets from various sources (e.g. web crawls and other proprietary sources) and are tasked with making sense of it. One of the first things we have to do is figure out when various records correlate to the same entity (e.g. using business names, addresses). It's amazing how 'dirty' the data can be - addresses in dramatically different formats, matching business names with almost nothing in common in the name (and the converse, independent entities which share a lot in common - e.g. businesses with a hotel, "mark adam's hotel", "mark adams florist"), etc.
Anyways, the specialized search space is fairly new here and we haven't yet ramped up the engineering or research resources to really attack this with the same vigor we've put into regular search. Given pressing deadlines, and my own lack of background in this space (I was much more systems focused, at Berkeley and before), your papers and the SecondString source library were remarkably helpful in bringing me up to speed and allowing me to come up with a fairly reasonable first pass at solving some of these problems. I ended up porting a large chunk of the library to C# (if you have any interest in it, feel free to let me know), and I've been using some combination of the SoftTFIDF algorithm class directly and the SoftTFIDFDictionary wrapper for many of the problems. They've done a much better job than the more naive algorithms the team had been using, with correlation rates ranging from 60%-90% (rough estimates) of the smaller dataset (depending upon the dataset), with a fairly low false positive rate above a reasonable score threshold. I'm currently scoring name and address matchings separately, but I'm planning on experimenting with the multistring wrapper and seeing if I can gain any benefit by dealing with the fields in combination (beyond just averaging the paired scores, which is essentially what I'm already doing).
But I'm rambling a bit. I noticed the forum here was fairly quiet, and I just thought I'd extend a personal thank you to those of you that have worked on the project. When I was investigating this problem, I found a wealth of research papers on the topics... but almost nothing 'tangible'. Your unified algorithm comparisons and library were fabulously helpful and, god knows, I was not looking forward to having to implement a handful of the algorithms from theory and start from scratch on the evaluation. Thank you. =)
Can anyone tell me do secondstring supports russian characters or not...Please help me out..Any help will be highly appreciated.