I have a task of a fuzzy search for a substring against the list of full names, e.g. "John B. Doe". I found this article: http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf, but its focus seems to be on matching name to name. Any recommendation or experience with search for names?
In which case it sounds like you need a fuzzy indexing technology (not a matching one), where an index is built that takes into account fuzzy indexing.
Q-Gram indexing is a technique i think will do the trick (explained on my website) - the tokenisers if wanted can be taken from the SImMetric library and indexing can be done using lucene. (FYI this technique is, I believe, used in the fuzzy name search of IMDB.com try it out to see if this is what you want, enter "Wilam Shatnerr" and see what it gives you).
Hope this helps,
Thanks, Sam, I'll take a look into q-Gram indexing!
Frankly. I could not find much info at http://www.k-now.co.uk, but Googling yields "A practical q-gram index for text retrieval allowing errors" article as the first hit. I'll read it and try to understand, and potentially - implement.
IMDB search does work in approximately the way I need.
sorry I intended to leave my uni weblink. the exact link to the exact data you need is here
but I'm sure you can find better external elsewhere online.
Thanks a lot, Sam! We'll try to use Q-Grams. Once we get closer to implementation, I'll post feedback here on how well it is working on our data.
Sorry to activate an old thread, Sam, but I don't understand what you're suggesting here. Do you know of any open source projects that use Lucene to index q-grams?
So you would use a custom Lucene Analyzer (tokenizer) to convert each document into q-grams. For example, (1,##s), (2,#sa), (3,sam), (4,am ), (5,m c), (6, ch), (7,cha), (8,hap), (9,apm), (10,pma), (11,man), (12,an%), (13,n%%). Then you would search these in Lucene with a "sloppy" PhraseQuery? Or with a Fuzzy (Levenshtein) query?
I don't see how Lucene could match "sam 7 chapman" to this index unless you used a Fuzzy query. But if you're eventually using Levenshtein anyway, why use q-grams at all?
Thanks for the terrific software, and thanks for any guidance you can give me. I am just starting with Lucene and may be misunderstanding it.
what i mean is - index the qGrams as you say but also use the same tokeniser on the input to break the input also check against the index (which is now finite in size also) and simply count matches/possible matches.
This technique is/was employed in IMDB.com try a query spelled wrong for an actors/ess name and see what it returns.