Matching a substring to people names

  • DKroot

    DKroot - 2009-01-28

    I have a task of a fuzzy search for a substring against the list of full names, e.g. "John B. Doe". I found this article:, but its focus seems to be on matching name to name. Any recommendation or experience with search for names?

    • ReverendSam

      ReverendSam - 2009-01-29

      In which case it sounds like you need a fuzzy indexing technology (not a matching one), where an index is built that takes into account fuzzy indexing.

      Q-Gram indexing is a technique i think will do the trick (explained on my website) - the tokenisers if wanted can be taken from the SImMetric library and indexing can be done using lucene. (FYI this technique is, I believe, used in the fuzzy name search of try it out to see if this is what you want, enter "Wilam Shatnerr" and see what it gives you).

      Hope this helps,

      Sam Chapman.

      • DKroot

        DKroot - 2009-01-30

        Thanks, Sam, I'll take a look into q-Gram indexing!
        Frankly. I could not find much info at, but Googling yields "A practical q-gram index for text retrieval allowing errors" article as the first hit. I'll read it and try to understand, and potentially - implement.
        IMDB search does work in approximately the way I need.

        • ReverendSam

          ReverendSam - 2009-01-30

          sorry I intended to leave my uni weblink. the exact link to the exact data you need is here

          but I'm sure you can find better external elsewhere online.

          Sam Chapman

          • DKroot

            DKroot - 2009-02-03

            Thanks a lot, Sam! We'll try to use Q-Grams. Once we get closer to implementation, I'll post feedback here on how well it is working on our data.

  • martin

    martin - 2010-08-07

    Sorry to activate an old thread, Sam, but I don't understand what you're suggesting here.  Do you know of any open source projects that use Lucene to index q-grams?

    So you would use a custom Lucene Analyzer (tokenizer) to convert each document into q-grams.  For example, (1,##s), (2,#sa), (3,sam), (4,am ), (5,m c), (6, ch), (7,cha), (8,hap), (9,apm), (10,pma), (11,man), (12,an%), (13,n%%).  Then you would search these in Lucene with a "sloppy" PhraseQuery?  Or with a Fuzzy (Levenshtein) query?

    I don't see how Lucene could match "sam 7 chapman" to this index unless you used a Fuzzy query.  But if you're eventually using Levenshtein anyway, why use q-grams at all?

    Thanks for the terrific software, and thanks for any guidance you can give me.  I am just starting with Lucene and may be misunderstanding it.


  • ReverendSam

    ReverendSam - 2010-08-10

    what i mean is - index the qGrams as you say but also use the same tokeniser on the input to break the input also check against the index (which is now finite in size also) and simply count matches/possible matches.

    This technique is/was employed in try a query spelled wrong for an actors/ess name and see what it returns.


Log in to post a comment.