Matching a substring to people names

DKroot
2009-01-28
2013-04-25
  • DKroot
    DKroot
    2009-01-28

    I have a task of a fuzzy search for a substring against the list of full names, e.g. "John B. Doe". I found this article: http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf, but its focus seems to be on matching name to name. Any recommendation or experience with search for names?

     
    • ReverendSam
      ReverendSam
      2009-01-29

      In which case it sounds like you need a fuzzy indexing technology (not a matching one), where an index is built that takes into account fuzzy indexing.

      Q-Gram indexing is a technique i think will do the trick (explained on my website) - the tokenisers if wanted can be taken from the SImMetric library and indexing can be done using lucene. (FYI this technique is, I believe, used in the fuzzy name search of IMDB.com try it out to see if this is what you want, enter "Wilam Shatnerr" and see what it gives you).

      Hope this helps,

      Sam Chapman.
      Ditector
      http://www.k-now.co.uk

       
      • DKroot
        DKroot
        2009-01-30

        Thanks, Sam, I'll take a look into q-Gram indexing!
        Frankly. I could not find much info at http://www.k-now.co.uk, but Googling yields "A practical q-gram index for text retrieval allowing errors" article as the first hit. I'll read it and try to understand, and potentially - implement.
        IMDB search does work in approximately the way I need.

         
        • ReverendSam
          ReverendSam
          2009-01-30

          sorry I intended to leave my uni weblink. the exact link to the exact data you need is here

          http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#qgram

          but I'm sure you can find better external elsewhere online.

          Sam Chapman
          Director
          K-Now http://k-now.co.uk/

           
          • DKroot
            DKroot
            2009-02-03

            Thanks a lot, Sam! We'll try to use Q-Grams. Once we get closer to implementation, I'll post feedback here on how well it is working on our data.

             
  • martin
    martin
    2010-08-07

    Sorry to activate an old thread, Sam, but I don't understand what you're suggesting here.  Do you know of any open source projects that use Lucene to index q-grams?

    So you would use a custom Lucene Analyzer (tokenizer) to convert each document into q-grams.  For example, (1,##s), (2,#sa), (3,sam), (4,am ), (5,m c), (6, ch), (7,cha), (8,hap), (9,apm), (10,pma), (11,man), (12,an%), (13,n%%).  Then you would search these in Lucene with a "sloppy" PhraseQuery?  Or with a Fuzzy (Levenshtein) query?

    I don't see how Lucene could match "sam 7 chapman" to this index unless you used a Fuzzy query.  But if you're eventually using Levenshtein anyway, why use q-grams at all?

    Thanks for the terrific software, and thanks for any guidance you can give me.  I am just starting with Lucene and may be misunderstanding it.

    Martin

     
  • ReverendSam
    ReverendSam
    2010-08-10

    what i mean is - index the qGrams as you say but also use the same tokeniser on the input to break the input also check against the index (which is now finite in size also) and simply count matches/possible matches.

    This technique is/was employed in IMDB.com try a query spelled wrong for an actors/ess name and see what it returns.