Menu

#28 Should be using Levenshtein Edit Distance

open
nobody
None
5
2009-10-01
2009-10-01
Anonymous
No

This is from bug 519214 for Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=519214

The spell checker suggestions are seriously lacking. There are many, many
occasions where I try to get a correction to a misspelled word, and get no or
very few results. So, I'm forced to look up the misspelling in Google. (Oddly
enough, it correctly identifies what I'm talking about 90% of the time.)

I think a major part of this is that Levenshtein Edit Distance (LED) is not
implemented as part of the search suggestions. LED determines the amount of
letter changes required to change one word to another. If the LED is small
(say, less than 5), then chances are that the word is a good suggestion to the
user. For example, here are some example misspellings that fail Firefox, but
have a low LED:

testimont -> testament
desimination -> dissemination
(There are others, but I would have to mark them as I find them.)

LED is described in Wikipedia here:
http://en.wikipedia.org/wiki/Levenshtein_distance. The algorithm is simple
enough to implement, if given a large enough sample to run through some words.
Perhaps a looser version of the suggestion engine which is tightened up with
LED?

Discussion

  • Németh László

    There is a better solution for this problem. Hunspell has got the phonetic suggestion feature borrowed from Aspell. Unfortunately, it hasn't been activated in the English dictionary yet (see PHONE feature in Hunspell manual).

     
  • Németh László

    I have made the extended en_CA, en_US, en_GB dictionaries. You can find them in the dict-en-fixed OpenOffice.org oxt package (a zip file): http://extensions.services.openoffice.org/hu/project/dict-en-fixed

    Test: desimination -> discrimination, recrimination, elimination, denomination, dissemination, insemination
    The testament suggestion is missing yet, but I will modify the suggestion limit for this case, too.

     
  • Ruud Baars

    Ruud Baars - 2010-05-11

    Levehnstein difference is a good addition, especially for suggestions suggested byj compounding, but since some REPs for Dutch are quite long (buro=> bureau), not just the difference, but also the total length of the word should be used in the applicable results filter.
    In a different issue report, I suggested to use different thresholds for different word length ranges, controlled by paramters in the affix file.

     
MongoDB Logo MongoDB