#14 Display Similar Entries

Released
closed
5
2005-08-29
2005-05-12
Sebastian M.
No

In large property files (>1000 entries) it if often the
case that there are double/similar entries.

It wold be nice, if you enter values for a specific
language key, then this language file is searched for
similar entries and somewhere a hint is diplayed that
"a similar entry with the key XY exists", or, if there
is more than one similar entry, a list of similar
entries is displayed.

That way double/multiple entries could be avoided.

Discussion

    • milestone: 450941 --> 479430
     
  • Logged In: YES
    user_id=1012642

    Good news: something similar is already on my personnal TODO
    list as it was recommended by email from another user. It
    probably won't be in the next release (0.6.0), but I
    definitely intend to implement something like this.
    "Similar" entries migth be difficult to implement, you'll
    probably see "same" entries implemented first.

    The way I saw this, I would also like to check for the same
    entries accross multiple files (for the same key). This
    would help resolve cut-n-paste issues for text we forgot to
    translate. The only problem is with those occasional
    entries that are meant to be the same in two or more
    languages. It is still unclear to me how I would handle
    those.

    Maybe the two should be considered as seperate features
    (duplicate within same file vs duplicate across multiple files).

     
  • Sebastian M.
    Sebastian M.
    2005-05-12

    Logged In: YES
    user_id=1277166

    A simple but effective way to calculate similarity is to
    compare the words of two Strings:

    Assume you have string A and string B.

    1) lowercase both strings and clea of special characters.

    2) split both strings into Word-Token, set
    possiblematches=max(numberWordsA,numberWordsB)

    3) iterate over the words of B and check for each word if
    its in the words of A. If match, remove it from the of words
    of A. Count each hit.

    4) similarity-score is <hits>/<possiblematches>, so its
    always between 0 and 1.

    I've implemented this in a small tool for myself and it
    works quite well. If youre interested I can mail you the source.

     
  • Logged In: YES
    user_id=1012642

    Interesting... but I am afraid such a solution might
    significantly affect performance on large files. To prevent
    this, I may always put this as a configurable option or as a
    "run-on-demand" feature. I am not ruling it out (I can see
    value).

    Any submitted code is always appreciated. Whether I use it
    as is of look into it for implementation ideas.

     
  • Sebastian M.
    Sebastian M.
    2005-05-12

    Logged In: YES
    user_id=1277166

    Aye. If performance is an issue, you could make use of the
    Levensthein Distance Algorithm. I used it in my projects and
    it is quite fast.

    More Information can be found here:
    http://www.merriampark.com/ld.htm

    However, for the Levensthein Distance, the order of the
    words is important.

     
  • Logged In: YES
    user_id=1012642

    Thanks, I will keep all this in mind for when I get there. I
    might even consider providing both techniques as
    configurable options.

     
    • milestone: 479430 --> 450941
    • assigned_to: nobody --> essiembre
     
  • Logged In: YES
    user_id=1012642

    A solution has been added to CVS. Will be part of next release.

     
  • Logged In: YES
    user_id=1012642

    Implemented in release 0.7.0. Configurable option turned
    off by default.

    Can select between "Levensthein Distance" and "Compare word
    count" algorithms.

     
    • milestone: 450941 --> Released
    • status: open --> closed