Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo


#14 Display Similar Entries

Sebastian M.

In large property files (>1000 entries) it if often the
case that there are double/similar entries.

It wold be nice, if you enter values for a specific
language key, then this language file is searched for
similar entries and somewhere a hint is diplayed that
"a similar entry with the key XY exists", or, if there
is more than one similar entry, a list of similar
entries is displayed.

That way double/multiple entries could be avoided.


    • milestone: 450941 --> 479430
  • Logged In: YES

    Good news: something similar is already on my personnal TODO
    list as it was recommended by email from another user. It
    probably won't be in the next release (0.6.0), but I
    definitely intend to implement something like this.
    "Similar" entries migth be difficult to implement, you'll
    probably see "same" entries implemented first.

    The way I saw this, I would also like to check for the same
    entries accross multiple files (for the same key). This
    would help resolve cut-n-paste issues for text we forgot to
    translate. The only problem is with those occasional
    entries that are meant to be the same in two or more
    languages. It is still unclear to me how I would handle

    Maybe the two should be considered as seperate features
    (duplicate within same file vs duplicate across multiple files).

  • Sebastian M.
    Sebastian M.

    Logged In: YES

    A simple but effective way to calculate similarity is to
    compare the words of two Strings:

    Assume you have string A and string B.

    1) lowercase both strings and clea of special characters.

    2) split both strings into Word-Token, set

    3) iterate over the words of B and check for each word if
    its in the words of A. If match, remove it from the of words
    of A. Count each hit.

    4) similarity-score is <hits>/<possiblematches>, so its
    always between 0 and 1.

    I've implemented this in a small tool for myself and it
    works quite well. If youre interested I can mail you the source.

  • Logged In: YES

    Interesting... but I am afraid such a solution might
    significantly affect performance on large files. To prevent
    this, I may always put this as a configurable option or as a
    "run-on-demand" feature. I am not ruling it out (I can see

    Any submitted code is always appreciated. Whether I use it
    as is of look into it for implementation ideas.

  • Sebastian M.
    Sebastian M.

    Logged In: YES

    Aye. If performance is an issue, you could make use of the
    Levensthein Distance Algorithm. I used it in my projects and
    it is quite fast.

    More Information can be found here:

    However, for the Levensthein Distance, the order of the
    words is important.

  • Logged In: YES

    Thanks, I will keep all this in mind for when I get there. I
    might even consider providing both techniques as
    configurable options.

    • milestone: 479430 --> 450941
    • assigned_to: nobody --> essiembre
  • Logged In: YES

    A solution has been added to CVS. Will be part of next release.

  • Logged In: YES

    Implemented in release 0.7.0. Configurable option turned
    off by default.

    Can select between "Levensthein Distance" and "Compare word
    count" algorithms.

    • milestone: 450941 --> Released
    • status: open --> closed