Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#2 Wrong results due to padding with "X"

closed-fixed
Michel Albert
None
9
2008-10-07
2007-11-21
Michel Albert
No

Internally the algorithm pads the supplied strings with "X" characters. This causes problems when the string itself begins or ends with an X. The similarity score will be smaller than expected as one trigram "disappears".

A solution is in the works ;)

Discussion

  • Michel Albert
    Michel Albert
    2007-11-21

    Fix

     
    Attachments
  • Michel Albert
    Michel Albert
    2007-11-21

    Logged In: YES
    user_id=560690
    Originator: YES

    Attached a fix for this.
    Now, I am using the non-breaking space (u'\xa0') as padding character, and if any of those are encountered in the string, they are replaced with normal spaces (u'\x20'). This is a non-destructive replacement.
    File Added: 1835788.patch

     
  • Graham Poulter
    Graham Poulter
    2008-10-06

    In my changes I'm padding with "$" instead... it's not a full solution, but '$' is a lot rarer.

    Actually, I think I'll make the padding char an optional constructor parameter defaulting to '$'

     
  • Graham Poulter
    Graham Poulter
    2008-10-06

    Didn't see Michel's comment below... have checked in Michel's patch to SVN instead.

     
  • Graham Poulter
    Graham Poulter
    2008-10-07

    From SVN revision 5 padding character is configurable via the pad_char constructor parameter. Close?

     
  • Michel Albert
    Michel Albert
    2008-10-07

    Agreed.

    Thanks Graham.

     
  • Michel Albert
    Michel Albert
    2008-10-07

    • status: open --> open-fixed
     
  • Michel Albert
    Michel Albert
    2008-10-07

    Forgot to set it to "closed" ;)

     
  • Michel Albert
    Michel Albert
    2008-10-07

    • status: open-fixed --> closed-fixed