#2 Wrong results due to padding with "X"

closed-fixed
None
9
2008-10-07
2007-11-21
No

Internally the algorithm pads the supplied strings with "X" characters. This causes problems when the string itself begins or ends with an X. The similarity score will be smaller than expected as one trigram "disappears".

A solution is in the works ;)

Discussion

  • Michel Albert

    Michel Albert - 2007-11-21

    Fix

     
  • Michel Albert

    Michel Albert - 2007-11-21

    Logged In: YES
    user_id=560690
    Originator: YES

    Attached a fix for this.
    Now, I am using the non-breaking space (u'\xa0') as padding character, and if any of those are encountered in the string, they are replaced with normal spaces (u'\x20'). This is a non-destructive replacement.
    File Added: 1835788.patch

     
  • Graham Poulter

    Graham Poulter - 2008-10-06

    In my changes I'm padding with "$" instead... it's not a full solution, but '$' is a lot rarer.

    Actually, I think I'll make the padding char an optional constructor parameter defaulting to '$'

     
  • Graham Poulter

    Graham Poulter - 2008-10-06

    Didn't see Michel's comment below... have checked in Michel's patch to SVN instead.

     
  • Graham Poulter

    Graham Poulter - 2008-10-07

    From SVN revision 5 padding character is configurable via the pad_char constructor parameter. Close?

     
  • Michel Albert

    Michel Albert - 2008-10-07

    Agreed.

    Thanks Graham.

     
  • Michel Albert

    Michel Albert - 2008-10-07
    • status: open --> open-fixed
     
  • Michel Albert

    Michel Albert - 2008-10-07

    Forgot to set it to "closed" ;)

     
  • Michel Albert

    Michel Albert - 2008-10-07
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.