Menu

#2 Apostrophe support

Unstable (example)
open
nobody
None
5
2014-11-21
2014-11-13
No

Currently presage splits on apostrophes and is unable to represent words containing apostrophes in the database (due to them not being escaped). This results in presage being unable to correctly predict words like "don't", which instead get represented in the database as being a 2-gram of: "don" and "t". Currently one of the highest ranked entries in the 3-gram database is "i don t".

The attached patch removes the apostrophe character from both presage's internal separator characters and from the text2ngram utility's separators, and adds support for escaping apostrophe characters in the database connector.

Hope that helps,
Mike.

1 Attachments

Discussion

  • Matteo Vescovi

    Matteo Vescovi - 2014-11-17

    Hi Michael,

    Thanks for taking the time to implement this change (to stop treating the apostrophe character as a separator character in the tokenizer) and for sending in a patch.

    While your change fixes the special case of handling the apostrophe character, I would like to address the more general problem of enabling the user to configure which characters should be treated as separators, blankspace, or valid characters.

    Essentially, I would like to fix the problem created by the fact that the separator characters are hard-coded the presage library and the tools.

    This is an area of the code that I haven't looked into for quite some time. In fact, it suffers from the fact that it is code that I wrote in the initial design stages of presage and that hasn't been looked at since. It would certainly benefit from a bit of attention and refactoring.

    For instance, it might be a good idea to put in a configuration mechanism in place so that the separator characters can be configured through the presage.xml config file, rather than being hardcoded. This would require modifying the profile handling code and observers to handle the new config variables (perhaps a variable in <Presage><ContextTracker><Tokenizer><Separators>...).

    It might also be good to review the design of the tokenizer classes... in src/lib/core/tokenizer/tokenizer.h there is a description of the separator, blankspace, and valid character classes... I think it would make sense to revisit the rationale behind distinguishing between separators and blankspaces (it does not seem to have any practical effect at the moment) and even refactor the forward and backward tokenizer implementations.

    I'm guessing that the most useful thing to get done would be to be able to configure separators and blankspace characters though, right?

     
  • Michael Sheldon

    Michael Sheldon - 2014-11-21

    Yep, for now all we need is the ability to ensure that apostrophes are treated correctly, but the additional flexibility of making it user configurable would probably be handy in the future.

    We could probably just use my patch at our end for the immediate future, so if you feel a rewrite of that section makes sense before making separators configurable I'd suggest doing things in whatever order makes life easiest for you.

    For context, we're making use of presage to help with prediction in the keyboard for Ubuntu Touch and it's been a big help to us, so thanks for all your work on it!

     

Log in to post a comment.