Indexing utf-8 term with other symboles

Ron Chris
  • Ron Chris

    Ron Chris - 2011-06-14

    anybody know, how can we index a utf-8 term with symbole as a atomic term?

    eg. tête+head

    it's working. when it's in ascii (eg. tete+head)
    for that I have changed the line 53 of TextTokenizer.l to add "+" symbol
    but with utf-8 it's not working..

  • David Fisher

    David Fisher - 2011-06-14

    If you look two lines farther down (line 55) you see the pattern for a

    • { byte_position += tokleng; return UTF8_TOKEN; }

    You may need to do additional changes in the method
    indri::parse::TextTokenizer::processUTF8Token() to keep the token from being

  • Ron Chris

    Ron Chris - 2011-06-15

    Hi David,
    Thank you,
    I Have successfully indexed the term "tête+head" but I have problem when I
    want run a query.
    because "tête+head" is an unparsable (utf-8). The #base64quote is not useful
    in this case( I suppose it's only for ASCII)
    Do you know how can I create a query for this type of terms? (without using
    wildcard )
    Thank you.

  • David Fisher

    David Fisher - 2011-06-15

    I don't understand what you mean by: "The #base64quote is not useful in this

    Using the google search base64 encode utf-8, I found the following online
    utility for encoding utf-8 data: which
    produces "dMOqdGUraGVhZA==" for your example above.

  • Ron Chris

    Ron Chris - 2011-06-15

    I've used another base64 encoder(by default the site was on US-ASCII and I
    ignored that.) which gave me : dOp0ZStoZWFk(us-ascii) (instead of
    dMOqdGUraGVhZA== (in utf-8)) for "tête+head" , so why I didn't get answers
    from Indri . (which is a mistake , I didn't choose the good encoding )

    And at the same time I've read that : ("in Indri grammar page. ")

    base64( ... ) -- converts from base64 -> ascii and then stems and normalizes.

    useful for including non-parsable terms in a query

    base64quote( ... ) -- same as #base64 except the the ascii term is unstemmed

    and unnormalized

    There was nothing about UTF-8, by confusion I thought that It wasn't useful
    and I have to find another way to run queries.

    Now I understood.

    Thanks a lot. for your fast replies.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks