Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Indexing utf-8 term with other symboles

Ron Chris
2011-06-14
2012-09-27
  • Ron Chris
    Ron Chris
    2011-06-14

    Hello,
    anybody know, how can we index a utf-8 term with symbole as a atomic term?

    eg. tête+head

    it's working. when it's in ascii (eg. tete+head)
    for that I have changed the line 53 of TextTokenizer.l to add "+" symbol
    but with utf-8 it's not working..

     
  • David Fisher
    David Fisher
    2011-06-14

    If you look two lines farther down (line 55) you see the pattern for a
    UTF8_TOKEN:

    • { byte_position += tokleng; return UTF8_TOKEN; }

    You may need to do additional changes in the method
    indri::parse::TextTokenizer::processUTF8Token() to keep the token from being
    split.

     
  • Ron Chris
    Ron Chris
    2011-06-15

    Hi David,
    Thank you,
    I Have successfully indexed the term "tête+head" but I have problem when I
    want run a query.
    because "tête+head" is an unparsable (utf-8). The #base64quote is not useful
    in this case( I suppose it's only for ASCII)
    Do you know how can I create a query for this type of terms? (without using
    wildcard )
    Thank you.

     
  • David Fisher
    David Fisher
    2011-06-15

    I don't understand what you mean by: "The #base64quote is not useful in this
    case"

    Using the google search base64 encode utf-8, I found the following online
    utility for encoding utf-8 data:
    http://coderstoolbox.net/string/ which
    produces "dMOqdGUraGVhZA==" for your example above.

     
  • Ron Chris
    Ron Chris
    2011-06-15

    I've used another base64 encoder(by default the site was on US-ASCII and I
    ignored that.) which gave me : dOp0ZStoZWFk(us-ascii) (instead of
    dMOqdGUraGVhZA== (in utf-8)) for "tête+head" , so why I didn't get answers
    from Indri . (which is a mistake , I didn't choose the good encoding )

    And at the same time I've read that : ("in Indri grammar page. ")

    base64( ... ) -- converts from base64 -> ascii and then stems and normalizes.

    useful for including non-parsable terms in a query

    base64quote( ... ) -- same as #base64 except the the ascii term is unstemmed

    and unnormalized

    There was nothing about UTF-8, by confusion I thought that It wasn't useful
    and I have to find another way to run queries.

    Now I understood.

    Thanks a lot. for your fast replies.