anybody know, how can we index a utf-8 term with symbole as a atomic term?
it's working. when it's in ascii (eg. tete+head)
for that I have changed the line 53 of TextTokenizer.l to add "+" symbol
but with utf-8 it's not working..
If you look two lines farther down (line 55) you see the pattern for a
You may need to do additional changes in the method
indri::parse::TextTokenizer::processUTF8Token() to keep the token from being
I Have successfully indexed the term "tête+head" but I have problem when I
want run a query.
because "tête+head" is an unparsable (utf-8). The #base64quote is not useful
in this case( I suppose it's only for ASCII)
Do you know how can I create a query for this type of terms? (without using
I don't understand what you mean by: "The #base64quote is not useful in this
Using the google search base64 encode utf-8, I found the following online
utility for encoding utf-8 data:
produces "dMOqdGUraGVhZA==" for your example above.
I've used another base64 encoder(by default the site was on US-ASCII and I
ignored that.) which gave me : dOp0ZStoZWFk(us-ascii) (instead of
dMOqdGUraGVhZA== (in utf-8)) for "tête+head" , so why I didn't get answers
from Indri . (which is a mistake , I didn't choose the good encoding )
And at the same time I've read that : ("in Indri grammar page. ")
useful for including non-parsable terms in a query
There was nothing about UTF-8, by confusion I thought that It wasn't useful
and I have to find another way to run queries.
Now I understood.
Thanks a lot. for your fast replies.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.