[Exist-open] Default Tokenizer

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi, as some may seem somo of emails I've been working with the DBLP dataset
and eXist. I've been having some problems with the eXist tokenizer. Consider
the example xml bellow:

<inproceedings mdate="2008-04-17" key="conf/ACMmsp/PengLWGR06"> <author>Jinzhan
Peng</author>
<title>A comprehensive study of hardware/software knowledge-based</title>
<pages>102-111</pages>
<year>2006</year>
 <ee>http://doi.acm.org/10.1145/1178597.1178614</ee>
<crossref>conf/ACMmsp/2006</crossref>
<url>db/conf/ACMmsp/msp2006.html#PengLWGR06</url>
</inproceedings>

The default tokenizer doesn't split the mdate and key attributes (at "-" and
"/"). The same behavior occurs in the <ee> <crossref> and <url> elements.
But, at the title element the behavior seems different. The
"hardware/software" is not splitted but the "knowledge-based" is separated
in two tokens, differently from the mdate attribute.

Well, in my case, I need all elements and attributes to be splitted at every
punctuation (because thats my application behavior). What can I do to
achieve that behavior? I took a look at the Default Tokenizer code, but
didn't thought it would be easy to change that. Should I write a new
Tokenizer? Or there is a simpler way.

Thanks!

Felipe Hummel

[Exist-open] Default Tokenizer

eXist-db is a feature rich Open Source native XML database

[Exist-open] Default Tokenizer