From: Felipe H. <fel...@gm...> - 2009-01-21 22:06:31
|
Hi, as some may seem somo of emails I've been working with the DBLP dataset and eXist. I've been having some problems with the eXist tokenizer. Consider the example xml bellow: <inproceedings mdate="2008-04-17" key="conf/ACMmsp/PengLWGR06"> <author>Jinzhan Peng</author> <title>A comprehensive study of hardware/software knowledge-based</title> <pages>102-111</pages> <year>2006</year> <ee>http://doi.acm.org/10.1145/1178597.1178614</ee> <crossref>conf/ACMmsp/2006</crossref> <url>db/conf/ACMmsp/msp2006.html#PengLWGR06</url> </inproceedings> The default tokenizer doesn't split the mdate and key attributes (at "-" and "/"). The same behavior occurs in the <ee> <crossref> and <url> elements. But, at the title element the behavior seems different. The "hardware/software" is not splitted but the "knowledge-based" is separated in two tokens, differently from the mdate attribute. Well, in my case, I need all elements and attributes to be splitted at every punctuation (because thats my application behavior). What can I do to achieve that behavior? I took a look at the Default Tokenizer code, but didn't thought it would be easy to change that. Should I write a new Tokenizer? Or there is a simpler way. Thanks! Felipe Hummel |