From: Wolfgang Meier <sbhati@we...> - 2002-02-14 17:22:59
Your right: the indexer is currently not very intelligent and treats
punctuation characters as word boundaries. I planned to change this in the
last release but finally I forgot. I will try to write a better tokenizer
class if I have some spare time today or tomorrow. However it's not always
easy to define a correct behaviour. Some punctuation characters actually
mark word boundaries while others do not as in 'Q.4068'. I would welcome any
suggestions about a good lexical analyzer we may use here.
>> Now if I run any of the following xpath queries I get an empty result:
>> collection('/test')/person[. &= 'My_city']
>> collection('/test')/person[. &= '5/46']
>> collection('/test')/person[. &= 'Q.4068']
>> collection('/test')/person[. &= '123\456']
>> But if run one of the following then the document is returned:
>> collection('/test')/person[. &= 'My city']
>> collection('/test')/person[. &= '5 46']
>> collection('/test')/person[. &= 'Q 4068']
>> collection('/test')/person[. &= '123 456']
>My data doesn't normally contain such characters, but I've created some
>items that do, and I can reproduce the problem. I guess the indexer is
>treating "punctuation" characters as word boundaries, so My_city is getting
>indexed as two separate words, hence the ability to retrieve it only when
>the underscore is left out. I fear this looks like a bug....
From: Ionut Emil Iacob <ionut@ms...> - 2002-02-14 18:58:22
Btw, there is also a problem if one searches for words that are also
keywords (and, or, match etc) as argument of any function (i.e. contains(.,'or')).
Since I noticed this some time ago, I found a possible solution:
in XPathLexer.java, function nextToken(), in line 242 change
_ttype = testLiteralsTable(_ttype);
if (_ttype != CONST) _ttype = testLiteralsTable(_ttype);
(or overwrite testLiteralsTable()?!)
Maybe to change the lexer is not the best solution, but I didn't find a
solution by looking at the grammar.
Another problem: I'm using document centered XML and I want to keep every
single character in the DB (all spaces, tabs, new line, etc.).
Since it's not difficult to have this, wouldn't it be nice to have this as
an option for eXist's users?