#481 contains() with an accent-blind collation

v8.7
closed
Michael Kay
5
2012-10-08
2006-03-06
Michael Kay
No

When contains() and other similar functions are used
with an accent-blind collation, accents are not ignored
as they should be. For example,

contains("télé", "tele",
"http://saxon.sf.net/collation?lang=fr-FR;strength=primary")

returns false.

The reason for the problem is an undocumented behaviour
of the JDK RuleBasedCollator class: with this kind of
collation, the stream of collation elements returned by
the CollationElementIterator includes zero values where
the accents occur, and the application (i.e. Saxon) is
apparently expected to ignore these zero values. The
attached file is a new version of
net.sf.saxon.sort.RuleBaseSubstringMatcher modified to
behave this way.

The functions affected are contains, starts-with,
ends-with, substring-before, and substring-after.

Discussion

  • Michael Kay
    Michael Kay
    2006-03-06

    Replacement code for net.sf.saxon.sort.RuleBasedSubstringMatcher

     
  • Michael Kay
    Michael Kay
    2006-03-06

    Logged In: YES
    user_id=251681

    Note that this change has some unexpected consequences. For
    example in a collation with strength=primary, "-" is an
    ignorable character for collation purposes, and is therefore
    represented in the sequence of collation units by a zero
    value. The effect is that substring-before("in-scope", "-")
    returns "", because the "-" matches an empty string. This
    behaviour, though strange, is correct according to the spec.