Quoted Literal Normalization Bug

Status: Alpha

Brought to you by: akarasulu

#11 Quoted Literal Normalization Bug

Milestone: v0.7pre-alpha

Status: open

Owner: Alex Karasulu

Labels: Schema (2)

Priority: 1

Updated: 2003-01-23

Created: 2003-01-23

Creator: Alex Karasulu

Private: No

Problem Description:

Right now normalizers are run against attribute values
without considering
some of the pathalogical forms that can be present
within the value. Schemas
expose normalizers directly to external modules
enabling them to perform
normalization on values directly with the normalizer.
This is common in
backends that maintain their own indices or other parts
of the system that
need a cannonical representation for an attribute value.

For example an attribute with a IA5CaseIgnoreMatch
matching rule cannot
normalize regions where literal quotes are used.
Meaning it cannot lowercase
the literal sections or deep trim spaces from it. It must
be matched against
as-is even though the matching rule states otherwise.
Hence a user must put
together filter attribute values in the same case and
spacing with quotes to
match against a literal section. In conclusion literal
value sections

override matching rules.

Presently cannonical representations are normalized
even in literal sections
for attribute values. The DnParser on the otherhand
correctly normalizes or
defers normalization based on context perserving literal
sections. Indices
on DN attributes in this respect will be inconsistent
when quotes are used
and some searches will return more results than they
should have.

This potential problem is rare, depends on the data
values used and requires
the use of quotes. It is important yet not a show
stopper, but should be
solved eventually for the sake of correctness.

Solution:

We need to parse the value and apply the normalization
function associated
with the matching rule to non-literal sections only,
concatenate normalized
regions with literal regions and return this as the overall
normalized value.

So if we had an attribute 'abcxyz' where the matching
rule is
IA5CaseIgnoreMatch with the following value with
boundry ticks:

1 2 3 4 5 6
| | | | | |
'cdJFAsdf \"Jack in The box\" j AS
2345DS'

would be transformed into:

1 2 3 4 5 6
| | | | | |
'cdjfAsdf \"Jack in The box\" j as 2345ds'

Here segments 12 and 56 are normalized using the
matching rules assigned
normalization funtion. Segment 34 is untouched as a
literal. The
transformed string is then used as an index key to
lookup entries with
this value.

Implementation:

Currently a value parser is used to correctly parse out a
the values of
DN name components optionally normalizing the values
of RDN attributes
based on their matching rules as specified by the
schema manager. This
parser is on in a two part parser system which together
implement a
parser with two lexical states using a token stream
switcher. Complicating
it further with code to merely parse a single attributes
value would
present a maintenence nightmare. Since these parsers
are antlr generated
we can easily create a new parser which would be very
simple to parse just
attribute values which normalizing them according their
matching rules
defined in a schema.

Constructing a new parser adds an extra overhead. First
of all a parser
stores state and so must be synchronized upon. This
creates contention.
Secondly new string objects are instantiated all over the
place presenting
a potential performance problem. Indices manage value
normalization to
conduct searches so they can cache the most recently
transformed values
to prevent redundant normalizations accross subsystem
boundries. The same
value may be normalized several times if it is not
cached. Also any such
cache must be specific to an attribute to prevent value
collisions in the
cache. Indices are the best place for caching
normalized values namely
since the see the resultant products of normalization.
Normalizers may
only be seeing parts of value. In the example above a
Normalizer never
sees the complete input: it sees segments 12 and 56
without seeing segment
34 at all. It would store two entries total for this
attribute value pair.
The cache in an index would store one entry who's key
is the original
value and the value is the normalized value. This is
more efficient and
practical.

Indices must then manage a cache of normalized values
and use them on
cache hits or delegate normalization requests to the
schema. The schema
invokes a value parser with the correct normalizer for the
attribute's
matching rule and returns the resultant normalized
value. This normalized
value is also known as the cannonical value or simply
the index key.

Quoted Literal Normalization Bug

Group

Searches

Help

#11 Quoted Literal Normalization Bug

Discussion