Re: [htdig-dev] Numbered HTML Entities mangled in Result Blurbs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> > See below.  Basically any entity of this form &#XXX; get translated to &amp;#XXX;
> >
> > &#153; --> &amp;#153;
> >
> > This only happens for numbered entities below 160.
> >
> > &#160; --> &nbsp;
> > &#169; --> &copy;
> > &#174; --> &reg;
[snip]
> > Is there a fix for this in 3.1.X??  Anyone complain about this before????
>
> No and yes.  Though 3.1.x does SGML decoding and re-encoding a bit
> differently than 3.2, there's still a fundamental problem with both
> versions that leads to this problem, which has come up again and again.
>
> The problem is that until we have full Unicode support, we can't decode
> all SGML entities and numbered entities into 8-bit characters.  So,
> we convert the ones we're most likely to need within words, to allow
> searches for accented characters and such, but we must leave some entities
> still encoded in the database.  That leads us to the problem: we don't
> know whether an ampersand in the database was originally decoded from
> an entity (and thus should be reencoded), or if it was originally the
> lead-in to an entity we didn't decode (and thus should not be encoded).

  This error is happening in the DISPLAY of the excerpts... so it
seems like looking for &#XXX; patterns and NOT encoding them before
display is a reasonable strategy... the browser will decide how to display it.

  The STORAGE of &#XXX; is properly done in the db.excerpts datafile.

  I think your above issues refer to encoding for accurate INDEXING of
special characters.

  It also seems like that we really only need one SGML entity object that
can handle both types of syntaxes rather than two that have to play well
together.

  As for the 8-bit problem, if our current fundamental charset for
indexing doesn't support a single char representation of some HTML entity
to enable searching on that char... then it has to wait for Unicode.

Thanks.

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485