From: Neal R. <ne...@ri...> - 2003-11-19 01:04:00
|
> > See below. Basically any entity of this form &#XXX; get translated to &#XXX; > > > > ™ --> &#153; > > > > This only happens for numbered entities below 160. > > > >   --> > > © --> © > > ® --> ® [snip] > > Is there a fix for this in 3.1.X?? Anyone complain about this before???? > > No and yes. Though 3.1.x does SGML decoding and re-encoding a bit > differently than 3.2, there's still a fundamental problem with both > versions that leads to this problem, which has come up again and again. > > The problem is that until we have full Unicode support, we can't decode > all SGML entities and numbered entities into 8-bit characters. So, > we convert the ones we're most likely to need within words, to allow > searches for accented characters and such, but we must leave some entities > still encoded in the database. That leads us to the problem: we don't > know whether an ampersand in the database was originally decoded from > an entity (and thus should be reencoded), or if it was originally the > lead-in to an entity we didn't decode (and thus should not be encoded). This error is happening in the DISPLAY of the excerpts... so it seems like looking for &#XXX; patterns and NOT encoding them before display is a reasonable strategy... the browser will decide how to display it. The STORAGE of &#XXX; is properly done in the db.excerpts datafile. I think your above issues refer to encoding for accurate INDEXING of special characters. It also seems like that we really only need one SGML entity object that can handle both types of syntaxes rather than two that have to play well together. As for the 8-bit problem, if our current fundamental charset for indexing doesn't support a single char representation of some HTML entity to enable searching on that char... then it has to wait for Unicode. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |