Re: [htdig-dev] HTML Translations

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

According to Ionut Nistor:
> I have posted a (wish)bug a couple of days ago regarding HTML
> translations performed by htdig (#484345). I should ave brought the
> issue on the list first (as Gilles Detillieux suggessted), so I'll just
> bring up the issue now.
> 
> htdig does supports (afaik) 3 translations:
> 
> 1. lg & gt (&lt; &gt;)
> 2. amp (&amp;)
> 3. quot (&quot;)

That's actually 4 - lt & gt are 2 separate entities.  htdig also handles
&trade; (153) in the 3.1.x code, which I think is non-standard, and the
full ISO Latin 1 set of entities from 160 to 255 (&nbsp; - &yuml;) in
both 3.1.x and 3.2 betas.

> However, there are some more escapes that I think would be helpful to
> have.
> 
> For instance, &apos; (apostrophe '). 
> Gilles said &apos; is not supported in HTML - that is correct; however, 
> xhtml1.0 brings in XML well formed documents - in XML, you cannot use '
> - ' is escaped as &apos;
> 
> XHTML1.0 notes can be found at: http://www.w3.org/TR/xhtml1/ 
> look at A2. Entity sets - special characters
> http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

Now, there are some references we can sink our teeth into!  Thanks.

> The problem is that there are many more escape sequences (in the
> &something; style); there are some ways to do it
> 1. by having a translation table - in which case htdig will translate
> everything there so that htsearch will not misescape them while
> displaying results (e.g. from a XHTML source file which has say &euro;
> when searching the browser will display &euro; instead of the euro sign
> cause htsearch escapes &euro; into &amp;euro;).
> 2. Eliminate translations from htdig; htsearch will have to stop
> escaping what is found in the DB in the &something; form.
> 
> I think the second way is better.
> 
> I'm not sure if I explained clearly I'll try to explain again if
> necessary.
> 
> Is it possible/desirable ?

I'm inclined to agree that the 2nd approach is better.  htdig currently
uses the first approach, which is better for database size, but there
are a few problems.  First of all, htsearch can't distinguish between
what text was translated from entities, and what was originally entered
as a single character, so it sometimes gets them wrong in results.  This
problem is compounded by the fact that it only uses an 8-bit encoding,
so when mixing documents with different encodings, mixups occur.

The problem with not translating is it would make word matches more
difficult, when words have accented character entities embedded in them.
The entities would probably have to be translated to Unicode or UTF-8
for word matching, and search words would have to be similarly encoded.
All of this would entail major rewriting of htdig and htsearch!  So, yes,
it is desirable, and possible if we have the volunteers to do it (which
we don't right now), but not simple and straightforward.

The current approach works for the most part, but is not ideal.  Support
for the &apos; entity would be easy to add, but all the other new entities
in XHTML define characters above 255, so they won't work in the current
8-bit only, locale-specific approach.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930