According to Neal Richter:
> On Wed, 19 Nov 2003, Gilles Detillieux wrote:
> > According to Neal Richter:
> > > > This error is happening in the DISPLAY of the excerpts... so it
> > > > seems like looking for &#XXX; patterns and NOT encoding them before
> > > > display is a reasonable strategy... the browser will decide how to display it.
> > That would be a reasonable compromise, but note that it is a compromise.
> > For example, if an HTML document has something like "use &#153; in
> > your HTML to encode a ™ character", this will end up in db.excerpts
> > as "use ™ in your HTML to encode a ™ character". At that point,
> > htsearch has no way of knowing that the first occurrence was originally
> > different than the second. It comes down to a decision between encoding
> > both or leaving both as-is.
> Eh... why not explicitly look for patterns like '&#153;' and leave
> them as-is?
Do you mean in htdig or in htsearch? The point I'm making is
that by the time htsearch reads the excerpt, it's already too late.
You could of course load up the SGML decoding in htdig with all sorts of
exceptions, so it would convert & to &, but not of & is followed
by #(some_number). It comes down to a question of how elaborate you
want to get with the exceptions. Not converting any SGML entities at
all isn't really a viable option, because then there will be problems
finding matches. We tried that in some limited capacity in 3.1.x with
the translate_amp attribute, et al., but ended up dropping that idea
because it caused more problems than it solved. If you want the whole
story about that, I'd recommend searching the archives for the numerous
> For the pupose of excerpts.... I think we may not need to do encoding at
> all... so that there is no conflict between store and display.
> Encoding SGML entities is beneficial for searchability via the
> db.words.db, but I don't see how it is a benefit for db.excerpts.
Well, if you don't mind that the excerpt highlighting won't find the SGML
entities, then no, there isn't any other benefit. We've been through this
too with 3.1.x, and decided excerpt highlighting was important enough to
get it to work consistently. If you can find a better way than what we
worked out back then, go for it. Just be sure to test what you develop,
because it seems you're not grasping the pitfalls I tried to point out
in my last e-mail on Wednesday.
> I don't want to go tearing up code that is there for a reason.... please
I'm not sure how I can explain myself more clearly than I did on Wednesday.
I suggest you have a look at htsearch/Display.cc and htlib/StringMatch.cc
to see how the code finds the words to highlight. It's a separate matching
mechanism from the search of db.words.db!
> > The other option would be for htdig to replace the & lead-in character
> > for undecoded entities into some other, non-ambiguous lead-in character in
> > the database, so that htsearch could always distinguish between the two.
> > But what character could we use, that wouldn't conflict with anything
> > else?
> For that matter we could be storing excerpts marked up via XML and
> process this XML as appropriate during display.
> A bigger project would be to make the entire search-query process
> produce an XML document that we could render to HTML via XSLT. This would
> allow pretty magnificent user customization of the search results.
Sounds like a good idea, but we're talking major coding effort here.
Who's up for it? (I don't have the time!)
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)