Menu

#276 3.1.6 -excerpt truncates HTML entity, causing parsing errors

open
nobody
htsearch (60)
5
2006-05-14
2006-05-14
Anonymous
No

ericesposito/notat/gmail(notdot)com

The excerpt code in htsearch/Display.cc truncates the
exerpts at exactly X characters (where X is the config
value of "excerpt_length").

I just ran in to a problem where the excerpt got cut
off while it was in the middle of an HTML entity. (So,
the indexed HTML contained &8217; (amp, 8217, semi).
The excerpt just happened to get cutoff at exactly the
semicolon, so it ended with &8217 (amp, 8217).

This is technically bad HTML. Browsers ignore it for
the most part. However, we use the XML wrapper for the
search results, and the XML parser barfed on the bad
entity.

I was able to fix this by editing htsearch/Display.cc.
In Display::excerpt. In the else clause after "if
(end > temp + headLength)", I added the following code:

char *lookForAmp = end - 10;
int sawAmp = 0;

for( ; (lookForAmp < end && *lookForAmp) ||
(*lookForAmp && sawAmp == 1); lookForAmp++ ) {
if( *lookForAmp == '&' ) {
sawAmp = 1;
} else if( *lookForAmp == ';' ) {
sawAmp = 0;
}
}

if( lookForAmp > end ) {
end = lookForAmp;
}

Basically, this goes back 10 chars from end and looks
for an ampersand. If it finds one, it advances until
end of string or it finds a semicolon. It then
advances the end pointer to the spot that it saw the semi.

Discussion


Log in to post a comment.