Re: [Archive-access-discuss] Re: nutchwax

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Lukas Matejka wrote:
> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
>  =20
>> ....
>> I see the 0x07 Bell character in the original page.  Below is an 'od'
>> dump of the relevant section with the ascii line underwritten by its h=
ex
>> representation.  The last line has the 0x07 character.
>>    =20
>
> you're absolutely right with bell character, but i think there is one a=
nother=20
> different thing. I'll try to explain.
>
> i will search word 'kniha' (which means book) trough
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hits=
PerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl
>
> answer is valid XML(57472 hits of word kniha), but in result
> in entity 'description' there are html entites that represent czech cha=
racters=20
> with diacritics and that's the problem. Original site doesn't contain t=
hese=20
> html entities but regular czech characters.
>
> Interesting thing is that in entity 'title' shows czech characters well=
, but=20
> in entity 'description' like html entites(for instance html entity &yac=
ute;=20
> represents special character y with dash).
>
> Have you any idea where could be problem?
>
>  =20
Thanks for the extra info Lukas.

Digging in, I see that the generation of summaries runs the text through=20
org.apache.nutch.html.Entities.  Here is the code for the=20
Entities#encode method that all summary text is run through:

  static final public String encode(String s) {
    int length =3D s.length();
    StringBuffer buffer =3D new StringBuffer(length * 2);
    for (int i =3D 0; i < length; i++) {
      char c =3D s.charAt(i);
      int j =3D (int)c;
      if (j < 0x100 && encoder[j] !=3D null) {
    buffer.append(encoder[j]);        // have a named encoding
    buffer.append(';');
      } else if (j < 0x80) {
    buffer.append(c);             // use ASCII value
      } else {
    buffer.append("&#");              // use numeric encoding
    buffer.append((int)c);
    buffer.append(';');
      }
    }
    return buffer.toString();
  }

Any character that is super-ASCII gets a numeric character encoding. =20
Assuming all is UTF-8 in nutch, then we probably don't want HTML entity=20
encoding when we're outputtting UTF-8 XML.  In fact, looks like we don't=20
want any html entity encoding at all when outputting XML.

The call to Entities#encode is buried in nutch inside the Fragment inner=20
class of Summary.  It would take a good bit of work making up a=20
NutchBean that called an alternate Summary-maker when outputting XML.

Meantime, I have a quick fix that adds HTML entity decoding to the=20
Nutchwax OpenSearchServlet.  Let me do some more testing and hopefully I=20
can commit later today. I'll let you know.

St.Ack