|
From: stack <st...@ar...> - 2006-02-10 19:33:03
|
Lukas Matejka wrote: > Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > =20 >> .... >> I see the 0x07 Bell character in the original page. Below is an 'od' >> dump of the relevant section with the ascii line underwritten by its h= ex >> representation. The last line has the 0x07 character. >> =20 > > you're absolutely right with bell character, but i think there is one a= nother=20 > different thing. I'll try to explain. > > i will search word 'kniha' (which means book) trough > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hits= PerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl > > answer is valid XML(57472 hits of word kniha), but in result > in entity 'description' there are html entites that represent czech cha= racters=20 > with diacritics and that's the problem. Original site doesn't contain t= hese=20 > html entities but regular czech characters. > > Interesting thing is that in entity 'title' shows czech characters well= , but=20 > in entity 'description' like html entites(for instance html entity &yac= ute;=20 > represents special character y with dash). > > Have you any idea where could be problem? > > =20 Thanks for the extra info Lukas. Digging in, I see that the generation of summaries runs the text through=20 org.apache.nutch.html.Entities. Here is the code for the=20 Entities#encode method that all summary text is run through: static final public String encode(String s) { int length =3D s.length(); StringBuffer buffer =3D new StringBuffer(length * 2); for (int i =3D 0; i < length; i++) { char c =3D s.charAt(i); int j =3D (int)c; if (j < 0x100 && encoder[j] !=3D null) { buffer.append(encoder[j]); // have a named encoding buffer.append(';'); } else if (j < 0x80) { buffer.append(c); // use ASCII value } else { buffer.append("&#"); // use numeric encoding buffer.append((int)c); buffer.append(';'); } } return buffer.toString(); } Any character that is super-ASCII gets a numeric character encoding. =20 Assuming all is UTF-8 in nutch, then we probably don't want HTML entity=20 encoding when we're outputtting UTF-8 XML. In fact, looks like we don't=20 want any html entity encoding at all when outputting XML. The call to Entities#encode is buried in nutch inside the Fragment inner=20 class of Summary. It would take a good bit of work making up a=20 NutchBean that called an alternate Summary-maker when outputting XML. Meantime, I have a quick fix that adds HTML entity decoding to the=20 Nutchwax OpenSearchServlet. Let me do some more testing and hopefully I=20 can commit later today. I'll let you know. St.Ack |