|
From: stack <st...@ar...> - 2006-02-11 19:00:22
|
Luk=E1=9A: I committed code to undo any html entity encoding found in text to be=20 emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4'=20 branch so be careful you get this branch from CVS rather than HEAD if=20 building from source. If you just want the WAR with the fix, its=20 available here: http://archive.org/~stack/nutchwax.war. Let me know if=20 you want me to make up a complete nutchwax tarball. Let me know if the=20 fix works for you (Here's the bug:=20 https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D118427&atid=3D681137). This is a band-aid fix until the core issue gets addressed in nutch. =20 I'll work on trying to get this done this week. This is a pretty serious issue. Text snippets -- i.e. the 'description'=20 field in the XML -- that have anything but plain ASCII are mangled=20 showing ugly numeric character representations, 'ŗ', etc., in place=20 of legit UTF-8 characters. Its was also possible to by-pass our=20 legit-xml character checking encoding illegal characters: e.g. ''. =20 If the fix works for you Luk=E1=9A, I'll make a new release of nutchwax w= ith=20 the bandaid incorporated later this week (Hopefully by the release of=20 the 0.6.0 mapreduce version of NutchWAX, will have the real fix=20 incorporated). Good stuff, St.Ack stack wrote: > Lukas Matejka wrote: >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): >> =20 >>> .... >>> I see the 0x07 Bell character in the original page. Below is an 'od' >>> dump of the relevant section with the ascii line underwritten by its=20 >>> hex >>> representation. The last line has the 0x07 character. >>> =20 >> >> you're absolutely right with bell character, but i think there is one=20 >> another different thing. I'll try to explain. >> >> i will search word 'kniha' (which means book) trough >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hit= sPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl=20 >> >> >> answer is valid XML(57472 hits of word kniha), but in result >> in entity 'description' there are html entites that represent czech=20 >> characters with diacritics and that's the problem. Original site=20 >> doesn't contain these html entities but regular czech characters. >> >> Interesting thing is that in entity 'title' shows czech characters=20 >> well, but in entity 'description' like html entites(for instance html=20 >> entity ý represents special character y with dash). >> >> Have you any idea where could be problem? >> >> =20 > Thanks for the extra info Lukas. > > Digging in, I see that the generation of summaries runs the text=20 > through org.apache.nutch.html.Entities. Here is the code for the=20 > Entities#encode method that all summary text is run through: > > static final public String encode(String s) { > int length =3D s.length(); > StringBuffer buffer =3D new StringBuffer(length * 2); > for (int i =3D 0; i < length; i++) { > char c =3D s.charAt(i); > int j =3D (int)c; > if (j < 0x100 && encoder[j] !=3D null) { > buffer.append(encoder[j]); // have a named encoding > buffer.append(';'); > } else if (j < 0x80) { > buffer.append(c); // use ASCII value > } else { > buffer.append("&#"); // use numeric encoding > buffer.append((int)c); > buffer.append(';'); > } > } > return buffer.toString(); > } > > Any character that is super-ASCII gets a numeric character encoding. =20 > Assuming all is UTF-8 in nutch, then we probably don't want HTML=20 > entity encoding when we're outputtting UTF-8 XML. In fact, looks like=20 > we don't want any html entity encoding at all when outputting XML. > > The call to Entities#encode is buried in nutch inside the Fragment=20 > inner class of Summary. It would take a good bit of work making up a=20 > NutchBean that called an alternate Summary-maker when outputting XML. > > Meantime, I have a quick fix that adds HTML entity decoding to the=20 > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully=20 > I can commit later today. I'll let you know. > > St.Ack > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log=20 > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |