|
From: Lukas M. <mat...@ce...> - 2006-02-14 17:22:33
|
> > > > > ---------- P=F8eposlan=E1 zpr=E1va ---------- > From: stack <st...@ar...> > To: stack <st...@ar...> > Date: Sat, 11 Feb 2006 10:59:07 -0800 > Subject: Re: [Archive-access-discuss] Re: nutchwax > uk=E1=B9: > > I committed code to undo any html entity encoding found in text to be > emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4' > branch so be careful you get this branch from CVS rather than HEAD if > building from source. If you just want the WAR with the fix, its > available here: http://archive.org/~stack/nutchwax.war. Let me know if > you want me to make up a complete nutchwax tarball. Let me know if the > fix works for you (Here's the bug: > https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D >118427&atid=3D681137 ). it works very well! good work. I've just downloaded nutchwax.war and .. it seems to be ok:) =2Dlm > > This is a band-aid fix until the core issue gets addressed in nutch. > I'll work on trying to get this done this week. > > This is a pretty serious issue. Text snippets -- i.e. the 'description' > field in the XML -- that have anything but plain ASCII are mangled > showing ugly numeric character representations, 'ŗ', etc., in place > of legit UTF-8 characters. Its was also possible to by-pass our > legit-xml character checking encoding illegal characters: e.g. ''. > If the fix works for you Luk=E1=B9, I'll make a new release of nutchwax w= ith > the bandaid incorporated later this week (Hopefully by the release of > the 0.6.0 mapreduce version of NutchWAX, will have the real fix > incorporated). > > Good stuff, > St.Ack > > stack wrote: > > Lukas Matejka wrote: > >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > >>> .... > >>> I see the 0x07 Bell character in the original page. Below is an 'od' > >>> dump of the relevant section with the ascii line underwritten by its > >>> hex > >>> representation. The last line has the 0x07 character. > >> > >> you're absolutely right with bell character, but i think there is one > >> another different thing. I'll try to explain. > >> > >> i will search word 'kniha' (which means book) trough > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPe= rPage=3D >10&hitsPerDup=3D1&dedupField=3Dexacturl > > >> answer is valid XML(57472 hits of word kniha), but in result > >> in entity 'description' there are html entites that represent czech > >> characters with diacritics and that's the problem. Original site > >> doesn't contain these html entities but regular czech characters. > >> > >> Interesting thing is that in entity 'title' shows czech characters > >> well, but in entity 'description' like html entites(for instance html > >> entity ý represents special character y with dash). > >> > >> Have you any idea where could be problem? > > > > Thanks for the extra info Lukas. > > > > Digging in, I see that the generation of summaries runs the text > > through org.apache.nutch.html.Entities. Here is the code for the > > Entities#encode method that all summary text is run through: > > > > static final public String encode(String s) { > > int length =3D s.length(); > > StringBuffer buffer =3D new StringBuffer(length * 2); > > for (int i =3D 0; i < length; i++) { > > char c =3D s.charAt(i); > > int j =3D (int)c; > > if (j < 0x100 && encoder[j] !=3D null) { > > buffer.append(encoder[j]); // have a named encoding > > buffer.append(';'); > > } else if (j < 0x80) { > > buffer.append(c); // use ASCII value > > } else { > > buffer.append("&#"); // use numeric encoding > > buffer.append((int)c); > > buffer.append(';'); > > } > > } > > return buffer.toString(); > > } > > > > Any character that is super-ASCII gets a numeric character encoding. > > Assuming all is UTF-8 in nutch, then we probably don't want HTML > > entity encoding when we're outputtting UTF-8 XML. In fact, looks like > > we don't want any html entity encoding at all when outputting XML. > > > > The call to Entities#encode is buried in nutch inside the Fragment > > inner class of Summary. It would take a good bit of work making up a > > NutchBean that called an alternate Summary-maker when outputting XML. > > > > Meantime, I have a quick fix that adds HTML entity decoding to the > > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully > > I can commit later today. I'll let you know. > > > > St.Ack > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > > files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss =2D-=20 =2D----------------------------- Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |