|
From:
<mat...@ce...> - 2006-02-09 12:08:33
|
Hi, i still can't handle this issue.. does anybody know how to help? Can NutchWAX produce output with html entities?(Output from NutchWAX sh= loud be utf,shouldn't be?) Because (in cases written below) invalid xml is caused by special chara= cters in html entties. thanks for any help -lm ______________________________________________________________ > Od: sve...@nb... > Komu: stack <st...@ar...> > CC: Luk=E1=9A Mat=ECjka <mat...@ce...> > Datum: 12.01.2006 10:38 > P=F8edm=ECt: Re: nutchwax > > Hi Michael, Luk=E1=9A .. >=20 > On Thursday 12 January 2006 01:33, stack wrote: > > Luk=E1=9A Mat=ECjka wrote: > ... > > > what's the difference between these cases? > > > > > > 1) > > > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD > > >&start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl -= >output is > not > > > valid xml(called from WERA) > > > > > > 2) > > > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3% > > >BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl= output is > valid > > > xml(called from Nutchwax search.jsp) > > >=20 > If i try the above urls i find quite the opposite! Case 1 produces va= lid > XML,=20 > case 2 produces invalid XML. >=20 > Test results: >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl > -> valid XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck= %C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl > -> INVALID XML >=20 > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20lou= ck%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl > -> INVALID XML >=20 > Setting hitsPerDup=3D2 results in valid XML >=20 > Conclusion:=20 > A specific record in the index contains invalid XML chars, and it is = only > part=20 > of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and st= art=3D10 > will=20 > produce a result list including the invalid XML. chars record. >=20 > I don't know if the above was of any help to you, i just had to say > something=20 > about it ;-) >=20 > Sverre > |
|
From: Lukas M. <mat...@ce...> - 2006-02-14 17:22:33
|
> > > > > ---------- P=F8eposlan=E1 zpr=E1va ---------- > From: stack <st...@ar...> > To: stack <st...@ar...> > Date: Sat, 11 Feb 2006 10:59:07 -0800 > Subject: Re: [Archive-access-discuss] Re: nutchwax > uk=E1=B9: > > I committed code to undo any html entity encoding found in text to be > emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4' > branch so be careful you get this branch from CVS rather than HEAD if > building from source. If you just want the WAR with the fix, its > available here: http://archive.org/~stack/nutchwax.war. Let me know if > you want me to make up a complete nutchwax tarball. Let me know if the > fix works for you (Here's the bug: > https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D >118427&atid=3D681137 ). it works very well! good work. I've just downloaded nutchwax.war and .. it seems to be ok:) =2Dlm > > This is a band-aid fix until the core issue gets addressed in nutch. > I'll work on trying to get this done this week. > > This is a pretty serious issue. Text snippets -- i.e. the 'description' > field in the XML -- that have anything but plain ASCII are mangled > showing ugly numeric character representations, 'ŗ', etc., in place > of legit UTF-8 characters. Its was also possible to by-pass our > legit-xml character checking encoding illegal characters: e.g. ''. > If the fix works for you Luk=E1=B9, I'll make a new release of nutchwax w= ith > the bandaid incorporated later this week (Hopefully by the release of > the 0.6.0 mapreduce version of NutchWAX, will have the real fix > incorporated). > > Good stuff, > St.Ack > > stack wrote: > > Lukas Matejka wrote: > >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > >>> .... > >>> I see the 0x07 Bell character in the original page. Below is an 'od' > >>> dump of the relevant section with the ascii line underwritten by its > >>> hex > >>> representation. The last line has the 0x07 character. > >> > >> you're absolutely right with bell character, but i think there is one > >> another different thing. I'll try to explain. > >> > >> i will search word 'kniha' (which means book) trough > > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPe= rPage=3D >10&hitsPerDup=3D1&dedupField=3Dexacturl > > >> answer is valid XML(57472 hits of word kniha), but in result > >> in entity 'description' there are html entites that represent czech > >> characters with diacritics and that's the problem. Original site > >> doesn't contain these html entities but regular czech characters. > >> > >> Interesting thing is that in entity 'title' shows czech characters > >> well, but in entity 'description' like html entites(for instance html > >> entity ý represents special character y with dash). > >> > >> Have you any idea where could be problem? > > > > Thanks for the extra info Lukas. > > > > Digging in, I see that the generation of summaries runs the text > > through org.apache.nutch.html.Entities. Here is the code for the > > Entities#encode method that all summary text is run through: > > > > static final public String encode(String s) { > > int length =3D s.length(); > > StringBuffer buffer =3D new StringBuffer(length * 2); > > for (int i =3D 0; i < length; i++) { > > char c =3D s.charAt(i); > > int j =3D (int)c; > > if (j < 0x100 && encoder[j] !=3D null) { > > buffer.append(encoder[j]); // have a named encoding > > buffer.append(';'); > > } else if (j < 0x80) { > > buffer.append(c); // use ASCII value > > } else { > > buffer.append("&#"); // use numeric encoding > > buffer.append((int)c); > > buffer.append(';'); > > } > > } > > return buffer.toString(); > > } > > > > Any character that is super-ASCII gets a numeric character encoding. > > Assuming all is UTF-8 in nutch, then we probably don't want HTML > > entity encoding when we're outputtting UTF-8 XML. In fact, looks like > > we don't want any html entity encoding at all when outputting XML. > > > > The call to Entities#encode is buried in nutch inside the Fragment > > inner class of Summary. It would take a good bit of work making up a > > NutchBean that called an alternate Summary-maker when outputting XML. > > > > Meantime, I have a quick fix that adds HTML entity decoding to the > > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully > > I can commit later today. I'll let you know. > > > > St.Ack > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > > files > > for problems? Stop! Download the new AJAX search engine that makes > > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss =2D-=20 =2D----------------------------- Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |
|
From: stack <st...@ar...> - 2006-02-09 17:52:17
|
Luk=E1=9A Mat=ECjka wrote: > Hi, > > i still can't handle this issue.. > =20 Pardon the late reply Luk=E1=9A. Here seems to be a page with problematic characters:=20 http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe=20 below adding hitsPerDup=3D0, etc.=20 If I get the page via opensearchservlet, firefox complains about ''=20 character in description field. The ascii 'Bell' character is illegal=20 in XML, even though its represented by numeric character reference=20 (Here's the grammer for XML Char:=20 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char). I see the 0x07 Bell character in the original page. Below is an 'od'=20 dump of the relevant section with the ascii line underwritten by its hex=20 representation. The last line has the 0x07 character. .... 0008832 8 0 . < / p > < p > nl < b > M K 3830 2e3c 2f70 3e3c 703e 0a3c 623e 4d4b 0008848 sp c8 R < / b > sp z a f8 a d i l o 20c8 523c 2f62 3e20 7a61 f861 6469 6c6f 0008864 sp n a sp s e z n a m sp n e j c e 206e 6120 7365 7a6e 616d 206e 656a 6365 0008880 n n ec j b9 ed c h sp d o k l a d f9 6e6e ec6a b9ed 6368 2064 6f6b 6c61 64f9 0008896 . bel sp / B i b l e sp b o s k o v 2e07 202f 4269 626c 6520 626f 736b 6f76 ... That the illegal character shows in the description text as a character=20 reference, then its probably been encoded earlier in the processing of=20 the document. Regardless, the opensearchservlet should probably look for such illegal=20 encodings and just strip them (Its doing this already for raw=20 characters). Let me try and fix this. St.Ack > does anybody know how to help? > Can NutchWAX produce output with html entities?(Output from NutchWAX sh= loud be utf,shouldn't be?) > Because (in cases written below) invalid xml is caused by special chara= cters in html entties. > > thanks for any help > > -lm > > ______________________________________________________________ > =20 >> Od: sve...@nb... >> Komu: stack <st...@ar...> >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...> >> Datum: 12.01.2006 10:38 >> P=F8edm=ECt: Re: nutchwax >> >> Hi Michael, Luk=E1=9A .. >> >> On Thursday 12 January 2006 01:33, stack wrote: >> =20 >>> Luk=E1=9A Mat=ECjka wrote: >>> =20 >> ... >> =20 >>>> what's the difference between these cases? >>>> >>>> 1) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> =20 >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o= utput is >>>> =20 >> not >> =20 >>>> valid xml(called from WERA) >>>> >>>> 2) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3% >> =20 >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o= utput is >>>> =20 >> valid >> =20 >>>> xml(called from Nutchwax search.jsp) >>>> =20 >> If i try the above urls i find quite the opposite! Case 1 produces val= id >> XML,=20 >> case 2 produces invalid XML. >> >> Test results: >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> Setting hitsPerDup=3D2 results in valid XML >> >> Conclusion:=20 >> A specific record in the index contains invalid XML chars, and it is o= nly >> part=20 >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta= rt=3D10 >> will=20 >> produce a result list including the invalid XML. chars record. >> >> I don't know if the above was of any help to you, i just had to say >> something=20 >> about it ;-) >> >> Sverre >> >> =20 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log = files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > =20 |
|
From: Lukas M. <mat...@ce...> - 2006-02-10 14:52:20
|
Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > Luk=E1=9A Mat=ECjka wrote: > > Hi, > > > > i still can't handle this issue.. > > Pardon the late reply Luk=E1=9A. > > Here seems to be a page with problematic characters: > http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe > below adding hitsPerDup=3D0, etc. > > If I get the page via opensearchservlet, firefox complains about '' > character in description field. The ascii 'Bell' character is illegal > in XML, even though its represented by numeric character reference > (Here's the grammer for XML Char: > http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char). > > I see the 0x07 Bell character in the original page. Below is an 'od' > dump of the relevant section with the ascii line underwritten by its hex > representation. The last line has the 0x07 character. you're absolutely right with bell character, but i think there is one anoth= er=20 different thing. I'll try to explain. i will search word 'kniha' (which means book) trough http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPerP= age=3D10&hitsPerDup=3D1&dedupField=3Dexacturl answer is valid XML(57472 hits of word kniha), but in result in entity 'description' there are html entites that represent czech charact= ers=20 with diacritics and that's the problem. Original site doesn't contain these= =20 html entities but regular czech characters. Interesting thing is that in entity 'title' shows czech characters well, bu= t=20 in entity 'description' like html entites(for instance html entity ý= =20 represents special character y with dash). Have you any idea where could be problem? l. > > .... > > 0008832 8 0 . < / p > < p > nl < b > M K > 3830 2e3c 2f70 3e3c 703e 0a3c 623e 4d4b > 0008848 sp c8 R < / b > sp z a f8 a d i l o > 20c8 523c 2f62 3e20 7a61 f861 6469 6c6f > 0008864 sp n a sp s e z n a m sp n e j c e > 206e 6120 7365 7a6e 616d 206e 656a 6365 > 0008880 n n ec j b9 ed c h sp d o k l a d f9 > 6e6e ec6a b9ed 6368 2064 6f6b 6c61 64f9 > 0008896 . bel sp / B i b l e sp b o s k o v > 2e07 202f 4269 626c 6520 626f 736b 6f76 > > ... > > That the illegal character shows in the description text as a character > reference, then its probably been encoded earlier in the processing of > the document. > > Regardless, the opensearchservlet should probably look for such illegal > encodings and just strip them (Its doing this already for raw > characters). Let me try and fix this. > > St.Ack > > > does anybody know how to help? > > Can NutchWAX produce output with html entities?(Output from NutchWAX > > shloud be utf,shouldn't be?) Because (in cases written below) invalid x= ml > > is caused by special characters in html entties. > > > > thanks for any help > > > > -lm > > > > ______________________________________________________________ > > > >> Od: sve...@nb... > >> Komu: stack <st...@ar...> > >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...> > >> Datum: 12.01.2006 10:38 > >> P=F8edm=ECt: Re: nutchwax > >> > >> Hi Michael, Luk=E1=9A .. > >> > >> On Thursday 12 January 2006 01:33, stack wrote: > >>> Luk=E1=9A Mat=ECjka wrote: > >> > >> ... > >> > >>>> what's the difference between these cases? > >>>> > >>>> 1) > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D > >> > >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o= utput is > >> > >> not > >> > >>>> valid xml(called from WERA) > >>>> > >>>> 2) > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>% > >> > >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o= utput is > >> > >> valid > >> > >>>> xml(called from Nutchwax search.jsp) > >> > >> If i try the above urls i find quite the opposite! Case 1 produces val= id > >> XML, > >> case 2 produces invalid XML. > >> > >> Test results: > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD&hitsPerDup=3D0&dedupField=3Dexacturl -> valid XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%B > >>D&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML > >> > >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3 > >>%BD&hitsPerDup=3D1&dedupField=3Dexacturl -> INVALID XML > >> > >> Setting hitsPerDup=3D2 results in valid XML > >> > >> Conclusion: > >> A specific record in the index contains invalid XML chars, and it is > >> only part > >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta= rt=3D10 > >> will > >> produce a result list including the invalid XML. chars record. > >> > >> I don't know if the above was of any help to you, i just had to say > >> something > >> about it ;-) > >> > >> Sverre > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > > files for problems? Stop! Download the new AJAX search engine that > > makes searching your log files as easy as surfing the web. DOWNLOAD > > SPLUNK! http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat= =121642 > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log > files for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss =2D-=20 =2D----------------------------- Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |
|
From: stack <st...@ar...> - 2006-02-10 19:33:03
|
Lukas Matejka wrote: > Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): > =20 >> .... >> I see the 0x07 Bell character in the original page. Below is an 'od' >> dump of the relevant section with the ascii line underwritten by its h= ex >> representation. The last line has the 0x07 character. >> =20 > > you're absolutely right with bell character, but i think there is one a= nother=20 > different thing. I'll try to explain. > > i will search word 'kniha' (which means book) trough > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hits= PerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl > > answer is valid XML(57472 hits of word kniha), but in result > in entity 'description' there are html entites that represent czech cha= racters=20 > with diacritics and that's the problem. Original site doesn't contain t= hese=20 > html entities but regular czech characters. > > Interesting thing is that in entity 'title' shows czech characters well= , but=20 > in entity 'description' like html entites(for instance html entity &yac= ute;=20 > represents special character y with dash). > > Have you any idea where could be problem? > > =20 Thanks for the extra info Lukas. Digging in, I see that the generation of summaries runs the text through=20 org.apache.nutch.html.Entities. Here is the code for the=20 Entities#encode method that all summary text is run through: static final public String encode(String s) { int length =3D s.length(); StringBuffer buffer =3D new StringBuffer(length * 2); for (int i =3D 0; i < length; i++) { char c =3D s.charAt(i); int j =3D (int)c; if (j < 0x100 && encoder[j] !=3D null) { buffer.append(encoder[j]); // have a named encoding buffer.append(';'); } else if (j < 0x80) { buffer.append(c); // use ASCII value } else { buffer.append("&#"); // use numeric encoding buffer.append((int)c); buffer.append(';'); } } return buffer.toString(); } Any character that is super-ASCII gets a numeric character encoding. =20 Assuming all is UTF-8 in nutch, then we probably don't want HTML entity=20 encoding when we're outputtting UTF-8 XML. In fact, looks like we don't=20 want any html entity encoding at all when outputting XML. The call to Entities#encode is buried in nutch inside the Fragment inner=20 class of Summary. It would take a good bit of work making up a=20 NutchBean that called an alternate Summary-maker when outputting XML. Meantime, I have a quick fix that adds HTML entity decoding to the=20 Nutchwax OpenSearchServlet. Let me do some more testing and hopefully I=20 can commit later today. I'll let you know. St.Ack |
|
From: stack <st...@ar...> - 2006-02-11 19:00:22
|
Luk=E1=9A: I committed code to undo any html entity encoding found in text to be=20 emitted by OpenSearchServlet. I committed on the nutchwax 'release-0_4'=20 branch so be careful you get this branch from CVS rather than HEAD if=20 building from source. If you just want the WAR with the fix, its=20 available here: http://archive.org/~stack/nutchwax.war. Let me know if=20 you want me to make up a complete nutchwax tarball. Let me know if the=20 fix works for you (Here's the bug:=20 https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro= up_id=3D118427&atid=3D681137). This is a band-aid fix until the core issue gets addressed in nutch. =20 I'll work on trying to get this done this week. This is a pretty serious issue. Text snippets -- i.e. the 'description'=20 field in the XML -- that have anything but plain ASCII are mangled=20 showing ugly numeric character representations, 'ŗ', etc., in place=20 of legit UTF-8 characters. Its was also possible to by-pass our=20 legit-xml character checking encoding illegal characters: e.g. ''. =20 If the fix works for you Luk=E1=9A, I'll make a new release of nutchwax w= ith=20 the bandaid incorporated later this week (Hopefully by the release of=20 the 0.6.0 mapreduce version of NutchWAX, will have the real fix=20 incorporated). Good stuff, St.Ack stack wrote: > Lukas Matejka wrote: >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a): >> =20 >>> .... >>> I see the 0x07 Bell character in the original page. Below is an 'od' >>> dump of the relevant section with the ascii line underwritten by its=20 >>> hex >>> representation. The last line has the 0x07 character. >>> =20 >> >> you're absolutely right with bell character, but i think there is one=20 >> another different thing. I'll try to explain. >> >> i will search word 'kniha' (which means book) trough >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hit= sPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl=20 >> >> >> answer is valid XML(57472 hits of word kniha), but in result >> in entity 'description' there are html entites that represent czech=20 >> characters with diacritics and that's the problem. Original site=20 >> doesn't contain these html entities but regular czech characters. >> >> Interesting thing is that in entity 'title' shows czech characters=20 >> well, but in entity 'description' like html entites(for instance html=20 >> entity ý represents special character y with dash). >> >> Have you any idea where could be problem? >> >> =20 > Thanks for the extra info Lukas. > > Digging in, I see that the generation of summaries runs the text=20 > through org.apache.nutch.html.Entities. Here is the code for the=20 > Entities#encode method that all summary text is run through: > > static final public String encode(String s) { > int length =3D s.length(); > StringBuffer buffer =3D new StringBuffer(length * 2); > for (int i =3D 0; i < length; i++) { > char c =3D s.charAt(i); > int j =3D (int)c; > if (j < 0x100 && encoder[j] !=3D null) { > buffer.append(encoder[j]); // have a named encoding > buffer.append(';'); > } else if (j < 0x80) { > buffer.append(c); // use ASCII value > } else { > buffer.append("&#"); // use numeric encoding > buffer.append((int)c); > buffer.append(';'); > } > } > return buffer.toString(); > } > > Any character that is super-ASCII gets a numeric character encoding. =20 > Assuming all is UTF-8 in nutch, then we probably don't want HTML=20 > entity encoding when we're outputtting UTF-8 XML. In fact, looks like=20 > we don't want any html entity encoding at all when outputting XML. > > The call to Entities#encode is buried in nutch inside the Fragment=20 > inner class of Summary. It would take a good bit of work making up a=20 > NutchBean that called an alternate Summary-maker when outputting XML. > > Meantime, I have a quick fix that adds HTML entity decoding to the=20 > Nutchwax OpenSearchServlet. Let me do some more testing and hopefully=20 > I can commit later today. I'll let you know. > > St.Ack > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log=20 > files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |