|
From: stack <st...@ar...> - 2006-02-09 17:52:17
|
Luk=E1=9A Mat=ECjka wrote: > Hi, > > i still can't handle this issue.. > =20 Pardon the late reply Luk=E1=9A. Here seems to be a page with problematic characters:=20 http://dig.vkol.cz/vz/vz01_12.htm. I get it by following Sverre's recipe=20 below adding hitsPerDup=3D0, etc.=20 If I get the page via opensearchservlet, firefox complains about ''=20 character in description field. The ascii 'Bell' character is illegal=20 in XML, even though its represented by numeric character reference=20 (Here's the grammer for XML Char:=20 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char). I see the 0x07 Bell character in the original page. Below is an 'od'=20 dump of the relevant section with the ascii line underwritten by its hex=20 representation. The last line has the 0x07 character. .... 0008832 8 0 . < / p > < p > nl < b > M K 3830 2e3c 2f70 3e3c 703e 0a3c 623e 4d4b 0008848 sp c8 R < / b > sp z a f8 a d i l o 20c8 523c 2f62 3e20 7a61 f861 6469 6c6f 0008864 sp n a sp s e z n a m sp n e j c e 206e 6120 7365 7a6e 616d 206e 656a 6365 0008880 n n ec j b9 ed c h sp d o k l a d f9 6e6e ec6a b9ed 6368 2064 6f6b 6c61 64f9 0008896 . bel sp / B i b l e sp b o s k o v 2e07 202f 4269 626c 6520 626f 736b 6f76 ... That the illegal character shows in the description text as a character=20 reference, then its probably been encoded earlier in the processing of=20 the document. Regardless, the opensearchservlet should probably look for such illegal=20 encodings and just strip them (Its doing this already for raw=20 characters). Let me try and fix this. St.Ack > does anybody know how to help? > Can NutchWAX produce output with html entities?(Output from NutchWAX sh= loud be utf,shouldn't be?) > Because (in cases written below) invalid xml is caused by special chara= cters in html entties. > > thanks for any help > > -lm > > ______________________________________________________________ > =20 >> Od: sve...@nb... >> Komu: stack <st...@ar...> >> CC: Luk=E1=9A Mat=ECjka <mat...@ce...> >> Datum: 12.01.2006 10:38 >> P=F8edm=ECt: Re: nutchwax >> >> Hi Michael, Luk=E1=9A .. >> >> On Thursday 12 January 2006 01:33, stack wrote: >> =20 >>> Luk=E1=9A Mat=ECjka wrote: >>> =20 >> ... >> =20 >>>> what's the difference between these cases? >>>> >>>> 1) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> =20 >>>> &start=3D0&hitsPerDup=3D0&hitsPerPage=3D10&dedupField=3Dexacturl ->o= utput is >>>> =20 >> not >> =20 >>>> valid xml(called from WERA) >>>> >>>> 2) >>>> >>>> =20 >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3% >> =20 >>>> BD&start=3D0&hitsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl o= utput is >>>> =20 >> valid >> =20 >>>> xml(called from Nutchwax search.jsp) >>>> =20 >> If i try the above urls i find quite the opposite! Case 1 produces val= id >> XML,=20 >> case 2 produces invalid XML. >> >> Test results: >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D0&dedupField=3Dexacturl >> -> valid XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l+louck%= C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dgradu%C3%A1l%20louc= k%C3%BD&hitsPerDup=3D1&dedupField=3Dexacturl >> -> INVALID XML >> >> Setting hitsPerDup=3D2 results in valid XML >> >> Conclusion:=20 >> A specific record in the index contains invalid XML chars, and it is o= nly >> part=20 >> of the result list when hitsPerDup=3D1. Setting hitsPerDup=3D0 and sta= rt=3D10 >> will=20 >> produce a result list including the invalid XML. chars record. >> >> I don't know if the above was of any help to you, i just had to say >> something=20 >> about it ;-) >> >> Sverre >> >> =20 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log = files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > =20 |