|
From: Kristinn S. <kr...@ar...> - 2005-11-02 15:27:26
|
Here is an part of the XML file generated by the opensearch servlet <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> Notice the section &amp; clearly somthing is (properly) escaping the = string <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&id=3D= 5959</nutch:cache> To <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;id=3D5959</nutch:cache> That string is then re-escaped to <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> A little bit of simple testing with Tomcat didn't indicate that it was = doing automatic escaping like this.=20 We need to identify where the extrenous escaping is being done.=20 I installed both Tomcat 5.0.28 and NutchWAX 0.4.0 without any changes to = their default configuration. I'm also using Sun's Java version 1.5.0_05. = To get XML on Tomcat working, I deleted the file = %TOMCAT_HOME/common/endorsed/xml-apis.jar. How does this differ from the = demo on nwa.nb.no/wera? -Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Sverre Bang > Sent: 2. n=C3=B3vember 2005 13:27 > To: arc...@li... > Subject: RE: [Archive-access-discuss] wera results >=20 >=20 > Hi there, > Definitely something wrong in NutchWax. If i execute > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D > and click the tmeline link of the first hit showing 0/0 hits i get > 'Sorry, no documents with the given uri were found'. The url displyed > seems fine, but if you look in the source of the uppermost frame you > will see that the url sent to the script was > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > The & separating the parameters irj and start has been replaced by its > html character entity reference.=20 >=20 > If i press the go button now the url submitted to the script=20 > will be ok. >=20 > If i look in the NutchWax result set of the initial search=20 > (add &debug=3D1 > to the search url to bring out the NutchWax search urls) i=20 > see that the > url (link element) returned is wrong already here. >=20 > Conclusion : NutchWax mangles the url returned by introducing html > entities instead of keeping the url in its original form. >=20 > What version of NutchWax are you using? >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > This looks like the same (or very similar) problem as I've=20 > got. I've been discussing it (offlist) with Stack and Sverre=20 > Bang, so I know it is being looked into. > >=20 > > I notice in your search results (as in mine) that URIs with=20 > & in them are showing up as 0/0 versions. I believe that both=20 > problems are due to the escaping (or unescaping) of HTML=20 > characters in the NutchWAX XML that is used to pass the=20 > results to WERA. > >=20 > > Possibly this is a misconfiguration of either Tomcat or Apache...? > >=20 > > - Kris > >=20 > > > -----Original Message----- > > > From: arc...@li...=20 > > > [mailto:arc...@li...]=20 > > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > > Sent: 2. n=C3=B3vember 2005 11:21 > > > To: arc...@li... > > > Subject: [Archive-access-discuss] wera results > > >=20 > > >=20 > > > Hi, > > >=20 > > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > om=3D&year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>přístupu k internetu v=20 > knihovnách propagovat využití internetu=20 > při zjišťování=20 > názorů obyvatel 2. Anketa Pomocí=20 > krátké ankety bude zjišťována=20 > nejoblíbenější <b>kniha</b> obyvatel=20 > České republiky. Pojem=20 > nejoblíbenější <b>kniha</b> je=20 > specifikován dalšími výklady, jako=20 > "<b>kniha</b>, která mě nejvíce=20 > ovlivnila", "<b>kniha</b>, ke které se=20 > často vracím", "<b>kniha</b>, kterou=20 > bych doporučil/a dobrým=20 > přátelům", "<b>kniha</b>,=20 > která změnila můj život",=20 > "<b>kniha</b> na kterou nemohu zapomenout",=20 > "<b>kniha</b>, která mne uvedla do jiného=20 > světa", "<b>kniha</b>, kterou bych si s sebou=20 > vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "přístupu" should be "p=C5=99=C3=ADstupu"(without=20 > diacritics "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |