|
From: Sverre B. <sve...@nb...> - 2005-11-02 13:28:31
|
Hi there, Definitely something wrong in NutchWax. If i execute http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&year_= to=3D and click the tmeline link of the first hit showing 0/0 hits i get 'Sorry, no documents with the given uri were found'. The url displyed seems fine, but if you look in the source of the uppermost frame you will see that the url sent to the script was http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. The & separating the parameters irj and start has been replaced by its html character entity reference.=20 If i press the go button now the url submitted to the script will be ok. If i look in the NutchWax result set of the initial search (add &debug=3D1 to the search url to bring out the NutchWax search urls) i see that the url (link element) returned is wrong already here. Conclusion : NutchWax mangles the url returned by introducing html entities instead of keeping the url in its original form. What version of NutchWax are you using? Sverre On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > This looks like the same (or very similar) problem as I've got. I've been= discussing it (offlist) with Stack and Sverre Bang, so I know it is being = looked into. >=20 > I notice in your search results (as in mine) that URIs with & in them are= showing up as 0/0 versions. I believe that both problems are due to the es= caping (or unescaping) of HTML characters in the NutchWAX XML that is used = to pass the results to WERA. >=20 > Possibly this is a misconfiguration of either Tomcat or Apache...? >=20 > - Kris >=20 > > -----Original Message----- > > From: arc...@li...=20 > > [mailto:arc...@li...]=20 > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > Sent: 2. n=C3=B3vember 2005 11:21 > > To: arc...@li... > > Subject: [Archive-access-discuss] wera results > >=20 > >=20 > > Hi, > >=20 > > for example > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D >=20 > description of each record is not well-displayed >=20 > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > (<b> ... </b>přístupu k internetu v knihovnách propago= vat využití internetu při zjišťován&iacut= e; názorů obyvatel 2. Anketa Pomocí krátké= ankety bude zjišťována nejoblíbenějš&iac= ute; <b>kniha</b> obyvatel České republiky. Pojem nejoblí= ;benější <b>kniha</b> je specifikován dalš&iac= ute;mi výklady, jako "<b>kniha</b>, která mě nejv&i= acute;ce ovlivnila", "<b>kniha</b>, ke které se často= vracím", "<b>kniha</b>, kterou bych doporučil/a dobr= ým přátelům", "<b>kniha</b>, která= změnila můj život", "<b>kniha</b> na kterou nemoh= u zapomenout", "<b>kniha</b>, která mne uvedla do jin&eacu= te;ho světa", "<b>kniha</b>, kterou bych si s sebou vzal/a j= ako jedinou<b> ... </b>) > Versions (matching query/total) 3/3 > Timeline | Overview >=20 > "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") >=20 > does anybody have same problem? >=20 > -lm >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |