|
From:
<mat...@ce...> - 2005-11-02 11:37:28
|
Hi, for example http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&y= ear_to=3D description of each record is not well-displayed 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) (<b> ... </b>přístupu k internetu v knihovnách propa= govat využití internetu při zjišťován= í názorů obyvatel 2. Anketa Pomocí krá= tké ankety bude zjišťována nejoblíben= 83;jší <b>kniha</b> obyvatel České republiky. P= ojem nejoblíbenější <b>kniha</b> je specifikov&= aacute;n dalšími výklady, jako "<b>kniha</b>, k= terá mě nejvíce ovlivnila", "<b>kniha</b>,= ke které se často vracím", "<b>kniha</b>,= kterou bych doporučil/a dobrým přátelům&q= uot;, "<b>kniha</b>, která změnila můj život= ", "<b>kniha</b> na kterou nemohu zapomenout", "<b>= kniha</b>, která mne uvedla do jiného světa", &= quot;<b>kniha</b>, kterou bych si s sebou vzal/a jako jedinou<b> ... </= b>) Versions (matching query/total) 3/3 Timeline | Overview "přístupu" should be "p=F8=EDstupu"(without diacritics "pri= stupu") does anybody have same problem? -lm |
|
From: Kristinn S. <kr...@ar...> - 2005-11-02 12:41:45
|
This looks like the same (or very similar) problem as I've got. I've = been discussing it (offlist) with Stack and Sverre Bang, so I know it is = being looked into. I notice in your search results (as in mine) that URIs with & in them = are showing up as 0/0 versions. I believe that both problems are due to = the escaping (or unescaping) of HTML characters in the NutchWAX XML that = is used to pass the results to WERA. Possibly this is a misconfiguration of either Tomcat or Apache...? - Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > Sent: 2. n=C3=B3vember 2005 11:21 > To: arc...@li... > Subject: [Archive-access-discuss] wera results >=20 >=20 > Hi, >=20 > for example > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr om=3D&year_to=3D description of each record is not well-displayed 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) (<b> ... </b>přístupu k internetu v knihovnách = propagovat využití internetu při = zjišťování názorů obyvatel 2. Anketa = Pomocí krátké ankety bude = zjišťována nejoblíbenější = <b>kniha</b> obyvatel České republiky. Pojem = nejoblíbenější <b>kniha</b> je = specifikován dalšími výklady, jako = "<b>kniha</b>, která mě nejvíce ovlivnila", = "<b>kniha</b>, ke které se často vracím", = "<b>kniha</b>, kterou bych doporučil/a dobrým = přátelům", "<b>kniha</b>, která = změnila můj život", "<b>kniha</b> na kterou = nemohu zapomenout", "<b>kniha</b>, která mne uvedla do = jiného světa", "<b>kniha</b>, kterou bych si s = sebou vzal/a jako jedinou<b> ... </b>) Versions (matching query/total) 3/3 Timeline | Overview "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") does anybody have same problem? -lm ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. = Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Sverre B. <sve...@nb...> - 2005-11-02 13:28:31
|
Hi there, Definitely something wrong in NutchWax. If i execute http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&year_= to=3D and click the tmeline link of the first hit showing 0/0 hits i get 'Sorry, no documents with the given uri were found'. The url displyed seems fine, but if you look in the source of the uppermost frame you will see that the url sent to the script was http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. The & separating the parameters irj and start has been replaced by its html character entity reference.=20 If i press the go button now the url submitted to the script will be ok. If i look in the NutchWax result set of the initial search (add &debug=3D1 to the search url to bring out the NutchWax search urls) i see that the url (link element) returned is wrong already here. Conclusion : NutchWax mangles the url returned by introducing html entities instead of keeping the url in its original form. What version of NutchWax are you using? Sverre On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > This looks like the same (or very similar) problem as I've got. I've been= discussing it (offlist) with Stack and Sverre Bang, so I know it is being = looked into. >=20 > I notice in your search results (as in mine) that URIs with & in them are= showing up as 0/0 versions. I believe that both problems are due to the es= caping (or unescaping) of HTML characters in the NutchWAX XML that is used = to pass the results to WERA. >=20 > Possibly this is a misconfiguration of either Tomcat or Apache...? >=20 > - Kris >=20 > > -----Original Message----- > > From: arc...@li...=20 > > [mailto:arc...@li...]=20 > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > Sent: 2. n=C3=B3vember 2005 11:21 > > To: arc...@li... > > Subject: [Archive-access-discuss] wera results > >=20 > >=20 > > Hi, > >=20 > > for example > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D >=20 > description of each record is not well-displayed >=20 > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > (<b> ... </b>přístupu k internetu v knihovnách propago= vat využití internetu při zjišťován&iacut= e; názorů obyvatel 2. Anketa Pomocí krátké= ankety bude zjišťována nejoblíbenějš&iac= ute; <b>kniha</b> obyvatel České republiky. Pojem nejoblí= ;benější <b>kniha</b> je specifikován dalš&iac= ute;mi výklady, jako "<b>kniha</b>, která mě nejv&i= acute;ce ovlivnila", "<b>kniha</b>, ke které se často= vracím", "<b>kniha</b>, kterou bych doporučil/a dobr= ým přátelům", "<b>kniha</b>, která= změnila můj život", "<b>kniha</b> na kterou nemoh= u zapomenout", "<b>kniha</b>, která mne uvedla do jin&eacu= te;ho světa", "<b>kniha</b>, kterou bych si s sebou vzal/a j= ako jedinou<b> ... </b>) > Versions (matching query/total) 3/3 > Timeline | Overview >=20 > "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") >=20 > does anybody have same problem? >=20 > -lm >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Kristinn S. <kr...@ar...> - 2005-11-02 15:27:26
|
Here is an part of the XML file generated by the opensearch servlet <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> Notice the section &amp; clearly somthing is (properly) escaping the = string <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&id=3D= 5959</nutch:cache> To <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;id=3D5959</nutch:cache> That string is then re-escaped to <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> A little bit of simple testing with Tomcat didn't indicate that it was = doing automatic escaping like this.=20 We need to identify where the extrenous escaping is being done.=20 I installed both Tomcat 5.0.28 and NutchWAX 0.4.0 without any changes to = their default configuration. I'm also using Sun's Java version 1.5.0_05. = To get XML on Tomcat working, I deleted the file = %TOMCAT_HOME/common/endorsed/xml-apis.jar. How does this differ from the = demo on nwa.nb.no/wera? -Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Sverre Bang > Sent: 2. n=C3=B3vember 2005 13:27 > To: arc...@li... > Subject: RE: [Archive-access-discuss] wera results >=20 >=20 > Hi there, > Definitely something wrong in NutchWax. If i execute > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D > and click the tmeline link of the first hit showing 0/0 hits i get > 'Sorry, no documents with the given uri were found'. The url displyed > seems fine, but if you look in the source of the uppermost frame you > will see that the url sent to the script was > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > The & separating the parameters irj and start has been replaced by its > html character entity reference.=20 >=20 > If i press the go button now the url submitted to the script=20 > will be ok. >=20 > If i look in the NutchWax result set of the initial search=20 > (add &debug=3D1 > to the search url to bring out the NutchWax search urls) i=20 > see that the > url (link element) returned is wrong already here. >=20 > Conclusion : NutchWax mangles the url returned by introducing html > entities instead of keeping the url in its original form. >=20 > What version of NutchWax are you using? >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > This looks like the same (or very similar) problem as I've=20 > got. I've been discussing it (offlist) with Stack and Sverre=20 > Bang, so I know it is being looked into. > >=20 > > I notice in your search results (as in mine) that URIs with=20 > & in them are showing up as 0/0 versions. I believe that both=20 > problems are due to the escaping (or unescaping) of HTML=20 > characters in the NutchWAX XML that is used to pass the=20 > results to WERA. > >=20 > > Possibly this is a misconfiguration of either Tomcat or Apache...? > >=20 > > - Kris > >=20 > > > -----Original Message----- > > > From: arc...@li...=20 > > > [mailto:arc...@li...]=20 > > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > > Sent: 2. n=C3=B3vember 2005 11:21 > > > To: arc...@li... > > > Subject: [Archive-access-discuss] wera results > > >=20 > > >=20 > > > Hi, > > >=20 > > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > om=3D&year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>přístupu k internetu v=20 > knihovnách propagovat využití internetu=20 > při zjišťování=20 > názorů obyvatel 2. Anketa Pomocí=20 > krátké ankety bude zjišťována=20 > nejoblíbenější <b>kniha</b> obyvatel=20 > České republiky. Pojem=20 > nejoblíbenější <b>kniha</b> je=20 > specifikován dalšími výklady, jako=20 > "<b>kniha</b>, která mě nejvíce=20 > ovlivnila", "<b>kniha</b>, ke které se=20 > často vracím", "<b>kniha</b>, kterou=20 > bych doporučil/a dobrým=20 > přátelům", "<b>kniha</b>,=20 > která změnila můj život",=20 > "<b>kniha</b> na kterou nemohu zapomenout",=20 > "<b>kniha</b>, která mne uvedla do jiného=20 > světa", "<b>kniha</b>, kterou bych si s sebou=20 > vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "přístupu" should be "p=C5=99=C3=ADstupu"(without=20 > diacritics "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |
|
From: Sverre B. <sve...@nb...> - 2005-11-02 13:07:05
|
The output from nutchwax is partly mangled. See http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPerP= age=3D10&hitsPerDup=3D1&dedupField=3Dexacturl where the contents of the des= cription element is garbage while the contents of the title element looks f= ine (!?).=20 As an example the text =C4=8Dasnosti =C5=BD=C4=8F=C3=A1rsk=C3=BDch vrch=C5=AF a Hornosvrateck=C3= =A9 hornatiny (taken from the html source of timeline view) has in nutchwax description element become 69;asnosti Žďárských vrchů a Hornosvratecké hornatiny An observation that may or may not have something to do with this: NutchWax does a more or less educated guess of the encoding used in the page. For the example it guessed windows-1252 which i believe is closer to iso-8859-1 than to the actual encoding specified in the example source, iso-8859-2. I'll keep looking. Sverre On Wed, 2005-11-02 at 12:20 +0100, Luk=C3=A1=C5=A1 Mat=C4=9Bjka wrote: > Hi, >=20 > for example > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&yea= r_to=3D >=20 > description of each record is not well-displayed >=20 > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > (<b> ... </b>přístupu k internetu v knihovnách propago= vat využití internetu při zjišťován&iacut= e; názorů obyvatel 2. Anketa Pomocí krátké= ankety bude zjišťována nejoblíbenějš&iac= ute; <b>kniha</b> obyvatel České republiky. Pojem nejoblí= ;benější <b>kniha</b> je specifikován dalš&iac= ute;mi výklady, jako "<b>kniha</b>, která mě nejv&i= acute;ce ovlivnila", "<b>kniha</b>, ke které se často= vracím", "<b>kniha</b>, kterou bych doporučil/a dobr= ým přátelům", "<b>kniha</b>, která= změnila můj život", "<b>kniha</b> na kterou nemoh= u zapomenout", "<b>kniha</b>, která mne uvedla do jin&eacu= te;ho světa", "<b>kniha</b>, kterou bych si s sebou vzal/a j= ako jedinou<b> ... </b>) > Versions (matching query/total) 3/3 > Timeline | Overview >=20 > "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") >=20 > does anybody have same problem? >=20 > -lm >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: stack <st...@ar...> - 2005-11-03 01:09:06
|
Luk=E1=9A Mat=ECjka wrote: >Hi, > >for example >http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&ye= ar_to=3D > >description of each record is not well-displayed > >1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) >(<b> ... </b>přístupu k internetu v knihovnách propag= ovat využití internetu při zjišťován&ia= cute; názorů obyvatel 2. Anketa Pomocí krátk&ea= cute; ankety bude zjišťována nejoblíbeněj= 53;í <b>kniha</b> obyvatel České republiky. Pojem nejo= blíbenější <b>kniha</b> je specifikován da= lšími výklady, jako "<b>kniha</b>, která m= ě nejvíce ovlivnila", "<b>kniha</b>, ke které= ; se často vracím", "<b>kniha</b>, kterou bych dopo= ručil/a dobrým přátelům", "<b>knih= a</b>, která změnila můj život", "<b>knih= a</b> na kterou nemohu zapomenout", "<b>kniha</b>, která= mne uvedla do jiného světa", "<b>kniha</b>, kterou= bych si s sebou vzal/a jako jedinou<b> ... </b>) >Versions (matching query/total) 3/3 >Timeline | Overview > >"přístupu" should be "p=F8=EDstupu"(without diacritics "pris= tupu") > >does anybody have same problem? > =20 > Did you change something Luk=E1=9A? When I browse to the link given abov= e,=20 the display looks correct: i.e. See "p=F8=EDstupu" in the below (Hopefull= y=20 this mail makes it across preserving original characters). St.Ack * 1. SKIP, Moje kniha* (http://skip.nkp.cz/akcMojekn.htm) (* ... *p=F8=EDstupu k internetu v knihovn=E1ch propagovat vyu=9Eit=ED in= ternetu=20 p=F8i zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa Pomoc=ED kr=E1tk=E9= ankety bude=20 zji=9A=9Dov=E1na nejobl=EDben=ECj=9A=ED *kniha* obyvatel =C8esk=E9 republ= iky. Pojem=20 nejobl=EDben=ECj=9A=ED *kniha* je specifikov=E1n dal=9A=EDmi v=FDklady, j= ako "*kniha*,=20 kter=E1 m=EC nejv=EDce ovlivnila", "*kniha*, ke kter=E9 se =E8asto vrac=ED= m",=20 "*kniha*, kterou bych doporu=E8il/a dobr=FDm p=F8=E1tel=F9m", "*kniha*, k= ter=E1=20 zm=ECnila m=F9j =9Eivot", "*kniha* na kterou nemohu zapomenout", "*kniha*= ,=20 kter=E1 mne uvedla do jin=E9ho sv=ECta", "*kniha*, kterou bych si s sebou= =20 vzal/a jako jedinou* ... *) Versions (matching query/total) 3/3 *Timeline=20 <http://war.mzk.cz/%7Enwa/wera/wera/result.php?time=3D20041212180928&url=3D= httpINDX3AINDX2FINDX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojeknINDXDOThtm>=20 | Overview=20 <http://war.mzk.cz/%7Enwa/wera/wera/overview.php?url=3DhttpINDX3AINDX2FIN= DX2FskipINDXDOTnkpINDXDOTczINDX2FakcMojeknINDXDOThtm>*=20 >-lm > > > >------------------------------------------------------- >SF.Net email is sponsored by: >Tame your development challenges with Apache's Geronimo App Server. Down= load >it for free - -and be entered to win a 42" plasma tv or your very own >Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > =20 > |