You can subscribe to this list here.
| 2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
| 2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
| 2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
| 2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
| 2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
| 2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
| 2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
| 2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
| 2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
|
From:
<mat...@ce...> - 2005-11-02 15:45:28
|
______________________________________________________________ > Od: sve...@nb... > Komu: arc...@li... > CC:=20 > Datum: 02.11.2005 14:33 > P=F8edm=ECt: RE: [Archive-access-discuss] wera results > > Hi there, > Definitely something wrong in NutchWax. If i execute > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D= &year_to=3D > and click the tmeline link of the first hit showing 0/0 hits i get where did you find hit showing 0/0? it works fine for me(i've just explored 150 urls..and no 0/0 hits ) did you remeber number of total hits?(if it's same - i experimented wit= h previous version of nutchwax,starting tomcat on various instances) i had for word "kniha" Total number of versions found : 49087. Displaying URL's 1-10 -lm > 'Sorry, no documents with the given uri were found'. The url displyed > seems fine, but if you look in the source of the uppermost frame you > will see that the url sent to the script was > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > The & separating the parameters irj and start has been replaced by it= s > html character entity reference.=20 >=20 > If i press the go button now the url submitted to the script will be = ok. >=20 > If i look in the NutchWax result set of the initial search (add &debu= g=3D1 > to the search url to bring out the NutchWax search urls) i see that t= he > url (link element) returned is wrong already here. >=20 > Conclusion : NutchWax mangles the url returned by introducing html > entities instead of keeping the url in its original form. >=20 > What version of NutchWax are you using? >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > This looks like the same (or very similar) problem as I've got. I'v= e > been discussing it (offlist) with Stack and Sverre Bang, so I know it= is > being looked into. > >=20 > > I notice in your search results (as in mine) that URIs with & in th= em > are showing up as 0/0 versions. I believe that both problems are due = to > the escaping (or unescaping) of HTML characters in the NutchWAX XML t= hat > is used to pass the results to WERA. > >=20 > > Possibly this is a misconfiguration of either Tomcat or Apache...? > >=20 > > - Kris > >=20 > > > -----Original Message----- > > > From: arc...@li...=20 > > > [mailto:arc...@li...]=20 > > > On Behalf Of Luk=C3=A5=C5=A5 Mat=C3=8Fjka > > > Sent: 2. n=C3=B3vember 2005 11:21 > > > To: arc...@li... > > > Subject: [Archive-access-discuss] wera results > > >=20 > > >=20 > > > Hi, > > >=20 > > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > om=3D&year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>p=F8=EDstupu k internetu v knihovn=E1ch > propagovat vyu=9Eit=ED internetu p=F8i > zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa > Pomoc=ED kr=E1tk=E9 ankety bude zji=9A=9Dov=E1na > nejobl=EDben=ECj=9A=ED <b>kniha</b> obyvatel > =C8esk=E9 republiky. Pojem nejobl=EDben=ECj=9A=ED > <b>kniha</b> je specifikov=E1n dal=9A=EDmi v=FDklady, > jako "<b>kniha</b>, kter=E1 m=EC nejv=EDce > ovlivnila", "<b>kniha</b>, ke kter=E9 se =E8asto > vrac=EDm", "<b>kniha</b>, kterou bych doporu=E8il/a > dobr=FDm p=F8=E1tel=F9m", "<b>kniha</b>, > kter=E1 zm=ECnila m=F9j =9Eivot", "<b>kniha</b> na > kterou nemohu zapomenout", "<b>kniha</b>, kter=E1 mne uvedla > do jin=E9ho sv=ECta", "<b>kniha</b>, kterou bych si s > sebou vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "p=F8=EDstupu" should be "p=C5=C3=ADstupu"(without diacritics > "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. > Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |
|
From:
<mat...@ce...> - 2005-11-02 15:31:53
|
______________________________________________________________ > Od: sve...@nb... > Komu: arc...@li... > CC:=20 > Datum: 02.11.2005 14:33 > P=F8edm=ECt: RE: [Archive-access-discuss] wera results > > Hi there, > Definitely something wrong in NutchWax. If i execute > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D= &year_to=3D > and click the tmeline link of the first hit showing 0/0 hits i get > 'Sorry, no documents with the given uri were found'. The url displyed > seems fine, but if you look in the source of the uppermost frame you > will see that the url sent to the script was > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > The & separating the parameters irj and start has been replaced by it= s > html character entity reference.=20 >=20 > If i press the go button now the url submitted to the script will be = ok. >=20 > If i look in the NutchWax result set of the initial search (add &debu= g=3D1 > to the search url to bring out the NutchWax search urls) i see that t= he > url (link element) returned is wrong already here. >=20 > Conclusion : NutchWax mangles the url returned by introducing html > entities instead of keeping the url in its original form. >=20 > What version of NutchWax are you using? the latest release.. >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > This looks like the same (or very similar) problem as I've got. I'v= e > been discussing it (offlist) with Stack and Sverre Bang, so I know it= is > being looked into. > >=20 > > I notice in your search results (as in mine) that URIs with & in th= em > are showing up as 0/0 versions. I believe that both problems are due = to > the escaping (or unescaping) of HTML characters in the NutchWAX XML t= hat > is used to pass the results to WERA. > >=20 > > Possibly this is a misconfiguration of either Tomcat or Apache...? > >=20 > > - Kris > >=20 > > > -----Original Message----- > > > From: arc...@li...=20 > > > [mailto:arc...@li...]=20 > > > On Behalf Of Luk=C3=A5=C5=A5 Mat=C3=8Fjka > > > Sent: 2. n=C3=B3vember 2005 11:21 > > > To: arc...@li... > > > Subject: [Archive-access-discuss] wera results > > >=20 > > >=20 > > > Hi, > > >=20 > > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > om=3D&year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>p=F8=EDstupu k internetu v knihovn=E1ch > propagovat vyu=9Eit=ED internetu p=F8i > zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa > Pomoc=ED kr=E1tk=E9 ankety bude zji=9A=9Dov=E1na > nejobl=EDben=ECj=9A=ED <b>kniha</b> obyvatel > =C8esk=E9 republiky. Pojem nejobl=EDben=ECj=9A=ED > <b>kniha</b> je specifikov=E1n dal=9A=EDmi v=FDklady, > jako "<b>kniha</b>, kter=E1 m=EC nejv=EDce > ovlivnila", "<b>kniha</b>, ke kter=E9 se =E8asto > vrac=EDm", "<b>kniha</b>, kterou bych doporu=E8il/a > dobr=FDm p=F8=E1tel=F9m", "<b>kniha</b>, > kter=E1 zm=ECnila m=F9j =9Eivot", "<b>kniha</b> na > kterou nemohu zapomenout", "<b>kniha</b>, kter=E1 mne uvedla > do jin=E9ho sv=ECta", "<b>kniha</b>, kterou bych si s > sebou vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "p=F8=EDstupu" should be "p=C5=C3=ADstupu"(without diacritics > "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. > Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |
|
From: Kristinn S. <kr...@ar...> - 2005-11-02 15:27:26
|
Here is an part of the XML file generated by the opensearch servlet <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> Notice the section &amp; clearly somthing is (properly) escaping the = string <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&id=3D= 5959</nutch:cache> To <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;id=3D5959</nutch:cache> That string is then re-escaped to <nutch:cache>http://tildra.bok.hi.is:8080/nutchwax/cached.jsp?idx=3D0&= ;amp;id=3D5959</nutch:cache> A little bit of simple testing with Tomcat didn't indicate that it was = doing automatic escaping like this.=20 We need to identify where the extrenous escaping is being done.=20 I installed both Tomcat 5.0.28 and NutchWAX 0.4.0 without any changes to = their default configuration. I'm also using Sun's Java version 1.5.0_05. = To get XML on Tomcat working, I deleted the file = %TOMCAT_HOME/common/endorsed/xml-apis.jar. How does this differ from the = demo on nwa.nb.no/wera? -Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Sverre Bang > Sent: 2. n=C3=B3vember 2005 13:27 > To: arc...@li... > Subject: RE: [Archive-access-discuss] wera results >=20 >=20 > Hi there, > Definitely something wrong in NutchWax. If i execute > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D > and click the tmeline link of the first hit showing 0/0 hits i get > 'Sorry, no documents with the given uri were found'. The url displyed > seems fine, but if you look in the source of the uppermost frame you > will see that the url sent to the script was > http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. > The & separating the parameters irj and start has been replaced by its > html character entity reference.=20 >=20 > If i press the go button now the url submitted to the script=20 > will be ok. >=20 > If i look in the NutchWax result set of the initial search=20 > (add &debug=3D1 > to the search url to bring out the NutchWax search urls) i=20 > see that the > url (link element) returned is wrong already here. >=20 > Conclusion : NutchWax mangles the url returned by introducing html > entities instead of keeping the url in its original form. >=20 > What version of NutchWax are you using? >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > > This looks like the same (or very similar) problem as I've=20 > got. I've been discussing it (offlist) with Stack and Sverre=20 > Bang, so I know it is being looked into. > >=20 > > I notice in your search results (as in mine) that URIs with=20 > & in them are showing up as 0/0 versions. I believe that both=20 > problems are due to the escaping (or unescaping) of HTML=20 > characters in the NutchWAX XML that is used to pass the=20 > results to WERA. > >=20 > > Possibly this is a misconfiguration of either Tomcat or Apache...? > >=20 > > - Kris > >=20 > > > -----Original Message----- > > > From: arc...@li...=20 > > > [mailto:arc...@li...]=20 > > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > > Sent: 2. n=C3=B3vember 2005 11:21 > > > To: arc...@li... > > > Subject: [Archive-access-discuss] wera results > > >=20 > > >=20 > > > Hi, > > >=20 > > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > > om=3D&year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>přístupu k internetu v=20 > knihovnách propagovat využití internetu=20 > při zjišťování=20 > názorů obyvatel 2. Anketa Pomocí=20 > krátké ankety bude zjišťována=20 > nejoblíbenější <b>kniha</b> obyvatel=20 > České republiky. Pojem=20 > nejoblíbenější <b>kniha</b> je=20 > specifikován dalšími výklady, jako=20 > "<b>kniha</b>, která mě nejvíce=20 > ovlivnila", "<b>kniha</b>, ke které se=20 > často vracím", "<b>kniha</b>, kterou=20 > bych doporučil/a dobrým=20 > přátelům", "<b>kniha</b>,=20 > která změnila můj život",=20 > "<b>kniha</b> na kterou nemohu zapomenout",=20 > "<b>kniha</b>, která mne uvedla do jiného=20 > světa", "<b>kniha</b>, kterou bych si s sebou=20 > vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "přístupu" should be "p=C5=99=C3=ADstupu"(without=20 > diacritics "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > > it for free - -and be entered to win a 42" plasma tv or=20 > your very own > > Sony(tm)PSP. Click here to play:=20 > http://sourceforge.net/geronimo.php > >=20 > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App=20 > Server. Download > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |
|
From:
<mat...@ce...> - 2005-11-02 13:28:31
|
if i use=20 http://war.mzk.cz:8080/nutchwax/search.jsp?query=3Dkniha&hitsPerPage=3D= 10 interface to nutchwax, description looks fine, so problem is in sevlet opensearch i guess... l. ______________________________________________________________ > Od: sve...@nb... > Komu: arc...@li... > CC: Luk=E1=9A Mat=ECjka <mat...@ce...> > Datum: 02.11.2005 14:07 > P=F8edm=ECt: Re: [Archive-access-discuss] wera results > > The output from nutchwax is partly mangled. See > http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hi= tsPerPage=3D10&hitsPerDup=3D1&dedupField=3Dexacturl > where the contents of the description element is garbage while the co= ntents > of the title element looks fine (!?).=20 >=20 > As an example the text >=20 > =E8asnosti =8E=EF=E1rsk=FDch vrch=F9 a Hornosvrateck=E9 hornatiny > (taken from the html source of timeline view) has in nutchwax > description element become >=20 > 69;asnosti =8E=EF=E1rsk=FDch vrch=F9 a > Hornosvrateck=E9 hornatiny >=20 > An observation that may or may not have something to do with this: > NutchWax does a more or less educated guess of the encoding used in t= he > page. For the example it guessed windows-1252 which i believe is clos= er > to iso-8859-1 than to the actual encoding specified in the example > source, iso-8859-2. >=20 > I'll keep looking. >=20 > Sverre >=20 > On Wed, 2005-11-02 at 12:20 +0100, Luk=E1=9A Mat=ECjka wrote: > > Hi, > >=20 > > for example > > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D= &year_to=3D > >=20 > > description of each record is not well-displayed > >=20 > > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > > (<b> ... </b>p=F8=EDstupu k internetu v knihovn=E1ch > propagovat vyu=9Eit=ED internetu p=F8i > zji=9A=9Dov=E1n=ED n=E1zor=F9 obyvatel 2. Anketa > Pomoc=ED kr=E1tk=E9 ankety bude zji=9A=9Dov=E1na > nejobl=EDben=ECj=9A=ED <b>kniha</b> obyvatel > =C8esk=E9 republiky. Pojem nejobl=EDben=ECj=9A=ED > <b>kniha</b> je specifikov=E1n dal=9A=EDmi v=FDklady, > jako "<b>kniha</b>, kter=E1 m=EC nejv=EDce > ovlivnila", "<b>kniha</b>, ke kter=E9 se =E8asto > vrac=EDm", "<b>kniha</b>, kterou bych doporu=E8il/a > dobr=FDm p=F8=E1tel=F9m", "<b>kniha</b>, > kter=E1 zm=ECnila m=F9j =9Eivot", "<b>kniha</b> na > kterou nemohu zapomenout", "<b>kniha</b>, kter=E1 mne uvedla > do jin=E9ho sv=ECta", "<b>kniha</b>, kterou bych si s > sebou vzal/a jako jedinou<b> ... </b>) > > Versions (matching query/total) 3/3 > > Timeline | Overview > >=20 > > "p=F8=EDstupu" should be "p=F8=EDstupu"(without diacritics > "pristupu") > >=20 > > does anybody have same problem? > >=20 > > -lm > >=20 > >=20 > >=20 > > ------------------------------------------------------- > > SF.Net email is sponsored by: > > Tame your development challenges with Apache's Geronimo App Server. > Download > > it for free - -and be entered to win a 42" plasma tv or your very o= wn > > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.ph= p > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 |
|
From: Sverre B. <sve...@nb...> - 2005-11-02 13:28:31
|
Hi there, Definitely something wrong in NutchWax. If i execute http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&year_= to=3D and click the tmeline link of the first hit showing 0/0 hits i get 'Sorry, no documents with the given uri were found'. The url displyed seems fine, but if you look in the source of the uppermost frame you will see that the url sent to the script was http://full.nkp.cz/nkdb/rejstriky/rejstrik.asp?irj=3D12&start=3DV. The & separating the parameters irj and start has been replaced by its html character entity reference.=20 If i press the go button now the url submitted to the script will be ok. If i look in the NutchWax result set of the initial search (add &debug=3D1 to the search url to bring out the NutchWax search urls) i see that the url (link element) returned is wrong already here. Conclusion : NutchWax mangles the url returned by introducing html entities instead of keeping the url in its original form. What version of NutchWax are you using? Sverre On Wed, 2005-11-02 at 12:41 +0000, Kristinn Sigurdsson wrote: > This looks like the same (or very similar) problem as I've got. I've been= discussing it (offlist) with Stack and Sverre Bang, so I know it is being = looked into. >=20 > I notice in your search results (as in mine) that URIs with & in them are= showing up as 0/0 versions. I believe that both problems are due to the es= caping (or unescaping) of HTML characters in the NutchWAX XML that is used = to pass the results to WERA. >=20 > Possibly this is a misconfiguration of either Tomcat or Apache...? >=20 > - Kris >=20 > > -----Original Message----- > > From: arc...@li...=20 > > [mailto:arc...@li...]=20 > > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > > Sent: 2. n=C3=B3vember 2005 11:21 > > To: arc...@li... > > Subject: [Archive-access-discuss] wera results > >=20 > >=20 > > Hi, > >=20 > > for example > > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr > om=3D&year_to=3D >=20 > description of each record is not well-displayed >=20 > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > (<b> ... </b>přístupu k internetu v knihovnách propago= vat využití internetu při zjišťován&iacut= e; názorů obyvatel 2. Anketa Pomocí krátké= ankety bude zjišťována nejoblíbenějš&iac= ute; <b>kniha</b> obyvatel České republiky. Pojem nejoblí= ;benější <b>kniha</b> je specifikován dalš&iac= ute;mi výklady, jako "<b>kniha</b>, která mě nejv&i= acute;ce ovlivnila", "<b>kniha</b>, ke které se často= vracím", "<b>kniha</b>, kterou bych doporučil/a dobr= ým přátelům", "<b>kniha</b>, která= změnila můj život", "<b>kniha</b> na kterou nemoh= u zapomenout", "<b>kniha</b>, která mne uvedla do jin&eacu= te;ho světa", "<b>kniha</b>, kterou bych si s sebou vzal/a j= ako jedinou<b> ... </b>) > Versions (matching query/total) 3/3 > Timeline | Overview >=20 > "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") >=20 > does anybody have same problem? >=20 > -lm >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Sverre B. <sve...@nb...> - 2005-11-02 13:07:05
|
The output from nutchwax is partly mangled. See http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPerP= age=3D10&hitsPerDup=3D1&dedupField=3Dexacturl where the contents of the des= cription element is garbage while the contents of the title element looks f= ine (!?).=20 As an example the text =C4=8Dasnosti =C5=BD=C4=8F=C3=A1rsk=C3=BDch vrch=C5=AF a Hornosvrateck=C3= =A9 hornatiny (taken from the html source of timeline view) has in nutchwax description element become 69;asnosti Žďárských vrchů a Hornosvratecké hornatiny An observation that may or may not have something to do with this: NutchWax does a more or less educated guess of the encoding used in the page. For the example it guessed windows-1252 which i believe is closer to iso-8859-1 than to the actual encoding specified in the example source, iso-8859-2. I'll keep looking. Sverre On Wed, 2005-11-02 at 12:20 +0100, Luk=C3=A1=C5=A1 Mat=C4=9Bjka wrote: > Hi, >=20 > for example > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&yea= r_to=3D >=20 > description of each record is not well-displayed >=20 > 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) > (<b> ... </b>přístupu k internetu v knihovnách propago= vat využití internetu při zjišťován&iacut= e; názorů obyvatel 2. Anketa Pomocí krátké= ankety bude zjišťována nejoblíbenějš&iac= ute; <b>kniha</b> obyvatel České republiky. Pojem nejoblí= ;benější <b>kniha</b> je specifikován dalš&iac= ute;mi výklady, jako "<b>kniha</b>, která mě nejv&i= acute;ce ovlivnila", "<b>kniha</b>, ke které se často= vracím", "<b>kniha</b>, kterou bych doporučil/a dobr= ým přátelům", "<b>kniha</b>, která= změnila můj život", "<b>kniha</b> na kterou nemoh= u zapomenout", "<b>kniha</b>, která mne uvedla do jin&eacu= te;ho světa", "<b>kniha</b>, kterou bych si s sebou vzal/a j= ako jedinou<b> ... </b>) > Versions (matching query/total) 3/3 > Timeline | Overview >=20 > "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") >=20 > does anybody have same problem? >=20 > -lm >=20 >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: > Tame your development challenges with Apache's Geronimo App Server. Downl= oad > it for free - -and be entered to win a 42" plasma tv or your very own > Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Kristinn S. <kr...@ar...> - 2005-11-02 12:41:45
|
This looks like the same (or very similar) problem as I've got. I've = been discussing it (offlist) with Stack and Sverre Bang, so I know it is = being looked into. I notice in your search results (as in mine) that URIs with & in them = are showing up as 0/0 versions. I believe that both problems are due to = the escaping (or unescaping) of HTML characters in the NutchWAX XML that = is used to pass the results to WERA. Possibly this is a misconfiguration of either Tomcat or Apache...? - Kris > -----Original Message----- > From: arc...@li...=20 > [mailto:arc...@li...]=20 > On Behalf Of Luk=C3=A1=C5=A1 Mat=C3=ACjka > Sent: 2. n=C3=B3vember 2005 11:21 > To: arc...@li... > Subject: [Archive-access-discuss] wera results >=20 >=20 > Hi, >=20 > for example > http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_fr om=3D&year_to=3D description of each record is not well-displayed 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) (<b> ... </b>přístupu k internetu v knihovnách = propagovat využití internetu při = zjišťování názorů obyvatel 2. Anketa = Pomocí krátké ankety bude = zjišťována nejoblíbenější = <b>kniha</b> obyvatel České republiky. Pojem = nejoblíbenější <b>kniha</b> je = specifikován dalšími výklady, jako = "<b>kniha</b>, která mě nejvíce ovlivnila", = "<b>kniha</b>, ke které se často vracím", = "<b>kniha</b>, kterou bych doporučil/a dobrým = přátelům", "<b>kniha</b>, která = změnila můj život", "<b>kniha</b> na kterou = nemohu zapomenout", "<b>kniha</b>, která mne uvedla do = jiného světa", "<b>kniha</b>, kterou bych si s = sebou vzal/a jako jedinou<b> ... </b>) Versions (matching query/total) 3/3 Timeline | Overview "přístupu" should be "p=C5=99=C3=ADstupu"(without diacritics = "pristupu") does anybody have same problem? -lm ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. = Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From:
<mat...@ce...> - 2005-11-02 11:37:28
|
Hi, for example http://war.mzk.cz/~nwa/wera/wera/index.php?query=3Dkniha&year_from=3D&y= ear_to=3D description of each record is not well-displayed 1. SKIP, Moje kniha (http://skip.nkp.cz/akcMojekn.htm) (<b> ... </b>přístupu k internetu v knihovnách propa= govat využití internetu při zjišťován= í názorů obyvatel 2. Anketa Pomocí krá= tké ankety bude zjišťována nejoblíben= 83;jší <b>kniha</b> obyvatel České republiky. P= ojem nejoblíbenější <b>kniha</b> je specifikov&= aacute;n dalšími výklady, jako "<b>kniha</b>, k= terá mě nejvíce ovlivnila", "<b>kniha</b>,= ke které se často vracím", "<b>kniha</b>,= kterou bych doporučil/a dobrým přátelům&q= uot;, "<b>kniha</b>, která změnila můj život= ", "<b>kniha</b> na kterou nemohu zapomenout", "<b>= kniha</b>, která mne uvedla do jiného světa", &= quot;<b>kniha</b>, kterou bych si s sebou vzal/a jako jedinou<b> ... </= b>) Versions (matching query/total) 3/3 Timeline | Overview "přístupu" should be "p=F8=EDstupu"(without diacritics "pri= stupu") does anybody have same problem? -lm |
|
From: stack <st...@ar...> - 2005-11-02 00:19:18
|
kau...@cs... wrote: >Is it possible to rank/sort search results by relevance? Show first >results >where search term is in html title, or appears several times in text. >(versus those where search term appears once, late in text, or >in a link name). > > It should be doing this for you Kaisa. In general, are you not seeing the most significant links showing first in results? I just added a little FAQ on ranking with some notes on how nutch is doing it. I'll repeat the note here: By default, at query time, the following fields are boosted as follows: query.url.boost, 4.0f query.anchor.boost, 2.0f query.title.boost, 1.5f query.host.boost, 2.0f query.phrase.boost, 1.0f From the above, terms found in an URL are scored high with anchor text next, then title. You can change the above boosts by editing your nutch-site.xml but in general, the defaults seem to work well for most collections. Anchor text can make a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then editing the URL to put in place 'anchors.jsp' instead of 'explain.jsp'. >Does nutchwax index link names within html files? If there's a link >http://www.something.net/storm.gif withing html , could I search for >'storm' >and get this image into result list? > > This is an interesting question Kaisa. I just took a look. It doesn't look like it (See below for how I figured this). Do you need this feature? Here's how I took a look see at what was in the a particular nutch segment: % ./bin/nutch segread -fix -nocontent -dump nutch-data/segments/debord2005-11-01-155531/ This dumps out what nutch has per resource. It will list the text it parsed from the document, list of outlinks found in the document, the page hash, etc. I compared what was in nutch to what was in the indexed ARC (I zcat'd the ARC). Yours, St.Ack >*Kaisa > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. >Get Certified Today * Register for a JBoss Training Course >Free Certification Exam for All Training Attendees Through End of 2005 >Visit http://www.jboss.com/services/certification for more information >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: Kaisa K. <kau...@cs...> - 2005-11-01 13:17:06
|
Sorry, I'll correct myself: If there is an html file http://www.something.net/story.html which contains an inline image with name ...storm.gif could I search for storm and get http://www.something.net/story.html into search results :) On Tue, 1 Nov 2005 kau...@cs... wrote: > Does nutchwax index link names within html files? If there's a link > http://www.something.net/storm.gif withing html , could I search for > 'storm' > and get this image into result list? > > *Kaisa > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: <kau...@cs...> - 2005-11-01 12:47:07
|
Is it possible to rank/sort search results by relevance? Show first results where search term is in html title, or appears several times in text. (versus those where search term appears once, late in text, or in a link name). Does nutchwax index link names within html files? If there's a link http://www.something.net/storm.gif withing html , could I search for 'storm' and get this image into result list? *Kaisa |
|
From: <kau...@cs...> - 2005-11-01 11:16:55
|
Hi, we have nutchwax&wera 0.4.0, and the scandinavian characters are now working :) There seems to be some problems with special characters in urls. Wera 0.4.0 doesn't find these urls and shows broken links although the files are in the archive. (1) This one has a percent sign http://www.helsinki2005.fi/files/pics/1079364264_mascot%20medium.gif (2) This contains an ampersand http://www.helsinki2005.fi/index.php?Name=3Dnewsitem&item=3D449 Looks like all urls with ampersands cause errors here. They are fairly common in addresses, does anyone else see the same problem? Kaisa |
|
From: <st...@ar...> - 2005-10-28 15:20:58
|
arc...@li... wrote: >Message: 1 >Date: Thu, 27 Oct 2005 08:52:10 +0300 (EEST) >From: Kaisa Kaunonen <kau...@cs...> >To: arc...@li... >Subject: Re: [Archive-access-discuss] Path to parse-pdf.sh > > >Yes, editing plugin.xml is of course the right thing to do.. >just suggesting that default value of this path should >vary according to the local $NUTCHWAX variable, if people install >indexer out of the box. > Yes Kaisa, it should just work. It shouldn't be necessary tinkering with this one path before you go about indexing (There is an FAQ on this -- http://archive-access.sourceforge.net/projects/nutch/faq.html#pdf -- but I should have put something on this into the setup instructions... I'll fix this). Setting the path is a little tough. Its a java variable so its awkward exploiting environment settings such as a NUTCHWAX. I should likely pass in the NUTCHWAX setting into java as a system property and then make use of that composing the path to parse-pdf.sh. I was also thinking though of redoing the pdf parser since the one we have has a couple of issues: 1. If pdf > 10megs, not indexed; and 2. If http content-length header does not exactly match actual content length, we skip the document (This happens quite frequently). I was thinking of doing a parse-xpdf plugin to use in place of parse-ext. It wouldn't do things like try to find an external script -- parse-pdf.sh -- to run but would just use the environments xpdf (though you could override this of course) and it would try to do a better job with big pdf and perhaps incomplete pdf (To be explored). St.Ack |
|
From: Kaisa K. <kau...@cs...> - 2005-10-27 05:52:25
|
Yes, editing plugin.xml is of course the right thing to do.. just suggesting that default value of this path should vary according to the local $NUTCHWAX variable, if people install indexer out of the box. kaisa On Wed, 26 Oct 2005, Charles Foetz wrote: > Hi Kaisa, > > you could execute: > > find / -name parse-pdf.sh > > This will tell you where parse-pdf.sh is located (e.g. /bin/parse-pdf.sh) > > Then, in the file ../plugins/parse-ext/plugin.xml, > > replace the line > /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh > > with the real location of the parse-pdf.sh script (e.g. /bin/parse-pdf.sh) > > Charlie > > > > change > ----- Original Message ----- From: <kau...@cs...> > To: <arc...@li...> > Sent: Wednesday, October 26, 2005 1:01 PM > Subject: [Archive-access-discuss] Path to parse-pdf.sh > > > > Hi, > thanks for the new nutchwax&wera releases! > > I'm indexing a small test archive and all pdf files create errors. The > script parse-pdf.sh is in bin directory, but path to it is hard coded > somewhere, > in ../plugins/parse-ext/plugin.xml > > 051026 125244 adding 2381478 bytes of mimetype application/pdf > http://www.helsinki2005.fi/files/pdf/pretraining_camps.pdf > 051026 125244 Failed parse: java.io.IOException: java.io.IOException: > /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh: > not found > > Kaisa > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: Sverre B. <sve...@nb...> - 2005-10-26 15:54:25
|
Did you set up Tomcat according to=20 http://archive-access.sourceforge.net/projects/nutch/faq.html#encoding i.e.= =20 edit $TOMCAT_HOME/conf/server.xml and add useBodyEncodingForURI=3Dtrue ? Sverre On Wednesday 26 October 2005 16:00, Luk=E1=9A Mat=ECjka wrote: > Hi, > > I have still same problems with encoding in new release. (I'm not able to > search for czech characters.) Did you try this issue for other specific > languages(with special characters)? > > In script is something like that: > > request.setCharacterEncoding("UTF-8"); > > but i have to explicitly say encoding, that i want to convert from > > String parameter =3D request.getParameter("query"); > String queryString =3D new String(parameter.getBytes("ISO-8859-1"), > "UTF-8"); > > > myabe it's problem on our server or tomcat...?works it for you? > > l. > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From:
<mat...@ce...> - 2005-10-26 14:01:07
|
Hi,
I have still same problems with encoding in new release. (I'm not able to search for czech characters.)
Did you try this issue for other specific languages(with special characters)?
In script is something like that:
request.setCharacterEncoding("UTF-8");
but i have to explicitly say encoding, that i want to convert from
String parameter = request.getParameter("query");
String queryString = new String(parameter.getBytes("ISO-8859-1"), "UTF-8");
myabe it's problem on our server or tomcat...?works it for you?
l.
|
|
From: Sverre B. <sve...@nb...> - 2005-10-26 13:30:40
|
I'm pleased to hear that, Luk=C3=A1=C5=A1. If you issue java -veresion (on the jdk1.5.0_04) what does it return? I have available here build 1.5.0_04-b05 and build 1.5.0_03-b07 and the installer works fine on both. Sverre =20 On Wed, 2005-10-26 at 15:04 +0200, Luk=C3=A1=C5=A1 Mat=C4=9Bjka wrote: > well, issue si fixed. > i used another version of java(j2sdk1.4.2_03). > (before i had used jdk1.5.0_04) >=20 > l. >=20 > ______________________________________________________________ > > Od: sve...@nb... > > Komu: arc...@li... > > CC:=20 > > Datum: 25.10.2005 15:10 > > P=C5=99edm=C4=9Bt: Re: [Archive-access-discuss] ANN: New release of W= ERA+NutchWAX, ARC access toolset > > > > Sorry to hear that, Lukas. > > I just downloaded the installer and tried invoking it on a few of the > > machines=20 > > i have access to. I tried both local and remote install, it works=20 > > frustratingly fine. Could you give me some details of your environmen= t? > > Local=20 > > or remote install, java version, DISPLAY environment variable, etc. > > I really want to be able to reproduce this issue, so that we can fix = the=20 > > installer and/or installer how-to. > >=20 > > Regards > > Sverre > >=20 > >=20 > >=20 > > On Tuesday 25 October 2005 13:13, Lukas Matejka wrote: > > > Hi, > > > > > > i've just downloaded file below and still not working... > > > > > > nwa@war:~/download$ java -jar wera-0.4.0-installer.jar text > > > > > > (.:29040): Gtk-WARNING **: cannot open display: > > > > > > did you update installer in release? > > > > > > lukas > > > > > > > The initial upload of the java-based WERA installer was corrupt. = A > > > > working installer package is now available from > > > > http://sourceforge.net/project/showfiles.php?group_id=3D118427 > > > > (wera-0.4.0-installer.jar). > > > > > > > > Installation tips: > > > > > > > > * On local host with X available : > > > > java -jar wera-0.4.0-installer.jar > > > > > > > > * Install on remote host (with X available at local and remote ho= st): > > > > copy the installer to the remote host, log in the to the remote w= ith > > ssh > > > > -X user@host (X port forwarding enabled) and start install with j= ava > > -jar > > > > wera-0.4.0-installer.jar > > > > > > > > * No X available, local or remote install : > > > > java -jar wera-0.4.0-installer.jar text > > > > > > > > Regards > > > > Sverre Bang > > > > > > > > On Saturday 22 October 2005 08:50, stack wrote: > > > > > This note is to announce a new release 0.4.0 of WERA (WEb aRchi= ve > > > > > Access), the web archive collection search and navigation tool,= and > > of > > > > > NutchWAX (Nutch with Web Archive eXtensions), the web archive > > > > > collection search engine that powers the WERA application (amon= g > > other > > > > > things). Use the tools together to access a repository of ARC f= iles. > > > > > > > > > > Release 0.4.0 of WERA adds much improved error and encoding > > handling, a > > > > > manual as well as an architectural overview document. Packaging= has > > > > > also been improved. See the Release Notes for more detail on ch= anges > > > > > (and current known limitations) at > > > > > > > http://archive-access.sourceforge.net/projects/wera/articles/releasen= ot > > > > >es .h tml. > > > > > > > > > > Also, with release 0.4.0, WERA has migrated from its old home i= n the > > > > > NWAToolset at nwa.nb.no to a subproject of > > > > > archive-access.sourceforge.net. The WERA home page is at: > > > > > http://archive-access.sourceforge.net/projects/wera/. > > > > > > > > > > Release 0.4.0 of NutchWAX includes lots of bug fixes and has be= en > > built > > > > > against release 0.7 of nutch. Again see the release notes for > > details: > > > > > > > http://archive-access.sourceforge.net/projects/nutch/articles/release= no > > > > >te s. html. > > > > > > > > > > For the NutchWAX home page, see: > > > > > http://archive-access.sourceforge.net/projects/nutch/. > > > > > > > > > > A demo of WERA+NutchWAX in operation can be found at > > > > > http://nwa.nb.no/wera/. > > > > > > > > > > Yours, > > > > > Sverre Bang and Michael Stack > > > > > > > > > > > > > > > ------------------------------------------------------- > > > > > This SF.Net email is sponsored by the JBoss Inc. > > > > > Get Certified Today * Register for a JBoss Training Course > > > > > Free Certification Exam for All Training Attendees Through End = of > > 2005 > > > > > Visit http://www.jboss.com/services/certification for more > > information > > > > > _______________________________________________ > > > > > Archive-access-discuss mailing list > > > > > Arc...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-dis= cuss > > > > > > > > ------------------------------------------------------- > > > > This SF.Net email is sponsored by the JBoss Inc. > > > > Get Certified Today * Register for a JBoss Training Course > > > > Free Certification Exam for All Training Attendees Through End of= 2005 > > > > Visit http://www.jboss.com/services/certification for more inform= ation > > > > _______________________________________________ > > > > Archive-access-discuss mailing list > > > > Arc...@li... > > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discu= ss > >=20 > >=20 > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss Training Course > > Free Certification Exam for All Training Attendees Through End of 200= 5 > > Visit http://www.jboss.com/services/certification for more informatio= n > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > >=20 >=20 |
|
From: Charles F. <Cha...@bn...> - 2005-10-26 12:29:00
|
Hi Kaisa, you could execute: find / -name parse-pdf.sh This will tell you where parse-pdf.sh is located (e.g. /bin/parse-pdf.sh) Then, in the file ../plugins/parse-ext/plugin.xml, replace the line /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh with the real location of the parse-pdf.sh script (e.g. /bin/parse-pdf.sh) Charlie change ----- Original Message ----- From: <kau...@cs...> To: <arc...@li...> Sent: Wednesday, October 26, 2005 1:01 PM Subject: [Archive-access-discuss] Path to parse-pdf.sh Hi, thanks for the new nutchwax&wera releases! I'm indexing a small test archive and all pdf files create errors. The script parse-pdf.sh is in bin directory, but path to it is hard coded somewhere, in ../plugins/parse-ext/plugin.xml 051026 125244 adding 2381478 bytes of mimetype application/pdf http://www.helsinki2005.fi/files/pdf/pretraining_camps.pdf 051026 125244 Failed parse: java.io.IOException: java.io.IOException: /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh: not found Kaisa ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Kaisa K. <kau...@cs...> - 2005-10-26 11:33:10
|
Hi, thanks for the new nutchwax&wera releases! I'm indexing a small test archive and all pdf files cause errors. The script parse-pdf.sh is in bin directory, but path to it is hard coded somewhere, in ../plugins/parse-ext/plugin.xml 051026 125244 adding 2381478 bytes of mimetype application/pdf http://www.helsinki2005.fi/files/pdf/pretraining_camps.pdf 051026 125244 Failed parse: java.io.IOException: java.io.IOException: /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh: not found Kaisa (Sorry if this email was sent twice to the list.) |
|
From: <kau...@cs...> - 2005-10-26 11:01:56
|
Hi, thanks for the new nutchwax&wera releases! I'm indexing a small test archive and all pdf files create errors. The script parse-pdf.sh is in bin directory, but path to it is hard coded somewhere, in ../plugins/parse-ext/plugin.xml 051026 125244 adding 2381478 bytes of mimetype application/pdf http://www.helsinki2005.fi/files/pdf/pretraining_camps.pdf 051026 125244 Failed parse: java.io.IOException: java.io.IOException: /home/stack/workspace/archive-access/projects/nutch/bin/parse-pdf.sh: not found Kaisa |
|
From: stack <st...@ar...> - 2005-10-25 15:38:41
|
Lukáš Matějka wrote: >Do you > > >>want to move your nedlibtoarc here onto archive-access? Seems like good >>place for it. I can make you a subproject and make you admin. >> >> > >after fixing these bugs, yes:) > > Ok. Good luck. St.Ack >l. > > > >>St.Ack >> >> >>------------------------------------------------------- >>This SF.Net email is sponsored by the JBoss Inc. >>Get Certified Today * Register for a JBoss Training Course >>Free Certification Exam for All Training Attendees Through End of 2005 >>Visit http://www.jboss.com/services/certification for more information >>_______________________________________________ >>Archive-access-discuss mailing list >>Arc...@li... >>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >></matejka.lukas> >> >> > > > |
|
From: Sverre B. <sve...@nb...> - 2005-10-25 13:07:28
|
Sorry to hear that, Lukas. I just downloaded the installer and tried invoking it on a few of the machines i have access to. I tried both local and remote install, it works frustratingly fine. Could you give me some details of your environment? Local or remote install, java version, DISPLAY environment variable, etc. I really want to be able to reproduce this issue, so that we can fix the installer and/or installer how-to. Regards Sverre On Tuesday 25 October 2005 13:13, Lukas Matejka wrote: > Hi, > > i've just downloaded file below and still not working... > > nwa@war:~/download$ java -jar wera-0.4.0-installer.jar text > > (.:29040): Gtk-WARNING **: cannot open display: > > did you update installer in release? > > lukas > > > The initial upload of the java-based WERA installer was corrupt. A > > working installer package is now available from > > http://sourceforge.net/project/showfiles.php?group_id=118427 > > (wera-0.4.0-installer.jar). > > > > Installation tips: > > > > * On local host with X available : > > java -jar wera-0.4.0-installer.jar > > > > * Install on remote host (with X available at local and remote host): > > copy the installer to the remote host, log in the to the remote with ssh > > -X user@host (X port forwarding enabled) and start install with java -jar > > wera-0.4.0-installer.jar > > > > * No X available, local or remote install : > > java -jar wera-0.4.0-installer.jar text > > > > Regards > > Sverre Bang > > > > On Saturday 22 October 2005 08:50, stack wrote: > > > This note is to announce a new release 0.4.0 of WERA (WEb aRchive > > > Access), the web archive collection search and navigation tool, and of > > > NutchWAX (Nutch with Web Archive eXtensions), the web archive > > > collection search engine that powers the WERA application (among other > > > things). Use the tools together to access a repository of ARC files. > > > > > > Release 0.4.0 of WERA adds much improved error and encoding handling, a > > > manual as well as an architectural overview document. Packaging has > > > also been improved. See the Release Notes for more detail on changes > > > (and current known limitations) at > > > http://archive-access.sourceforge.net/projects/wera/articles/releasenot > > >es .h tml. > > > > > > Also, with release 0.4.0, WERA has migrated from its old home in the > > > NWAToolset at nwa.nb.no to a subproject of > > > archive-access.sourceforge.net. The WERA home page is at: > > > http://archive-access.sourceforge.net/projects/wera/. > > > > > > Release 0.4.0 of NutchWAX includes lots of bug fixes and has been built > > > against release 0.7 of nutch. Again see the release notes for details: > > > http://archive-access.sourceforge.net/projects/nutch/articles/releaseno > > >te s. html. > > > > > > For the NutchWAX home page, see: > > > http://archive-access.sourceforge.net/projects/nutch/. > > > > > > A demo of WERA+NutchWAX in operation can be found at > > > http://nwa.nb.no/wera/. > > > > > > Yours, > > > Sverre Bang and Michael Stack > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by the JBoss Inc. > > > Get Certified Today * Register for a JBoss Training Course > > > Free Certification Exam for All Training Attendees Through End of 2005 > > > Visit http://www.jboss.com/services/certification for more information > > > _______________________________________________ > > > Archive-access-discuss mailing list > > > Arc...@li... > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss Training Course > > Free Certification Exam for All Training Attendees Through End of 2005 > > Visit http://www.jboss.com/services/certification for more information > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Lukas M. <mat...@ce...> - 2005-10-25 11:08:08
|
Hi, i've just downloaded file below and still not working... nwa@war:~/download$ java -jar wera-0.4.0-installer.jar text (.:29040): Gtk-WARNING **: cannot open display: did you update installer in release? lukas > The initial upload of the java-based WERA installer was corrupt. A working > installer package is now available from > http://sourceforge.net/project/showfiles.php?group_id=118427 > (wera-0.4.0-installer.jar). > > Installation tips: > > * On local host with X available : > java -jar wera-0.4.0-installer.jar > > * Install on remote host (with X available at local and remote host): > copy the installer to the remote host, log in the to the remote with ssh -X > user@host (X port forwarding enabled) and start install with java -jar > wera-0.4.0-installer.jar > > * No X available, local or remote install : > java -jar wera-0.4.0-installer.jar text > > Regards > Sverre Bang > > On Saturday 22 October 2005 08:50, stack wrote: > > This note is to announce a new release 0.4.0 of WERA (WEb aRchive > > Access), the web archive collection search and navigation tool, and of > > NutchWAX (Nutch with Web Archive eXtensions), the web archive collection > > search engine that powers the WERA application (among other things). > > Use the tools together to access a repository of ARC files. > > > > Release 0.4.0 of WERA adds much improved error and encoding handling, a > > manual as well as an architectural overview document. Packaging has also > > been improved. See the Release Notes for more detail on changes (and > > current known limitations) at > > http://archive-access.sourceforge.net/projects/wera/articles/releasenotes > >.h tml. > > > > Also, with release 0.4.0, WERA has migrated from its old home in the > > NWAToolset at nwa.nb.no to a subproject of > > archive-access.sourceforge.net. The WERA home page is at: > > http://archive-access.sourceforge.net/projects/wera/. > > > > Release 0.4.0 of NutchWAX includes lots of bug fixes and has been built > > against release 0.7 of nutch. Again see the release notes for details: > > http://archive-access.sourceforge.net/projects/nutch/articles/releasenote > >s. html. > > > > For the NutchWAX home page, see: > > http://archive-access.sourceforge.net/projects/nutch/. > > > > A demo of WERA+NutchWAX in operation can be found at > > http://nwa.nb.no/wera/. > > > > Yours, > > Sverre Bang and Michael Stack > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss Training Course > > Free Certification Exam for All Training Attendees Through End of 2005 > > Visit http://www.jboss.com/services/certification for more information > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss -- ------------------------------ Bc.Lukas Matejka email:mat...@ce... GSM:+420777093233 |
|
From: <st...@ar...> - 2005-10-25 00:09:54
|
>From: Lukas Matejka <mat...@ce...> >... > >I've just tested Nutchwax on single machine. >there are some parametres.. >documents: 2 222 660 >dups:477 234 >begin:13:37:13 CEST 13.10.2005 >end:03:25:12 CEST 16.10.2005 > > Whats that Lukas -- Almost 3 days to do 2 and a quarter million documents? That looks way slow. You're using default nutchwax config (With indexer.maxMergeDocs set to indexer.maxMergeDocs?). Are you doing NedlibToArc convertion at same time? What kinda machine is it? > >i fixed new version of NedlibToArc2.0 based on arc-1.5.1-200508191341.jar with >little changes. > >http://cvs.sourceforge.net/viewcvs.py/arcwayback/NedlibToArc2.0/ > > Good stuff. Did you take the original dk stuff and modify it? Do you want to move your nedlibtoarc here onto archive-access? Seems like good place for it. I can make you a subproject and make you admin. St.Ack |
|
From: Sverre B. <sve...@nb...> - 2005-10-23 18:03:52
|
The initial upload of the java-based WERA installer was corrupt. A working installer package is now available from http://sourceforge.net/project/showfiles.php?group_id=118427 (wera-0.4.0-installer.jar). Installation tips: * On local host with X available : java -jar wera-0.4.0-installer.jar * Install on remote host (with X available at local and remote host): copy the installer to the remote host, log in the to the remote with ssh -X user@host (X port forwarding enabled) and start install with java -jar wera-0.4.0-installer.jar * No X available, local or remote install : java -jar wera-0.4.0-installer.jar text Regards Sverre Bang On Saturday 22 October 2005 08:50, stack wrote: > This note is to announce a new release 0.4.0 of WERA (WEb aRchive > Access), the web archive collection search and navigation tool, and of > NutchWAX (Nutch with Web Archive eXtensions), the web archive collection > search engine that powers the WERA application (among other things). > Use the tools together to access a repository of ARC files. > > Release 0.4.0 of WERA adds much improved error and encoding handling, a > manual as well as an architectural overview document. Packaging has also > been improved. See the Release Notes for more detail on changes (and > current known limitations) at > http://archive-access.sourceforge.net/projects/wera/articles/releasenotes.h >tml. > > Also, with release 0.4.0, WERA has migrated from its old home in the > NWAToolset at nwa.nb.no to a subproject of > archive-access.sourceforge.net. The WERA home page is at: > http://archive-access.sourceforge.net/projects/wera/. > > Release 0.4.0 of NutchWAX includes lots of bug fixes and has been built > against release 0.7 of nutch. Again see the release notes for details: > http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes. >html. > > For the NutchWAX home page, see: > http://archive-access.sourceforge.net/projects/nutch/. > > A demo of WERA+NutchWAX in operation can be found at > http://nwa.nb.no/wera/. > > Yours, > Sverre Bang and Michael Stack > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |