archive-access-discuss Mailing List for Web Archive Access Utilities (Page 43)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss — General discussion about archive-access projects

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 .. 41 42 43 (Page 43 of 43)

[Archive-access-discuss] test nutchwax + nedlibToArc2.0

From: Lukas M. <mat...@ce...> - 2005-10-23 12:58:39

I've just tested Nutchwax on single machine.
there are some parametres..
documents: 2 222 660
dups:477 234
begin:13:37:13 CEST 13.10.2005
end:03:25:12 CEST 16.10.2005


i fixed new version of NedlibToArc2.0 based on arc-1.5.1-200508191341.jar with 
little changes.

http://cvs.sourceforge.net/viewcvs.py/arcwayback/NedlibToArc2.0/


lukas

[Archive-access-discuss] ANN: New release of WERA+NutchWAX, ARC access toolset

From: stack <st...@du...> - 2005-10-22 20:51:05

This note is to announce a new release 0.4.0 of WERA (WEb aRchive
Access), the web archive collection search and navigation tool, and of
NutchWAX (Nutch with Web Archive eXtensions), the web archive collection
search engine that powers the WERA application (among other things).  
Use the tools together to access a repository of ARC files.

Release 0.4.0 of WERA adds much improved error and encoding handling, a
manual as well as an architectural overview document. Packaging has also
been improved. See the Release Notes for more detail on changes (and
current known limitations) at
http://archive-access.sourceforge.net/projects/wera/articles/releasenotes.html. 

Also, with release 0.4.0, WERA has migrated from its old home in the
NWAToolset at nwa.nb.no to a subproject of
archive-access.sourceforge.net.  The WERA home page is at:
http://archive-access.sourceforge.net/projects/wera/.

Release 0.4.0 of NutchWAX includes lots of bug fixes and has been built
against release 0.7 of nutch. Again see the release notes for details:
http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html.   

For the NutchWAX home page, see:
http://archive-access.sourceforge.net/projects/nutch/.

A demo of WERA+NutchWAX in operation can be found at
http://nwa.nb.no/wera/.

Yours,
Sverre Bang and Michael Stack

[Archive-access-discuss] On nutchwax not indexing images

From: stack <st...@ar...> - 2005-10-04 02:08:47

Below is a snippet from the mail Charlie Foetz sent to this list last week.

Comments inline.


> ==================================================================
> PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE)
> ==================================================================
> 
> 8) No images indexed?
> =====================
> 

I just downloaded HEAD and it seems to be indexing images fine.


....

> 
> So I look in the indexarcs output file and notice I have plenty of entries
> like this:
> 
> (...)
> 050929 115748 adding 4223 bytes of mimetype image/jpeg
> http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg
> 050929 115748 Failed parse: Content-Type not text/html: image/jpeg
> (...)
> 
When I read the above, it makes me think that you the configuration is incorrect. Its tricky getting it right. The above seems to imply that the html parser is the last parser plugin to run whereas HEAD goes out of its way to run the default-parser last (It looks like the config. is the default nutch config. rather than the nutchwax config.).

Checkout this FAQ: http://archive-access.sourceforge.net/projects/nutch/faq.html#default_parser

Try using one of the bundles from our continuous build server.  It has most recent builds of nutchwax on it.  Checkout under the 'build artifacts' link on this page: http://crawltools.archive.org:8080/cruisecontrol/buildresults/HEAD-archive-access.

(I'm adding link to continuous build server up on nutchwax site).

St.Ack

[Fwd: Re: [Archive-access-cvs] Re: [Archive-access-discuss] WERA / Nutchwax - bugs, problems and questions from Luxembourg]

From: <st...@du...> - 2005-10-03 22:43:27

Attachments: Re: [Archive-access-cvs] Re: [Archive-access-discuss] WERA / Nutchwax - bugs,problems and questions from Luxembourg

Below message was meant for the archive-access-discuss list.
St.Ack

Re: [Archive-access-discuss] WERA / Nutchwax - bugs, problems and questions from Luxembourg

From: Sverre B. <sve...@nb...> - 2005-09-30 07:21:35

Hi Charlie, great feedback!
I'm on my way to San Francisco to work with Michael on further integration =
of=20
Wera and NutchWax. I'll get back to you next week.

Btw. not much point in getting latest wera from cvs, no new improvements=20
there, sorry.

Sverre

On Thursday 29 September 2005 18:16, Charles Foetz wrote:
> Hello!
>
> We (Biblioth=C3=A8que nationale de Luxembourg) are still newbies in the w=
orld of
> web archiving, pretty much taking our first steps, and for a prototype/te=
st
> project we've chosen Luxembourg's regional elections, taking place 9th of
> October.
>
> The set-up:
>
> We've got a small collection of .arc files, crawled and archived by
> Heritrix 1.4. I am the only human resource for this project, (and also wo=
rk
> on other projects), so we're quite limited resourcewise. I'm now at the
> stage of trying to interface (partly to be able to see if everything has
> been crawled) this .arc collection to the "users", which at this stage is
> the library staff. I'm using WERA 0.2.2, running on Apache 2, and I've got
> both the nutchwax release 0.2.1 and the CVS head nutchwax (September 25)
> running on Tomcat 5. Java is 1.5.0.
>
> Here are the problems I am experiencing after a first look at WERA/Nutchw=
ax
> (well, after a couple of weeks of messing about with the releases and cvs
> builds, rather =3D)
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
>
> 1) Inline redirected images
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
>
> The URL www.csv.lu was part of a domain-scoped crawl. Many inline images
> from this domain are not displayed. One example:
>
> http://kayltetange.csv.lu/index.html has 3 few inline images:
>
> <img height=3D"101" alt=3D"image334.jpg" src=3D
> "http://kayltetange.csv.lu/fotoen/image334.jpg" width=3D"136"
> align=3D"baseline"/> (1) <img style=3D"WIDTH: 886px; HEIGHT: 686px"
> height=3D"1050" alt=3D"l__iffr__chen_036internet.JPG" src=3D
> "http://kayltetange.csv.lu/fotoen/l__iffr__chen_036internet.JPG"
> width=3D"1400" align=3D"baseline"/> (2) <img style=3D"WIDTH: 74px; HEIGHT=
: 77px"
> height=3D"52" alt=3D"image3731.jpg" src=3D
> "http://kayltetange.csv.lu/fotoen/image3731.jpg" width=3D"49"
> align=3D"baseline"/> (3)
>
> A search for the filename shows that these images are in my collection, b=
ut
> with URLs
>
> http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.jpg
> (4)
> http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__chen=
_0
>36internet.JPG (5)
> http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731.jpg
> (6)
>
> Opening the URLs (1-3) in a browser on the "live" web redirects me
> immediately to (4-6)
>
> What I suppose that happened is that Heritrix tried fetching (1-3), got a
> redirect back, therefore fetched and archived (4-6). Now when WERA
> retrieves (1-3) it doesn't find them, since these URLs were never archive=
d.
>
> I don't know what could be a workaround for this, but I suppose it can a
> serious problem. Would it also happen with redirected html pages?
>
>
> 2) Need for URL canonicalisation in WERA?
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> On the "live" web:
>
> The main (home) page of http://www.csv.lu has a "Newsletter" link to
> http://www.csv.lu/newsletter. The main page also has links to dozens of
> regional subsites of the party (e.g. http://bettembourg.csv.lu/, which are
> all in pretty much the same design as the main page, with some links
> including the "Newsletter" one.
>
> BUT: Most of these regional subsites have their "Newsletter" link pointing
> to http://csv.lu/newsletter.
>
> Heritrix didn't archive this a second time.
>
> Result: "Sorry, no documents with the given uri were found" when clicking
> "Newsletter" on the archived regional sites.
>
>
> 3) Dynamic pages / question marks in the URL
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> I've read about this bug some time ago - is it supposed to have been fixe=
d?
>
> As soon as there is one question mark (or a '+' sign, or others?) in a URL
> the page can't be retrieved. Say I search for "Juncker"... I get:
>
> -----------------------------
>
> 1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique
> (http://hesper.csv.lu/2004/juncker.html) ( ... CSV Hesper - Juncker on To=
ur
> zu Hesper am Centre Civique CSV CSV lokal     Juncker on Tour zu Hesper am
> Centre Civique                                                           =
 =20
>                                          Zer=C3=A9ck           CSV. De s=
=C3=A9chere
> Wee. Rufft eis un um 22 57 31-1 ... ) Number of versions satisfying query=
 /
> total number of versions : 1/1 Timeline | Overview
>
> 2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique
> (http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html) ( ... CSV
> Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV lokal   =20
> Juncker on Tour zu Hesper am Centre Civique                              =
 =20
>                                                                     =20
> Zer=C3=A9ck           CSV. De s=C3=A9chere Wee. Rufft eis un um 22 57 31-=
1 ... )
> Number of versions satisfying query / total number of versions : 0/0
> Timeline | Overview
>
>
> 3. CSV - Jean-Claude Juncker sur France Inter
> (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition) ( ...
> ministre Jean-Claude Juncker au sujet du raz-de-mar=C3=A9e en Asie du Sud=
=2Dest
> France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Juncker: Oui,
> bonjour. France Inter: En tant que pr=C3=A9sident en exercice de l'Union
> europ=C3=A9enne, vous =C3=A9tiez pr=C3=A9sent jeudi dernier aux ... impor=
tants puisqu'il
> s'agit d'une r=C3=A9gion du monde qui nous est tr=C3=A8s proche. France I=
nter: Merci,
> Jean-Claude Juncker. Merci, Monsieur le Pr=C3=A9sident. Jean-Claude ... )=
 Number
> of versions satisfying query / total number of versions : 0/0 Timeline |
> Overview
>
> 4. CSV - Jean-Claude Juncker sur France Inter
> (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?print=3D=
1)
> (...) 5. CSV - Interview mam Jean-Claude Juncker
> (http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...) 6. CSV -
> Interview mam Jean-Claude Juncker
> (http://www.csv.lu/text/2133.html/Frank+Engel)(...) 7. CSV - Edm=C3=A9e J=
uncker
> verabschiedet sich als Pr=C3=A4sidentin
> (http://www.csv.lu/text/1978.html/Marco+Schank)(...) 8. CSV - Edm=C3=A9e =
Juncker
> verabschiedet sich als Pr=C3=A4sidentin
> (http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...) 9. CSV - D=
rei
> Fragen an Jean-Claude Juncker
> (http://www.csv.lu/text/2212.html/Claude+Wiseler)(...) 10. CSV - Drei
> Fragen an Jean-Claude Juncker
> (http://www.csv.lu/text/2212.html/Luc+Frieden)(...)
>
> -------------------------
>
> Results 2-10 all show me "Sorry, no documents with the given uri were
> found". They also have "total number of versions 0/0".
>
> The only link who retrieves anything is the first one. But even here: The
> page I get has a set of thumbnails which are only displayed for about 0.1
> seconds and then disappear (I guess because of JavaScript replacing the
> links with links pointing to within the collection..). A look at the sour=
ce
> code of the page shows that these pictures should be:
>
> juncker/JoTt-(01).jpg
> juncker/JoTt-(02).jpg
> ...
>
> So I search for "JoTt-(01).jpg"...
>
> and get 2 hits:
>
>  Total number of versions found : 2. Displaying URL's 1-2
> 1. http://hesper.csv.lu/juncker/JoTt-(01).jpg
> (http://hesper.csv.lu/juncker/JoTt-(01).jpg) (CSV CSV lokal     Fehler:
> D=C3=AF=C2=BF=C2=BDs S=C3=AF=C2=BF=C2=BDt exist=C3=AF=C2=BF=C2=BDert net!=
       CSV. De s=C3=A9chere Wee. Rufft eis un um 22
> 57 31-1 oder sch=C3=A9ckt eng Email op csv@csv.) Number of versions satis=
fying
> query / total number of versions : 0/0 Timeline | Overview
>
> 2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg
> (http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg) (
> ... http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg)
> Number of versions satisfying query / total number of versions : 0/0
> Timeline | Overview
>
> Again, both not retrievable. Same goes for any other pictures with bracke=
ts
> (and possibly some other non-"a-z|A-Z|0-9" characters) in the filename.
>
>
> 4) Special characters
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> This has repeatedly been reported as fixed but there is still trouble:
>
> Searching for "Edm=C3=A9e" (in case that doesn't display fine: e-d-m-eacu=
te-e)
> gives me hits but ONLY if I manually set Encoding of my browser (Firefox)
> to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3=
=A9e",
> and then Search I get a page with results,
>
> BUT
>
> the Search box now says "Edm?e" and Character encoding has been set back =
to
> UTF-8. If I no do another search (say "fran=C3=A7ais") I get again "no hi=
ts!".
> I'd have to set back Character encoding manually before each search.
>
>
> 5) XML error: reference to invalid character number at line 34
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> For some searches (on collections indexed with nutchwax release 0.2.1) I
> get only the above error message as result. The source code :
>
> *****START*****
>
>     <!-- ************************ Results:
> ****************************************************** --> <table
> align=3D"center" class=3D"greyborder" border=3D"0" cellspacing=3D"0"
> cellpadding=3D"1" width=3D"90%"> <tr>
>     <td>
>
>       <table align=3D"center" class=3D"resultsborder" border=3D"0"
> cellspacing=3D"0" cellpadding=3D"10" width=3D"100%"> <tr>
>           <td>
>
> XML error: reference to invalid character number at line 34
>
>
> *****END*****
>
> That's the last line (HTML generation by php is cut off there)
>
> A look into catalina.out :
>
> *****START*****
>
> 050929 163012 12 query request from 192.168.6.21
> 050929 163012 12 query: Juncker
> 050929 163012 12 searching for 20 raw hits
> 050929 163012 12 re-searching for 40 raw hits, query: juncker
> -exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL
> 2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ"
> -exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX
> WO4ZHCSQY"
> 050929 163012 12 found 10476 raw hits
> 050929 163012 12 total hits: 10496
>
> *****END*****
>
> 6) Wrong re-setting of Character encoding
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> On the "live" web, www.gouvernement.lu has character encoding UTF-8. Every
> time you reload the page it sets it to this.
>
> In my archived collection, every time I retrieve a page from this URL,
> encoding is always set back to ISO 8859-1. The page, being in French, is
> therefore pretty much unreadable and you have to set back Encoding manual=
ly
> back to UTF-8 after every click.
>
>
> 7) Immediate re-direct to "live" web
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> URL http://www.lsap.lu (in my seeds list) is a redirect to
> http://www.lsap.lu/index.php?idusergroup=3D42114236.
>
> When I retrieve http://www.lsap.lu/ from my collection, WERA immediately
> displays the live web page. Besides that, <i>every</i> link on www.lsap.lu
> includes variables (question marks) and is hence unretrievable (see (3)).
>
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE)
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> 8) No images indexed?
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> When I index my collection with NutchWax head CVS BUILD, no images appear
> at all.
>
> One method has been suggested here to see if a file is in the archive:
> >Setting $conf_debug to 1 in /lib/config.inc and changing index.php
> >
> >from $search->setFieldsInResult("teaser url description");
> >to   $search->setFieldsInResult("teaser url description
> > archiveidentifier");
>
> When I do this and query for one of the many non-displayed images (e.g.
> "gouvernement.gif") I get
>
> [1] =3D> Array
>         (
>             [teaser] =3D>
> http://www.gouvernement.lu/pictures/layout/gouvernement.gif [url] =3D>
> http://www.gouvernement.lu/pictures/layout/gouvernement.gif
> [archiveidentifier] =3D> //arc/.arc.gz
>         )
>
> So I look in the indexarcs output file and notice I have plenty of entries
> like this:
>
> (...)
> 050929 115748 adding 4223 bytes of mimetype image/jpeg
> http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg 050929
> 115748 Failed parse: Content-Type not text/html: image/jpeg (...)
>
> and towards the end of the file:
>
> (...)
> 050929 125148 No collection for url
> http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sE
>auQR.pdf 050929 125148 No arcname for url
> http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sE
>auQR.pdf 050929 125148 No arcoffset for url
> http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sE
>auQR.pdf 050929 125148 No collection for url
> http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcname for url
> http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcoffset for url
> http://www.adr.lu/Norden/koepp_port.jpg (...)
>
> I didn't have these lines before (when I indexed with the released nutchw=
ax
> as opposed to the cvs built)
>
> Any ideas on how this is possible or what it means? Why do my images not
> have an archiveidentifier? My indexing process must have been wrong I
> guess?
>
> bin/indexarcs.sh -c elections -s /arc/ -d
> /usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/
> &>index_arc_elections_29sep.log
>
> What is a typical indexarcs.sh command line meant to look like instead?
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> One more question: Is there a version of WERA newer than the 0.2.2 release
> going somewhere (via cvs, for instance) that's worth getting (ie with any
> substantial changes)? If so, what commands or steps need to be executed to
> use it?
>
> That's all for now :)
>
> Looking forward to reading your comments
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Charlie Foetz
> Biblioth=C3=A8que nationale de Luxembourg

[Archive-access-discuss] WERA / Nutchwax - bugs, problems and questions from Luxembourg

From: Charles F. <Cha...@bn...> - 2005-09-29 16:16:19

Hello!

We (Biblioth=E8que nationale de Luxembourg) are still newbies in the =
world of web archiving, pretty much taking our first steps, and for a =
prototype/test project we've chosen Luxembourg's regional elections, =
taking place 9th of October.=20

The set-up:

We've got a small collection of .arc files, crawled and archived by =
Heritrix 1.4. I am the only human resource for this project, (and also =
work on other projects), so we're quite limited resourcewise. I'm now at =
the stage of trying to interface (partly to be able to see if everything =
has been crawled) this .arc collection to the "users", which at this =
stage is the library staff. I'm using WERA 0.2.2, running on Apache 2, =
and I've got both the nutchwax release 0.2.1 and the CVS head nutchwax =
(September 25) running on Tomcat 5. Java is 1.5.0.

Here are the problems I am experiencing after a first look at =
WERA/Nutchwax (well, after a couple of weeks of messing about with the =
releases and cvs builds, rather =3D)=20


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D


1) Inline redirected images=20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D

The URL www.csv.lu was part of a domain-scoped crawl. Many inline images =
from this domain are not displayed. One example:

http://kayltetange.csv.lu/index.html has 3 few inline images:=20

<img height=3D"101" alt=3D"image334.jpg" src=3D
"http://kayltetange.csv.lu/fotoen/image334.jpg" width=3D"136" =
align=3D"baseline"/> (1)
<img style=3D"WIDTH: 886px; HEIGHT: 686px" height=3D"1050" =
alt=3D"l__iffr__chen_036internet.JPG" src=3D
"http://kayltetange.csv.lu/fotoen/l__iffr__chen_036internet.JPG" =
width=3D"1400" align=3D"baseline"/> (2)
<img style=3D"WIDTH: 74px; HEIGHT: 77px" height=3D"52" =
alt=3D"image3731.jpg" src=3D
"http://kayltetange.csv.lu/fotoen/image3731.jpg" width=3D"49" =
align=3D"baseline"/> (3)

A search for the filename shows that these images are in my collection, =
but with URLs

http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.jpg =
(4)=20
http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__chen=
_036internet.JPG (5)
http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731.jpg=
 (6)

Opening the URLs (1-3) in a browser on the "live" web redirects me =
immediately to (4-6)=20

What I suppose that happened is that Heritrix tried fetching (1-3), got =
a redirect back, therefore fetched and archived (4-6). Now when WERA =
retrieves (1-3) it doesn't find them, since these URLs were never =
archived.=20

I don't know what could be a workaround for this, but I suppose it can a =
serious problem. Would it also happen with redirected html pages?


2) Need for URL canonicalisation in WERA?
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On the "live" web:

The main (home) page of http://www.csv.lu has a "Newsletter" link to =
http://www.csv.lu/newsletter.=20
The main page also has links to dozens of regional subsites of the party =
(e.g. http://bettembourg.csv.lu/, which are all in pretty much the same =
design as the main page, with some links including the "Newsletter" one. =


BUT: Most of these regional subsites have their "Newsletter" link =
pointing to http://csv.lu/newsletter.=20

Heritrix didn't archive this a second time.

Result: "Sorry, no documents with the given uri were found" when =
clicking "Newsletter" on the archived regional sites.=20


3) Dynamic pages / question marks in the URL
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I've read about this bug some time ago - is it supposed to have been =
fixed?

As soon as there is one question mark (or a '+' sign, or others?) in a =
URL the page can't be retrieved. Say I search for "Juncker"... I get:

-----------------------------

1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique =
(http://hesper.csv.lu/2004/juncker.html)
( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV =
lokal     Juncker on Tour zu Hesper am Centre Civique                    =
                                                                         =
          Zer=E9ck           CSV. De s=E9chere Wee. Rufft eis un um 22 =
57 31-1 ... )
Number of versions satisfying query / total number of versions : 1/1
Timeline | Overview

2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique =
(http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html)
( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV =
lokal     Juncker on Tour zu Hesper am Centre Civique                    =
                                                                         =
          Zer=E9ck           CSV. De s=E9chere Wee. Rufft eis un um 22 =
57 31-1 ... )
Number of versions satisfying query / total number of versions : 0/0
Timeline | Overview


3. CSV - Jean-Claude Juncker sur France Inter =
(http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition)
( ... ministre Jean-Claude Juncker au sujet du raz-de-mar=E9e en Asie du =
Sud-est France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Juncker: =
Oui, bonjour. France Inter: En tant que pr=E9sident en exercice de =
l'Union europ=E9enne, vous =E9tiez pr=E9sent jeudi dernier aux ... =
importants puisqu'il s'agit d'une r=E9gion du monde qui nous est tr=E8s =
proche. France Inter: Merci, Jean-Claude Juncker. Merci, Monsieur le =
Pr=E9sident. Jean-Claude ... )
Number of versions satisfying query / total number of versions : 0/0
Timeline | Overview

4. CSV - Jean-Claude Juncker sur France Inter =
(http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?print=3D=
1) (...)
5. CSV - Interview mam Jean-Claude Juncker =
(http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...)
6. CSV - Interview mam Jean-Claude Juncker =
(http://www.csv.lu/text/2133.html/Frank+Engel)(...)
7. CSV - Edm=E9e Juncker verabschiedet sich als Pr=E4sidentin =
(http://www.csv.lu/text/1978.html/Marco+Schank)(...)
8. CSV - Edm=E9e Juncker verabschiedet sich als Pr=E4sidentin =
(http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...)
9. CSV - Drei Fragen an Jean-Claude Juncker =
(http://www.csv.lu/text/2212.html/Claude+Wiseler)(...)
10. CSV - Drei Fragen an Jean-Claude Juncker =
(http://www.csv.lu/text/2212.html/Luc+Frieden)(...)

-------------------------

Results 2-10 all show me "Sorry, no documents with the given uri were =
found". They also have "total number of versions 0/0".=20

The only link who retrieves anything is the first one. But even here: =
The page I get has a set of thumbnails which are only displayed for =
about 0.1 seconds and then disappear (I guess because of JavaScript =
replacing the links with links pointing to within the collection..). A =
look at the source code of the page shows that these pictures should be: =


juncker/JoTt-(01).jpg
juncker/JoTt-(02).jpg
...

So I search for "JoTt-(01).jpg"...

and get 2 hits:

 Total number of versions found : 2. Displaying URL's 1-2
1. http://hesper.csv.lu/juncker/JoTt-(01).jpg =
(http://hesper.csv.lu/juncker/JoTt-(01).jpg)
(CSV CSV lokal     Fehler: D=EF=BF=BDs S=EF=BF=BDt exist=EF=BF=BDert =
net!       CSV. De s=E9chere Wee. Rufft eis un um 22 57 31-1 oder =
sch=E9ckt eng Email op csv@csv.)
Number of versions satisfying query / total number of versions : 0/0
Timeline | Overview
=20
2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg =
(http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg)
( ... =
http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg)
Number of versions satisfying query / total number of versions : 0/0
Timeline | Overview
=20
Again, both not retrievable. Same goes for any other pictures with =
brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in the =
filename.


4) Special characters
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

This has repeatedly been reported as fixed but there is still trouble:=20

Searching for "Edm=E9e" (in case that doesn't display fine: =
e-d-m-eacute-e) gives me hits but ONLY if I manually set Encoding of my =
browser (Firefox) to "Windows 1252" or "ISO 8859-1". If I do that, then =
enter the "Edm=E9e", and then Search I get a page with results,=20

BUT=20

the Search box now says "Edm?e" and Character encoding has been set back =
to UTF-8. If I no do another search (say "fran=E7ais") I get again "no =
hits!". I'd have to set back Character encoding manually before each =
search.


5) XML error: reference to invalid character number at line 34
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
For some searches (on collections indexed with nutchwax release 0.2.1) I =
get only the above error message as result. The source code :=20

*****START*****

    <!-- ************************ Results: =
****************************************************** -->
<table align=3D"center" class=3D"greyborder" border=3D"0" =
cellspacing=3D"0" cellpadding=3D"1" width=3D"90%">
  <tr>
    <td>

      <table align=3D"center" class=3D"resultsborder" border=3D"0" =
cellspacing=3D"0" cellpadding=3D"10" width=3D"100%">
        <tr>
          <td>
   =20
XML error: reference to invalid character number at line 34


*****END*****

That's the last line (HTML generation by php is cut off there)

A look into catalina.out :

*****START*****

050929 163012 12 query request from 192.168.6.21
050929 163012 12 query: Juncker
050929 163012 12 searching for 20 raw hits
050929 163012 12 re-searching for 40 raw hits, query: juncker =
-exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL
2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ" =
-exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX
WO4ZHCSQY"
050929 163012 12 found 10476 raw hits
050929 163012 12 total hits: 10496

*****END*****

6) Wrong re-setting of Character encoding
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On the "live" web, www.gouvernement.lu has character encoding UTF-8. =
Every time you reload the page it sets it to this.=20

In my archived collection, every time I retrieve a page from this URL, =
encoding is always set back to ISO 8859-1. The page, being in French, is =
therefore pretty much unreadable and you have to set back Encoding =
manually back to UTF-8 after every click.


7) Immediate re-direct to "live" web
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
URL http://www.lsap.lu (in my seeds list) is a redirect to =
http://www.lsap.lu/index.php?idusergroup=3D42114236.=20

When I retrieve http://www.lsap.lu/ from my collection, WERA immediately =
displays the live web page. Besides that, <i>every</i> link on =
www.lsap.lu includes variables (question marks) and is hence =
unretrievable (see (3)).=20


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE)
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

8) No images indexed?
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

When I index my collection with NutchWax head CVS BUILD, no images =
appear at all.=20

One method has been suggested here to see if a file is in the archive:

>Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20
>
>from $search->setFieldsInResult("teaser url description");=20
>to   $search->setFieldsInResult("teaser url description =
archiveidentifier");=20

When I do this and query for one of the many non-displayed images (e.g. =
"gouvernement.gif") I get=20

[1] =3D> Array
        (
            [teaser] =3D> =
http://www.gouvernement.lu/pictures/layout/gouvernement.gif
            [url] =3D> =
http://www.gouvernement.lu/pictures/layout/gouvernement.gif
            [archiveidentifier] =3D> //arc/.arc.gz
        )

So I look in the indexarcs output file and notice I have plenty of =
entries like this:

(...)
050929 115748 adding 4223 bytes of mimetype image/jpeg =
http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg
050929 115748 Failed parse: Content-Type not text/html: image/jpeg
(...)

and towards the end of the file:

(...)
050929 125148 No collection for url =
http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sEauQR.pdf
050929 125148 No arcname for url =
http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sEauQR.pdf
050929 125148 No arcoffset for url =
http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune=
sEauQR.pdf
050929 125148 No collection for url =
http://www.adr.lu/Norden/koepp_port.jpg
050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.jpg
050929 125148 No arcoffset for url =
http://www.adr.lu/Norden/koepp_port.jpg
(...)

I didn't have these lines before (when I indexed with the released =
nutchwax as opposed to the cvs built)

Any ideas on how this is possible or what it means? Why do my images not =
have an archiveidentifier? My indexing process must have been wrong I =
guess?=20

bin/indexarcs.sh -c elections -s /arc/ -d =
/usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ =
&>index_arc_elections_29sep.log

What is a typical indexarcs.sh command line meant to look like instead?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D

One more question: Is there a version of WERA newer than the 0.2.2 =
release going somewhere (via cvs, for instance) that's worth getting (ie =
with any substantial changes)? If so, what commands or steps need to be =
executed to use it?=20

That's all for now :)

Looking forward to reading your comments

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
Charlie Foetz
Biblioth=E8que nationale de Luxembourg

RE: [Archive-access-discuss] result.php with certain urls causes problems

From: Kaisa K. <kau...@cs...> - 2005-09-14 06:22:02

Yes, that's it. Total number of versions is always 0/0 with
these urls.

kaisa

On Tue, 13 Sep 2005, Sverre Bang wrote:

> Hi, 
> I'm glad to hear that Wera is working for you.
> 
> Oh yes, we have seen this behaviour. It is most likely related to the bug 
> http://sourceforge.net/tracker/index.php?func=detail&aid=1244875&group_id=118427&atid=681137.
> 
> Question: Does the hit in the result list that brought you to this 
> particular url show 0/0 hits? If so then this is definitely related to 
> the bug above. If not and you want to determine wether or not a 
> particular url is archived, you can do the following:
> 
> 1. open the file index.php and change the line 
>    $search->setFieldsInResult("teaser url description"); to 
>    $search->setFieldsInResult("teaser url description archiveidentifier");
> 2. Open the file lib/config.inc and set $conf_debug to 1
> 3. Execute the query and select "view source" in your browser. Scroll down until you find the result list printout (in comment brackets). Find the hit you are looking for and copy the value of archiveidentifier.
> 4. Open a browser and enter the value of $conf_document_retriever from config.inc appended with the archiveidentifier value from 3 (something like: http://yourhost:8080/ArcRetriever/ArcRetriever?aid=9826040//home/wera/arcs/IAH-20041216115204-00024-utvikling1.nb.no.arc.gz&reqtype=getfile).
> 5. If the file is in the archive the retriever will return it to your browser.
> 
> After doing this, you should reset $conf_debug (or else the output from 
> wera wont be pretty)
> 
> I'm going to SF first week of Oktober to work with Michael on Wera/NutchWax.
> The above bug will be one of the targets i guess.
> 
> Sverre
> 
> -----Original Message-----
> From: arc...@li...
> [mailto:arc...@li...]On Behalf Of
> kau...@cs...
> Sent: Tuesday, September 13, 2005 11:21 AM
> To: arc...@li...
> Subject: [Archive-access-discuss] result.php with certain urls causes
> problems
> 
> 
> 
> Hi,
> a wera archive has been up here for some time now and it looks very
> usable.
> (nutchwax-0.2.1 and wera-0.2.2)
> 
> I was wondering about some documents which I can't get to screen from
> the archive. Somehow it seems that urls with question marks result in
> error messages (document not found etc). Or is it equal sign that causes
> it.
> 
> A request of this type results in error
> http://nwa5a.lib.helsinki.fi/wera/result.php?time=..something..&
> mode=..something..&url=http://www.helsinginsanomat.fi/helsinki2005/
> tulokset/?pvm=13.08.2005/
> 
> Or using result.php with urls like these
> url=http://www.varaslahto.net/phpBB2/viewtopic.php?p=2320
> url=http://www.hel.fi/tourism/en/tapahtumat_lisa.asp?kieli=en&id=3320&sivu=paa
> 
> Or maybe these docs really are not in my archive.. has anyone else seen
> similar behaviour?

RE: [Archive-access-discuss] result.php with certain urls causes problems

From: Sverre B. <Sve...@nb...> - 2005-09-13 10:47:37

Hi,=20
I'm glad to hear that Wera is working for you.

Oh yes, we have seen this behaviour. It is most likely related to the =
bug =
http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1244875&grou=
p_id=3D118427&atid=3D681137.

Question: Does the hit in the result list that brought you to this =
particular url show 0/0 hits? If so then this is definitely related to =
the bug above. If not and you want to determine wether or not a =
particular url is archived, you can do the following:

1. open the file index.php and change the line=20
   $search->setFieldsInResult("teaser url description"); to=20
   $search->setFieldsInResult("teaser url description =
archiveidentifier");
2. Open the file lib/config.inc and set $conf_debug to 1
3. Execute the query and select "view source" in your browser. Scroll =
down until you find the result list printout (in comment brackets). Find =
the hit you are looking for and copy the value of archiveidentifier.
4. Open a browser and enter the value of $conf_document_retriever from =
config.inc appended with the archiveidentifier value from 3 (something =
like: =
http://yourhost:8080/ArcRetriever/ArcRetriever?aid=3D9826040//home/wera/a=
rcs/IAH-20041216115204-00024-utvikling1.nb.no.arc.gz&reqtype=3Dgetfile).
5. If the file is in the archive the retriever will return it to your =
browser.

After doing this, you should reset $conf_debug (or else the output from =
wera wont be pretty)

I'm going to SF first week of Oktober to work with Michael on =
Wera/NutchWax. The above bug will be one of the targets i guess.

Sverre

-----Original Message-----
From: arc...@li...
[mailto:arc...@li...]On Behalf Of
kau...@cs...
Sent: Tuesday, September 13, 2005 11:21 AM
To: arc...@li...
Subject: [Archive-access-discuss] result.php with certain urls causes
problems



Hi,
a wera archive has been up here for some time now and it looks very
usable.
(nutchwax-0.2.1 and wera-0.2.2)

I was wondering about some documents which I can't get to screen from
the archive. Somehow it seems that urls with question marks result in
error messages (document not found etc). Or is it equal sign that causes
it.

A request of this type results in error
http://nwa5a.lib.helsinki.fi/wera/result.php?time=3D..something..&
mode=3D..something..&url=3Dhttp://www.helsinginsanomat.fi/helsinki2005/
tulokset/?pvm=3D13.08.2005/

Or using result.php with urls like these
url=3Dhttp://www.varaslahto.net/phpBB2/viewtopic.php?p=3D2320
url=3Dhttp://www.hel.fi/tourism/en/tapahtumat_lisa.asp?kieli=3Den&id=3D33=
20&sivu=3Dpaa

Or maybe these docs really are not in my archive.. has anyone else seen
similar behaviour?


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle =
Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & =
QA
Security * Process Improvement & Measurement * =
http://www.sqe.com/bsce5sf
_______________________________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] result.php with certain urls causes problems

From: <kau...@cs...> - 2005-09-13 09:20:50

Hi,
a wera archive has been up here for some time now and it looks very
usable.
(nutchwax-0.2.1 and wera-0.2.2)

I was wondering about some documents which I can't get to screen from
the archive. Somehow it seems that urls with question marks result in
error messages (document not found etc). Or is it equal sign that causes
it.

A request of this type results in error
http://nwa5a.lib.helsinki.fi/wera/result.php?time=3D..something..&
mode=3D..something..&url=3Dhttp://www.helsinginsanomat.fi/helsinki2005/
tulokset/?pvm=3D13.08.2005/

Or using result.php with urls like these
url=3Dhttp://www.varaslahto.net/phpBB2/viewtopic.php?p=3D2320
url=3Dhttp://www.hel.fi/tourism/en/tapahtumat_lisa.asp?kieli=3Den&id=3D3320&s=
ivu=3Dpaa

Or maybe these docs really are not in my archive.. has anyone else seen
similar behaviour?

Re: [Archive-access-discuss] The first question: wera window blank!

From: Kaisa K. <kau...@cs...> - 2005-08-10 13:29:04

Yes, but that's what I've been doing all the time :)
Starting 'java -jar weraInstaller.jar' on the Linux machine
opens a window on my PC display and the window is blank,
except for title 'Wera Installer' & logo. 

Maybe it would be an idea to offer a text version of 
installation procedure somewhere on Nutchwax pages ?

Kaisa

---------- Forwarded message ----------
Date: Mon, 08 Aug 2005 14:30:10 +0200
From: Sverre Bang <sve...@nb...>
To: Kaisa Kaunonen <kau...@cs...>
Subject: Re: [Archive-access-discuss] The first question: wera window blank!

Aha, ....
Put weraInstaller.jar on the remote host where you want wera installed
(not your windows PC). Log in on the remote host using ssh and start the
installer with java -jar weraInstaller.jar. If you do this from a linux
client (ssh to the remote host and invoke installer) you get the
weraInstaller GUI on your client (assuming X11 port forwarding is set).
I don't know about windows if or how this works, you will probably be
presented the console (text only) version of wera installer.

Good luck,
Sverre

man, 08,.08.2005 kl. 14.41 +0300, skrev Kaisa Kaunonen:
> Hi,
> I'm installing Nutchwax on a Linux machine which has
> Sun jdk1.5.0_02
> 
> Linux in question seems to be 'Heka-3 Enterprise Linux Update 3'
> (Heka is a name for a local distribution here?)
> 
> But I'm working at a PC with MS Windows XP. It has WinaXe software
> which works as Windows X Server to that Linux, so the Wera
> windows opens on a PC.
> 
> Kaisa
> 
> On Mon, 8 Aug 2005, Sverre Bang wrote:
> 
> > Hi Kaisa,
> > Sorry for not replying earlier, just back from vacation.
> > What jvm are you using, and what OS?
> > 
> > Sverre
> > 
> > ons, 03,.08.2005 kl. 15.05 +0300, skrev kau...@cs...:
> > > Hi all (how many are you),
> > > congratulations to the new software and this discussion list!
> > > 
> > > Indexed about 150 *.arc.gz files. It went easily and the index now gives
> > > back search results. But I didn't look more closely if everything was
> > > indexed and completely without errors :)
> > > 
> > > Now I'm trying to install this wera. Using 'java -jar
> > > weraInstaller.jar' opens a window. It has a logo in the upper left
> > > corner and a page title but otherwise the page remains empty.
> > > 
> > > Is it something with the xml module in php installation, although running
> > > phpinfo.php on my machine gives 'XML Support active'.
> > > 
> > > Regards,
> > > Kaisa
> > > 
> > > 
> > > -------------------------------------------------------
> > > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> > > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > > informative Webcasts and more! Get everything you need to get up to
> > > speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
> > > _______________________________________________
> > > Archive-access-discuss mailing list
> > > Arc...@li...
> > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> > 
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

Re: [Archive-access-discuss] The first question: wera window blank!

From: Kaisa K. <kau...@cs...> - 2005-08-08 11:42:35

Hi,
I'm installing Nutchwax on a Linux machine which has
Sun jdk1.5.0_02

Linux in question seems to be 'Heka-3 Enterprise Linux Update 3'
(Heka is a name for a local distribution here?)

But I'm working at a PC with MS Windows XP. It has WinaXe software
which works as Windows X Server to that Linux, so the Wera
windows opens on a PC.

Kaisa

On Mon, 8 Aug 2005, Sverre Bang wrote:

> Hi Kaisa,
> Sorry for not replying earlier, just back from vacation.
> What jvm are you using, and what OS?
> 
> Sverre
> 
> ons, 03,.08.2005 kl. 15.05 +0300, skrev kau...@cs...:
> > Hi all (how many are you),
> > congratulations to the new software and this discussion list!
> > 
> > Indexed about 150 *.arc.gz files. It went easily and the index now gives
> > back search results. But I didn't look more closely if everything was
> > indexed and completely without errors :)
> > 
> > Now I'm trying to install this wera. Using 'java -jar
> > weraInstaller.jar' opens a window. It has a logo in the upper left
> > corner and a page title but otherwise the page remains empty.
> > 
> > Is it something with the xml module in php installation, although running
> > phpinfo.php on my machine gives 'XML Support active'.
> > 
> > Regards,
> > Kaisa
> > 
> > 
> > -------------------------------------------------------
> > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> > from IBM. Find simple to follow Roadmaps, straightforward articles,
> > informative Webcasts and more! Get everything you need to get up to
> > speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] The first question: wera window remains blank!

From: Sverre B. <sve...@nb...> - 2005-08-08 07:32:59

Hi Kaisa,
Sorry for not replying earlier, just back from vacation.
What jvm are you using, and what OS?

Sverre

ons, 03,.08.2005 kl. 15.05 +0300, skrev kau...@cs...:
> Hi all (how many are you),
> congratulations to the new software and this discussion list!
> 
> Indexed about 150 *.arc.gz files. It went easily and the index now gives
> back search results. But I didn't look more closely if everything was
> indexed and completely without errors :)
> 
> Now I'm trying to install this wera. Using 'java -jar
> weraInstaller.jar' opens a window. It has a logo in the upper left
> corner and a page title but otherwise the page remains empty.
> 
> Is it something with the xml module in php installation, although running
> phpinfo.php on my machine gives 'XML Support active'.
> 
> Regards,
> Kaisa
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

[Archive-access-discuss] The first question: wera window remains blank!

From: <kau...@cs...> - 2005-08-03 12:06:15

Hi all (how many are you),
congratulations to the new software and this discussion list!

Indexed about 150 *.arc.gz files. It went easily and the index now gives
back search results. But I didn't look more closely if everything was
indexed and completely without errors :)

Now I'm trying to install this wera. Using 'java -jar
weraInstaller.jar' opens a window. It has a logo in the upper left
corner and a page title but otherwise the page remains empty.

Is it something with the xml module in php installation, although running
phpinfo.php on my machine gives 'XML Support active'.

Regards,
Kaisa

[Archive-access-discuss] [Annoucement] First release of nutchwax + WERA access tool

From: <st...@du...> - 2005-07-29 00:40:13

We would like to announce the release of nutchwax -- the nutch search 
application + extensions for searching of web archive collections -- and 
WERA, a web collection viewer application from the NWA Toolset that has 
been adapted to nutchwax.  The two tools used in concert provide 
full-text search of small web archive collections and a means of 
browsing an archive collection over time.

Nutchwax is hosted on sourceforge at http://archive-access.sourceforge.net.

St.Ack

37 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 41 42 43 (Page 43 of 43)