From: Sverre B. <sve...@nb...> - 2005-09-30 07:21:35
|
Hi Charlie, great feedback! I'm on my way to San Francisco to work with Michael on further integration = of=20 Wera and NutchWax. I'll get back to you next week. Btw. not much point in getting latest wera from cvs, no new improvements=20 there, sorry. Sverre On Thursday 29 September 2005 18:16, Charles Foetz wrote: > Hello! > > We (Biblioth=C3=A8que nationale de Luxembourg) are still newbies in the w= orld of > web archiving, pretty much taking our first steps, and for a prototype/te= st > project we've chosen Luxembourg's regional elections, taking place 9th of > October. > > The set-up: > > We've got a small collection of .arc files, crawled and archived by > Heritrix 1.4. I am the only human resource for this project, (and also wo= rk > on other projects), so we're quite limited resourcewise. I'm now at the > stage of trying to interface (partly to be able to see if everything has > been crawled) this .arc collection to the "users", which at this stage is > the library staff. I'm using WERA 0.2.2, running on Apache 2, and I've got > both the nutchwax release 0.2.1 and the CVS head nutchwax (September 25) > running on Tomcat 5. Java is 1.5.0. > > Here are the problems I am experiencing after a first look at WERA/Nutchw= ax > (well, after a couple of weeks of messing about with the releases and cvs > builds, rather =3D) > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > 1) Inline redirected images > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > > The URL www.csv.lu was part of a domain-scoped crawl. Many inline images > from this domain are not displayed. One example: > > http://kayltetange.csv.lu/index.html has 3 few inline images: > > <img height=3D"101" alt=3D"image334.jpg" src=3D > "http://kayltetange.csv.lu/fotoen/image334.jpg" width=3D"136" > align=3D"baseline"/> (1) <img style=3D"WIDTH: 886px; HEIGHT: 686px" > height=3D"1050" alt=3D"l__iffr__chen_036internet.JPG" src=3D > "http://kayltetange.csv.lu/fotoen/l__iffr__chen_036internet.JPG" > width=3D"1400" align=3D"baseline"/> (2) <img style=3D"WIDTH: 74px; HEIGHT= : 77px" > height=3D"52" alt=3D"image3731.jpg" src=3D > "http://kayltetange.csv.lu/fotoen/image3731.jpg" width=3D"49" > align=3D"baseline"/> (3) > > A search for the filename shows that these images are in my collection, b= ut > with URLs > > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.jpg > (4) > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__chen= _0 >36internet.JPG (5) > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731.jpg > (6) > > Opening the URLs (1-3) in a browser on the "live" web redirects me > immediately to (4-6) > > What I suppose that happened is that Heritrix tried fetching (1-3), got a > redirect back, therefore fetched and archived (4-6). Now when WERA > retrieves (1-3) it doesn't find them, since these URLs were never archive= d. > > I don't know what could be a workaround for this, but I suppose it can a > serious problem. Would it also happen with redirected html pages? > > > 2) Need for URL canonicalisation in WERA? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > On the "live" web: > > The main (home) page of http://www.csv.lu has a "Newsletter" link to > http://www.csv.lu/newsletter. The main page also has links to dozens of > regional subsites of the party (e.g. http://bettembourg.csv.lu/, which are > all in pretty much the same design as the main page, with some links > including the "Newsletter" one. > > BUT: Most of these regional subsites have their "Newsletter" link pointing > to http://csv.lu/newsletter. > > Heritrix didn't archive this a second time. > > Result: "Sorry, no documents with the given uri were found" when clicking > "Newsletter" on the archived regional sites. > > > 3) Dynamic pages / question marks in the URL > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I've read about this bug some time ago - is it supposed to have been fixe= d? > > As soon as there is one question mark (or a '+' sign, or others?) in a URL > the page can't be retrieved. Say I search for "Juncker"... I get: > > ----------------------------- > > 1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/2004/juncker.html) ( ... CSV Hesper - Juncker on To= ur > zu Hesper am Centre Civique CSV CSV lokal Juncker on Tour zu Hesper am > Centre Civique = =20 > Zer=C3=A9ck CSV. De s= =C3=A9chere > Wee. Rufft eis un um 22 57 31-1 ... ) Number of versions satisfying query= / > total number of versions : 1/1 Timeline | Overview > > 2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html) ( ... CSV > Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV lokal =20 > Juncker on Tour zu Hesper am Centre Civique = =20 > =20 > Zer=C3=A9ck CSV. De s=C3=A9chere Wee. Rufft eis un um 22 57 31-= 1 ... ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview > > > 3. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition) ( ... > ministre Jean-Claude Juncker au sujet du raz-de-mar=C3=A9e en Asie du Sud= =2Dest > France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Juncker: Oui, > bonjour. France Inter: En tant que pr=C3=A9sident en exercice de l'Union > europ=C3=A9enne, vous =C3=A9tiez pr=C3=A9sent jeudi dernier aux ... impor= tants puisqu'il > s'agit d'une r=C3=A9gion du monde qui nous est tr=C3=A8s proche. France I= nter: Merci, > Jean-Claude Juncker. Merci, Monsieur le Pr=C3=A9sident. Jean-Claude ... )= Number > of versions satisfying query / total number of versions : 0/0 Timeline | > Overview > > 4. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?print=3D= 1) > (...) 5. CSV - Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...) 6. CSV - > Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel)(...) 7. CSV - Edm=C3=A9e J= uncker > verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank)(...) 8. CSV - Edm=C3=A9e = Juncker > verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...) 9. CSV - D= rei > Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Claude+Wiseler)(...) 10. CSV - Drei > Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Luc+Frieden)(...) > > ------------------------- > > Results 2-10 all show me "Sorry, no documents with the given uri were > found". They also have "total number of versions 0/0". > > The only link who retrieves anything is the first one. But even here: The > page I get has a set of thumbnails which are only displayed for about 0.1 > seconds and then disappear (I guess because of JavaScript replacing the > links with links pointing to within the collection..). A look at the sour= ce > code of the page shows that these pictures should be: > > juncker/JoTt-(01).jpg > juncker/JoTt-(02).jpg > ... > > So I search for "JoTt-(01).jpg"... > > and get 2 hits: > > Total number of versions found : 2. Displaying URL's 1-2 > 1. http://hesper.csv.lu/juncker/JoTt-(01).jpg > (http://hesper.csv.lu/juncker/JoTt-(01).jpg) (CSV CSV lokal Fehler: > D=C3=AF=C2=BF=C2=BDs S=C3=AF=C2=BF=C2=BDt exist=C3=AF=C2=BF=C2=BDert net!= CSV. De s=C3=A9chere Wee. Rufft eis un um 22 > 57 31-1 oder sch=C3=A9ckt eng Email op csv@csv.) Number of versions satis= fying > query / total number of versions : 0/0 Timeline | Overview > > 2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg > (http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg) ( > ... http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview > > Again, both not retrievable. Same goes for any other pictures with bracke= ts > (and possibly some other non-"a-z|A-Z|0-9" characters) in the filename. > > > 4) Special characters > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > This has repeatedly been reported as fixed but there is still trouble: > > Searching for "Edm=C3=A9e" (in case that doesn't display fine: e-d-m-eacu= te-e) > gives me hits but ONLY if I manually set Encoding of my browser (Firefox) > to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3= =A9e", > and then Search I get a page with results, > > BUT > > the Search box now says "Edm?e" and Character encoding has been set back = to > UTF-8. If I no do another search (say "fran=C3=A7ais") I get again "no hi= ts!". > I'd have to set back Character encoding manually before each search. > > > 5) XML error: reference to invalid character number at line 34 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > For some searches (on collections indexed with nutchwax release 0.2.1) I > get only the above error message as result. The source code : > > *****START***** > > <!-- ************************ Results: > ****************************************************** --> <table > align=3D"center" class=3D"greyborder" border=3D"0" cellspacing=3D"0" > cellpadding=3D"1" width=3D"90%"> <tr> > <td> > > <table align=3D"center" class=3D"resultsborder" border=3D"0" > cellspacing=3D"0" cellpadding=3D"10" width=3D"100%"> <tr> > <td> > > XML error: reference to invalid character number at line 34 > > > *****END***** > > That's the last line (HTML generation by php is cut off there) > > A look into catalina.out : > > *****START***** > > 050929 163012 12 query request from 192.168.6.21 > 050929 163012 12 query: Juncker > 050929 163012 12 searching for 20 raw hits > 050929 163012 12 re-searching for 40 raw hits, query: juncker > -exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL > 2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ" > -exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX > WO4ZHCSQY" > 050929 163012 12 found 10476 raw hits > 050929 163012 12 total hits: 10496 > > *****END***** > > 6) Wrong re-setting of Character encoding > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > On the "live" web, www.gouvernement.lu has character encoding UTF-8. Every > time you reload the page it sets it to this. > > In my archived collection, every time I retrieve a page from this URL, > encoding is always set back to ISO 8859-1. The page, being in French, is > therefore pretty much unreadable and you have to set back Encoding manual= ly > back to UTF-8 after every click. > > > 7) Immediate re-direct to "live" web > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > URL http://www.lsap.lu (in my seeds list) is a redirect to > http://www.lsap.lu/index.php?idusergroup=3D42114236. > > When I retrieve http://www.lsap.lu/ from my collection, WERA immediately > displays the live web page. Besides that, <i>every</i> link on www.lsap.lu > includes variables (question marks) and is hence unretrievable (see (3)). > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > 8) No images indexed? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > When I index my collection with NutchWax head CVS BUILD, no images appear > at all. > > One method has been suggested here to see if a file is in the archive: > >Setting $conf_debug to 1 in /lib/config.inc and changing index.php > > > >from $search->setFieldsInResult("teaser url description"); > >to $search->setFieldsInResult("teaser url description > > archiveidentifier"); > > When I do this and query for one of the many non-displayed images (e.g. > "gouvernement.gif") I get > > [1] =3D> Array > ( > [teaser] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif [url] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif > [archiveidentifier] =3D> //arc/.arc.gz > ) > > So I look in the indexarcs output file and notice I have plenty of entries > like this: > > (...) > 050929 115748 adding 4223 bytes of mimetype image/jpeg > http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg 050929 > 115748 Failed parse: Content-Type not text/html: image/jpeg (...) > > and towards the end of the file: > > (...) > 050929 125148 No collection for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sE >auQR.pdf 050929 125148 No arcname for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sE >auQR.pdf 050929 125148 No arcoffset for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sE >auQR.pdf 050929 125148 No collection for url > http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcname for url > http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcoffset for url > http://www.adr.lu/Norden/koepp_port.jpg (...) > > I didn't have these lines before (when I indexed with the released nutchw= ax > as opposed to the cvs built) > > Any ideas on how this is possible or what it means? Why do my images not > have an archiveidentifier? My indexing process must have been wrong I > guess? > > bin/indexarcs.sh -c elections -s /arc/ -d > /usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ > &>index_arc_elections_29sep.log > > What is a typical indexarcs.sh command line meant to look like instead? > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > One more question: Is there a version of WERA newer than the 0.2.2 release > going somewhere (via cvs, for instance) that's worth getting (ie with any > substantial changes)? If so, what commands or steps need to be executed to > use it? > > That's all for now :) > > Looking forward to reading your comments > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Charlie Foetz > Biblioth=C3=A8que nationale de Luxembourg |