From: Charles F. <Cha...@bn...> - 2005-09-29 16:16:19
|
Hello! We (Biblioth=E8que nationale de Luxembourg) are still newbies in the = world of web archiving, pretty much taking our first steps, and for a = prototype/test project we've chosen Luxembourg's regional elections, = taking place 9th of October.=20 The set-up: We've got a small collection of .arc files, crawled and archived by = Heritrix 1.4. I am the only human resource for this project, (and also = work on other projects), so we're quite limited resourcewise. I'm now at = the stage of trying to interface (partly to be able to see if everything = has been crawled) this .arc collection to the "users", which at this = stage is the library staff. I'm using WERA 0.2.2, running on Apache 2, = and I've got both the nutchwax release 0.2.1 and the CVS head nutchwax = (September 25) running on Tomcat 5. Java is 1.5.0. Here are the problems I am experiencing after a first look at = WERA/Nutchwax (well, after a couple of weeks of messing about with the = releases and cvs builds, rather =3D)=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D 1) Inline redirected images=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D The URL www.csv.lu was part of a domain-scoped crawl. Many inline images = from this domain are not displayed. One example: http://kayltetange.csv.lu/index.html has 3 few inline images:=20 <img height=3D"101" alt=3D"image334.jpg" src=3D "http://kayltetange.csv.lu/fotoen/image334.jpg" width=3D"136" = align=3D"baseline"/> (1) <img style=3D"WIDTH: 886px; HEIGHT: 686px" height=3D"1050" = alt=3D"l__iffr__chen_036internet.JPG" src=3D "http://kayltetange.csv.lu/fotoen/l__iffr__chen_036internet.JPG" = width=3D"1400" align=3D"baseline"/> (2) <img style=3D"WIDTH: 74px; HEIGHT: 77px" height=3D"52" = alt=3D"image3731.jpg" src=3D "http://kayltetange.csv.lu/fotoen/image3731.jpg" width=3D"49" = align=3D"baseline"/> (3) A search for the filename shows that these images are in my collection, = but with URLs http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.jpg = (4)=20 http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__chen= _036internet.JPG (5) http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731.jpg= (6) Opening the URLs (1-3) in a browser on the "live" web redirects me = immediately to (4-6)=20 What I suppose that happened is that Heritrix tried fetching (1-3), got = a redirect back, therefore fetched and archived (4-6). Now when WERA = retrieves (1-3) it doesn't find them, since these URLs were never = archived.=20 I don't know what could be a workaround for this, but I suppose it can a = serious problem. Would it also happen with redirected html pages? 2) Need for URL canonicalisation in WERA? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On the "live" web: The main (home) page of http://www.csv.lu has a "Newsletter" link to = http://www.csv.lu/newsletter.=20 The main page also has links to dozens of regional subsites of the party = (e.g. http://bettembourg.csv.lu/, which are all in pretty much the same = design as the main page, with some links including the "Newsletter" one. = BUT: Most of these regional subsites have their "Newsletter" link = pointing to http://csv.lu/newsletter.=20 Heritrix didn't archive this a second time. Result: "Sorry, no documents with the given uri were found" when = clicking "Newsletter" on the archived regional sites.=20 3) Dynamic pages / question marks in the URL =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D I've read about this bug some time ago - is it supposed to have been = fixed? As soon as there is one question mark (or a '+' sign, or others?) in a = URL the page can't be retrieved. Say I search for "Juncker"... I get: ----------------------------- 1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique = (http://hesper.csv.lu/2004/juncker.html) ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV = lokal Juncker on Tour zu Hesper am Centre Civique = = Zer=E9ck CSV. De s=E9chere Wee. Rufft eis un um 22 = 57 31-1 ... ) Number of versions satisfying query / total number of versions : 1/1 Timeline | Overview 2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique = (http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html) ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CSV = lokal Juncker on Tour zu Hesper am Centre Civique = = Zer=E9ck CSV. De s=E9chere Wee. Rufft eis un um 22 = 57 31-1 ... ) Number of versions satisfying query / total number of versions : 0/0 Timeline | Overview 3. CSV - Jean-Claude Juncker sur France Inter = (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition) ( ... ministre Jean-Claude Juncker au sujet du raz-de-mar=E9e en Asie du = Sud-est France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Juncker: = Oui, bonjour. France Inter: En tant que pr=E9sident en exercice de = l'Union europ=E9enne, vous =E9tiez pr=E9sent jeudi dernier aux ... = importants puisqu'il s'agit d'une r=E9gion du monde qui nous est tr=E8s = proche. France Inter: Merci, Jean-Claude Juncker. Merci, Monsieur le = Pr=E9sident. Jean-Claude ... ) Number of versions satisfying query / total number of versions : 0/0 Timeline | Overview 4. CSV - Jean-Claude Juncker sur France Inter = (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?print=3D= 1) (...) 5. CSV - Interview mam Jean-Claude Juncker = (http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...) 6. CSV - Interview mam Jean-Claude Juncker = (http://www.csv.lu/text/2133.html/Frank+Engel)(...) 7. CSV - Edm=E9e Juncker verabschiedet sich als Pr=E4sidentin = (http://www.csv.lu/text/1978.html/Marco+Schank)(...) 8. CSV - Edm=E9e Juncker verabschiedet sich als Pr=E4sidentin = (http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...) 9. CSV - Drei Fragen an Jean-Claude Juncker = (http://www.csv.lu/text/2212.html/Claude+Wiseler)(...) 10. CSV - Drei Fragen an Jean-Claude Juncker = (http://www.csv.lu/text/2212.html/Luc+Frieden)(...) ------------------------- Results 2-10 all show me "Sorry, no documents with the given uri were = found". They also have "total number of versions 0/0".=20 The only link who retrieves anything is the first one. But even here: = The page I get has a set of thumbnails which are only displayed for = about 0.1 seconds and then disappear (I guess because of JavaScript = replacing the links with links pointing to within the collection..). A = look at the source code of the page shows that these pictures should be: = juncker/JoTt-(01).jpg juncker/JoTt-(02).jpg ... So I search for "JoTt-(01).jpg"... and get 2 hits: Total number of versions found : 2. Displaying URL's 1-2 1. http://hesper.csv.lu/juncker/JoTt-(01).jpg = (http://hesper.csv.lu/juncker/JoTt-(01).jpg) (CSV CSV lokal Fehler: D=EF=BF=BDs S=EF=BF=BDt exist=EF=BF=BDert = net! CSV. De s=E9chere Wee. Rufft eis un um 22 57 31-1 oder = sch=E9ckt eng Email op csv@csv.) Number of versions satisfying query / total number of versions : 0/0 Timeline | Overview =20 2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg = (http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg) ( ... = http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg) Number of versions satisfying query / total number of versions : 0/0 Timeline | Overview =20 Again, both not retrievable. Same goes for any other pictures with = brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in the = filename. 4) Special characters =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D This has repeatedly been reported as fixed but there is still trouble:=20 Searching for "Edm=E9e" (in case that doesn't display fine: = e-d-m-eacute-e) gives me hits but ONLY if I manually set Encoding of my = browser (Firefox) to "Windows 1252" or "ISO 8859-1". If I do that, then = enter the "Edm=E9e", and then Search I get a page with results,=20 BUT=20 the Search box now says "Edm?e" and Character encoding has been set back = to UTF-8. If I no do another search (say "fran=E7ais") I get again "no = hits!". I'd have to set back Character encoding manually before each = search. 5) XML error: reference to invalid character number at line 34 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D For some searches (on collections indexed with nutchwax release 0.2.1) I = get only the above error message as result. The source code :=20 *****START***** <!-- ************************ Results: = ****************************************************** --> <table align=3D"center" class=3D"greyborder" border=3D"0" = cellspacing=3D"0" cellpadding=3D"1" width=3D"90%"> <tr> <td> <table align=3D"center" class=3D"resultsborder" border=3D"0" = cellspacing=3D"0" cellpadding=3D"10" width=3D"100%"> <tr> <td> =20 XML error: reference to invalid character number at line 34 *****END***** That's the last line (HTML generation by php is cut off there) A look into catalina.out : *****START***** 050929 163012 12 query request from 192.168.6.21 050929 163012 12 query: Juncker 050929 163012 12 searching for 20 raw hits 050929 163012 12 re-searching for 40 raw hits, query: juncker = -exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL 2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ" = -exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX WO4ZHCSQY" 050929 163012 12 found 10476 raw hits 050929 163012 12 total hits: 10496 *****END***** 6) Wrong re-setting of Character encoding =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On the "live" web, www.gouvernement.lu has character encoding UTF-8. = Every time you reload the page it sets it to this.=20 In my archived collection, every time I retrieve a page from this URL, = encoding is always set back to ISO 8859-1. The page, being in French, is = therefore pretty much unreadable and you have to set back Encoding = manually back to UTF-8 after every click. 7) Immediate re-direct to "live" web =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D URL http://www.lsap.lu (in my seeds list) is a redirect to = http://www.lsap.lu/index.php?idusergroup=3D42114236.=20 When I retrieve http://www.lsap.lu/ from my collection, WERA immediately = displays the live web page. Besides that, <i>every</i> link on = www.lsap.lu includes variables (question marks) and is hence = unretrievable (see (3)).=20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D 8) No images indexed? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D When I index my collection with NutchWax head CVS BUILD, no images = appear at all.=20 One method has been suggested here to see if a file is in the archive: >Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20 > >from $search->setFieldsInResult("teaser url description");=20 >to $search->setFieldsInResult("teaser url description = archiveidentifier");=20 When I do this and query for one of the many non-displayed images (e.g. = "gouvernement.gif") I get=20 [1] =3D> Array ( [teaser] =3D> = http://www.gouvernement.lu/pictures/layout/gouvernement.gif [url] =3D> = http://www.gouvernement.lu/pictures/layout/gouvernement.gif [archiveidentifier] =3D> //arc/.arc.gz ) So I look in the indexarcs output file and notice I have plenty of = entries like this: (...) 050929 115748 adding 4223 bytes of mimetype image/jpeg = http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg 050929 115748 Failed parse: Content-Type not text/html: image/jpeg (...) and towards the end of the file: (...) 050929 125148 No collection for url = http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sEauQR.pdf 050929 125148 No arcname for url = http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sEauQR.pdf 050929 125148 No arcoffset for url = http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commune= sEauQR.pdf 050929 125148 No collection for url = http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.jpg 050929 125148 No arcoffset for url = http://www.adr.lu/Norden/koepp_port.jpg (...) I didn't have these lines before (when I indexed with the released = nutchwax as opposed to the cvs built) Any ideas on how this is possible or what it means? Why do my images not = have an archiveidentifier? My indexing process must have been wrong I = guess?=20 bin/indexarcs.sh -c elections -s /arc/ -d = /usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ = &>index_arc_elections_29sep.log What is a typical indexarcs.sh command line meant to look like instead? =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D One more question: Is there a version of WERA newer than the 0.2.2 = release going somewhere (via cvs, for instance) that's worth getting (ie = with any substantial changes)? If so, what commands or steps need to be = executed to use it?=20 That's all for now :) Looking forward to reading your comments =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D Charlie Foetz Biblioth=E8que nationale de Luxembourg |