From:
<mat...@ce...> - 2005-09-30 09:11:26
|
Hi, i will try to help you just with encoding problems:) ______________________________________________________________ > Od: Cha...@bn... > Komu: arc...@li... > CC: Carlo Blum <Car...@ci...> > Datum: 29.09.2005 18:20 > P=F8edm=ECt: [Archive-access-discuss] WERA / Nutchwax - bugs, problem= s and questions from Luxembourg > > Hello! >=20 > We (Biblioth=C3=A8que nationale de Luxembourg) are still newbies in t= he world > of web archiving, pretty much taking our first steps, and for a > prototype/test project we've chosen Luxembourg's regional elections, > taking place 9th of October.=20 >=20 > The set-up: >=20 > We've got a small collection of .arc files, crawled and archived by > Heritrix 1.4. I am the only human resource for this project, (and als= o > work on other projects), so we're quite limited resourcewise. I'm now= at > the stage of trying to interface (partly to be able to see if everyth= ing > has been crawled) this .arc collection to the "users", which at this = stage > is the library staff. I'm using WERA 0.2.2, running on Apache 2, and = I've > got both the nutchwax release 0.2.1 and the CVS head nutchwax (Septem= ber > 25) running on Tomcat 5. Java is 1.5.0. >=20 > Here are the problems I am experiencing after a first look at > WERA/Nutchwax (well, after a couple of weeks of messing about with th= e > releases and cvs builds, rather =3D)=20 >=20 >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 >=20 > 1) Inline redirected images=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >=20 > The URL www.csv.lu was part of a domain-scoped crawl. Many inline ima= ges > from this domain are not displayed. One example: >=20 > http://kayltetange.csv.lu/index.html has 3 few inline images:=20 >=20 > <img alt=3D"image334.jpg" src=3D"%7B%7B#$NXL@$%7D%7D%3E" =3D"" http:/= /kayltetange.csv.lu/fotoen/image334.jpg=3D"" {{#$nxl@$}}>=3D"" align=3D= "bottom" height=3D"101" width=3D"136"> (1) > <img style=3D"width: 886px; height: 686px;" alt=3D"l__iffr__chen_036i= nternet.JPG" src=3D"%7B%7B#$NXL@$%7D%7D%3E" =3D"" http://kayltetange.cs= v.lu/fotoen/l__iffr__chen_036internet.jpg=3D"" {{#$nxl@$}}>=3D"" alig= n=3D"bottom" height=3D"1050" width=3D"1400"> (2) > <img style=3D"width: 74px; height: 77px;" alt=3D"image3731.jpg" src=3D= "%7B%7B#$NXL@$%7D%7D%3E" =3D"" http://kayltetange.csv.lu/fotoen/image37= 31.jpg=3D"" {{#$nxl@$}}>=3D"" align=3D"bottom" height=3D"52" width=3D= "49"> (3) >=20 > A search for the filename shows that these images are in my collectio= n, > but with URLs >=20 > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.= jpg > (4)=20 > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__= chen_036internet.JPG > (5) > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731= =2Ejpg > (6) >=20 > Opening the URLs (1-3) in a browser on the "live" web redirects me > immediately to (4-6)=20 >=20 > What I suppose that happened is that Heritrix tried fetching (1-3), g= ot a > redirect back, therefore fetched and archived (4-6). Now when WERA > retrieves (1-3) it doesn't find them, since these URLs were never > archived.=20 >=20 > I don't know what could be a workaround for this, but I suppose it ca= n a > serious problem. Would it also happen with redirected html pages? >=20 >=20 > 2) Need for URL canonicalisation in WERA? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > On the "live" web: >=20 > The main (home) page of http://www.csv.lu has a "Newsletter" link to > http://www.csv.lu/newsletter.=20 > The main page also has links to dozens of regional subsites of the pa= rty > (e.g. http://bettembourg.csv.lu/, which are all in pretty much the sa= me > design as the main page, with some links including the "Newsletter" o= ne.=20 > BUT: Most of these regional subsites have their "Newsletter" link poi= nting > to http://csv.lu/newsletter.=20 >=20 > Heritrix didn't archive this a second time. >=20 > Result: "Sorry, no documents with the given uri were found" when clic= king > "Newsletter" on the archived regional sites.=20 >=20 >=20 > 3) Dynamic pages / question marks in the URL > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > I've read about this bug some time ago - is it supposed to have been > fixed? >=20 > As soon as there is one question mark (or a '+' sign, or others?) in = a URL > the page can't be retrieved. Say I search for "Juncker"... I get: >=20 > ----------------------------- >=20 > 1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/2004/juncker.html) > ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CS= V > lokal Juncker on Tour zu Hesper am Centre Civique=20 >=20 > Zer=C3=8Ack CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 ... > ) > Number of versions satisfying query / total number of versions : 1/1 > Timeline | Overview >=20 > 2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html) > ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CS= V > lokal Juncker on Tour zu Hesper am Centre Civique=20 >=20 > Zer=C3=8Ack CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 ... > ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 >=20 > 3. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition) > ( ... ministre Jean-Claude Juncker au sujet du raz-de-mar=C3=8Ae en A= sie du > Sud-est France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Junck= er: > Oui, bonjour. France Inter: En tant que pr=C3=8Asident en exercice de= l'Union > europ=C3=8Aenne, vous =C3=8Atiez pr=C3=8Asent jeudi dernier aux ... i= mportants > puisqu'il s'agit d'une r=C3=8Agion du monde qui nous est tr=C3=A8s pr= oche. France > Inter: Merci, Jean-Claude Juncker. Merci, Monsieur le Pr=C3=8Asident. > Jean-Claude ... ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > 4. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?prin= t=3D1) > (...) > 5. CSV - Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...) > 6. CSV - Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel)(...) > 7. CSV - Edm=C3=8Ae Juncker verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank)(...) > 8. CSV - Edm=C3=8Ae Juncker verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...) > 9. CSV - Drei Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Claude+Wiseler)(...) > 10. CSV - Drei Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Luc+Frieden)(...) >=20 > ------------------------- >=20 > Results 2-10 all show me "Sorry, no documents with the given uri were > found". They also have "total number of versions 0/0".=20 >=20 > The only link who retrieves anything is the first one. But even here:= The > page I get has a set of thumbnails which are only displayed for about= 0.1 > seconds and then disappear (I guess because of JavaScript replacing t= he > links with links pointing to within the collection..). A look at the > source code of the page shows that these pictures should be:=20 > juncker/JoTt-(01).jpg > juncker/JoTt-(02).jpg > ... >=20 > So I search for "JoTt-(01).jpg"... >=20 > and get 2 hits: >=20 > Total number of versions found : 2. Displaying URL's 1-2 > 1. http://hesper.csv.lu/juncker/JoTt-(01).jpg > (http://hesper.csv.lu/juncker/JoTt-(01).jpg) > (CSV CSV lokal Fehler: D=C3=AF=C2=BF=C2=BDs S=C3=AF=C2=BF=C2=BDt exis= t=C3=AF=C2=BF=C2=BDert net!=20 > CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 oder sch=C3=8Ack= t eng Email > op csv@csv.) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > 2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).= jpg > (http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jp= g) > ( ... > http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg= ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > Again, both not retrievable. Same goes for any other pictures with > brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in th= e > filename. i think this bug is fixed but i'm not really sure if it is at cvs. Ther= e was a discussion=20 about this and Svere decribed(i think) how to fix it. I can forward thi= s email to you if you want. >=20 >=20 > 4) Special characters > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > This has repeatedly been reported as fixed but there is still trouble= :=20 >=20 > Searching for "Edm=C3=8Ae" (in case that doesn't display fine: e-d-m-= eacute-e) > gives me hits but ONLY if I manually set Encoding of my browser (Fire= fox) > to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3= =8Ae", > and then Search I get a page with results,=20 >=20 > BUT=20 >=20 > the Search box now says "Edm?e" and Character encoding has been set b= ack > to UTF-8. If I no do another search (say "fran=C3=A7ais") I get again= "no > hits!". I'd have to set back Character encoding manually before each > search. >=20 i posted this bug and sent patch to Michael. The problem is in getting = request query from tomcat server. You have to explicitly specify "from" encoding in converting query like String parameter =3D request.getParameter("query"); if (parameter =3D=3D null) parameter =3D ""; String queryString =3D new String(parameter.getBytes("ISO-8859-1"), "U= TF-8"); i made this change in JSP search.jsp and it works for me. you can try i= t. >=20 > 5) XML error: reference to invalid character number at line 34 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > For some searches (on collections indexed with nutchwax release 0.2.1= ) I > get only the above error message as result. The source code :=20 >=20 > *****START***** >=20 ><!-- ************************ Results: > ****************************************************** --> >=20 > =20 > <table class=3D"greyborder" {{#$nxl@$}}>=3D"" align=3D"center"= border=3D"0" cellpadding=3D"1" cellspacing=3D"0" width=3D"90%"><tbody>= <tr><td> >=20 > =20 > =20 > <table class=3D"resultsborder" {{#$nxl@$}}>=3D"" align=3D= "center" border=3D"0" cellpadding=3D"10" cellspacing=3D"0" width=3D"100= %"><tbody><tr><td> >=20 > XML error: reference to invalid character number at line 34 >=20 >=20 > *****END***** >=20 > That's the last line (HTML generation by php is cut off there) >=20 > A look into catalina.out : >=20 > *****START***** >=20 > 050929 163012 12 query request from 192.168.6.21 > 050929 163012 12 query: Juncker > 050929 163012 12 searching for 20 raw hits > 050929 163012 12 re-searching for 40 raw hits, query: juncker > -exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL > 2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ" > -exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX > WO4ZHCSQY" > 050929 163012 12 found 10476 raw hits > 050929 163012 12 total hits: 10496 >=20 > *****END***** is this output from wera? i remember bug like this from previous versio= n of nwa. There was problem (i'm not really sure it's a long time ago) with chara= cters like /amp &=20 i can try to look at it... >=20 > 6) Wrong re-setting of Character encoding > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > On the "live" web, www.gouvernement.lu has character encoding UTF-8. = Every > time you reload the page it sets it to this.=20 >=20 > In my archived collection, every time I retrieve a page from this URL= , > encoding is always set back to ISO 8859-1. The page, being in French,= is > therefore pretty much unreadable and you have to set back Encoding > manually back to UTF-8 after every click. >=20 >=20 > 7) Immediate re-direct to "live" web > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > URL http://www.lsap.lu (in my seeds list) is a redirect to > http://www.lsap.lu/index.php?idusergroup=3D42114236.=20 >=20 > When I retrieve http://www.lsap.lu/ from my collection, WERA immediat= ely > displays the live web page. Besides that, <i>every</i> link on www.ls= ap.lu > includes variables (question marks) and is hence unretrievable (see (= 3)).=20 >=20 >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > 8) No images indexed? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > When I index my collection with NutchWax head CVS BUILD, no images ap= pear > at all.=20 >=20 > One method has been suggested here to see if a file is in the archive= : >=20 > >Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20 > > > >from $search->setFieldsInResult("teaser url description");=20 > >to $search->setFieldsInResult("teaser url description > archiveidentifier");=20 >=20 > When I do this and query for one of the many non-displayed images (e.= g. > "gouvernement.gif") I get=20 >=20 > [1] =3D> Array > ( > [teaser] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif > [url] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif > [archiveidentifier] =3D> //arc/.arc.gz > ) >=20 > So I look in the indexarcs output file and notice I have plenty of en= tries > like this: >=20 > (...) > 050929 115748 adding 4223 bytes of mimetype image/jpeg > http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg > 050929 115748 Failed parse: Content-Type not text/html: image/jpeg > (...) >=20 > and towards the end of the file: >=20 > (...) > 050929 125148 No collection for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No arcname for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No arcoffset for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No collection for url > http://www.adr.lu/Norden/koepp_port.jpg > 050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.= jpg > 050929 125148 No arcoffset for url http://www.adr.lu/Norden/koepp_por= t.jpg > (...) >=20 > I didn't have these lines before (when I indexed with the released > nutchwax as opposed to the cvs built) >=20 > Any ideas on how this is possible or what it means? Why do my images = not > have an archiveidentifier? My indexing process must have been wrong I > guess?=20 >=20 > bin/indexarcs.sh -c elections -s /arc/ -d > /usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ > &>index_arc_elections_29sep.log >=20 > What is a typical indexarcs.sh command line meant to look like instea= d? >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > One more question: Is there a version of WERA newer than the 0.2.2 re= lease > going somewhere (via cvs, for instance) that's worth getting (ie with= any > substantial changes)? If so, what commands or steps need to be execut= ed to > use it?=20 >=20 > That's all for now :) >=20 > Looking forward to reading your comments >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Charlie Foetz > Biblioth=C3=A8que nationale de Luxembourg >=20 >=20 > <script type=3D"text/javascript"> var atts =3D []; var quicka =3D [["= za...@mz...","zabak"],["el...@em...","elisa.b"],["Miles.D@seznam.= cz","Miles.D"],["Hol...@se...","Holubcova.L"],["miles.d@seznam= =2Ecz","miles.d"],["po...@fi...","poborny"],["st...@ar...","= stack"],["xk...@fi...","xkouril"],["mat...@ce...","m= atejka.lukas"],["su...@em...","suzana"],["mj...@uc...","mj"],["vojtab= @centrum.cz","vojtab"],["bro...@ce...","brokesova"],["Archive-a= cce...@li...","Archive-access-cvs-request"],= ["arc...@li...","archive-access-cvs"]]; va= r names =3D [["zabak", "za...@mz..."],["elisa.b", "el...@em..."],= ["Miles.D", "Mi...@se..."],["Holubcova.L", "Hol...@se...= "],["miles.d", "mi...@se..."],["poborny", "po...@fi..."],["s= tack", "st...@ar..."],["xkouril", "xk...@fi..."],["matejka= =2Elukas", "mat...@ce..."],["suzana", "su...@em..."],["= mj", "mj...@uc..."],["vojtab", "vo...@ce..."],["brokesova", "brokes= ov...@ce..."],["Archive-access-cvs-request", "Archive-access-cvs-req= ue...@li..."],["archive-access-cvs", "archive-access-cvs= @lists.sourceforge.net"]]; var signatures =3D []; var s =3D "Re: [Archi= ve-access-discuss] WERA / Nutchwax - bugs, problems and questions from = Luxembourg"; var a =3D ["Cha...@bn...", "", "", s, 20, 200= 23165, 0, 0, 0]; var getElementWithId; if(document.getElementById){ get= ElementWithId =3D function(id){ return document.getElementById(id); } }= else if(document.all){ getElementWithId =3D function(id){ return docume= nt.all[id]; } }else{ getElementWithId =3D function(id){ return false; }= } var obj=3DgetElementWithId("body"); if(obj){ var html =3D new String= (obj.innerHTML); html =3D html.replace(/\n/g, ' '); html =3D html.repla= ce(/\r/g, ''); if (0) var n =3D "<br />"; else var n =3D "\n"; var agt=3D= navigator.userAgent.toLowerCase(); if(agt.indexOf("opera")!=3D-1) n=3D"= \n"; html =3D html.replace(/\{\{\#\$NXL\@\$\}\}/g, n); } else var html = =3D ""; var bae=3D1+0; parent.smdfrl(a,html,atts,quicka,names,signatu= res,bae); parent.rsdlmf(0); </script></td></tr></tbody></table></td></t= r></tbody></table> lukas |