From: stack <st...@ar...> - 2005-10-01 06:33:20
|
Sorry. I just noticed I wasn't subscribed to the discussion list so=20 have missed past postings. First, thanks Charles for the report on problems using WERA+NutchWAX=20 (And thanks Luk=E1=9A for CC'ing archive-access-cvs so I got to see the b= elow). Sverre and I will take a closer look at your mail next Monday, but for=20 now... =2E.. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >>1) Inline redirected images=20 >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D >> =20 >> =2E... >>What I suppose that happened is that Heritrix tried fetching (1-3), got= a >>redirect back, therefore fetched and archived (4-6). Now when WERA >>retrieves (1-3) it doesn't find them, since these URLs were never >>archived.=20 >> >> =20 >> Correct. >>I don't know what could be a workaround for this, but I suppose it can = a >>serious problem. Would it also happen with redirected html pages? >> >> =20 >> Yes.=20 Need to look at this. Heritrix usually records redirects, so indexing,=20 we should probably figure how to make it so an exact search on a=20 redirected URL gets you the redirect result instead. >>2) Need for URL canonicalisation in WERA? >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> =20 >> In the wayback, there an URL canonicalization is done. None in=20 nutchwax currently. Indexing, we should probably write a normalized URL = into the exacturl field -- perhaps need to rename it -- and then queries = on exacturl get the same normalization done before we go into the index. >>3) Dynamic pages / question marks in the URL >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> =20 >> =2E.. >>Again, both not retrievable. Same goes for any other pictures with >>brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in the >>filename. >> =20 >> > >i think this bug is fixed but i'm not really sure if it is at cvs. There= was a discussion=20 >about this and Svere decribed(i think) how to fix it. I can forward this= email to you if you want. > > =20 > As Luk=E1=9A says, I think this has been fixed. Will confirm Monday. >>4) Special characters >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>This has repeatedly been reported as fixed but there is still trouble: = >> >>Searching for "Edm=C3=8Ae" (in case that doesn't display fine: e-d-m-ea= cute-e) >>gives me hits but ONLY if I manually set Encoding of my browser (Firefo= x) >>to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3= =8Ae", >>and then Search I get a page with results,=20 >> >>BUT=20 >> >>the Search box now says "Edm?e" and Character encoding has been set bac= k >>to UTF-8. If I no do another search (say "fran=C3=A7ais") I get again "= no >>hits!". I'd have to set back Character encoding manually before each >>search. >> >> =20 >> > >i posted this bug and sent patch to Michael. The problem is in getting r= equest query from tomcat server. >You have to explicitly specify "from" encoding in converting query > >like > >String parameter =3D request.getParameter("query"); >if (parameter =3D=3D null) parameter =3D ""; >String queryString =3D new String(parameter.getBytes("ISO-8859-1"), "UT= F-8"); > >i made this change in JSP search.jsp and it works for me. you can try it= =2E > > =20 > I committed Luk=E1=9A's fix a while back. But fix is in HEAD, not in a=20 release yet. >>5) XML error: reference to invalid character number at line 34 >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> =2E.. > >is this output from wera? i remember bug like this from previous version= of nwa. >There was problem (i'm not really sure it's a long time ago) with charac= ters like /amp &=20 >i can try to look at it... > > =20 > Might be a bug in nutch itself. Few are using the OpenSearchServlet. =20 Its probably not checking the snippet content for the disallowed XML=20 characters. We need to fix it. >>6) Wrong re-setting of Character encoding >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>On the "live" web, www.gouvernement.lu has character encoding UTF-8. Ev= ery >>time you reload the page it sets it to this.=20 >> >>In my archived collection, every time I retrieve a page from this URL, >>encoding is always set back to ISO 8859-1. The page, being in French, i= s >>therefore pretty much unreadable and you have to set back Encoding >>manually back to UTF-8 after every click. >> =20 >> >> >>7) Immediate re-direct to "live" web >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>URL http://www.lsap.lu (in my seeds list) is a redirect to >>http://www.lsap.lu/index.php?idusergroup=3D42114236.=20 >> >>When I retrieve http://www.lsap.lu/ from my collection, WERA immediatel= y >>displays the live web page. Besides that, <i>every</i> link on www.lsap= =2Elu >>includes variables (question marks) and is hence unretrievable (see (3)= ).=20 >> >> =20 >> Ok. Thanks. Will look into these. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>8) No images indexed? >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>When I index my collection with NutchWax head CVS BUILD, no images appe= ar >>at all.=20 >> >>One method has been suggested here to see if a file is in the archive: >> >> =20 >> >>>Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20 >>> >>> =20 >>> >>>from $search->setFieldsInResult("teaser url description");=20 >> =20 >> >>>to $search->setFieldsInResult("teaser url description >>> =20 >>> >>archiveidentifier");=20 >> >>When I do this and query for one of the many non-displayed images (e.g.= >>"gouvernement.gif") I get=20 >> >>[1] =3D> Array >>( >>[teaser] =3D> >>http://www.gouvernement.lu/pictures/layout/gouvernement.gif >>[url] =3D> >>http://www.gouvernement.lu/pictures/layout/gouvernement.gif >>[archiveidentifier] =3D> //arc/.arc.gz >>) >> >>So I look in the indexarcs output file and notice I have plenty of entr= ies >>like this: >> >>(...) >>050929 115748 adding 4223 bytes of mimetype image/jpeg >>http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg >>050929 115748 Failed parse: Content-Type not text/html: image/jpeg >>(...) >> >>and towards the end of the file: >> >>(...) >>050929 125148 No collection for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No arcname for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No arcoffset for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No collection for url >>http://www.adr.lu/Norden/koepp_port.jpg >>050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.jp= g >>050929 125148 No arcoffset for url http://www.adr.lu/Norden/koepp_port.= jpg >>(...) >> >>I didn't have these lines before (when I indexed with the released >>nutchwax as opposed to the cvs built) >> >>Any ideas on how this is possible or what it means? Why do my images no= t >>have an archiveidentifier? My indexing process must have been wrong I >>guess >> =20 >> The above lines are complaining a resource was incompletely added to=20 index -- its missing core metadata fields. I've been slowly addressing=20 all the reasons for why this might happen but still some work to do. >>bin/indexarcs.sh -c elections -s /arc/ -d >>/usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ >>&>index_arc_elections_29sep.log >> >>What is a typical indexarcs.sh command line meant to look like instead?= >> >> =20 >> That looks fine. Usually you also have to have a collection name '-c NAM= E'. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>One more question: Is there a version of WERA newer than the 0.2.2 rele= ase >>going somewhere (via cvs, for instance) that's worth getting (ie with a= ny >>substantial changes)? If so, what commands or steps need to be executed= to >>use it?=20 >> >>That's all for now :) >> >>Looking forward to reading your comments >> >> =20 >> Sverre is coming to SF next week. We're going to work on a new=20 release. New release will be bug fixes. Hopefully we get all above. =20 Expect release first week or two in october. Thanks for the report Charles and good seeing you in Vienna. St.Ack >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>Charlie Foetz >>Biblioth=C3=A8que nationale de Luxembourg >> >> >><script type=3D"text/javascript"> var atts =3D []; var quicka =3D [["za= ba...@mz...","zabak"],["el...@em...","elisa.b"],["Mi...@se...",= "Miles.D"],["Hol...@se...","Holubcova.L"],["mi...@se...","= miles.d"],["po...@fi...","poborny"],["st...@ar...","stack"],["= xk...@fi...","xkouril"],["mat...@ce...","matejka.lukas= "],["su...@em...","suzana"],["mj...@uc...","mj"],["vo...@ce...","= vojtab"],["bro...@ce...","brokesova"],["Archive-access-cvs-reques= t...@li...","Archive-access-cvs-request"],["archive-access-c= vs...@li...","archive-access-cvs"]]; var names =3D [["zabak= ", "za...@mz..."],["elisa.b", "el...@em..."],["Miles.D", "Miles.D@s= eznam.cz"],["Holubcova.L", "Hol...@se..."],["miles.d", "miles.d@= seznam.cz"],["poborny", "po...@fi..."],["stack", "st...@ar..."= ],["xkouril", "xk...@fi..."],["matejka.lukas", "matejka.lukas@cent= rum.cz"],["suzana", "su...@em..."],["mj", "mj...@uc..."],["vojtab", "vo= jt...@ce..."],["brokesova", "bro...@ce..."],["Archive-access-= cvs-request", "Arc...@li..."],["archi= ve-access-cvs", "arc...@li..."]]; var signatu= res =3D []; var s =3D "Re: [Archive-access-discuss] WERA / Nutchwax - bug= s, problems and questions from Luxembourg"; var a =3D ["Charles.Foetz@bnl= =2Eetat.lu", "", "", s, 20, 20023165, 0, 0, 0]; var getElementWithId; if(= document.getElementById){ getElementWithId =3D function(id){ return docum= ent.getElementById(id); } }else if(document.all){ getElementWithId =3D fu= nction(id){ return document.all[id]; } }else{ getElementWithId =3D functi= on(id){ return false; } } var obj=3DgetElementWithId("body"); if(obj){ va= r html =3D new String(obj.innerHTML); html =3D html.replace(/\n/g, ' '); = html =3D html.replace(/\r/g, ''); if (0) var n =3D "<br />"; else var n =3D= "\n"; var agt=3Dnavigator.userAgent.toLowerCase(); if(agt.indexOf("opera= ")!=3D-1) n=3D"\n"; html =3D html.replace(/\{\{\#\$NXL\@\$\}\}/g, n); } e= lse var html =3D ""; var bae=3D1+0; parent.smdfrl(a,html,atts,quicka,na= mes,signatures,bae); parent.rsdlmf(0); </script></td></tr></tbody></table= ></td></tr></tbody></table> >> =20 >> > >lukas > > > >------------------------------------------------------- >This SF.Net email is sponsored by: >Power Architecture Resource Center: Free content, downloads, discussions= , >and more. http://solutions.newsforge.com/ibmarch.tmpl >_______________________________________________ >Archive-access-cvs mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-cvs > =20 > |