You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: Michael S. <sta...@us...> - 2005-10-04 22:43:25
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/css In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20709/css Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/css added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:43:25
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/test In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20709/test Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/test added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:43:25
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20709/lib Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/lib added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:43:25
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/handlers In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20709/handlers Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera/handlers added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:42:10
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20443/wera Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps/wera added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:37:48
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles/images In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv19513/images Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/articles/images added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:31:35
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/installer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18341/installer Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/installer added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:31:35
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/webapps In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18341/webapps Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/webapps added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:31:35
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18341/articles Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/articles added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:31:35
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/images In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18341/images Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src/images added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:30:17
|
Update of /cvsroot/archive-access/archive-access/projects/wera/wera In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18013/wera Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/wera added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:30:17
|
Update of /cvsroot/archive-access/archive-access/projects/wera/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18013/xdocs Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/xdocs added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 22:30:17
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv18013/src Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera/src added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 20:01:23
|
Update of /cvsroot/archive-access/archive-access/projects/wera In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv17351/wera Log Message: Directory /cvsroot/archive-access/archive-access/projects/wera added to the repository |
From: Michael S. <sta...@us...> - 2005-10-04 00:13:38
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/web In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv31835 Modified Files: search.jsp Log Message: Last bit of '[ 1244875 ] exacturl encoding not working' * search.jsp Encode the query we put under the RSS logo. Do encoding on original query putting link under the RSS logo, not on one that has already had entities encoded. Index: search.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/web/search.jsp,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** search.jsp 7 Sep 2005 23:05:29 -0000 1.20 --- search.jsp 4 Oct 2005 00:13:22 -0000 1.21 *************** *** 92,96 **** String rss = request.getContextPath() + "/opensearch?query=" + ! htmlQueryString + "&hitsPerDup=" + hitsPerDup + ((start != 0)? "&start=" + start: "") + params; --- 92,96 ---- String rss = request.getContextPath() + "/opensearch?query=" + ! response.encodeURL(queryString) + "&hitsPerDup=" + hitsPerDup + ((start != 0)? "&start=" + start: "") + params; |
From: Michael S. <sta...@us...> - 2005-10-03 23:47:19
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25779/xdocs Modified Files: 2005-oswir-wacsearch.sxi navigation.xml Added Files: downloads.xml Log Message: * xdocs/navigation.xml * xdocs/downloads.xml Add new downloads page with talk of the continuous build server. Index: 2005-oswir-wacsearch.sxi =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/2005-oswir-wacsearch.sxi,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsy2KoHl and /tmp/cvsp7LUQH differ Index: navigation.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/navigation.xml,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** navigation.xml 29 Jul 2005 22:12:23 -0000 1.5 --- navigation.xml 3 Oct 2005 23:47:10 -0000 1.6 *************** *** 14,18 **** <item name="Requirements" href="/requirements.html"/> <item name="Downloads" ! href="http://sourceforge.net/project/showfiles.php?group_id=118427"/> <item name="Documentation" > --- 14,18 ---- <item name="Requirements" href="/requirements.html"/> <item name="Downloads" ! href="downloads.html"/> <item name="Documentation" > --- NEW FILE: downloads.xml --- <?xml version="1.0" encoding="ISO-8859-1"?> <document> <properties> <title>Downloads</title> <author email="stack at archive dot org">St.Ack</author> <revision>$Id: downloads.xml,v 1.1 2005/10/03 23:47:10 stack-sf Exp $</revision> </properties> <body> <section name="Downloads"> <subsection name="Releases"> <p>All releases are available off the <a href="http://sourceforge.net/project/showfiles.php?group_id=118427">Sourceforge Downloads</a> page. Release notes can be found here, <a href="articles/releasenotes.html">Heritrix Release Notes</a>. </p> </subsection> <subsection name="Continuous build"> <p>Here is a <a href="http://crawltools.archive.org:8080/cruisecontrol/">pointer</a> to our continuous build box. The latest build can be found by clicking on <i>HEAD-archive-access</i>. Bundled up version of the latest build can be found under the 'Build Artifacts' link. Be aware that this distribution has been made from CVS HEAD and CVS HEAD builds are not guaranteed stable. </p> </subsection> </section> </body> </document> |
From: Michael S. <sta...@us...> - 2005-10-02 20:57:41
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11862/conf Modified Files: nutch-site.xml Log Message: * conf/nutch-site.xml Same value as that on the mapreduce branch. Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** nutch-site.xml 27 Sep 2005 23:39:38 -0000 1.26 --- nutch-site.xml 30 Sep 2005 21:07:07 -0000 1.27 *************** *** 69,73 **** <property> <name>indexer.maxMergeDocs</name> ! <value>1000000000</value> <description>This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values --- 69,73 ---- <property> <name>indexer.maxMergeDocs</name> ! <value>2147483647</value> <description>This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values |
From: stack <st...@ar...> - 2005-10-01 06:33:20
|
Sorry. I just noticed I wasn't subscribed to the discussion list so=20 have missed past postings. First, thanks Charles for the report on problems using WERA+NutchWAX=20 (And thanks Luk=E1=9A for CC'ing archive-access-cvs so I got to see the b= elow). Sverre and I will take a closer look at your mail next Monday, but for=20 now... =2E.. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >>1) Inline redirected images=20 >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D >> =20 >> =2E... >>What I suppose that happened is that Heritrix tried fetching (1-3), got= a >>redirect back, therefore fetched and archived (4-6). Now when WERA >>retrieves (1-3) it doesn't find them, since these URLs were never >>archived.=20 >> >> =20 >> Correct. >>I don't know what could be a workaround for this, but I suppose it can = a >>serious problem. Would it also happen with redirected html pages? >> >> =20 >> Yes.=20 Need to look at this. Heritrix usually records redirects, so indexing,=20 we should probably figure how to make it so an exact search on a=20 redirected URL gets you the redirect result instead. >>2) Need for URL canonicalisation in WERA? >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> =20 >> In the wayback, there an URL canonicalization is done. None in=20 nutchwax currently. Indexing, we should probably write a normalized URL = into the exacturl field -- perhaps need to rename it -- and then queries = on exacturl get the same normalization done before we go into the index. >>3) Dynamic pages / question marks in the URL >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> =20 >> =2E.. >>Again, both not retrievable. Same goes for any other pictures with >>brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in the >>filename. >> =20 >> > >i think this bug is fixed but i'm not really sure if it is at cvs. There= was a discussion=20 >about this and Svere decribed(i think) how to fix it. I can forward this= email to you if you want. > > =20 > As Luk=E1=9A says, I think this has been fixed. Will confirm Monday. >>4) Special characters >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>This has repeatedly been reported as fixed but there is still trouble: = >> >>Searching for "Edm=C3=8Ae" (in case that doesn't display fine: e-d-m-ea= cute-e) >>gives me hits but ONLY if I manually set Encoding of my browser (Firefo= x) >>to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3= =8Ae", >>and then Search I get a page with results,=20 >> >>BUT=20 >> >>the Search box now says "Edm?e" and Character encoding has been set bac= k >>to UTF-8. If I no do another search (say "fran=C3=A7ais") I get again "= no >>hits!". I'd have to set back Character encoding manually before each >>search. >> >> =20 >> > >i posted this bug and sent patch to Michael. The problem is in getting r= equest query from tomcat server. >You have to explicitly specify "from" encoding in converting query > >like > >String parameter =3D request.getParameter("query"); >if (parameter =3D=3D null) parameter =3D ""; >String queryString =3D new String(parameter.getBytes("ISO-8859-1"), "UT= F-8"); > >i made this change in JSP search.jsp and it works for me. you can try it= =2E > > =20 > I committed Luk=E1=9A's fix a while back. But fix is in HEAD, not in a=20 release yet. >>5) XML error: reference to invalid character number at line 34 >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> =2E.. > >is this output from wera? i remember bug like this from previous version= of nwa. >There was problem (i'm not really sure it's a long time ago) with charac= ters like /amp &=20 >i can try to look at it... > > =20 > Might be a bug in nutch itself. Few are using the OpenSearchServlet. =20 Its probably not checking the snippet content for the disallowed XML=20 characters. We need to fix it. >>6) Wrong re-setting of Character encoding >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>On the "live" web, www.gouvernement.lu has character encoding UTF-8. Ev= ery >>time you reload the page it sets it to this.=20 >> >>In my archived collection, every time I retrieve a page from this URL, >>encoding is always set back to ISO 8859-1. The page, being in French, i= s >>therefore pretty much unreadable and you have to set back Encoding >>manually back to UTF-8 after every click. >> =20 >> >> >>7) Immediate re-direct to "live" web >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>URL http://www.lsap.lu (in my seeds list) is a redirect to >>http://www.lsap.lu/index.php?idusergroup=3D42114236.=20 >> >>When I retrieve http://www.lsap.lu/ from my collection, WERA immediatel= y >>displays the live web page. Besides that, <i>every</i> link on www.lsap= =2Elu >>includes variables (question marks) and is hence unretrievable (see (3)= ).=20 >> >> =20 >> Ok. Thanks. Will look into these. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>8) No images indexed? >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>When I index my collection with NutchWax head CVS BUILD, no images appe= ar >>at all.=20 >> >>One method has been suggested here to see if a file is in the archive: >> >> =20 >> >>>Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20 >>> >>> =20 >>> >>>from $search->setFieldsInResult("teaser url description");=20 >> =20 >> >>>to $search->setFieldsInResult("teaser url description >>> =20 >>> >>archiveidentifier");=20 >> >>When I do this and query for one of the many non-displayed images (e.g.= >>"gouvernement.gif") I get=20 >> >>[1] =3D> Array >>( >>[teaser] =3D> >>http://www.gouvernement.lu/pictures/layout/gouvernement.gif >>[url] =3D> >>http://www.gouvernement.lu/pictures/layout/gouvernement.gif >>[archiveidentifier] =3D> //arc/.arc.gz >>) >> >>So I look in the indexarcs output file and notice I have plenty of entr= ies >>like this: >> >>(...) >>050929 115748 adding 4223 bytes of mimetype image/jpeg >>http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg >>050929 115748 Failed parse: Content-Type not text/html: image/jpeg >>(...) >> >>and towards the end of the file: >> >>(...) >>050929 125148 No collection for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No arcname for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No arcoffset for url >>http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Commu= nesEauQR.pdf >>050929 125148 No collection for url >>http://www.adr.lu/Norden/koepp_port.jpg >>050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.jp= g >>050929 125148 No arcoffset for url http://www.adr.lu/Norden/koepp_port.= jpg >>(...) >> >>I didn't have these lines before (when I indexed with the released >>nutchwax as opposed to the cvs built) >> >>Any ideas on how this is possible or what it means? Why do my images no= t >>have an archiveidentifier? My indexing process must have been wrong I >>guess >> =20 >> The above lines are complaining a resource was incompletely added to=20 index -- its missing core metadata fields. I've been slowly addressing=20 all the reasons for why this might happen but still some work to do. >>bin/indexarcs.sh -c elections -s /arc/ -d >>/usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ >>&>index_arc_elections_29sep.log >> >>What is a typical indexarcs.sh command line meant to look like instead?= >> >> =20 >> That looks fine. Usually you also have to have a collection name '-c NAM= E'. >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >>One more question: Is there a version of WERA newer than the 0.2.2 rele= ase >>going somewhere (via cvs, for instance) that's worth getting (ie with a= ny >>substantial changes)? If so, what commands or steps need to be executed= to >>use it?=20 >> >>That's all for now :) >> >>Looking forward to reading your comments >> >> =20 >> Sverre is coming to SF next week. We're going to work on a new=20 release. New release will be bug fixes. Hopefully we get all above. =20 Expect release first week or two in october. Thanks for the report Charles and good seeing you in Vienna. St.Ack >>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>Charlie Foetz >>Biblioth=C3=A8que nationale de Luxembourg >> >> >><script type=3D"text/javascript"> var atts =3D []; var quicka =3D [["za= ba...@mz...","zabak"],["el...@em...","elisa.b"],["Mi...@se...",= "Miles.D"],["Hol...@se...","Holubcova.L"],["mi...@se...","= miles.d"],["po...@fi...","poborny"],["st...@ar...","stack"],["= xk...@fi...","xkouril"],["mat...@ce...","matejka.lukas= "],["su...@em...","suzana"],["mj...@uc...","mj"],["vo...@ce...","= vojtab"],["bro...@ce...","brokesova"],["Archive-access-cvs-reques= t...@li...","Archive-access-cvs-request"],["archive-access-c= vs...@li...","archive-access-cvs"]]; var names =3D [["zabak= ", "za...@mz..."],["elisa.b", "el...@em..."],["Miles.D", "Miles.D@s= eznam.cz"],["Holubcova.L", "Hol...@se..."],["miles.d", "miles.d@= seznam.cz"],["poborny", "po...@fi..."],["stack", "st...@ar..."= ],["xkouril", "xk...@fi..."],["matejka.lukas", "matejka.lukas@cent= rum.cz"],["suzana", "su...@em..."],["mj", "mj...@uc..."],["vojtab", "vo= jt...@ce..."],["brokesova", "bro...@ce..."],["Archive-access-= cvs-request", "Arc...@li..."],["archi= ve-access-cvs", "arc...@li..."]]; var signatu= res =3D []; var s =3D "Re: [Archive-access-discuss] WERA / Nutchwax - bug= s, problems and questions from Luxembourg"; var a =3D ["Charles.Foetz@bnl= =2Eetat.lu", "", "", s, 20, 20023165, 0, 0, 0]; var getElementWithId; if(= document.getElementById){ getElementWithId =3D function(id){ return docum= ent.getElementById(id); } }else if(document.all){ getElementWithId =3D fu= nction(id){ return document.all[id]; } }else{ getElementWithId =3D functi= on(id){ return false; } } var obj=3DgetElementWithId("body"); if(obj){ va= r html =3D new String(obj.innerHTML); html =3D html.replace(/\n/g, ' '); = html =3D html.replace(/\r/g, ''); if (0) var n =3D "<br />"; else var n =3D= "\n"; var agt=3Dnavigator.userAgent.toLowerCase(); if(agt.indexOf("opera= ")!=3D-1) n=3D"\n"; html =3D html.replace(/\{\{\#\$NXL\@\$\}\}/g, n); } e= lse var html =3D ""; var bae=3D1+0; parent.smdfrl(a,html,atts,quicka,na= mes,signatures,bae); parent.rsdlmf(0); </script></td></tr></tbody></table= ></td></tr></tbody></table> >> =20 >> > >lukas > > > >------------------------------------------------------- >This SF.Net email is sponsored by: >Power Architecture Resource Center: Free content, downloads, discussions= , >and more. http://solutions.newsforge.com/ibmarch.tmpl >_______________________________________________ >Archive-access-cvs mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-cvs > =20 > |
From: Michael S. <sta...@us...> - 2005-09-30 17:06:53
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5475/xdocs/iwaw Modified Files: wera.pdf wera.sxi Log Message: * xdocs/iwaw/wera.pdf * xdocs/iwaw/wera.sxi Actual versions presented. Index: wera.sxi =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw/wera.sxi,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsy4CPAt and /tmp/cvs5icQjj differ Index: wera.pdf =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw/wera.pdf,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsYkeOPu and /tmp/cvs0zFk2k differ |
From:
<mat...@ce...> - 2005-09-30 09:11:26
|
Hi, i will try to help you just with encoding problems:) ______________________________________________________________ > Od: Cha...@bn... > Komu: arc...@li... > CC: Carlo Blum <Car...@ci...> > Datum: 29.09.2005 18:20 > P=F8edm=ECt: [Archive-access-discuss] WERA / Nutchwax - bugs, problem= s and questions from Luxembourg > > Hello! >=20 > We (Biblioth=C3=A8que nationale de Luxembourg) are still newbies in t= he world > of web archiving, pretty much taking our first steps, and for a > prototype/test project we've chosen Luxembourg's regional elections, > taking place 9th of October.=20 >=20 > The set-up: >=20 > We've got a small collection of .arc files, crawled and archived by > Heritrix 1.4. I am the only human resource for this project, (and als= o > work on other projects), so we're quite limited resourcewise. I'm now= at > the stage of trying to interface (partly to be able to see if everyth= ing > has been crawled) this .arc collection to the "users", which at this = stage > is the library staff. I'm using WERA 0.2.2, running on Apache 2, and = I've > got both the nutchwax release 0.2.1 and the CVS head nutchwax (Septem= ber > 25) running on Tomcat 5. Java is 1.5.0. >=20 > Here are the problems I am experiencing after a first look at > WERA/Nutchwax (well, after a couple of weeks of messing about with th= e > releases and cvs builds, rather =3D)=20 >=20 >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 1 - INDEXED WITH THE RELEASE VERSION 0.2.1 OF NUTCHWAX > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 >=20 > 1) Inline redirected images=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >=20 > The URL www.csv.lu was part of a domain-scoped crawl. Many inline ima= ges > from this domain are not displayed. One example: >=20 > http://kayltetange.csv.lu/index.html has 3 few inline images:=20 >=20 > <img alt=3D"image334.jpg" src=3D"%7B%7B#$NXL@$%7D%7D%3E" =3D"" http:/= /kayltetange.csv.lu/fotoen/image334.jpg=3D"" {{#$nxl@$}}>=3D"" align=3D= "bottom" height=3D"101" width=3D"136"> (1) > <img style=3D"width: 886px; height: 686px;" alt=3D"l__iffr__chen_036i= nternet.JPG" src=3D"%7B%7B#$NXL@$%7D%7D%3E" =3D"" http://kayltetange.cs= v.lu/fotoen/l__iffr__chen_036internet.jpg=3D"" {{#$nxl@$}}>=3D"" alig= n=3D"bottom" height=3D"1050" width=3D"1400"> (2) > <img style=3D"width: 74px; height: 77px;" alt=3D"image3731.jpg" src=3D= "%7B%7B#$NXL@$%7D%7D%3E" =3D"" http://kayltetange.csv.lu/fotoen/image37= 31.jpg=3D"" {{#$nxl@$}}>=3D"" align=3D"bottom" height=3D"52" width=3D= "49"> (3) >=20 > A search for the filename shows that these images are in my collectio= n, > but with URLs >=20 > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image334.= jpg > (4)=20 > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/l__iffr__= chen_036internet.JPG > (5) > http://kayltetange.csv.lu/subsites/sites/kayltetange/fotoen/image3731= =2Ejpg > (6) >=20 > Opening the URLs (1-3) in a browser on the "live" web redirects me > immediately to (4-6)=20 >=20 > What I suppose that happened is that Heritrix tried fetching (1-3), g= ot a > redirect back, therefore fetched and archived (4-6). Now when WERA > retrieves (1-3) it doesn't find them, since these URLs were never > archived.=20 >=20 > I don't know what could be a workaround for this, but I suppose it ca= n a > serious problem. Would it also happen with redirected html pages? >=20 >=20 > 2) Need for URL canonicalisation in WERA? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > On the "live" web: >=20 > The main (home) page of http://www.csv.lu has a "Newsletter" link to > http://www.csv.lu/newsletter.=20 > The main page also has links to dozens of regional subsites of the pa= rty > (e.g. http://bettembourg.csv.lu/, which are all in pretty much the sa= me > design as the main page, with some links including the "Newsletter" o= ne.=20 > BUT: Most of these regional subsites have their "Newsletter" link poi= nting > to http://csv.lu/newsletter.=20 >=20 > Heritrix didn't archive this a second time. >=20 > Result: "Sorry, no documents with the given uri were found" when clic= king > "Newsletter" on the archived regional sites.=20 >=20 >=20 > 3) Dynamic pages / question marks in the URL > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > I've read about this bug some time ago - is it supposed to have been > fixed? >=20 > As soon as there is one question mark (or a '+' sign, or others?) in = a URL > the page can't be retrieved. Say I search for "Juncker"... I get: >=20 > ----------------------------- >=20 > 1. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/2004/juncker.html) > ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CS= V > lokal Juncker on Tour zu Hesper am Centre Civique=20 >=20 > Zer=C3=8Ack CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 ... > ) > Number of versions satisfying query / total number of versions : 1/1 > Timeline | Overview >=20 > 2. CSV Hesper - Juncker on Tour zu Hesper am Centre Civique > (http://hesper.csv.lu/index.php?print=3D1&a=3D2004/juncker.html) > ( ... CSV Hesper - Juncker on Tour zu Hesper am Centre Civique CSV CS= V > lokal Juncker on Tour zu Hesper am Centre Civique=20 >=20 > Zer=C3=8Ack CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 ... > ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 >=20 > 3. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition) > ( ... ministre Jean-Claude Juncker au sujet du raz-de-mar=C3=8Ae en A= sie du > Sud-est France Inter: Bonjour, Jean-Claude Juncker. Jean-Claude Junck= er: > Oui, bonjour. France Inter: En tant que pr=C3=8Asident en exercice de= l'Union > europ=C3=8Aenne, vous =C3=8Atiez pr=C3=8Asent jeudi dernier aux ... i= mportants > puisqu'il s'agit d'une r=C3=8Agion du monde qui nous est tr=C3=A8s pr= oche. France > Inter: Merci, Jean-Claude Juncker. Merci, Monsieur le Pr=C3=8Asident. > Jean-Claude ... ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > 4. CSV - Jean-Claude Juncker sur France Inter > (http://www.csv.lu/text/2065.html/Koalitioun+Koalition+coalition?prin= t=3D1) > (...) > 5. CSV - Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel?print=3D1)(...) > 6. CSV - Interview mam Jean-Claude Juncker > (http://www.csv.lu/text/2133.html/Frank+Engel)(...) > 7. CSV - Edm=C3=8Ae Juncker verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank)(...) > 8. CSV - Edm=C3=8Ae Juncker verabschiedet sich als Pr=C3=A4sidentin > (http://www.csv.lu/text/1978.html/Marco+Schank?print=3D1)(...) > 9. CSV - Drei Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Claude+Wiseler)(...) > 10. CSV - Drei Fragen an Jean-Claude Juncker > (http://www.csv.lu/text/2212.html/Luc+Frieden)(...) >=20 > ------------------------- >=20 > Results 2-10 all show me "Sorry, no documents with the given uri were > found". They also have "total number of versions 0/0".=20 >=20 > The only link who retrieves anything is the first one. But even here:= The > page I get has a set of thumbnails which are only displayed for about= 0.1 > seconds and then disappear (I guess because of JavaScript replacing t= he > links with links pointing to within the collection..). A look at the > source code of the page shows that these pictures should be:=20 > juncker/JoTt-(01).jpg > juncker/JoTt-(02).jpg > ... >=20 > So I search for "JoTt-(01).jpg"... >=20 > and get 2 hits: >=20 > Total number of versions found : 2. Displaying URL's 1-2 > 1. http://hesper.csv.lu/juncker/JoTt-(01).jpg > (http://hesper.csv.lu/juncker/JoTt-(01).jpg) > (CSV CSV lokal Fehler: D=C3=AF=C2=BF=C2=BDs S=C3=AF=C2=BF=C2=BDt exis= t=C3=AF=C2=BF=C2=BDert net!=20 > CSV. De s=C3=8Achere Wee. Rufft eis un um 22 57 31-1 oder sch=C3=8Ack= t eng Email > op csv@csv.) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > 2. http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).= jpg > (http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jp= g) > ( ... > http://hesper.csv.lu/subsites/sites/hesper/2004/juncker/JoTt-(01).jpg= ) > Number of versions satisfying query / total number of versions : 0/0 > Timeline | Overview >=20 > Again, both not retrievable. Same goes for any other pictures with > brackets (and possibly some other non-"a-z|A-Z|0-9" characters) in th= e > filename. i think this bug is fixed but i'm not really sure if it is at cvs. Ther= e was a discussion=20 about this and Svere decribed(i think) how to fix it. I can forward thi= s email to you if you want. >=20 >=20 > 4) Special characters > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > This has repeatedly been reported as fixed but there is still trouble= :=20 >=20 > Searching for "Edm=C3=8Ae" (in case that doesn't display fine: e-d-m-= eacute-e) > gives me hits but ONLY if I manually set Encoding of my browser (Fire= fox) > to "Windows 1252" or "ISO 8859-1". If I do that, then enter the "Edm=C3= =8Ae", > and then Search I get a page with results,=20 >=20 > BUT=20 >=20 > the Search box now says "Edm?e" and Character encoding has been set b= ack > to UTF-8. If I no do another search (say "fran=C3=A7ais") I get again= "no > hits!". I'd have to set back Character encoding manually before each > search. >=20 i posted this bug and sent patch to Michael. The problem is in getting = request query from tomcat server. You have to explicitly specify "from" encoding in converting query like String parameter =3D request.getParameter("query"); if (parameter =3D=3D null) parameter =3D ""; String queryString =3D new String(parameter.getBytes("ISO-8859-1"), "U= TF-8"); i made this change in JSP search.jsp and it works for me. you can try i= t. >=20 > 5) XML error: reference to invalid character number at line 34 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > For some searches (on collections indexed with nutchwax release 0.2.1= ) I > get only the above error message as result. The source code :=20 >=20 > *****START***** >=20 ><!-- ************************ Results: > ****************************************************** --> >=20 > =20 > <table class=3D"greyborder" {{#$nxl@$}}>=3D"" align=3D"center"= border=3D"0" cellpadding=3D"1" cellspacing=3D"0" width=3D"90%"><tbody>= <tr><td> >=20 > =20 > =20 > <table class=3D"resultsborder" {{#$nxl@$}}>=3D"" align=3D= "center" border=3D"0" cellpadding=3D"10" cellspacing=3D"0" width=3D"100= %"><tbody><tr><td> >=20 > XML error: reference to invalid character number at line 34 >=20 >=20 > *****END***** >=20 > That's the last line (HTML generation by php is cut off there) >=20 > A look into catalina.out : >=20 > *****START***** >=20 > 050929 163012 12 query request from 192.168.6.21 > 050929 163012 12 query: Juncker > 050929 163012 12 searching for 20 raw hits > 050929 163012 12 re-searching for 40 raw hits, query: juncker > -exacturl:"ZUKNZ3J2N7I5Z3A2MEYYU6PP7M" -exacturl:"HY5Q6TJQ7YL > 2VFYHAJXT7SYMPY" -exacturl:"LDA5RUE6G6T46A2SEBDHQQ4JAQ" > -exacturl:"X6LW4F7OYOFF6NXMC3WKOJVHJY" -exacturl:"WGBI4JQ3RXDYOBBAX > WO4ZHCSQY" > 050929 163012 12 found 10476 raw hits > 050929 163012 12 total hits: 10496 >=20 > *****END***** is this output from wera? i remember bug like this from previous versio= n of nwa. There was problem (i'm not really sure it's a long time ago) with chara= cters like /amp &=20 i can try to look at it... >=20 > 6) Wrong re-setting of Character encoding > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > On the "live" web, www.gouvernement.lu has character encoding UTF-8. = Every > time you reload the page it sets it to this.=20 >=20 > In my archived collection, every time I retrieve a page from this URL= , > encoding is always set back to ISO 8859-1. The page, being in French,= is > therefore pretty much unreadable and you have to set back Encoding > manually back to UTF-8 after every click. >=20 >=20 > 7) Immediate re-direct to "live" web > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > URL http://www.lsap.lu (in my seeds list) is a redirect to > http://www.lsap.lu/index.php?idusergroup=3D42114236.=20 >=20 > When I retrieve http://www.lsap.lu/ from my collection, WERA immediat= ely > displays the live web page. Besides that, <i>every</i> link on www.ls= ap.lu > includes variables (question marks) and is hence unretrievable (see (= 3)).=20 >=20 >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > 8) No images indexed? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > When I index my collection with NutchWax head CVS BUILD, no images ap= pear > at all.=20 >=20 > One method has been suggested here to see if a file is in the archive= : >=20 > >Setting $conf_debug to 1 in /lib/config.inc and changing index.php=20 > > > >from $search->setFieldsInResult("teaser url description");=20 > >to $search->setFieldsInResult("teaser url description > archiveidentifier");=20 >=20 > When I do this and query for one of the many non-displayed images (e.= g. > "gouvernement.gif") I get=20 >=20 > [1] =3D> Array > ( > [teaser] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif > [url] =3D> > http://www.gouvernement.lu/pictures/layout/gouvernement.gif > [archiveidentifier] =3D> //arc/.arc.gz > ) >=20 > So I look in the indexarcs output file and notice I have plenty of en= tries > like this: >=20 > (...) > 050929 115748 adding 4223 bytes of mimetype image/jpeg > http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg > 050929 115748 Failed parse: Content-Type not text/html: image/jpeg > (...) >=20 > and towards the end of the file: >=20 > (...) > 050929 125148 No collection for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No arcname for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No arcoffset for url > http://www.greng.lu/files/documentcenter/20050114-247-DirectiveCE-Com= munesEauQR.pdf > 050929 125148 No collection for url > http://www.adr.lu/Norden/koepp_port.jpg > 050929 125148 No arcname for url http://www.adr.lu/Norden/koepp_port.= jpg > 050929 125148 No arcoffset for url http://www.adr.lu/Norden/koepp_por= t.jpg > (...) >=20 > I didn't have these lines before (when I indexed with the released > nutchwax as opposed to the cvs built) >=20 > Any ideas on how this is possible or what it means? Why do my images = not > have an archiveidentifier? My indexing process must have been wrong I > guess?=20 >=20 > bin/indexarcs.sh -c elections -s /arc/ -d > /usr/share/archive-access/projects/nutchwax_head/nutch-data-29sep/ > &>index_arc_elections_29sep.log >=20 > What is a typical indexarcs.sh command line meant to look like instea= d? >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > One more question: Is there a version of WERA newer than the 0.2.2 re= lease > going somewhere (via cvs, for instance) that's worth getting (ie with= any > substantial changes)? If so, what commands or steps need to be execut= ed to > use it?=20 >=20 > That's all for now :) >=20 > Looking forward to reading your comments >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Charlie Foetz > Biblioth=C3=A8que nationale de Luxembourg >=20 >=20 > <script type=3D"text/javascript"> var atts =3D []; var quicka =3D [["= za...@mz...","zabak"],["el...@em...","elisa.b"],["Miles.D@seznam.= cz","Miles.D"],["Hol...@se...","Holubcova.L"],["miles.d@seznam= =2Ecz","miles.d"],["po...@fi...","poborny"],["st...@ar...","= stack"],["xk...@fi...","xkouril"],["mat...@ce...","m= atejka.lukas"],["su...@em...","suzana"],["mj...@uc...","mj"],["vojtab= @centrum.cz","vojtab"],["bro...@ce...","brokesova"],["Archive-a= cce...@li...","Archive-access-cvs-request"],= ["arc...@li...","archive-access-cvs"]]; va= r names =3D [["zabak", "za...@mz..."],["elisa.b", "el...@em..."],= ["Miles.D", "Mi...@se..."],["Holubcova.L", "Hol...@se...= "],["miles.d", "mi...@se..."],["poborny", "po...@fi..."],["s= tack", "st...@ar..."],["xkouril", "xk...@fi..."],["matejka= =2Elukas", "mat...@ce..."],["suzana", "su...@em..."],["= mj", "mj...@uc..."],["vojtab", "vo...@ce..."],["brokesova", "brokes= ov...@ce..."],["Archive-access-cvs-request", "Archive-access-cvs-req= ue...@li..."],["archive-access-cvs", "archive-access-cvs= @lists.sourceforge.net"]]; var signatures =3D []; var s =3D "Re: [Archi= ve-access-discuss] WERA / Nutchwax - bugs, problems and questions from = Luxembourg"; var a =3D ["Cha...@bn...", "", "", s, 20, 200= 23165, 0, 0, 0]; var getElementWithId; if(document.getElementById){ get= ElementWithId =3D function(id){ return document.getElementById(id); } }= else if(document.all){ getElementWithId =3D function(id){ return docume= nt.all[id]; } }else{ getElementWithId =3D function(id){ return false; }= } var obj=3DgetElementWithId("body"); if(obj){ var html =3D new String= (obj.innerHTML); html =3D html.replace(/\n/g, ' '); html =3D html.repla= ce(/\r/g, ''); if (0) var n =3D "<br />"; else var n =3D "\n"; var agt=3D= navigator.userAgent.toLowerCase(); if(agt.indexOf("opera")!=3D-1) n=3D"= \n"; html =3D html.replace(/\{\{\#\$NXL\@\$\}\}/g, n); } else var html = =3D ""; var bae=3D1+0; parent.smdfrl(a,html,atts,quicka,names,signatu= res,bae); parent.rsdlmf(0); </script></td></tr></tbody></table></td></t= r></tbody></table> lukas |
From: Michael S. <sta...@us...> - 2005-09-30 01:56:48
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15636/xdocs/iwaw Added Files: wacsearch-iwaw-2005.pdf wacsearch-iwaw-2005.sxi wera.pdf wera.sxi Log Message: Add slides presented at iwaw. --- NEW FILE: wacsearch-iwaw-2005.sxi --- (This appears to be a binary file; contents omitted.) --- NEW FILE: wacsearch-iwaw-2005.pdf --- (This appears to be a binary file; contents omitted.) --- NEW FILE: wera.sxi --- (This appears to be a binary file; contents omitted.) --- NEW FILE: wera.pdf --- (This appears to be a binary file; contents omitted.) |
From: Michael S. <sta...@us...> - 2005-09-27 23:39:46
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28370/conf Modified Files: nutch-site.xml Log Message: * conf/nutch-site.xml Up the maxMergeDocs from 50 to a billion on Doug's recommendation (apparently responsible for slow indexing). Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** nutch-site.xml 17 Aug 2005 21:47:24 -0000 1.25 --- nutch-site.xml 27 Sep 2005 23:39:38 -0000 1.26 *************** *** 52,55 **** --- 52,57 ---- </property> + + <!-- For lucene indexes, normally. The default is 128. Write every 1024 entries rather than every 128, the default. *************** *** 65,68 **** --- 67,86 ---- </property> + <property> + <name>indexer.maxMergeDocs</name> + <value>1000000000</value> + <description>This number determines the maximum number of Lucene + Documents to be merged into a new Lucene segment. Larger values + increase indexing speed and reduce the number of Lucene segments, + which reduces the number of open file handles; however, this also + increases RAM usage during indexing. + + Doug says: "There was a bogus value for indexer.maxMergeDocs in + nutch-default.xml which made indexing really slow. The correct + value is something really big (like Integer.MAX_VALUE)." + </description> + </property> + + <!-- make summaries a little longer than the default --> <property> |
From: Michael S. <sta...@us...> - 2005-09-18 20:30:50
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28762 Modified Files: 2005-oswir-wacsearch.sxi Log Message: * 2005-oswir-wacsearch.sxi Latest. Index: 2005-oswir-wacsearch.sxi =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir/2005-oswir-wacsearch.sxi,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsm9ygUH and /tmp/cvsslmdKb differ |
From: Michael S. <sta...@us...> - 2005-09-17 13:38:20
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1947 Modified Files: 2005-oswir-wacsearch.ppt Added Files: 2005-oswir-wacsearch.sxi Log Message: * 2005-oswir-wacsearch.ppt First cut. Save ppt version too. * 2005-oswir-wacsearch.sxi Save ooo version too. --- NEW FILE: 2005-oswir-wacsearch.sxi --- (This appears to be a binary file; contents omitted.) Index: 2005-oswir-wacsearch.ppt =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir/2005-oswir-wacsearch.ppt,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 Binary files /tmp/cvsgGS0Ng and /tmp/cvsmDHcty differ |
From: Michael S. <sta...@us...> - 2005-09-15 22:25:39
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1808 Modified Files: 2005-oswir-wacsearch.ppt Log Message: * 2005-oswir-wacsearch.ppt Latest. Index: 2005-oswir-wacsearch.ppt =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir/2005-oswir-wacsearch.ppt,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvszLUDwQ and /tmp/cvswdaPXp differ |