You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: <Pra...@on...> - 2011-05-18 15:09:29
|
Hi Everybody, I have few collections that are indexed. But when I use the link like http://localhost:8080/wayback , I do not see my collection list But when I use the url something like http://localhost:8080/wayback/collectionA/*/http://abc.com<http://localhost:8080/wayback/collectionA/*/http:/abc.com> , I can see the Calender.jsp with dates that it is crawled. How can I see the list of my collections on the main page? I am using wayback-1.6.1 and also followed the instructions for accesspoints as in the given url http://archive-access.sourceforge.net/projects/wayback/access_point_naming.html My access point is defined as follows in wayback.xml. Any one with any ideas? <bean name="8080:collectionA" class="org.archive.wayback.webapp.AccessPoint"> <property name="serveStatic" value="true" /> <property name="bounceToReplayPrefix" value="false" /> <property name="bounceToQueryPrefix" value="false" /> <property name="replayPrefix" value="${wayback.urlprefix}collectionA/" /> <property name="queryPrefix" value="${wayback.urlprefix}collectionA/" /> <property name="staticPrefix" value="${wayback.urlprefix}collectionA/" /> <property name="collection" ref="localbdbcollection" /> <property name="replay" ref="archivalurlreplay" /> <property name="query"> <bean class="org.archive.wayback.query.Renderer"> <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" /> </bean> </property> <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> <property name="replayURIPrefix" value="${wayback.urlprefix}collectionA/"/> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"> <property name="maxRecords" value="10000" /> <property name="earliestTimestamp" value="2010" /> </bean> </property> </bean> Thanks, --Pramila Thakur |
From: Mahmoud M. <mah...@bi...> - 2011-05-18 08:17:21
|
Thanks Brad for your reply. By default AccessPoint name, wayback, I have tried "http://localhost.archive.org:8080/wayback/wayback/" but with "HTTP Status 404 - /wayback/wayback" response. Instead, http://localhost.archive.org:8080/wayback/" showed the UI with no problem. But when I clicked on the "Take Me Back" button, the error I mentioned before: *HTTP Status 404 - /query _____________________________________________________________________ type Status report message /query description The requested resource (/query) is not available. * The URL became "http://localhost.archive.org:8080/query?type=urlquery&url=http%3A%2F%2Farchive.org&date&Submit=Take+Me+Back." I have tried to add wayback/ in the URL as "http://localhost.archive.org:8080/*wayback/*query?type=urlquery&url=http%3A%2F%2Farchive.org&date&Submit=Take+Me+Back" or even "http://localhost.archive.org:8080/*wayback/wayback/*query?type=urlquery&url=http%3A%2F%2Farchive.org&date&Submit=Take+Me+Back" with no change. Using the reference, I have tried also "http://localhost.archive.org:8080/wayback/wayback/*/http://archive.org/", but ,again, an error: "HTTP Status 404 - /wayback/wayback/*/http://archive.org/" My wayback.xml: <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="properties"> <value> wayback.basedir=/var/lib/tomcat6/webapps/wayback wayback.urlprefix=http://localhost.archive.org:8080/wayback/ </value> </property> </bean> <bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint"> <property name="serveStatic" value="true" /> <property name="bounceToReplayPrefix" value="false" /> <property name="bounceToQueryPrefix" value="false" /> <property name="replayPrefix" value="${wayback.urlprefix}" /> <property name="queryPrefix" value="${wayback.urlprefix}" /> <property name="staticPrefix" value="${wayback.urlprefix}" /> <!-- <property name="exactHostMatch" value="true" /> --> <property name="collection" ref="localbdbcollection" /> <!-- <property name="collection" ref="localcdxcollection" /> --> <property name="replay" ref="archivalurlreplay" /> <property name="query"> <bean class="org.archive.wayback.query.Renderer"> <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" /> <!-- This .jsp provides a "search engine" style listing of results vertically <property name="captureJsp" value="/WEB-INF/query/HTMLCaptureResults.jsp" /> --> </bean> </property> <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> <property name="replayURIPrefix" value="${wayback.urlprefix}"/> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"> <property name="maxRecords" value="10000" /> <!-- <property name="earliestTimestamp" value="1999" /> <property name="latestTimestamp" value="2004" /> --> </bean> </property> <!-- See the LiveWeb.xml import above. <property name="exclusionFactory" ref="excluder-factory-static" /> --> </bean> -- Mahmoud A. Mubarak Bibliotheca Alexandrina On 05/17/2011 07:19 PM, Bradley Tofel wrote: > Hi Mahmoud, > > Do you intend to run Wayback as the ROOT context? This is the > recommended deployment mode, enabling Wayback to successfully handle > some server-relative URLs that are not replayed correctly. If so, > you'll need to name the .war file "ROOT.war" in the webapps directory. > > If you are unable to run Wayback in the ROOT webapp context, then > you'll need to specify the correct AccessPoint URL, which will depend > on the bean name in your wayback.xml. > > By default, the AccessPoint name is "wayback", meaning the URL for > your AccessPoint would be > "http://localhost.archive.org:8080/wayback/wayback/". > > Please refer to: > > http://archive-access.sourceforge.net/projects/wayback/access_point_naming.html > > for more specific examples and help with Access Point names. > > Please let me know if this doesn't help you resolve the problem! > > Brad > > On 5/17/11 6:36 AM, Mahmoud Mubarak wrote: >> I have done the following steps after installing tomcat: >> >> - Stopped tomcat >> - Placed wayback.war in /var/lib/tomcat6/webapps >> - Changed wayback.basedir=wayback to be >> wayback.basdir=/var/lib/tomcat6/webapps/wayback >> - Made a soft link from files1 to my warc directory as: >> root@mymachine:/var/lib/tomcat6/webapps/wayback# ln -s <absolute path >> of warc directory> files1 >> -Started tomcat >> >> When I wrote a URL in the text box in >> http://localhost.archive.org:8080/wayback/ and click "take Me Back", >> this error appeared: >> >> *HTTP Status 404 - /query >> _____________________________________________________________________ >> type Status report >> message /query >> description The requested resource (/query) is not available. >> >> *How can I solve this problem? How can I change years range in the >> drop down list in http://localhost.archive.org:8080/wayback/? >> >> Thanks in advance for your help. >> -- >> Mahmoud A. Mubarak >> Software Engineer, ICT Sector >> Bibliotheca Alexandrina >> P.O. Box 138 >> 21526 El Shatby, Alexandria >> Tel: +20-3-4839999 >> Fax: +20-3-4820405 >> E-mail:mah...@bi... >> >> >> ------------------------------------------------------------------------------ >> Achieve unprecedented app performance and reliability >> What every C/C++ and Fortran developer should know. >> Learn how Intel has extended the reach of its next-generation tools >> to help boost performance applications - inlcuding clusters. >> http://p.sf.net/sfu/intel-dev2devmay >> >> >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-05-17 16:19:44
|
Hi Mahmoud, Do you intend to run Wayback as the ROOT context? This is the recommended deployment mode, enabling Wayback to successfully handle some server-relative URLs that are not replayed correctly. If so, you'll need to name the .war file "ROOT.war" in the webapps directory. If you are unable to run Wayback in the ROOT webapp context, then you'll need to specify the correct AccessPoint URL, which will depend on the bean name in your wayback.xml. By default, the AccessPoint name is "wayback", meaning the URL for your AccessPoint would be "http://localhost.archive.org:8080/wayback/wayback/". Please refer to: http://archive-access.sourceforge.net/projects/wayback/access_point_naming.html for more specific examples and help with Access Point names. Please let me know if this doesn't help you resolve the problem! Brad On 5/17/11 6:36 AM, Mahmoud Mubarak wrote: > I have done the following steps after installing tomcat: > > - Stopped tomcat > - Placed wayback.war in /var/lib/tomcat6/webapps > - Changed wayback.basedir=wayback to be > wayback.basdir=/var/lib/tomcat6/webapps/wayback > - Made a soft link from files1 to my warc directory as: > root@mymachine:/var/lib/tomcat6/webapps/wayback# ln -s <absolute path > of warc directory> files1 > -Started tomcat > > When I wrote a URL in the text box in > http://localhost.archive.org:8080/wayback/ and click "take Me Back", > this error appeared: > > *HTTP Status 404 - /query > _____________________________________________________________________ > type Status report > message /query > description The requested resource (/query) is not available. > > *How can I solve this problem? How can I change years range in the > drop down list in http://localhost.archive.org:8080/wayback/? > > Thanks in advance for your help. > -- > Mahmoud A. Mubarak > Software Engineer, ICT Sector > Bibliotheca Alexandrina > P.O. Box 138 > 21526 El Shatby, Alexandria > Tel: +20-3-4839999 > Fax: +20-3-4820405 > E-mail:mah...@bi... > > > ------------------------------------------------------------------------------ > Achieve unprecedented app performance and reliability > What every C/C++ and Fortran developer should know. > Learn how Intel has extended the reach of its next-generation tools > to help boost performance applications - inlcuding clusters. > http://p.sf.net/sfu/intel-dev2devmay > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Mahmoud M. <mah...@bi...> - 2011-05-17 14:45:47
|
I have done the following steps after installing tomcat: - Stopped tomcat - Placed wayback.war in /var/lib/tomcat6/webapps - Changed wayback.basedir=wayback to be wayback.basdir=/var/lib/tomcat6/webapps/wayback - Made a soft link from files1 to my warc directory as: root@mymachine:/var/lib/tomcat6/webapps/wayback# ln -s <absolute path of warc directory> files1 -Started tomcat When I wrote a URL in the text box in http://localhost.archive.org:8080/wayback/ and click "take Me Back", this error appeared: *HTTP Status 404 - /query _____________________________________________________________________ type Status report message /query description The requested resource (/query) is not available. *How can I solve this problem? How can I change years range in the drop down list in http://localhost.archive.org:8080/wayback/? Thanks in advance for your help. -- Mahmoud A. Mubarak Software Engineer, ICT Sector Bibliotheca Alexandrina P.O. Box 138 21526 El Shatby, Alexandria Tel: +20-3-4839999 Fax: +20-3-4820405 E-mail: mah...@bi... |
From: <Pra...@on...> - 2011-05-17 13:57:17
|
Hi Everybody, I am using the latest wayback which is wayback-1.6.1, but still have the errors as shown by Gerard Dupont. Thanks, --Pramila Thakur ________________________________ From: Bradley Tofel [mailto:br...@ar...] Sent: Monday, May 16, 2011 5:11 PM To: Gérard Dupont Cc: arc...@li...; Saval, Arnaud Subject: Re: [Archive-access-discuss] Error while reading WARC Hi Gérard, This sounds like it's likely related to a recent Java update, which changed the behavior of GZIPInputStream. There is a Wayback 1.6.1 release candidate which should address this at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz Please let me know if this addresses the problem you're facing. Brad On 5/13/11 6:55 AM, Gérard Dupont wrote: Hi all, I'm facing a new problem while setting up wayback on a set of WARCs created by heritrix. Basically, everthing goes fine, the weapp is well deployed and it correctly foun the new warcs as they are closed. However, something appends during the "copy/merge" and the files I can see in the wayback temp folder are only ~222 bytes whereas the roginal ones where 1Mb. I can only see the following warning in the logs: INFO: Renamed merged file /tmp/wayback/index-data/incoming/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz to /tmp/wayback/index-data/merged/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz May 13, 2011 2:52:15 PM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20110502133909048-00000-5501~weblab9~8888.warc.gz:WEB-20110502133909048-00000-5501~weblab9~8888.warc.gz) If anyone already faced that problem, help is welcome. cheers, -- Gérard Dupont Information Processing Control and Cognition (IPCC) CASSIDIAN - an EADS company Document & Learning team - LITIS Laboratory ------------------------------------------------------------------------------ Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay _______________________________________________ Archive-access-discuss mailing list Arc...@li...<mailto:Arc...@li...> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Chow, D. <Dor...@la...> - 2011-05-17 13:35:32
|
Hi, I am having trouble viewing an arc file using wayback 0.8.0. Could someone tell me if my .cdx file looks ok? Thanks, - D |
From: Bradley T. <br...@ar...> - 2011-05-16 21:11:04
|
Hi Gérard, This sounds like it's likely related to a recent Java update, which changed the behavior of GZIPInputStream. There is a Wayback 1.6.1 release candidate which should address this at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz Please let me know if this addresses the problem you're facing. Brad On 5/13/11 6:55 AM, Gérard Dupont wrote: > Hi all, > > I'm facing a new problem while setting up wayback on a set of WARCs > created by heritrix. Basically, everthing goes fine, the weapp is well > deployed and it correctly foun the new warcs as they are closed. > However, something appends during the "copy/merge" and the files I can > see in the wayback temp folder are only ~222 bytes whereas the roginal > ones where 1Mb. I can only see the following warning in the logs: > > INFO: Renamed merged file > /tmp/wayback/index-data/incoming/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz > to > /tmp/wayback/index-data/merged/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz > May 13, 2011 2:52:15 PM > org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt > WARNING: FAILED > canonicalize(http://filedesc:WEB-20110502133909048-00000-5501~weblab9~8888.warc.gz:WEB-20110502133909048-00000-5501~weblab9~8888.warc.gz) > > If anyone already faced that problem, help is welcome. > > cheers, > > -- > Gérard Dupont > Information Processing Control and Cognition (IPCC) > CASSIDIAN - an EADS company > > Document & Learning team - LITIS Laboratory > > > ------------------------------------------------------------------------------ > Achieve unprecedented app performance and reliability > What every C/C++ and Fortran developer should know. > Learn how Intel has extended the reach of its next-generation tools > to help boost performance applications - inlcuding clusters. > http://p.sf.net/sfu/intel-dev2devmay > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Gérard D. <ger...@gm...> - 2011-05-13 13:55:48
|
Hi all, I'm facing a new problem while setting up wayback on a set of WARCs created by heritrix. Basically, everthing goes fine, the weapp is well deployed and it correctly foun the new warcs as they are closed. However, something appends during the "copy/merge" and the files I can see in the wayback temp folder are only ~222 bytes whereas the roginal ones where 1Mb. I can only see the following warning in the logs: INFO: Renamed merged file /tmp/wayback/index-data/incoming/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz to /tmp/wayback/index-data/merged/WEB-20110420084401772-00000-3811~weblab9~8888.warc.gz May 13, 2011 2:52:15 PM org.archive.wayback.resourceindex.bdb.SearchResultToBDBRecordAdapter adapt WARNING: FAILED canonicalize(http://filedesc:WEB-20110502133909048-00000-5501 ~weblab9~8888.warc.gz:WEB-20110502133909048-00000-5501~weblab9~8888.warc.gz) If anyone already faced that problem, help is welcome. cheers, -- Gérard Dupont Information Processing Control and Cognition (IPCC) CASSIDIAN - an EADS company Document & Learning team - LITIS Laboratory |
From: Aaron B. <aa...@ar...> - 2011-05-02 23:28:50
|
Gerard Suades i Méndez <gs...@ce...> writes: >>> We tried both approaches for the entire ARC collection: >>> >>> a) IndexMerger Lucene API (inside TNH). >>> index size: 813GB >>> >>> b) Re-built the entire index giving as input both old and new NutchWAX >>> segments of the ARC files. >>> index size: 563GB >>> >>> is it normal that there is this difference of sizes in the indexes? > > If I don't get the wrong idea, NutchWAX 0.13 (official release), which > is the version we've used in method b), doesn't de-duplicate. So if > neither of the two methods de-duplicates, could it be any other reason > for such a difference in indexes sizes? De-duplication while index building: NO NutchWAX 0.13 YES NutchWAX 0.13-JIRA-WAX-75 YES JBs I double-checked the source code. Sorry if I said something different before. So, it sounds like you re-built the entire index in one job using NutchWAX 0.13 (NO deduplication) and yielded a much smaller index. That is strange. Can you confirm that the number of documents in the indexes? You can use a utility in TNH do dump out the counts of documents by mime-type and compare the totals: $ mkdir tnh $ cd tnh $ jar xvf ../tnh-*.war $ java -cp WEB-INF/classes:WEB-INF/lib/lucene-core-3.0.1.jar TermDumper -c type This will print out the number of documents for each mime-type. They should be the same (or at least the same total) for both indexes. >>> 3.- We have only one collection for all the ARC files. We have our >>> collection on open access and the service is load balanced through >>> several nodes. That's the scenario in where several tomcats are >>> accessing the same indexes. >>> > > Yes, the index is on an NFS shared storage system accessed by several nodes. Hmmm, you might consider running some performance tests with the index on local disk. Maybe I'm just an old Unix guy, but I would expect a big performance hit of searching a Lucene index over NFS. > We will try both JBs and NutchWAX-with-deduplication. By the way, the > SVN branch you pointed out ( > > http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive > > ) is it possible that it is only suitable for nutch 1.1 and not for > nutch 1.0 as it is said in INSTALL.txt? Yes, you are correct. Since this hasn't been made into an official release, I have not updated all the documentation yet. > * TNH > - Is it possible to define collections? Yes, and there are two ways to do this: 1. Index each collection separately. If you have your (W)ARC files grouped by collection, process each one separately. For example if you had a "election 2011 collection" and a "word cup 2010" collection, with separate groups of (W)ARCs for each, you would import and index each collection totally separately. Then, in TNH, you would put them as sibling directories in the root 'index' directory, such as: /var/lib/tnh/index /2011-election /2010-world-cup And set your index directory in TNH's web.xml to be /var/lib/tnh/index TNH will automatically traverse the directory tree, finding the indexes and mapping them internally according to their directory name. Then, on the URL, you can use the 'i' parameter to specify which index to search, such as: search?q=winner&i=2011-election search?q=winner&i=2010-world-cup Multiple 'i' parameters can be used to search multiple collections, and if no 'i' parameter is given, then all collections are searched by default. 2. When importing, put the name of the collection (W)ARC file next to the (W)ARC URL, e.g. /mnt/data/warcs/foo.warc.gz 2011-election /mnt/data/warcs/bar.warc.gz 2010-world-cup During the import process, the Importer will decorate each record with a metadata field "collection" with the appropriate value. Then, during indexing, the value will be added to the Lucene index in the field titled "collection". This field can be searched by adding the 'c' parameter to the OpenSearch URL, such as: search?q=winner&c=2011-election search?q=winner&c=2010-world-cup In this case, however, we only build one Lucene index, rather than one for each collection. In this case, the collection name is just another field, like mime-type. I prefer method #1. For us, it's much easier to manage. In fact, in our Archive-It.org hosted service, we have close to 2000 collections by over 150 partners. By keeping each collection in a separate index allows us to manage them much better/easier than if they were in one giant index with 1.2 billion documents and 4.3 TB in size. With separate indexes for each collection, you have managable sized indexes on disk, with the ability to arbitrarily combine them at search time via the 'i' parameter. > - How are the results sorted? The results are ordered by rank, from best to worst. The ranking is determined by Lucene. Also, 'site collapsing' is performed, so that if you get multiple hits from one website, we only show the top 1, or 2, or however many you want, specified by the 'h' parameter on the OpenSearch URL search?q=winner&h=3 would show up to 3 of the top hits from any one website. This is pretty much what web search companies have been doing for a long time. That way if you search for "Facebook" on Yahoo!, you don't get the first 500000 hits from facebook.com, but get a mix of things from their site, news coverage about Facebook, etc. > * JB/NutchWAX > - de-duplication is not possible in either tool (JB or NutchWAX) if > we want to add new crawls to an existing index. ¿? The most straightforward way to do this is to re-build the entire index, with all the data -- new and old -- in a single indexing job. Both the JBs and NutchWAX 0.13-JIRA-WAX-75 support this. There is another way to do this, but it is more complicated. You have to analyze the CDX files and extract out all the lines for duplicate captures, then feed that to the 'import' command to use as an exlusion filter, telling it which captures to ignore when importing. In this case you are de-duplicating at the front of the processing chain: during the import stage. > - We've tried JB with hadoop 20.2, but it didn't end up > well. org.apache.hadoop.util.DiskChecker$DiskErrorException was thrown > so I guess there wasn't enough space in /tmp. If segments have > somewhere around 650 GB (having removed crawl_*/ and content/), how > much free space should be left on disk in order to carry out the index > process? any estimate size? based on our first try 2TB doesn't seem to > be enough. Yes, lots of disk space is needed. There is close to a 1:1 mapping of the size of the segment to the size of the index. Plus, with Hadoop you keep a copy of the map output and reduce input in /tmp during the Map/Reduce job. For an index 500MB in size, here at IA we'd use our Hadoop cluster of at least 10 machines. How many machines are you using? > * #crawls (WAYBACK 1.6.0/CDX) I'll have to let Brad tackle this one. Aaron |
From: Gerard S. i M. <gs...@ce...> - 2011-04-29 09:57:26
|
Aaron Binns escribió: [... > For the JBs and more recent versions of NutchWAX, only > > parse_text > parse_data > > are used. In my indexing projects, after the (w)arcs are all imported, > I just do a > > rm -rf segments/*/c* > > to remove the sub-dirs starting with 'c'. > done. >> We tried both approaches for the entire ARC collection: >> >> a) IndexMerger Lucene API (inside TNH). >> index size: 813GB >> >> b) Re-built the entire index giving as input both old and new NutchWAX >> segments of the ARC files. >> index size: 563GB >> >> is it normal that there is this difference of sizes in the indexes? >> > > It's quite possible. If there was a lot of duplicate captures, then you > could see such a large reduction in size. Method A would preserve the > duplicates whereas method B de-duplicates. > > In later versions of NutchWAX and the JBs, de-duplication happens > automatically, but only *within a single indexing job*. If you > index all the segments in one job, then they will be de-duplicated. > > If you index subsets of segments, creating multiple indexes, then there > could be duplicates across the indexes. > ...] There should be a lot of duplicate captures on those Top Level Domain/agreement institutions crawls. If I don't get the wrong idea, NutchWAX 0.13 (official release), which is the version we've used in method b), doesn't de-duplicate. So if neither of the two methods de-duplicates, could it be any other reason for such a difference in indexes sizes? >> 3.- We have only one collection for all the ARC files. We have our >> collection on open access and the service is load balanced through >> several nodes. That's the scenario in where several tomcats are >> accessing the same indexes. >> > > Does that mean that each node has a local copy of the index? Or perhaps > the index is on an NFS share or SAN mounted on each node? > Yes, the index is on an NFS shared storage system accessed by several nodes. > Lastly, the indexing process for the JBs is pretty much the same as for > NutchWAX. The command-lines are similar, but for the JBs, you have to > use the Hadoop command-line driver, whereas NutchWAX comes with its own. > > E.g. > > $ nutchwax index indexes segments/* > > vs. > > $ hadoop jar jbs-*.jar Indexer indexes segments/* > > The version of Hadoop that we use is the Cloudera distribution, which is > based on Hadoop 0.20.2 with some Cloudera patches to fix bugs. I > believe you can use Hadoop 0.20.1 or 0.20.2 w/o any problems. > > The JBs also does a better job of filtering out obvious crap, such as > "words" which are do not contain any letters, such as "34983545$%23432" > is filtered out when indexing with JBs. > > It also canonicalizes the mime-types so that all the dozens of different > known variaties of MS Office mime-types are all mapped to one standard > set. It also omits 'robots.txt' files and ignores mime-types that > probably don't have text in them, such as "application/octet-stream". > > I'd recommend giving JBs a try, at least to test and compare to the > index built with NutchWAX; especially since JBs does the accented > character collapsing. > We will try both JBs and NutchWAX-with-deduplication. By the way, the SVN branch you pointed out ( http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive ) is it possible that it is only suitable for nutch 1.1 and not for nutch 1.0 as it is said in INSTALL.txt? We would like to ask you a few more questions regarding to: * TNH - Is it possible to define collections? - How are the results sorted? * JB/NutchWAX - de-duplication is not possible in either tool (JB or NutchWAX) if we want to add new crawls to an existing index. ¿? - We've tried JB with hadoop 20.2, but it didn't end up well. org.apache.hadoop.util.DiskChecker$DiskErrorException was thrown so I guess there wasn't enough space in /tmp. If segments have somewhere around 650 GB (having removed crawl_*/ and content/), how much free space should be left on disk in order to carry out the index process? any estimate size? based on our first try 2TB doesn't seem to be enough. * #crawls (WAYBACK 1.6.0/CDX) Wayback returns the crawls per URL number through a URL query search. The crawl per URL number is displayed in the ToolBar.jsp (data.getResultCount()). I can't see any thing that allows me to bind "crawl"<->"URL" in CDX index file, so I guess Wayback calculates it based on ARC files... Is it possible? Is it possible to know (in an easy way) how many crawls are there using Wayback java classes? I'm thinking in something similar that it is done in ToolBar.jsp and (somehow) taking advantage of (data.getResultCount()) but without especifying the URL on the query. It would be very helpful if you could point me out which classes should I look at. Thank you very much and best regards, -- Gerard ...................................................................... __ / / Gerard Suades Méndez C E / S / C A Departament d'Aplicacions i Projectes /_/ Centre de Supercomputació de Catalunya Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona T. 93 551 62 20 · F. 93 205 6979 · gs...@ce... ...................................................................... |
From: Ko, L. <Lau...@un...> - 2011-04-28 18:45:39
|
Hello Pramila, I think there is still a problem in wayback-1.6.1 when deploying in a non-ROOT context. I noticed that, when deploying under name "wb" with access point called "coll1," http://localhost:8080/wb/coll1/ works correctly as expected, but if you attempt to access http://localhost:8080/wb/, you no longer get the "You seem to be accessing this Wayback via an incorrect URL." Rather you get the default search page with the image and css broken, and all of the links (Help, Home, Take Me Back) all dropping back to using the prefix of http://localhost:8080/ without the "wb/." I think this is the same thing you are describing below. Here is my example configuration that does work at http://localhost:8080/wb/coll1/ when deploying under name "wb", setting an access point at "coll1", and using a CDX index for my WARCs: >From wayback.xml: <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="properties"> <value> wayback.basedir=/home/me/Desktop/wayback wayback.urlprefix=http://localhost:8080/wb/ </value> </property> </bean> <bean id="waybackCanonicalizer" class="org.archive.wayback.util.url.AggressiveUrlCanonicalizer" /> <import resource="CDXCollection.xml"/> <import resource="ArchivalUrlReplay.xml"/> <bean name="+" class="org.archive.wayback.webapp.ServerRelativeArchivalRedirect"> <property name="matchPort" value="8080" /> <property name="useCollection" value="true" /> </bean> <bean name="8080:coll1" class="org.archive.wayback.webapp.AccessPoint"> <property name="serveStatic" value="true" /> <property name="bounceToReplayPrefix" value="false" /> <property name="bounceToQueryPrefix" value="false" /> <property name="replayPrefix" value="${wayback.urlprefix}coll1/" /> <property name="queryPrefix" value="${wayback.urlprefix}coll1/" /> <property name="staticPrefix" value="${wayback.urlprefix}coll1/" /> <property name="collection" ref="localcdxcollection" /> <property name="replay" ref="archivalurlreplay" /> <property name="query"> <bean class="org.archive.wayback.query.Renderer"> <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" /> </bean> </property> <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> <property name="replayURIPrefix" value="${wayback.urlprefix}coll1/"/> </bean> </property> <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"> <property name="maxRecords" value="10000" /> <property name="earliestTimestamp" value="2007" /> </bean> </property> </bean> And from CDXCollection.xml: <bean id="localcdxcollection" class="org.archive.wayback.webapp.WaybackCollection"> <property name="resourceStore"> <bean class="org.archive.wayback.resourcestore.LocationDBResourceStore"> <property name="db"> <bean class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB"> <property name="path" value="/home/me/Desktop/path-index.txt" /> </bean> </property> </bean> </property> <property name="resourceIndex"> <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> <property name="canonicalizer" ref="waybackCanonicalizer" /> <property name="source"> <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex"> <property name="path" value="/home/me/Desktop/index.cdx" /> </bean> </property> <property name="maxRecords" value="10000" /> </bean> </property> </bean> Hope this helps, Lauren Ko Web Archiving Programmer UNT Libraries ________________________________________ From: Pra...@on... [Pra...@on...] Sent: Wednesday, April 27, 2011 9:51 AM To: arc...@li... Subject: [Archive-access-discuss] Replay help Hi Everyone, I used the latest wayback-1.6.1.tar.gz The Indexing seem to work OK. But now I have some configuration issues. Can someone please help me? In wayback.xml I have the following property set <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="properties"> <value> wayback.basedir=/tmp/wayback wayback.urlprefix=http://localhost:8080/wayback-1.6.1/politicalCollection/ </value> </property> </bean> But when I go to http://localhost:8080/wayback-1.6.1 , I see the page but no collection list. When I type in the url of the archived site, the url changes to http://localhost:8080/query?type=urlquery&url=http%3A%2F%2Fwww.gpo.ca&date=&Submit=Take+Me+Back which is an error. How do I replay the site. I think I am missing something in the config. My arcs are present at ${wayback.basedir}/politicalCollection/ <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="politicalCollection" /> <property name="prefix" value="${wayback.basedir}/politicalCollection/" /> <property name="recurse" value="false" /> </bean> </list> </property> </bean> Am I missing something? Please help. Thanks, --Pramila Thakur ________________________________ From: Bradley Tofel [mailto:br...@ar...] Sent: Tuesday, April 19, 2011 7:12 PM To: arc...@li... Subject: Re: [Archive-access-discuss] Auto Indexr NOT working Eric, thanks again for the accurate and timely list response! Pramila, Eric is correct - wayback-1.6.0 uses a core Java GZIP library which was changed recently by a Java update. This change broke Wayback indexing code. Heritrix SVN now includes a work-around for the issue, and I've just built a 1.6.1 release candidate for Wayback with the latest Heritrix snapshot. The release address the GZIP issue, which fixes the W/ARC indexing code, and also corrects an issue which prevents Wayback from being deployed in a non-ROOT context. You can test out the release candidate at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz which should address the indexing problem you reported. Thanks! Brad On 4/18/11 10:36 AM, Erik Hetzner wrote: At Mon, 18 Apr 2011 17:03:57 +0000, <Pra...@on...><mailto:Pra...@on...> wrote: Hi Everyone, I am getting started with wayback machine. But I am having problem indexing the arc files. I get an error on Tomcat as this java.io.IOException: Resetting to invalid mark […] Hi Pramila, Please see: https://webarchive.jira.com/browse/HER-1865 In summary, you will probably need to use a version of Java prior to 6u23. best, Erik Sent from my free software system <http://fsf.org/><http://fsf.org/>. ------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev _______________________________________________ Archive-access-discuss mailing list Arc...@li...<mailto:Arc...@li...> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: <Pra...@on...> - 2011-04-27 21:15:58
|
Hi Everybody, I need some help with the wayback configuration. My css, images are not found . Also the replay is not happening. One example would help. Thanks, --Pramila Thakur ________________________________ From: Bradley Tofel [mailto:br...@ar...] Sent: Tuesday, April 19, 2011 7:12 PM To: arc...@li... Subject: Re: [Archive-access-discuss] Auto Indexr NOT working Eric, thanks again for the accurate and timely list response! Pramila, Eric is correct - wayback-1.6.0 uses a core Java GZIP library which was changed recently by a Java update. This change broke Wayback indexing code. Heritrix SVN now includes a work-around for the issue, and I've just built a 1.6.1 release candidate for Wayback with the latest Heritrix snapshot. The release address the GZIP issue, which fixes the W/ARC indexing code, and also corrects an issue which prevents Wayback from being deployed in a non-ROOT context. You can test out the release candidate at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz which should address the indexing problem you reported. Thanks! Brad On 4/18/11 10:36 AM, Erik Hetzner wrote: At Mon, 18 Apr 2011 17:03:57 +0000, <Pra...@on...><mailto:Pra...@on...> wrote: Hi Everyone, I am getting started with wayback machine. But I am having problem indexing the arc files. I get an error on Tomcat as this java.io.IOException: Resetting to invalid mark [...] Hi Pramila, Please see: https://webarchive.jira.com/browse/HER-1865 In summary, you will probably need to use a version of Java prior to 6u23. best, Erik Sent from my free software system <http://fsf.org/><http://fsf.org/>. ------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev _______________________________________________ Archive-access-discuss mailing list Arc...@li...<mailto:Arc...@li...> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Graham, L. <lg...@lo...> - 2011-04-27 17:36:47
|
Thanks, we did find the heretrix template and will test naming without tildas. It seems reasonably easy to do. I did share here the benefits that Brad outlined in having the tildas. While we weren't having issues receiving bags with files named with tildas in our system, the advice was, if you can avoid them, do so. Maybe have an option in 3.1 to turn-on tildas and have hyphens remain the default? (Also, sorry, realize I'm probably on the wrong list with this heritrix issue.) Thanks much. Laura Graham Library of Congress ---------------------------------------------------------------------- Message: 1 Date: Mon, 25 Apr 2011 09:04:41 -0700 From: Gordon Mohr <go...@ar...> Subject: Re: [Archive-access-discuss] heritrix-3.1.0-beta: tildas in warc filename To: arc...@li... Message-ID: <4DB...@ar...> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Thanks for the note; we've heard some concern about the use of tildes, but so far no reports of anyplace where it actually breaks things. We could consider changing the defaults before the official release. However, by using something other than '-', the new 'pid~host~port' construct can be thought of as filling the exact same dash-delimited position as 'host' previously did. (Any processing based on the old naming that assumed dashes and the prior fields should continue to work, getting the same number of tokens.) One recommendation: if for any reasons projects do need a different naming formula than the default, it's best to change it in the crawler configuraton (specifically the 'template' property on the W/ARCWriterProcessor) so that the file is initially written with the desired name, rather than by a separate renaming step after it is written. Both the ARC and WARC formats include an internal reference to their filename-as-originally-written, and thus any later renamings create a mismatch with their internally-declared name and thus some risk of confusion. - Gordon @ IA On 4/25/11 6:41 AM, Graham, Laura wrote: > This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. > > After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. > > While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. > > Again, just a comment. > > Thanks, > Laura Graham > Library of Congress > > ------------------------------------------------------------------------------ > Fulfilling the Lean Software Promise > Lean software platforms are now widely adopted and the benefits have been > demonstrated beyond question. Learn why your peers are replacing JEE > containers with lightweight application servers - and what you can gain > from the move. http://p.sf.net/sfu/vmware-sfemails > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: <Mac...@nb...> - 2011-04-27 15:27:38
|
Hi there, does anybody know about issues with the Wayback Machine 1.4.2 and Internet Explorer (especially version 7)? I didnt find any bugs/reports about that so far. Kind Regards Mac Mac Kobus Digitale Archivierung ¦ e-Helvetica Eidgenössisches Departement des Innern EDI Bundesamt für Kultur BAK Schweizerische Nationalbibliothek NB Hallwylstrasse 15, 3003 Bern tel +41 31 322 89 93 fax +41 31 322 84 63 mac...@nb... www.nb.admin.ch ¦ http://www.nb.admin.ch/e-helvetica |
From: <Pra...@on...> - 2011-04-27 14:51:33
|
Hi Everyone, I used the latest wayback-1.6.1.tar.gz The Indexing seem to work OK. But now I have some configuration issues. Can someone please help me? In wayback.xml I have the following property set <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer"> <property name="properties"> <value> wayback.basedir=/tmp/wayback wayback.urlprefix=http://localhost:8080/wayback-1.6.1/politicalCollection/ </value> </property> </bean> But when I go to http://localhost:8080/wayback-1.6.1 , I see the page but no collection list. When I type in the url of the archived site, the url changes to http://localhost:8080/query?type=urlquery&url=http%3A%2F%2Fwww.gpo.ca&date=&Submit=Take+Me+Back which is an error. How do I replay the site. I think I am missing something in the config. My arcs are present at ${wayback.basedir}/politicalCollection/ <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean"> <property name="sourceList"> <list> <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource"> <property name="name" value="politicalCollection" /> <property name="prefix" value="${wayback.basedir}/politicalCollection/" /> <property name="recurse" value="false" /> </bean> </list> </property> </bean> Am I missing something? Please help. Thanks, --Pramila Thakur ________________________________ From: Bradley Tofel [mailto:br...@ar...] Sent: Tuesday, April 19, 2011 7:12 PM To: arc...@li... Subject: Re: [Archive-access-discuss] Auto Indexr NOT working Eric, thanks again for the accurate and timely list response! Pramila, Eric is correct - wayback-1.6.0 uses a core Java GZIP library which was changed recently by a Java update. This change broke Wayback indexing code. Heritrix SVN now includes a work-around for the issue, and I've just built a 1.6.1 release candidate for Wayback with the latest Heritrix snapshot. The release address the GZIP issue, which fixes the W/ARC indexing code, and also corrects an issue which prevents Wayback from being deployed in a non-ROOT context. You can test out the release candidate at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz which should address the indexing problem you reported. Thanks! Brad On 4/18/11 10:36 AM, Erik Hetzner wrote: At Mon, 18 Apr 2011 17:03:57 +0000, <Pra...@on...><mailto:Pra...@on...> wrote: Hi Everyone, I am getting started with wayback machine. But I am having problem indexing the arc files. I get an error on Tomcat as this java.io.IOException: Resetting to invalid mark [...] Hi Pramila, Please see: https://webarchive.jira.com/browse/HER-1865 In summary, you will probably need to use a version of Java prior to 6u23. best, Erik Sent from my free software system <http://fsf.org/><http://fsf.org/>. ------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev _______________________________________________ Archive-access-discuss mailing list Arc...@li...<mailto:Arc...@li...> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Finn, B. L <bra...@ed...> - 2011-04-27 01:46:41
|
I have set 0 for both because the default 5000 documents aren't enough for my captures. I have performed single captures with over 300,000 documents. Although looking at your screen shots it doesn't look like your hitting this limit. Are you sure your scope rules are correct? Bradley Finn Systems Development Officer | State Library of Tasmania | Linc Tasmania 91 Murray Street | Hobart Tasmania 7000 Ph: (03) 6233 7503 www.adulteducation.tas.gov.au <http://www.adulteducation.tas.gov.au/> | www.statelibrary.tas.gov.au <http://www.statelibrary.tas.gov.au/> | www.tco.asn.au <http://www.tco.asn.au/> | www.archives.tas.gov.au <http://www.archives.tas.gov.au/> ________________________________ From: Allen Sim [mailto:all...@gm...] Sent: Wednesday, 27 April 2011 11:39 AM To: Finn, Bradley L Cc: web...@li...; arc...@li...; br...@ar... Subject: Re: WCT Stopping Dear Bradley, Glad to hear from you once again. Following are some of my querries: 1. Where to select enable override? 2. At the base screen, where to change the maximum documents and maximum data. Usually what amount you proposed to allocate? Attached herewith the print-screen for the Base screen. Thanks for your reply. You are so helpful! Really appreciate your guidance. Warmest regards, Allen On Wed, Apr 27, 2011 at 6:04 AM, Finn, Bradley L <bra...@ed...> wrote: You haven't selected enable override. You are far better of going to Harvester Configuration -> Profile -> edit your profile -> Base. >From this screen you can change your maximum documents and maximum data. If you have multiple profiles, make sure that you have the correct profile selected within the target properties. Also you will need to set your proxy settings for all of your profiles. Bradley Finn Systems Development Officer | State Library of Tasmania | Linc Tasmania 91 Murray Street | Hobart Tasmania 7000 Ph: (03) 6233 7503 www.adulteducation.tas.gov.au <http://www.adulteducation.tas.gov.au/> | www.statelibrary.tas.gov.au <http://www.statelibrary.tas.gov.au/> | www.tco.asn.au <http://www.tco.asn.au/> | www.archives.tas.gov.au <http://www.archives.tas.gov.au/> ________________________________ Thanks in advance. Warmest regards, Allen ------------------------------------------------------------------------ ----- CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. ----------------------------------------------------------------------------- CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. |
From: Finn, B. L <bra...@ed...> - 2011-04-26 22:18:13
|
You haven't selected enable override. You are far better of going to Harvester Configuration -> Profile -> edit your profile -> Base. >From this screen you can change your maximum documents and maximum data. If you have multiple profiles, make sure that you have the correct profile selected within the target properties. Also you will need to set your proxy settings for all of your profiles. Bradley Finn Systems Development Officer | State Library of Tasmania | Linc Tasmania 91 Murray Street | Hobart Tasmania 7000 Ph: (03) 6233 7503 www.adulteducation.tas.gov.au <http://www.adulteducation.tas.gov.au/> | www.statelibrary.tas.gov.au <http://www.statelibrary.tas.gov.au/> | www.tco.asn.au <http://www.tco.asn.au/> | www.archives.tas.gov.au <http://www.archives.tas.gov.au/> ________________________________ From: Allen Sim [mailto:all...@gm...] Sent: Tuesday, 26 April 2011 11:07 AM To: web...@li...; arc...@li...; br...@ar...; Finn, Bradley L Subject: WCT Stopping Hi, I am facing some archiving problems while archiving the websites. Initially it works okay, but recently it keeps auto STOPPING by itself. May I know what's wrong with it? Could it be my proxy or network problem? I already set the http-proxy-host and http=proxy-port. But, Still, the archiving keep stooping by itself. Please advice. Looking forward to hear from you. Thanks in advance. Warmest regards, Allen ----------------------------------------------------------------------------- CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. |
From: Bradley T. <br...@ar...> - 2011-04-25 19:15:04
|
Hi Laura, Note that the initial WARCINFO record contains the name of the file in the WARC-Filename field. This some possible benefits: 1) later processing steps can deduce the filename from the data stream itself, without relying on actual filename, or on wrapping tools to forward the actual filename. Any software which assumes this field will reflect the actual original filename (perhaps for other processes to later access the data) may break. 1b) similar to #1, the format allows multiple WARC file contents to be concatenated into a single stream and fed to a processing tool, which can use the warcinfo record, specifically the WARC-Filename field, to detect original file input boundaries, and correctly report original source of the records. 2) With some filesystems, a corrupt directory block may cause file name information to be lost (for example, EXT-3|4 will place files into ./lost+found/ named with their inode). In the past at IA, we've simplified the reconstruction of the original filenames by inspecting the first ARC/WARC record, rather than having to deduce the original filename based on a content digest hash + a lookup of that hash in an external hash-to-filename database. So, none of these things are insurmountable obstacles to a simple rename, but I wanted to point out some possible downstream issues. Brad On 4/25/11 6:41 AM, Graham, Laura wrote: > This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. > > After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. > > While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. > > Again, just a comment. > > Thanks, > Laura Graham > Library of Congress > > ------------------------------------------------------------------------------ > Fulfilling the Lean Software Promise > Lean software platforms are now widely adopted and the benefits have been > demonstrated beyond question. Learn why your peers are replacing JEE > containers with lightweight application servers - and what you can gain > from the move. http://p.sf.net/sfu/vmware-sfemails > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Gordon M. <go...@ar...> - 2011-04-25 16:04:50
|
Thanks for the note; we've heard some concern about the use of tildes, but so far no reports of anyplace where it actually breaks things. We could consider changing the defaults before the official release. However, by using something other than '-', the new 'pid~host~port' construct can be thought of as filling the exact same dash-delimited position as 'host' previously did. (Any processing based on the old naming that assumed dashes and the prior fields should continue to work, getting the same number of tokens.) One recommendation: if for any reasons projects do need a different naming formula than the default, it's best to change it in the crawler configuraton (specifically the 'template' property on the W/ARCWriterProcessor) so that the file is initially written with the desired name, rather than by a separate renaming step after it is written. Both the ARC and WARC formats include an internal reference to their filename-as-originally-written, and thus any later renamings create a mismatch with their internally-declared name and thus some risk of confusion. - Gordon @ IA On 4/25/11 6:41 AM, Graham, Laura wrote: > This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. > > After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. > > While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. > > Again, just a comment. > > Thanks, > Laura Graham > Library of Congress > > ------------------------------------------------------------------------------ > Fulfilling the Lean Software Promise > Lean software platforms are now widely adopted and the benefits have been > demonstrated beyond question. Learn why your peers are replacing JEE > containers with lightweight application servers - and what you can gain > from the move. http://p.sf.net/sfu/vmware-sfemails > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Graham, L. <lg...@lo...> - 2011-04-25 14:01:58
|
This is just a comment, as it may not be an issue for other institutions. But we have noticed in heritrix-3.1.0-beta that warcs are written with tildas in the filenames. From the documentation, these are to indicate host and port and pid values in the filename. After consulting with our respository tool development team here at the Library of Congress, we will likely rename these files going forward, replacing the tildas with hyphens. While our current bit preservation inventory tool accepts the tildas, and while there are lots of issues in filenames in general, which any system or set of tools will need to deal with, we've decided to take this extra renaming step. It only takes a moment, and it's one less possible issue to track in our work going forward. Again, just a comment. Thanks, Laura Graham Library of Congress |
From: Graham, L. <lg...@lo...> - 2011-04-20 17:40:10
|
Hi, Here at the Library of Congress, we use the Wayback authentication property to limit access based on IP range for our QR Wayback, and for our public Wayback, in combination with exclusionFactory, to distinquish between sites that can only be viewed onsite (IP range) and those that can be viewed both on- and off-site. We have been testing authentication with 1.6. and it's not working as expected. When it is setup as stated in the documentation, the user gets a Secure Login popup, regardless of IP range. If we combine with IP filter with the exclusionFactory, the excluded URL are always blocked, no matter what the client's IP range. Here is an example of our authentication property: Authentication Property without exclusionFactory: <property name="authentication"> <bean name="ndiippAccess" class="org.archive.wayback.authenticationcontrol.AccessControlSettingOperation"> <property name="operator"> <bean class="org.archive.wayback.util.operator.NotBooleanOperator"> <property name="operand"> <bean class="org.archive.wayback.authenticationcontrol.IPMatchesBooleanOperator"> <property name="allowedRanges"> <list> <value>140.147.236.194/16</value> </list> </property> </bean> </property> </bean> </property> <property name="factory" ref="ndiipp-exclusion-list" /> </bean> </property> Thanks, Laura Graham Library of Congress |
From: Bradley T. <br...@ar...> - 2011-04-19 23:03:06
|
Eric, thanks again for the accurate and timely list response! Pramila, Eric is correct - wayback-1.6.0 uses a core Java GZIP library which was changed recently by a Java update. This change broke Wayback indexing code. Heritrix SVN now includes a work-around for the issue, and I've just built a 1.6.1 release candidate for Wayback with the latest Heritrix snapshot. The release address the GZIP issue, which fixes the W/ARC indexing code, and also corrects an issue which prevents Wayback from being deployed in a non-ROOT context. You can test out the release candidate at: http://home.us.archive.org/~brad/wayback-1.6.1.tar.gz which should address the indexing problem you reported. Thanks! Brad On 4/18/11 10:36 AM, Erik Hetzner wrote: > At Mon, 18 Apr 2011 17:03:57 +0000, > <Pra...@on...> wrote: >> Hi Everyone, >> >> I am getting started with wayback machine. >> >> But I am having problem indexing the arc files. >> >> I get an error on Tomcat as this >> >> java.io.IOException: Resetting to invalid mark >> >> […] > Hi Pramila, > > Please see: > > https://webarchive.jira.com/browse/HER-1865 > > In summary, you will probably need to use a version of Java prior to > 6u23. > > best, Erik > > > Sent from my free software system<http://fsf.org/>. > > > ------------------------------------------------------------------------------ > Benefiting from Server Virtualization: Beyond Initial Workload > Consolidation -- Increasing the use of server virtualization is a top > priority.Virtualization can reduce costs, simplify management, and improve > application availability and disaster protection. Learn more about boosting > the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Colin R. <cs...@st...> - 2011-04-19 13:13:14
|
This is really more of a tomcat question than a wayback question, but perhaps someone here knows the answer. I'm deploying wayback in proxy mode with a custom context file ROOT.xml that looks like: <?xml version="1.0" encoding="UTF-8"?> <Context docBase="/home/test/tmp/wayback/wayback-1.6.0.war"> <Parameter name="config-path" value="../../../wayback_conf/wayback.xml" override="false"/> </Context> The docBase attribute works as expected - I can redploy wayback just by touching the war'file. However the context parameter seems to be ignored and the default value from web.xml is used instead. This is with tomcat 7.0.12. Any ideas why? -- Colin Rosenthal State and University Library, Aarhus |
From: Gérard D. <ger...@gm...> - 2011-04-19 13:11:20
|
Hi all, I'm still testing on the last trunk version. This afternoon I encounter a new problem : Exception in thread "virtuosoV0-news launchthread" java.lang.NoSuchMethodError: org.apache.commons.httpclient.HttpState.setCookiesMap(Ljava/util/SortedMap;)V at org.archive.modules.fetcher.FetchHTTP.start(FetchHTTP.java:1332) at org.archive.spring.PathSharingContext.doStart(PathSharingContext.java:111) at org.archive.spring.PathSharingContext.doStart(PathSharingContext.java:108) at org.archive.spring.PathSharingContext.doStart(PathSharingContext.java:108) at org.archive.spring.PathSharingContext.doStart(PathSharingContext.java:108) at org.archive.spring.PathSharingContext.start(PathSharingContext.java:97) at org.archive.crawler.framework.CrawlJob.startContext(CrawlJob.java:454) at org.archive.crawler.framework.CrawlJob$1.run(CrawlJob.java:431) This is apparently due to a problem in the classpath : apache HttpClient is loaded before heritrix commons HttpClient. I can correct that in the heritrix script, but I would like to be sure I'm not missing something. BTW, I'm on the trunk commit 7136 and I rebuilt the distribution from source with assembly. cheers, -- Gérard Dupont Information Processing Control and Cognition (IPCC) CASSIDIAN - an EADS company Document & Learning team - LITIS Laboratory |
From: Erik H. <eri...@uc...> - 2011-04-18 17:36:57
|
At Mon, 18 Apr 2011 17:03:57 +0000, <Pra...@on...> wrote: > > Hi Everyone, > > I am getting started with wayback machine. > > But I am having problem indexing the arc files. > > I get an error on Tomcat as this > > java.io.IOException: Resetting to invalid mark > > […] Hi Pramila, Please see: https://webarchive.jira.com/browse/HER-1865 In summary, you will probably need to use a version of Java prior to 6u23. best, Erik |