From: <Mac...@nb...> - 2011-08-02 11:57:04
|
Hi Robin, the workflow for reindexing the ARC-files did work for us, although our wayback-instance wasn't located in a /tmp/-folder either. We just replaced it with the actual location of our waybackmachine. Another thing that gave us a hard time with indexing the ARCs were the given rights. We work on different machines for Harvesting and Reviewing and after the filetransfer all the files had an 644 (what can not be indexed), so we set them to 664. Did you get it done? Best regards Mac Mac Kobus Digitale Archivierung ¦ e-Helvetica Eidgenössisches Departement des Innern EDI Bundesamt für Kultur BAK Schweizerische Nationalbibliothek NB Hallwylstrasse 15, 3003 Bern tel +41 31 322 89 93 fax +41 31 322 84 63 mac...@nb... www.nb.admin.ch ¦ http://www.nb.admin.ch/e-helvetica -----Ursprüngliche Nachricht----- Von: Robin Davis [mailto:rob...@gm...] Gesendet: Dienstag, 19. Juli 2011 17:38 An: arc...@li... Betreff: [Archive-access-discuss] Reindexing in Wayback Hello, I'm the web preservation intern at the Smithsonian Institution Archives. We've been using Heritrix and Wayback to crawl and view websites affiliated with the Institution. In general, it's been a success. We did run into a problem, however. We'd had Heritrix configured to write both ARC and WARC files, although we were only interested in WARCs. To make space, I deleted the ARC files. All the affected resources are now "temporarily unavailable" and won't display. The WARCs are all still there (uncompressed), as are their associated pointer files in /wayback/index-data/merged. "Searching all pages under" the domains displays all the documents with the correct dates, but they are all unavailable. I have tried to reindex three test files by --shutting down Tomcat --removing the three crawl job folders from /smithsonian-archive (where all of our crawl job folders live) --removing the associated files from /wayback/index-data/merged --restarting Tomcat --copying over the three crawl job folders again into /smithsonian-archive. New files corresponding to these are added automatically to index-data/merged... but the resource remains "temporarily unavailable." Clearing our browser's history and cache had no effect. As an experiment, I have also --shut down Tomcat --removed two crawl job folders from /smithsonian-archive and their pointer files in /wayback/index-data/merged --restarted Tomcat I thought the files would be gone completely - but the URLs with the correct dates still show up in searches. When these "ghost links" are clicked, the same error page appears: "temporarily unavailable." I've looked at the only related mailing list thread I could rustle up: http://sourceforge.net/mailarchive/message.php?msg_id=25800307 The problems Jerome Kowalczyk and Mac Kobus were having seem similar to ours. Brad suggested stopping Tomcat, removing the WARC files from the /tmp/wayback/ folder, typing the command find /tmp/wayback/ -type f -print0 | xargs -0 -r rm -fv, moving the files back, and restarting Tomcat... But the /tmp/wayback/ directory doesn't seem to exist on our machine (probably /smithsonian-archive for us?), and I'm also not sure what the command is supposed to do and so haven't tried tweaking it. What's in the way of getting our WARC files to display? How can we reindex and/or completely delete crawled sites from Wayback? Any insights are appreciated. For reference: we have around 700 WARCs total in our collection. We're using Wayback 1.6.0 on a Linux machine, set up by a contractor (support expired). All recent crawls were written as WARCs only, and they display without issue in Wayback. Best, Robin -- Robin Camille Davis Smithsonian Institution Archives Intern, Digital Services Division ------------------------------------------------------------------------------ Magic Quadrant for Content-Aware Data Loss Prevention Research study explores the data loss prevention market. Includes in-depth analysis on the changes within the DLP market, and the criteria used to evaluate the strengths and weaknesses of these DLP solutions. http://www.accelacomm.com/jaw/sfnl/114/51385063/ _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |