You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: Doug C. <cu...@us...> - 2005-09-01 18:45:38
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/conf Modified Files: Tag: mapred nutch-site.xml Log Message: Add indexArcs command. Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.24.2.2 retrieving revision 1.24.2.3 diff -C2 -d -r1.24.2.2 -r1.24.2.3 *** nutch-site.xml 24 Aug 2005 04:15:48 -0000 1.24.2.2 --- nutch-site.xml 1 Sep 2005 18:45:29 -0000 1.24.2.3 *************** *** 7,14 **** <!-- NDFS --> ! <property> ! <name>fs.default.name</name> ! <value>ia109102:8009</value> ! </property> <property> --- 7,14 ---- <!-- NDFS --> ! <!-- <property> --> ! <!-- <name>fs.default.name</name> --> ! <!-- <value>ia109102:8009</value> --> ! <!-- </property> --> <property> *************** *** 29,56 **** <!-- MapReduce --> ! <property> ! <name>mapred.job.tracker</name> ! <value>ia109102:8010</value> ! </property> ! <property> ! <name>mapred.job.tracker.info.port</name> ! <value>7846</value> ! </property> ! <property> ! <name>mapred.local.dir</name> ! <value>/0/nutch/mapred/local</value> ! </property> ! <property> ! <name>mapred.system.dir</name> ! <value>/mapred/system</value> ! </property> ! <property> ! <name>mapred.task.timeout</name> ! <value>3600000</value> ! </property> <!-- Override a few Nutch defaults --> --- 29,56 ---- <!-- MapReduce --> ! <!-- <property> --> ! <!-- <name>mapred.job.tracker</name> --> ! <!-- <value>ia109102:8010</value> --> ! <!-- </property> --> ! <!-- <property> --> ! <!-- <name>mapred.job.tracker.info.port</name> --> ! <!-- <value>7846</value> --> ! <!-- </property> --> ! <!-- <property> --> ! <!-- <name>mapred.local.dir</name> --> ! <!-- <value>/0/nutch/mapred/local</value> --> ! <!-- </property> --> ! <!-- <property> --> ! <!-- <name>mapred.system.dir</name> --> ! <!-- <value>/mapred/system</value> --> ! <!-- </property> --> ! <!-- <property> --> ! <!-- <name>mapred.task.timeout</name> --> ! <!-- <value>3600000</value> --> ! <!-- </property> --> <!-- Override a few Nutch defaults --> |
From: Doug C. <cu...@us...> - 2005-09-01 18:45:38
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/src/java/org/archive/access/nutch Added Files: Tag: mapred ImportArcs.java IndexArcs.java Removed Files: Tag: mapred Arc2Segment.java Log Message: Add indexArcs command. --- NEW FILE: ImportArcs.java --- /* * $Id: ImportArcs.java,v 1.1.2.1 2005/09/01 18:45:29 cutting Exp $ * * Copyright (C) 2003 Internet Archive. * * This file is part of the archive-access tools project * (http://sourceforge.net/projects/archive-access). * * The archive-access tools are free software; you can redistribute them and/or * modify them under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or any * later version. * * The archive-access tools are distributed in the hope that they will be * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser * Public License for more details. * * You should have received a copy of the GNU Lesser Public License along with * the archive-access tools; if not, write to the Free Software Foundation, * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package org.archive.access.nutch; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.IOException; import java.util.Iterator; import java.util.Properties; import java.util.logging.Level; import java.util.logging.Logger; import java.net.URI; import org.apache.commons.httpclient.Header; import org.apache.nutch.io.Writable; import org.apache.nutch.io.WritableComparable; import org.apache.nutch.io.UTF8; import org.apache.nutch.io.MD5Hash; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConf; import org.apache.nutch.util.NutchConfigured; import org.apache.nutch.util.mime.MimeType; import org.apache.nutch.util.mime.MimeTypes; import org.apache.nutch.mapred.JobConf; import org.apache.nutch.mapred.JobClient; import org.apache.nutch.mapred.Mapper; import org.apache.nutch.mapred.OutputCollector; import org.apache.nutch.mapred.Reporter; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Fetcher; import org.apache.nutch.crawl.FetcherOutput; import org.apache.nutch.crawl.FetcherOutputFormat; import org.apache.nutch.parse.Parse; import org.apache.nutch.parse.ParseStatus; import org.apache.nutch.parse.Parser; import org.apache.nutch.parse.ParserFactory; import org.apache.nutch.parse.ParseImpl; import org.archive.io.arc.ARCReader; import org.archive.io.arc.ARCReaderFactory; import org.archive.io.arc.ARCRecord; import org.archive.io.arc.ARCRecordMetaData; import org.archive.util.ArchiveUtils; import org.archive.util.TextUtils; public class ImportArcs extends NutchConfigured implements Mapper { private static final Logger LOG = Logger.getLogger(ImportArcs.class.getName()); private static final String WHITESPACE = "\\s+"; public static final String ARCFILENAME_KEY = "arcname"; public static final String ARCFILEOFFSET_KEY = "arcoffset"; public static final String ARCCOLLECTION_KEY = "collection"; private static final String CONTENT_TYPE_KEY = "content-type"; private static final String TEXT_TYPE = "text/"; private static final String APPLICATION_TYPE = "application/"; private boolean indexAll; private int contentLimit; private MimeTypes mimeTypes; private String collectionName; private String segmentName; public ImportArcs() { super(null); } public ImportArcs(NutchConf conf) { super(conf); } public void configure(JobConf job) { setConf(job); this.indexAll = job.getBoolean("archive.index.all", false); this.contentLimit = job.getInt("http.content.limit", 100000); this.mimeTypes = MimeTypes.get(job.get("mime.types.file")); this.collectionName = job.get("archive.collection", "web"); this.segmentName = job.get(Fetcher.SEGMENT_NAME_KEY); if (job.getBoolean("arc2segment.verbose", false)) { LOG.setLevel(Level.FINE); } System.setProperty("java.protocol.handler.pkgs", "org.archive.net"); } public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String arcLocation = ((UTF8)value).toString(); LOG.info("opening "+arcLocation); ARCReader arc = null; String arcName = null; try { arc = ARCReaderFactory.get(arcLocation); } catch (Throwable e) { LOG.log(Level.WARNING, "Error opening: " + arcLocation, e); return; } // Don't run the digester. Digest is unused and it costs CPU. arc.setDigest(false); try { for (Iterator i = arc.iterator(); i.hasNext();) { ARCRecord rec = (ARCRecord) i.next(); if (arcName == null) { // first entry has arc name String arcPath = new URI(rec.getMetaData().getUrl()).getPath(); arcName = new File(arcPath).getName(); if (arcName.endsWith(".arc")) { arcName = arcName.substring(0, arcName.indexOf(".arc")); } reporter.setStatus(arcName); } if (rec.getStatusCode() != 200) continue; try { processRecord(arcName, rec, output); } catch (Throwable e) { LOG.log(Level.WARNING, "Error processing: " + arcLocation, e); } } } catch (Throwable e) { // problem parsing arc file LOG.log(Level.WARNING, "Error parsing: " + arcLocation, e); } } private void processRecord(final String arcName, final ARCRecord rec, OutputCollector output) throws IOException { ARCRecordMetaData arcData = rec.getMetaData(); String url = arcData.getUrl(); String mimetype = arcData.getMimetype(); if (mimetype != null && mimetype.length() > 0) { mimetype = mimetype.toLowerCase(); } else { MimeType mt = mimeTypes.getMimeType(url); if (mt != null) { mimetype = mt.getName(); } } if (!indexAll) { if ((mimetype == null) || (!mimetype.startsWith(TEXT_TYPE) && !mimetype.startsWith(APPLICATION_TYPE))) { // Skip any but basic types. return; } } String noSpacesMimetype = TextUtils.replaceAll(WHITESPACE, mimetype, "-"); // LOG.info("adding " + Long.toString(arcData.getLength()) // + " bytes of mimetype " + noSpacesMimetype + " " + url); // copy http headers to nutch metadata Properties metaData = new Properties(); Header[] headers = rec.getHttpHeaders(); for (int j = 0; j < headers.length; j++) { Header header = headers[j]; metaData.put(header.getName(), header.getValue()); } // Add the collection name, the arcfile name, and the offset. // Also add mimetype. Needed by the ia indexers. metaData.put(ARCCOLLECTION_KEY, this.collectionName); metaData.put(ARCFILENAME_KEY, arcName); metaData.put(ARCFILEOFFSET_KEY, Long.toString(arcData.getOffset())); metaData.put(CONTENT_TYPE_KEY, mimetype); // Collect content bytes // TODO: Skip if unindexable type. rec.skipHttpHeader(); ByteArrayOutputStream contentBuffer = new ByteArrayOutputStream(); byte[] buf = new byte[1024 * 4]; int total = 0; int len = rec.read(buf, 0, buf.length); while (len != -1 && total < this.contentLimit) { total += len; contentBuffer.write(buf, 0, len); len = rec.read(buf, 0, buf.length); } // System.out.println("--------------"); // System.out.write(contentBuffer.toByteArray()); // System.out.println("--------------"); byte[] contentBytes = contentBuffer.toByteArray(); Content content = new Content(url, url, contentBytes, mimetype, metaData); metaData.put(Fetcher.DIGEST_KEY, MD5Hash.digest(contentBytes).toString()); metaData.put(Fetcher.SEGMENT_NAME_KEY, segmentName); CrawlDatum datum = new CrawlDatum(); datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS); long date = 0; try { date = ArchiveUtils.parse14DigitDate(arcData.getDate()).getTime(); } catch (java.text.ParseException e) { LOG.severe("Failed parse of date: " + arcData.getDate()); } datum.setFetchTime(date); Parse parse = null; ParseStatus parseStatus; try { Parser parser = ParserFactory.getParser(content.getContentType(), content.getBaseUrl()); parse = parser.getParse(content); parseStatus = parse.getData().getStatus(); } catch (Exception e) { parseStatus = new ParseStatus(e); } if (!parseStatus.isSuccess()) { LOG.warning("Error parsing: "+url+": "+parseStatus); parse = null; } output.collect(new UTF8(url), new FetcherOutput(datum, null, parse!=null ? new ParseImpl(parse):null)); } public void importArcs(File arcUrlsDir, File segment) throws IOException { LOG.info("ImportArcs: starting"); LOG.info("ImportArcs: arcUrlsDir: " + arcUrlsDir); LOG.info("ImportArcs: segment: " + segment); JobConf job = new JobConf(getConf()); job.setJar("build/nutchwax.job.jar"); job.set(Fetcher.SEGMENT_NAME_KEY, segment.getName()); job.setInputDir(arcUrlsDir); job.setMapperClass(ImportArcs.class); job.setOutputDir(segment); job.setOutputFormat(FetcherOutputFormat.class); job.setOutputKeyClass(UTF8.class); job.setOutputValueClass(FetcherOutput.class); JobClient.runJob(job); LOG.info("ImportArcs: done"); } public static void main(String[] args) throws Exception { // parse command line options String usage = "Usage: ImportArcs arcUrlsDir segmentDir"; if (args.length != 2) { System.err.println(usage); System.exit(-1); } File arcUrlsDir = new File(args[0]); File segmentDir = new File(args[1]); new ImportArcs(NutchConf.get()).importArcs(arcUrlsDir, segmentDir); } } --- NEW FILE: IndexArcs.java --- /** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.archive.access.nutch; import java.io.*; import java.net.*; import java.util.*; import java.text.*; import java.util.logging.*; import org.apache.nutch.io.*; import org.apache.nutch.fs.*; import org.apache.nutch.util.*; import org.apache.nutch.mapred.*; import org.apache.nutch.crawl.*; public class IndexArcs { public static final Logger LOG = LogFormatter.getLogger("org.archive.acces.nutch.IndexArcs"); private static String getDate() { return new SimpleDateFormat("yyyyMMddHHmmss").format (new Date(System.currentTimeMillis())); } /* Import and index a set of arc files. */ public static void main(String args[]) throws Exception { if (args.length < 1) { System.out.println("Usage: IndexArcs <arcsDir> [-dir d]"); return; } JobConf conf = new JobConf(NutchConf.get()); File arcsDir = null; File dir = new File("crawl-" + getDate()); for (int i = 0; i < args.length; i++) { if ("-dir".equals(args[i])) { dir = new File(args[i+1]); i++; } else if (args[i] != null) { arcsDir = new File(args[i]); } } NutchFileSystem fs = NutchFileSystem.get(conf); if (fs.exists(dir)) { throw new RuntimeException(dir + " already exists."); } LOG.info("IndexArcs started in: " + dir); LOG.info("arcsDir = " + arcsDir); File linkDb = new File(dir + "/linkdb"); File index = new File(dir + "/indexes"); File segments = new File(dir + "/segments"); File segment = new File(segments, getDate()); // import arcs new ImportArcs(conf).importArcs(arcsDir, segment); // invert links new LinkDb(conf).invert(linkDb, segments); // index everything new Indexer(conf).index(index, linkDb, fs.listFiles(segments)); LOG.info("IndexArcs finished: " + dir); } } --- Arc2Segment.java DELETED --- |
From: Doug C. <cu...@us...> - 2005-09-01 18:45:38
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/bin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/bin Added Files: Tag: mapred indexArcs.sh Removed Files: Tag: mapred arc2seg.sh arcs2segs.sh indexarcs.sh Log Message: Add indexArcs command. --- arcs2segs.sh DELETED --- --- indexarcs.sh DELETED --- --- NEW FILE: indexArcs.sh --- #!/bin/sh # resolve links - $0 may be a softlink THIS="$0" while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '.*/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" fi done # some directories THIS_DIR=`dirname "$THIS"` PROJECT_HOME=`cd "$THIS_DIR/.." ; pwd` # If no 'nutch' directory, assume the binaries-only layout (All scripts are # in a single 'bin' directory and NUTCH_HOME=PROJECT_HOME). NUTCH_HOME="${PROJECT_HOME}/nutch" if [ ! -d "${NUTCH_HOME}" ] then NUTCH_HOME="${PROJECT_HOME}" fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi JAVA=$JAVA_HOME/bin/java if [ -z "$JAVA_OPTS" ] then JAVA_OPTS=(-Xmx400m -server) fi # CLASSPATH initially contains conf dirs CLASSPATH=${PROJECT_HOME}/conf:${NUTCH_HOME}/conf # for developers, add classes to CLASSPATH if [ -d "$PROJECT_HOME/build/classes" ]; then CLASSPATH=${CLASSPATH}:$PROJECT_HOME/build/classes fi # for developers, add Nutch classes to CLASSPATH if [ -d "$NUTCH_HOME/build/classes" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes fi if [ -d "$NUTCH_HOME/build/plugins" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build fi if [ -d "$NUTCH_HOME/build/test/classes" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes fi # so that filenames w/ spaces are handled correctly in loops below IFS= # for releases, add Nutch jar to CLASSPATH for f in $NUTCH_HOME/nutch-*.jar; do CLASSPATH=${CLASSPATH}:$f; done # add plugins to classpath if [ -d "$NUTCH_HOME/plugins" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME fi # Add our libs to CLASSPATH but take care to make heritrix jar come # before the httpclient jar (heritrix overlays a couple of httpclient # classes). httpclient_jar= for f in ${PROJECT_HOME}/lib/*.jar; do case `basename $f` in commons-httpclient*.jar) httpclient_jar=$f ;; *) CLASSPATH=${CLASSPATH}:$f ;; esac done CLASSPATH=${CLASSPATH}:${httpclient_jar} # Add Nutch libs to CLASSPATH for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # restore ordinary behaviour unset IFS CLASS=org.archive.access.nutch.IndexArcs # cygwin path translation if expr match `uname` 'CYGWIN*' &> /dev/null; then CLASSPATH=`cygpath -p -w "$CLASSPATH"` fi # Run it. Add in to java.net.URL the heritrix rsync handler. exec $JAVA ${JAVA_OPTS[@]} \ -Djava.protocol.handler.pkgs=org.archive.net \ -classpath "$CLASSPATH" $CLASS "$@" --- arc2seg.sh DELETED --- |
From: Doug C. <cu...@us...> - 2005-09-01 17:37:02
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6304 Modified Files: Tag: mapred Arc2Segment.java Log Message: Use reporter to set status. Index: Arc2Segment.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Arc2Segment.java,v retrieving revision 1.28.2.7 retrieving revision 1.28.2.8 diff -C2 -d -r1.28.2.7 -r1.28.2.8 *** Arc2Segment.java 24 Aug 2005 04:15:48 -0000 1.28.2.7 --- Arc2Segment.java 1 Sep 2005 17:36:53 -0000 1.28.2.8 *************** *** 132,136 **** arcName = arcName.substring(0, arcName.indexOf(".arc")); } ! LOG.info("arcName="+arcName); } --- 132,136 ---- arcName = arcName.substring(0, arcName.indexOf(".arc")); } ! reporter.setStatus(arcName); } *************** *** 138,142 **** continue; try { ! processRecord(arcName, rec, output, reporter); } catch (Throwable e) { LOG.log(Level.WARNING, "Error processing: " + arcLocation, e); --- 138,142 ---- continue; try { ! processRecord(arcName, rec, output); } catch (Throwable e) { LOG.log(Level.WARNING, "Error processing: " + arcLocation, e); *************** *** 149,153 **** private void processRecord(final String arcName, final ARCRecord rec, ! OutputCollector output, Reporter reporter) throws IOException { --- 149,153 ---- private void processRecord(final String arcName, final ARCRecord rec, ! OutputCollector output) throws IOException { *************** *** 155,160 **** String url = arcData.getUrl(); - reporter.setStatus(url); - String mimetype = arcData.getMimetype(); if (mimetype != null && mimetype.length() > 0) { --- 155,158 ---- |
From: John A. K. <joh...@us...> - 2005-08-28 18:55:38
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv17998 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: added complete in-line description of ANVL Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** warc_file_format.html 26 Aug 2005 23:19:18 -0000 1.6 --- warc_file_format.html 28 Aug 2005 18:55:30 -0000 1.7 *************** *** 363,371 **** warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF anvl-fields block = *OCTET </pre> <p>Elements of this grammar are further specified and explained in ! sections that follow (and in the case of <span class="emph">anvl-fields</span>, also a separate document). </p> <p>The record <span class="emph">header-line</span> is a --- 363,371 ---- warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF *anvl-field CRLF block = *OCTET </pre> <p>Elements of this grammar are further specified and explained in ! sections that follow. </p> <p>The record <span class="emph">header-line</span> is a *************** *** 385,401 **** been written. </p> ! <p>After the <span class="emph">header-line</span> come any number of ! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL] that is very similar to that of email ! headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a>. Its format can be roughly summarized ! as the following: </p><pre> ! anvl-fields = *line CRLF ! line = (field / other-anvl) CRLF ! field = <field per RFC0822> ! other-anvl = <see ANVL> </pre> ! <p>This document defines a number of named fields which may appear in ! the <span class="emph">anvl-fields</span> area of the header. Note that ! the smallest possible <span class="emph">anvl-fields</span> is a single CRLF, indicating no named fields. </p> --- 385,411 ---- been written. </p> ! <p>After the <span class="emph">header-line</span> come zero or more ! named <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL] fields in a line-oriented syntax ! very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a> but with ! unrestricted "text" values (none of its 13 reserved special characters). ! The precise format is as follows: </p><pre> ! anvl-field = field-name ":" [ field-body ] CRLF ! field-name = 1*<any CHAR, excluding control-chars and ":"> ! field-body = text [CRLF LWSP-char field-body] ! text = 1*<any UTF-8 character, including bare ! CR and bare LF, but NOT including CRLF> ! ; (Octal, Decimal.) ! CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) ! CR = <ASCII CR, carriage return> ; ( 15, 13.) ! LF = <ASCII LF, linefeed> ; ( 12, 10.) ! SPACE = <ASCII SP, space> ; ( 40, 32.) ! HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) ! CRLF = CR LF ! LWSP-char = SPACE / HTAB ; semantics = SPACE </pre> ! <p>This document defines a number of named fields that may appear as ! an <span class="emph">anvl-field</span>. Note that the smallest ! possible <span class="emph">anvl-fields</span> is a single CRLF, indicating no named fields. </p> *************** *** 632,636 **** </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL]. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters --- 642,647 ---- </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax defined previously (also know as ! <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL]). Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** warc_file_format.xml 26 Aug 2005 23:19:18 -0000 1.10 --- warc_file_format.xml 28 Aug 2005 18:55:30 -0000 1.11 *************** *** 203,207 **** warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF anvl-fields block = *OCTET </artwork> --- 203,207 ---- warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF *anvl-field CRLF block = *OCTET </artwork> *************** *** 209,214 **** <t>Elements of this grammar are further specified and explained in ! sections that follow (and in the case of <spanx ! style="emph">anvl-fields</spanx>, also a separate document).</t> <t>The record <spanx style="emph">header-line</spanx> is a --- 209,213 ---- <t>Elements of this grammar are further specified and explained in ! sections that follow.</t> <t>The record <spanx style="emph">header-line</spanx> is a *************** *** 233,254 **** been written.</t> ! <t>After the <spanx style="emph">header-line</spanx> come any number of ! named fields in a line-oriented syntax called <xref ! target="ANVL">ANVL</xref> that is very similar to that of email ! headers <xref target="RFC0822" />. Its format can be roughly summarized ! as the following:</t> <figure> <artwork> ! anvl-fields = *line CRLF ! line = (field / other-anvl) CRLF ! field = <field per RFC0822> ! other-anvl = <see ANVL> </artwork> </figure> ! <t>This document defines a number of named fields which may appear in ! the <spanx style="emph">anvl-fields</spanx> area of the header. Note that ! the smallest possible <spanx style="emph">anvl-fields</spanx> is a single CRLF, indicating no named fields.</t> --- 232,262 ---- been written.</t> ! <t>After the <spanx style="emph">header-line</spanx> come zero or more ! named <xref target="ANVL">ANVL</xref> fields in a line-oriented syntax ! very similar to that of email headers <xref target="RFC0822" /> but with ! unrestricted "text" values (none of its 13 reserved special characters). ! The precise format is as follows:</t> <figure> <artwork> ! anvl-field = field-name ":" [ field-body ] CRLF ! field-name = 1*<any CHAR, excluding control-chars and ":"> ! field-body = text [CRLF LWSP-char field-body] ! text = 1*<any UTF-8 character, including bare ! CR and bare LF, but NOT including CRLF> ! ; (Octal, Decimal.) ! CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) ! CR = <ASCII CR, carriage return> ; ( 15, 13.) ! LF = <ASCII LF, linefeed> ; ( 12, 10.) ! SPACE = <ASCII SP, space> ; ( 40, 32.) ! HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) ! CRLF = CR LF ! LWSP-char = SPACE / HTAB ; semantics = SPACE </artwork> </figure> ! <t>This document defines a number of named fields that may appear as ! an <spanx style="emph">anvl-field</spanx>. Note that the smallest ! possible <spanx style="emph">anvl-fields</spanx> is a single CRLF, indicating no named fields.</t> *************** *** 488,492 **** <t>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <xref target="ANVL">ANVL</xref>. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters --- 496,501 ---- <t>Named parameters after the header-line, if any, follow the ! line-oriented syntax defined previously (also know as ! <xref target="ANVL">ANVL</xref>). Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** warc_file_format.txt 26 Aug 2005 23:19:18 -0000 1.5 --- warc_file_format.txt 28 Aug 2005 18:55:30 -0000 1.6 *************** *** 293,302 **** warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF anvl-fields block = *OCTET Elements of this grammar are further specified and explained in ! sections that follow (and in the case of _anvl-fields_, also a ! separate document). The record _header-line_ is a newline-terminated sequence of --- 293,301 ---- warc-file = 1*warc-record warc-record = header block CRLF CRLF ! header = header-line CRLF *anvl-field CRLF block = *OCTET Elements of this grammar are further specified and explained in ! sections that follow. The record _header-line_ is a newline-terminated sequence of *************** *** 314,333 **** completely known after the record content _block_ has been written. ! After the _header-line_ come any number of named fields in a line- ! oriented syntax called ANVL [ANVL] that is very similar to that of ! email headers [RFC0822]. Its format can be roughly summarized as the ! following: ! ! anvl-fields = *line CRLF ! line = (field / other-anvl) CRLF ! field = <field per RFC0822> ! other-anvl = <see ANVL> ! ! This document defines a number of named fields which may appear in ! the _anvl-fields_ area of the header. Note that the smallest ! possible _anvl-fields_ is a single CRLF, indicating no named fields. ! Following the headers comes the content _block_, if any, which may ! contain arbitrary binary data, up through the remaining number of --- 313,333 ---- completely known after the record content _block_ has been written. ! After the _header-line_ come zero or more named ANVL [ANVL] fields in ! a line-oriented syntax very similar to that of email headers ! [RFC0822] but with unrestricted "text" values (none of its 13 ! reserved special characters). The precise format is as follows: ! anvl-field = field-name ":" [ field-body ] CRLF ! field-name = 1*<any CHAR, excluding control-chars and ":"> ! field-body = text [CRLF LWSP-char field-body] ! text = 1*<any UTF-8 character, including bare ! CR and bare LF, but NOT including CRLF> ! ; (Octal, Decimal.) ! CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) ! CR = <ASCII CR, carriage return> ; ( 15, 13.) ! LF = <ASCII LF, linefeed> ; ( 12, 10.) ! SPACE = <ASCII SP, space> ; ( 40, 32.) ! HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) ! CRLF = CR LF *************** *** 338,341 **** --- 338,349 ---- + LWSP-char = SPACE / HTAB ; semantics = SPACE + + This document defines a number of named fields that may appear as an + _anvl-field_. Note that the smallest possible _anvl-fields_ is a + single CRLF, indicating no named fields. + + Following the headers comes the content _block_, if any, which may + contain arbitrary binary data, up through the remaining number of octets as specified in the previously-given _data-length_ parameter. Finally come two CRLF newlines, not counted in the declared record *************** *** 381,392 **** - - - - - - - - Kunze, et al. Expires January 2, 2006 [Page 7] --- 389,392 ---- *************** *** 658,668 **** Named parameters after the header-line, if any, follow the line- ! oriented syntax called ANVL [ANVL]. Normally, named parameters are ! optional and their order is insignificant, however, specific record ! types require that certain named parameters be present (and future ! extensions may have ordering requirements). If there are no named ! parameters present, the entire WARC record header is the line of ! positional parameters followed by one blank line (two consecutive ! newlines). --- 658,668 ---- Named parameters after the header-line, if any, follow the line- ! oriented syntax defined previously (also know as ANVL [ANVL]). ! Normally, named parameters are optional and their order is ! insignificant, however, specific record types require that certain ! named parameters be present (and future extensions may have ordering ! requirements). If there are no named parameters present, the entire ! WARC record header is the line of positional parameters followed by ! one blank line (two consecutive newlines). |
From: John A. K. <joh...@us...> - 2005-08-26 23:19:27
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20105 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: added proposed text for a Warcinfo-ID named parameter Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** warc_file_format.html 24 Aug 2005 01:39:51 -0000 1.5 --- warc_file_format.html 26 Aug 2005 23:19:18 -0000 1.6 *************** *** 234,238 **** GZIP extra field: skip-lengths ('sl')<br /> <a href="#anchor26">9.3.</a> ! GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> WARC File Name and Size Recommendations<br /> --- 234,238 ---- GZIP extra field: skip-lengths ('sl')<br /> <a href="#anchor26">9.3.</a> ! GZIP WARC File Name Suffix<br /> <a href="#anchor27">10.</a> WARC File Name and Size Recommendations<br /> *************** *** 406,412 **** record <span class="emph">data-length</span>. </p> ! <p>It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format. </p> <p>Subsequent records contain content blocks that are either the --- 406,415 ---- record <span class="emph">data-length</span>. </p> ! <p>It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that follow it. ! It is always the case that the concatenation of any two WARC files is a ! syntactically correct WARC file; care should be taken, however, when ! concatenation would inadvertently cause 'warcinfo' records to appear ! at points in the result that would create confusion. </p> <p>Subsequent records contain content blocks that are either the *************** *** 851,854 **** --- 854,873 ---- </dd> + <dt>Warcinfo-ID: record-id</dt> + <dd> + When present, indicates the record-id of the associated 'warcinfo' + record for this record. Typically, the Warcinfo-ID parameter is used + when the context of the applicable 'warcinfo' record is unavailable, + such as after distributing single records into separate WARC files. + WARC writing applications (such web crawlers) may choose to record + this parameter routinely (e.g., before computing checksums). + + The Warcinfo-ID parameter overrides any association with a previously + occurring (in the WARC) 'warcinfo' record, thus providing a way to protect + the true association when records are combined from different WARCs. + Use of this parameter in a record of type 'warcinfo' is undefined and + reserved for possible future extension. + + </dd> </dl></blockquote> <a name="anchor15"></a><br /><hr /> *************** *** 1113,1124 **** <a name="anchor26"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Extension</h3> ! <p>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure they are properly recognized by GZIP tools, they ! should only get the customary additional ".gz" file extension suffix, ! making their suffix ".warc.gz". Software which works with WARC files ! compressed using these conventions will detect and exploit them; other ! GZIP software will harmlessly ignore the extensions. </p> <a name="anchor27"></a><br /><hr /> --- 1132,1143 ---- <a name="anchor26"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Name Suffix</h3> ! <p>A WARC file compressed with the extra GZIP field conventions described ! in this document is a legal GZIP file. To ensure that it is properly ! recognized by GZIP tools, its name should have the customary ".gz" ! appended to it, making the complete suffix, ".warc.gz". ! GZIP software that does not recognize the extra GZIP fields will ! simply pass over them without benefit or harm. </p> <a name="anchor27"></a><br /><hr /> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** warc_file_format.xml 26 Aug 2005 22:29:40 -0000 1.9 --- warc_file_format.xml 26 Aug 2005 23:19:18 -0000 1.10 *************** *** 260,266 **** record <spanx style="emph">data-length</spanx>.</t> ! <t>It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format.</t> <t>Subsequent records contain content blocks that are either the --- 260,269 ---- record <spanx style="emph">data-length</spanx>.</t> ! <t>It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that follow it. ! It is always the case that the concatenation of any two WARC files is a ! syntactically correct WARC file; care should be taken, however, when ! concatenation would inadvertently cause 'warcinfo' records to appear ! at points in the result that would create confusion.</t> <t>Subsequent records contain content blocks that are either the *************** *** 680,683 **** --- 683,701 ---- </t> + <t hangText="Warcinfo-ID: record-id"> + When present, indicates the record-id of the associated 'warcinfo' + record for this record. Typically, the Warcinfo-ID parameter is used + when the context of the applicable 'warcinfo' record is unavailable, + such as after distributing single records into separate WARC files. + WARC writing applications (such web crawlers) may choose to record + this parameter routinely (e.g., before computing checksums). + + The Warcinfo-ID parameter overrides any association with a previously + occurring (in the WARC) 'warcinfo' record, thus providing a way to protect + the true association when records are combined from different WARCs. + Use of this parameter in a record of type 'warcinfo' is undefined and + reserved for possible future extension. + </t> + </list> Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** warc_file_format.txt 23 Aug 2005 17:35:41 -0000 1.4 --- warc_file_format.txt 26 Aug 2005 23:19:18 -0000 1.5 *************** *** 142,146 **** 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 11. Registration of MIME Media Type application/warc . . . . . . . 25 --- 142,146 ---- 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Name Suffix . . . . . . . . . . . . . . . . 23 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 11. Registration of MIME Media Type application/warc . . . . . . . 25 *************** *** 342,348 **** _data-length_. ! It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format. Subsequent records contain content blocks that are either the direct --- 342,352 ---- _data-length_. ! It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that ! follow it. It is always the case that the concatenation of any two ! WARC files is a syntactically correct WARC file; care should be ! taken, however, when concatenation would inadvertently cause ! 'warcinfo' records to appear at points in the result that would ! create confusion. Subsequent records contain content blocks that are either the direct *************** *** 385,392 **** - - - - Kunze, et al. Expires January 2, 2006 [Page 7] --- 389,392 ---- *************** *** 474,480 **** describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost ! always refer to another record of another type, with hat other record ! holding original harvested or transformed content. (However, it is ! allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be --- 474,480 ---- describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost ! always refer to another record of another type, with that other ! record holding original harvested or transformed content. (However, ! it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be *************** *** 506,510 **** ! preferred if the current record's is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) --- 506,510 ---- ! preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) *************** *** 532,544 **** A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content ransformations that ! maintain viability of content after widely available rendering ools ! for the originally stored format disappear. As needed, the original ! content may be migrated (transformed) to a more viable format in ! order to keep the information usable with current tools while ! minimizing loss of information (intellectual content, look and feel, ! etc). Any number of transformation records may be created that reference a specific source record, which may itself contain ! ransformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe --- 532,544 ---- A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content transformations ! that maintain viability of content after widely available rendering ! tools for the originally stored format disappear. As needed, the ! original content may be migrated (transformed) to a more viable ! format in order to keep the information usable with current tools ! while minimizing loss of information (intellectual content, look and ! feel, etc). Any number of transformation records may be created that reference a specific source record, which may itself contain ! transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe *************** *** 711,715 **** subject-uri The original URI whose collection gave rise to the ! information content in this record. In he context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', --- 711,715 ---- subject-uri The original URI whose collection gave rise to the ! information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', *************** *** 717,725 **** uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of he WARC file, as a URI. Care should be taken to ensure that the URI in this value is - properly escaped (per [RFC2396] and that it is written with no --- 717,725 ---- uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of the WARC file, as a ! URI. Care should be taken to ensure that the URI in this value is *************** *** 730,733 **** --- 730,734 ---- + properly escaped (per [RFC2396] and that it is written with no internal whitespace. *************** *** 780,784 **** - Kunze, et al. Expires January 2, 2006 [Page 14] --- 781,784 ---- *************** *** 825,829 **** A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about ! record-id considerations. This creates satellite record- ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some --- 825,829 ---- A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about ! record-id considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some *************** *** 850,871 **** Truncated: reason-token When present, indicates that the current record ends before the apparent end of the source material, but no ! continuation records are forthcoming. Possible values indicate he ! reason for the truncation: 'length' for exceeding a desired length ! limit; 'time' for exceeding a desired time limit during collection. ! ! ! ! ! ! ! ! ! ! ! ! ! --- 850,871 ---- Truncated: reason-token When present, indicates that the current record ends before the apparent end of the source material, but no ! continuation records are forthcoming. Possible values indicate ! the reason for the truncation: 'length' for exceeding a desired ! length limit; 'time' for exceeding a desired time limit during collection. ! Warcinfo-ID: record-id When present, indicates the record-id of the ! associated 'warcinfo' record for this record. Typically, the ! Warcinfo-ID parameter is used when the context of the applicable ! 'warcinfo' record is unavailable, such as after distributing ! single records into separate WARC files. WARC writing ! applications (such web crawlers) may choose to record this ! parameter routinely (e.g., before computing checksums). The ! Warcinfo-ID parameter overrides any association with a previously ! occurring (in the WARC) 'warcinfo' record, thus providing a way to ! protect the true association when records are combined from ! different WARCs. Use of this parameter in a record of type ! 'warcinfo' is undefined and reserved for possible future ! extension. *************** *** 974,978 **** records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is eventually ! know to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it requires --- 974,978 ---- records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is eventually ! known to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it requires *************** *** 1011,1015 **** with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of he Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. --- 1011,1015 ---- with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. *************** *** 1140,1144 **** Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files hat have meaningful URIs retrieved from a locally-accessible filesystem or other repository. --- 1140,1144 ---- Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository. *************** *** 1184,1190 **** However, experience with the precursor ARC format at the Internet ! Archive has demonstrated hat applying simple standard compression can ! result in significant storage savings, while preserving random access ! to individual records. For this purpose, the GZIP format with customary "deflate" --- 1184,1190 ---- However, experience with the precursor ARC format at the Internet ! Archive has demonstrated that applying simple standard compression ! can result in significant storage savings, while preserving random ! access to individual records. For this purpose, the GZIP format with customary "deflate" *************** *** 1221,1229 **** Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a ! small portion of a record, would like to skip to he next full record. ! In the absence of an external, precalculated index, using only the ! WARC record's uncompressed length would require the complete current ! record to be decompressed o find the start of the next record. ! --- 1221,1229 ---- Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a ! small portion of a record, would like to skip to the next full ! record. In the absence of an external, precalculated index, using ! only the WARC record's uncompressed length would require the complete ! current record to be decompressed to find the start of the next ! record. *************** *** 1264,1275 **** appropriate. ! 9.3. GZIP WARC File Extension ! WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure hey are properly recognized by GZIP tools, ! they should only get the customary additional ".gz" file extension ! suffix, making their suffix ".warc.gz". Software which works with ! WARC files compressed using these conventions will detect and exploit ! them; other GZIP software will harmlessly ignore the extensions. --- 1264,1275 ---- appropriate. ! 9.3. GZIP WARC File Name Suffix ! A WARC file compressed with the extra GZIP field conventions ! described in this document is a legal GZIP file. To ensure that it ! is properly recognized by GZIP tools, its name should have the ! customary ".gz" appended to it, making the complete suffix, ! ".warc.gz". GZIP software that does not recognize the extra GZIP ! fields will simply pass over them without benefit or harm. *************** *** 1300,1304 **** Prefix is an abbreviation usually reflective of the project or crawl ! that created this file. imestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often --- 1300,1304 ---- Prefix is an abbreviation usually reflective of the project or crawl ! that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often *************** *** 1314,1319 **** This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. he file name prefix "iipc" should ! be avoided unless participating in the IIPC naming registry. [REVIEW ISSUE: Discover sense of the group for what naming and --- 1314,1319 ---- This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. The file name prefix "iipc" ! should be avoided unless participating in the IIPC naming registry. [REVIEW ISSUE: Discover sense of the group for what naming and *************** *** 1405,1409 **** After IESG approval, IANA is expected to register the WARC type ! "application/warc" using he application provided in this document. --- 1405,1409 ---- After IESG approval, IANA is expected to register the WARC type ! "application/warc" using the application provided in this document. *************** *** 1461,1465 **** This document could not have been written without major contributions ! from participants of he International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. --- 1461,1465 ---- This document could not have been written without major contributions ! from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. *************** *** 1534,1538 **** blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in he context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a --- 1534,1538 ---- blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a *************** *** 1595,1602 **** <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata> ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org --- 1595,1602 ---- <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org *************** *** 1611,1615 **** </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo nxsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> --- 1611,1615 ---- </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo xsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> *************** *** 1754,1763 **** Again, reference is made back to the original 'response' record. A ! new creation-date reflects he time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing he result of a revisit remain to be ! defined. Appendix B.7. Example of 'conversion' Record --- 1754,1763 ---- Again, reference is made back to the original 'response' record. A ! new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing the result of a revisit remain to ! be defined. Appendix B.7. Example of 'conversion' Record |
From: John A. K. <joh...@us...> - 2005-08-26 22:29:48
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6493 Modified Files: warc_file_format.xml Log Message: tinkered with section 9.3 (GZIP WARC File Extension) for clarity Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** warc_file_format.xml 24 Aug 2005 01:39:50 -0000 1.8 --- warc_file_format.xml 26 Aug 2005 22:29:40 -0000 1.9 *************** *** 945,956 **** </section> ! <section title="GZIP WARC File Extension"> ! <t>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure they are properly recognized by GZIP tools, they ! should only get the customary additional ".gz" file extension suffix, ! making their suffix ".warc.gz". Software which works with WARC files ! compressed using these conventions will detect and exploit them; other ! GZIP software will harmlessly ignore the extensions.</t> </section> --- 945,956 ---- </section> ! <section title="GZIP WARC File Name Suffix"> ! <t>A WARC file compressed with the extra GZIP field conventions described ! in this document is a legal GZIP file. To ensure that it is properly ! recognized by GZIP tools, its name should have the customary ".gz" ! appended to it, making the complete suffix, ".warc.gz". ! GZIP software that does not recognize the extra GZIP fields will ! simply pass over them without benefit or harm.</t> </section> |
From: Doug C. <cu...@us...> - 2005-08-24 04:15:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv22467/src/java/org/archive/access/nutch Modified Files: Tag: mapred Arc2Segment.java Log Message: Put task timeout in nutch-site.xml so that it is seen when tasktracker is started. Index: Arc2Segment.java =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Arc2Segment.java,v retrieving revision 1.28.2.6 retrieving revision 1.28.2.7 diff -C2 -d -r1.28.2.6 -r1.28.2.7 *** Arc2Segment.java 22 Aug 2005 18:18:34 -0000 1.28.2.6 --- Arc2Segment.java 24 Aug 2005 04:15:48 -0000 1.28.2.7 *************** *** 258,263 **** job.set(Fetcher.SEGMENT_NAME_KEY, segment.getName()); - job.set("mapred.task.timeout", 60 * 60 * 1000); // 1 hour - job.setInputDir(arcUrlsDir); job.setMapperClass(Arc2Segment.class); --- 258,261 ---- |
From: Doug C. <cu...@us...> - 2005-08-24 04:15:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv22467/conf Modified Files: Tag: mapred nutch-site.xml Log Message: Put task timeout in nutch-site.xml so that it is seen when tasktracker is started. Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.24.2.1 retrieving revision 1.24.2.2 diff -C2 -d -r1.24.2.1 -r1.24.2.2 *** nutch-site.xml 22 Aug 2005 18:18:34 -0000 1.24.2.1 --- nutch-site.xml 24 Aug 2005 04:15:48 -0000 1.24.2.2 *************** *** 49,52 **** --- 49,57 ---- </property> + <property> + <name>mapred.task.timeout</name> + <value>3600000</value> + </property> + <!-- Override a few Nutch defaults --> |
From: Michael S. <sta...@us...> - 2005-08-24 01:40:01
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25656 Modified Files: warc_file_format.xml warc_file_format.html Log Message: * warc_file_format.xml Added entity definition for mdash. Typos. Fixed warcinfo example xml. Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** warc_file_format.html 23 Aug 2005 17:35:41 -0000 1.4 --- warc_file_format.html 24 Aug 2005 01:39:51 -0000 1.5 *************** *** 411,417 **** </p> <p>Subsequent records contain content blocks that are either the ! direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone ! files, etc. — or they are synthesized content blocks (e.g., metadata, transformed content) that provide additional information about archived content. Any content block may contain arbitrary text --- 411,417 ---- </p> <p>Subsequent records contain content blocks that are either the ! direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone ! files, etc. — or they are synthesized content blocks (e.g., metadata, transformed content) that provide additional information about archived content. Any content block may contain arbitrary text *************** *** 501,505 **** explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to ! another record of another type, with hat other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other --- 501,505 ---- explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to ! another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other *************** *** 527,531 **** <p>A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be ! preferred if the current record's is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) --- 527,531 ---- <p>A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be ! preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) *************** *** 555,560 **** <p>A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content ransformations that ! maintain viability of content after widely available rendering ools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order --- 555,560 ---- <p>A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content transformations that ! maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order *************** *** 562,566 **** loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a ! specific source record, which may itself contain ransformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original --- 562,566 ---- loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a ! specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original *************** *** 662,666 **** The number of octets in the record, starting with the first letter ("w") of the first token, through to the end of the content block ! — not including the 2 record-ending newlines. After proceeding this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the --- 662,666 ---- The number of octets in the record, starting with the first letter ("w") of the first token, through to the end of the content block ! — not including the 2 record-ending newlines. After proceeding this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the *************** *** 688,697 **** <dd> The original URI whose collection gave rise to the information content ! in this record. In he context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of he WARC file, as a URI. <br /> --- 688,697 ---- <dd> The original URI whose collection gave rise to the information content ! in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of the WARC file, as a URI. <br /> *************** *** 820,824 **** A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id ! considerations. This creates satellite record- ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of --- 820,824 ---- A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id ! considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of *************** *** 846,850 **** When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are ! forthcoming. Possible values indicate he reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection. --- 846,850 ---- When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are ! forthcoming. Possible values indicate the reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection. *************** *** 883,887 **** allow records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is ! eventually know to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it --- 883,887 ---- allow records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is ! eventually known to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it *************** *** 917,921 **** <p>All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of he Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. --- 917,921 ---- <p>All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. *************** *** 1008,1012 **** <p>Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files hat have meaningful URIs retrieved from a locally-accessible filesystem or other repository. --- 1008,1012 ---- <p>Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository. *************** *** 1033,1037 **** </p> <p>However, experience with the precursor ARC format at the Internet ! Archive has demonstrated hat applying simple standard compression can result in significant storage savings, while preserving random access to individual records. --- 1033,1037 ---- </p> <p>However, experience with the precursor ARC format at the Internet ! Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records. *************** *** 1075,1082 **** <p>Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after ! reading a small portion of a record, would like to skip to he next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete ! current record to be decompressed o find the start of the next record. </p> --- 1075,1082 ---- <p>Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after ! reading a small portion of a record, would like to skip to the next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete ! current record to be decompressed to find the start of the next record. </p> *************** *** 1116,1120 **** <p>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure hey are properly recognized by GZIP tools, they should only get the customary additional ".gz" file extension suffix, making their suffix ".warc.gz". Software which works with WARC files --- 1116,1120 ---- <p>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure they are properly recognized by GZIP tools, they should only get the customary additional ".gz" file extension suffix, making their suffix ".warc.gz". Software which works with WARC files *************** *** 1134,1138 **** </p> <p>Prefix is an abbreviation usually reflective of the project or ! crawl that created this file. imestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often --- 1134,1138 ---- </p> <p>Prefix is an abbreviation usually reflective of the project or ! crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often *************** *** 1148,1152 **** <p>This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. he file name prefix "iipc" should be avoided unless participating in the IIPC naming registry. </p> --- 1148,1152 ---- <p>This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry. </p> *************** *** 1212,1216 **** <p>After IESG approval, IANA is expected to register the WARC type ! "application/warc" using he application provided in this document. </p> <a name="anchor30"></a><br /><hr /> --- 1212,1216 ---- <p>After IESG approval, IANA is expected to register the WARC type ! "application/warc" using the application provided in this document. </p> <a name="anchor30"></a><br /><hr /> *************** *** 1219,1223 **** <p>This document could not have been written without major ! contributions from participants of he International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. --- 1219,1223 ---- <p>This document could not have been written without major ! contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. *************** *** 1246,1250 **** blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in he context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a --- 1246,1250 ---- blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a *************** *** 1305,1312 **** <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata> ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org --- 1305,1312 ---- <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org *************** *** 1321,1325 **** </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo nxsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> --- 1321,1325 ---- </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo xsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> *************** *** 1446,1454 **** </pre> <p>Again, reference is made back to the original 'response' record. A ! new creation-date reflects he time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing he result of a revisit remain to be defined. </p> --- 1446,1454 ---- </pre> <p>Again, reference is made back to the original 'response' record. A ! new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing the result of a revisit remain to be defined. </p> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** warc_file_format.xml 23 Aug 2005 17:35:41 -0000 1.7 --- warc_file_format.xml 24 Aug 2005 01:39:50 -0000 1.8 *************** *** 2,5 **** --- 2,7 ---- <!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd' [ + <!ENTITY mdash '—' > + <!ENTITY rfc0822 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0822.xml'> <!ENTITY rfc1034 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1034.xml'> *************** *** 349,353 **** explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to ! another record of another type, with hat other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other --- 351,355 ---- explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to ! another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other *************** *** 375,379 **** <t>A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be ! preferred if the current record's is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.)</t> --- 377,381 ---- <t>A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be ! preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.)</t> *************** *** 403,408 **** <t>A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content ransformations that ! maintain viability of content after widely available rendering ools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order --- 405,410 ---- <t>A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content transformations that ! maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order *************** *** 410,414 **** loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a ! specific source record, which may itself contain ransformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original --- 412,416 ---- loss of information (intellectual content, look and feel, etc). Any number of transformation records may be created that reference a ! specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original *************** *** 535,544 **** <t hangText="subject-uri"> The original URI whose collection gave rise to the information content ! in this record. In he context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of he WARC file, as a URI. <vspace blankLines="2" /> --- 537,546 ---- <t hangText="subject-uri"> The original URI whose collection gave rise to the information content ! in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of the WARC file, as a URI. <vspace blankLines="2" /> *************** *** 650,654 **** A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id ! considerations. This creates satellite record- ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of --- 652,656 ---- A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about record-id ! considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of *************** *** 673,677 **** When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are ! forthcoming. Possible values indicate he reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection. --- 675,679 ---- When present, indicates that the current record ends before the apparent end of the source material, but no continuation records are ! forthcoming. Possible values indicate the reason for the truncation: 'length' for exceeding a desired length limit; 'time' for exceeding a desired time limit during collection. *************** *** 713,717 **** allow records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is ! eventually know to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it --- 715,719 ---- allow records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is ! eventually known to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it *************** *** 745,749 **** <t>All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of he Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters.</t> --- 747,751 ---- <t>All subsequent segments must have a record type of 'continuation', with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters.</t> *************** *** 838,842 **** <t>Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files hat have meaningful URIs retrieved from a locally-accessible filesystem or other repository.</t> --- 840,844 ---- <t>Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository.</t> *************** *** 865,869 **** <t>However, experience with the precursor ARC format at the Internet ! Archive has demonstrated hat applying simple standard compression can result in significant storage savings, while preserving random access to individual records.</t> --- 867,871 ---- <t>However, experience with the precursor ARC format at the Internet ! Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.</t> *************** *** 905,912 **** <t>Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after ! reading a small portion of a record, would like to skip to he next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete ! current record to be decompressed o find the start of the next record.</t> --- 907,914 ---- <t>Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after ! reading a small portion of a record, would like to skip to the next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete ! current record to be decompressed to find the start of the next record.</t> *************** *** 946,950 **** <t>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure hey are properly recognized by GZIP tools, they should only get the customary additional ".gz" file extension suffix, making their suffix ".warc.gz". Software which works with WARC files --- 948,952 ---- <t>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure they are properly recognized by GZIP tools, they should only get the customary additional ".gz" file extension suffix, making their suffix ".warc.gz". Software which works with WARC files *************** *** 966,970 **** <t>Prefix is an abbreviation usually reflective of the project or ! crawl that created this file. imestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often --- 968,972 ---- <t>Prefix is an abbreviation usually reflective of the project or ! crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often *************** *** 980,984 **** <t>This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. he file name prefix "iipc" should be avoided unless participating in the IIPC naming registry.</t> --- 982,986 ---- <t>This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry.</t> *************** *** 1044,1048 **** <t>After IESG approval, IANA is expected to register the WARC type ! "application/warc" using he application provided in this document.</t> </section> --- 1046,1050 ---- <t>After IESG approval, IANA is expected to register the WARC type ! "application/warc" using the application provided in this document.</t> </section> *************** *** 1051,1055 **** <t>This document could not have been written without major ! contributions from participants of he International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes.</t> --- 1053,1057 ---- <t>This document could not have been written without major ! contributions from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes.</t> *************** *** 1078,1082 **** blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in he context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a --- 1080,1084 ---- blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a *************** *** 1141,1148 **** <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata> ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org --- 1143,1150 ---- <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org *************** *** 1157,1161 **** </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo nxsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> --- 1159,1163 ---- </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo xsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> *************** *** 1304,1312 **** <t>Again, reference is made back to the original 'response' record. A ! new creation-date reflects he time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing he result of a revisit remain to be defined.</t> --- 1306,1314 ---- <t>Again, reference is made back to the original 'response' record. A ! new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing the result of a revisit remain to be defined.</t> |
From: John A. K. <joh...@us...> - 2005-08-23 17:36:04
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30443 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: trivial changes (typos) plus test of xml2rfc-1.30 outputs Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** warc_file_format.html 18 Aug 2005 01:57:10 -0000 1.3 --- warc_file_format.html 23 Aug 2005 17:35:41 -0000 1.4 *************** *** 3,7 **** <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="The WARC File Format (Version 0.8 rev B)"> ! <meta name="generator" content="xml2rfc v1.29 (http://xml.resource.org/)"> <style type='text/css'> <!-- --- 3,7 ---- <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="The WARC File Format (Version 0.8 rev B)"> ! <meta name="generator" content="xml2rfc v1.30 (http://xml.resource.org/)"> <style type='text/css'> <!-- *************** *** 28,32 **** font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif; font-size: x-small ; background-color: #000000; } ! /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */ div#counter{margin-top: 100px} --- 28,32 ---- font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif; font-size: x-small ; background-color: #000000; } ! /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */ div#counter{margin-top: 100px} *************** *** 58,61 **** --- 58,63 ---- p.copyright { font-size: x-small ; } p.toc { font-size: small ; font-weight: bold ; margin-left: 3em ;} + table.toc { margin: 0 0 0 3em; padding: 0; border: 0; vertical-align: text-top; } + td.toc { font-size: small; font-weight: bold; vertical-align: text-top; } span.emph { font-style: italic; } *************** *** 95,108 **** td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; } td.author-text { font-size: x-small; } ! table.data { vertical-align: top ; border-collapse: collapse ; border-style: solid solid solid solid ; border-color: black black black black ; font-size: small ; text-align: center ; } ! table.data th { font-weight: bold ; ! border-style: solid solid solid solid ; border-color: black black black black ; } ! table.data td { border-style: solid solid solid solid ; border-color: #333333 #333333 #333333 #333333 ; } hr { height: 1px } --- 97,119 ---- td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; } td.author-text { font-size: x-small; } ! table.full { vertical-align: top ; border-collapse: collapse ; border-style: solid solid solid solid ; border-color: black black black black ; font-size: small ; text-align: center ; } ! table.headers, table.none { vertical-align: top ; border-collapse: collapse ; ! border-style: none; ! font-size: small ; text-align: center ; } ! table.full th { font-weight: bold ; ! border-style: solid ; border-color: black black black black ; } ! table.headers th { font-weight: bold ; ! border-style: none none solid none; ! border-color: black black black black ; } ! table.none th { font-weight: bold ; ! border-style: none; } ! table.full td { border-style: solid solid solid solid ; border-color: #333333 #333333 #333333 #333333 ; } + table.headers td, table.none td { border-style: none; } hr { height: 1px } *************** *** 178,202 **** <a href="#record_types">4.</a> Record Types<br /> ! <a href="#anchor4">4.1</a> 'warcinfo'<br /> ! <a href="#anchor5">4.2</a> 'response'<br /> ! <a href="#anchor6">4.3</a> 'resource'<br /> ! <a href="#anchor7">4.4</a> 'request'<br /> ! <a href="#anchor8">4.5</a> 'metadata'<br /> ! <a href="#anchor9">4.6</a> 'revisit'<br /> ! <a href="#anchor10">4.7</a> 'conversion'<br /> ! <a href="#anchor11">4.8</a> 'continuation'<br /> <a href="#anchor12">5.</a> Record Header<br /> ! <a href="#anchor13">5.1</a> Positional Parameters<br /> ! <a href="#anchor14">5.2</a> Named Parameters<br /> <a href="#anchor15">6.</a> --- 189,213 ---- <a href="#record_types">4.</a> Record Types<br /> ! <a href="#anchor4">4.1.</a> 'warcinfo'<br /> ! <a href="#anchor5">4.2.</a> 'response'<br /> ! <a href="#anchor6">4.3.</a> 'resource'<br /> ! <a href="#anchor7">4.4.</a> 'request'<br /> ! <a href="#anchor8">4.5.</a> 'metadata'<br /> ! <a href="#anchor9">4.6.</a> 'revisit'<br /> ! <a href="#anchor10">4.7.</a> 'conversion'<br /> ! <a href="#anchor11">4.8.</a> 'continuation'<br /> <a href="#anchor12">5.</a> Record Header<br /> ! <a href="#anchor13">5.1.</a> Positional Parameters<br /> ! <a href="#anchor14">5.2.</a> Named Parameters<br /> <a href="#anchor15">6.</a> *************** *** 204,226 **** <a href="#anchor16">7.</a> Truncated and Segmented Records<br /> ! <a href="#anchor17">7.1</a> Record Truncation<br /> ! <a href="#anchor18">7.2</a> Record Segmentation<br /> <a href="#anchor19">8.</a> WARC Application to Specific Protocols<br /> ! <a href="#anchor20">8.1</a> HTTP and HTTPS<br /> ! <a href="#anchor21">8.2</a> DNS<br /> ! <a href="#anchor22">8.3</a> Other Resources with URIs, and Other Protocols<br /> <a href="#anchor23">9.</a> Compression Recommendations<br /> ! <a href="#anchor24">9.1</a> Record-at-a-time Compression<br /> ! <a href="#anchor25">9.2</a> GZIP extra field: skip-lengths ('sl')<br /> ! <a href="#anchor26">9.3</a> GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> --- 215,237 ---- <a href="#anchor16">7.</a> Truncated and Segmented Records<br /> ! <a href="#anchor17">7.1.</a> Record Truncation<br /> ! <a href="#anchor18">7.2.</a> Record Segmentation<br /> <a href="#anchor19">8.</a> WARC Application to Specific Protocols<br /> ! <a href="#anchor20">8.1.</a> HTTP and HTTPS<br /> ! <a href="#anchor21">8.2.</a> DNS<br /> ! <a href="#anchor22">8.3.</a> Other Resources with URIs, and Other Protocols<br /> <a href="#anchor23">9.</a> Compression Recommendations<br /> ! <a href="#anchor24">9.1.</a> Record-at-a-time Compression<br /> ! <a href="#anchor25">9.2.</a> GZIP extra field: skip-lengths ('sl')<br /> ! <a href="#anchor26">9.3.</a> GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> *************** *** 232,254 **** <a href="#anchor30">13.</a> Acknowledgements<br /> ! <a href="#anchor31">A.</a> Consideratons in Choice of record-id<br /> ! <a href="#anchor32">B.</a> Examples of WARC Records<br /> ! <a href="#anchor33">B.1</a> Example of 'warcinfo' Record<br /> ! <a href="#anchor34">B.2</a> Example of 'request' Record<br /> ! <a href="#anchor35">B.3</a> Example of 'response' Record<br /> ! <a href="#anchor36">B.4</a> Example of 'resource' Record<br /> ! <a href="#anchor37">B.5</a> Example of 'metadata' Record<br /> ! <a href="#anchor38">B.6</a> Example of 'revisit' Record<br /> ! <a href="#anchor39">B.7</a> Example of 'conversion' Record<br /> ! <a href="#anchor40">B.8</a> Example of 'continuation' Record<br /> <a href="#rfc.references1">14.</a> --- 243,265 ---- <a href="#anchor30">13.</a> Acknowledgements<br /> ! <a href="#anchor31">Appendix A.</a> Consideratons in Choice of record-id<br /> ! <a href="#anchor32">Appendix B.</a> Examples of WARC Records<br /> ! <a href="#anchor33">Appendix B.1.</a> Example of 'warcinfo' Record<br /> ! <a href="#anchor34">Appendix B.2.</a> Example of 'request' Record<br /> ! <a href="#anchor35">Appendix B.3.</a> Example of 'response' Record<br /> ! <a href="#anchor36">Appendix B.4.</a> Example of 'resource' Record<br /> ! <a href="#anchor37">Appendix B.5.</a> Example of 'metadata' Record<br /> ! <a href="#anchor38">Appendix B.6.</a> Example of 'revisit' Record<br /> ! <a href="#anchor39">Appendix B.7.</a> Example of 'conversion' Record<br /> ! <a href="#anchor40">Appendix B.8.</a> Example of 'continuation' Record<br /> <a href="#rfc.references1">14.</a> *************** *** 269,273 **** simple text headers and an arbitary data block into one long file. The WARC format is a revision of the <a class="info" href="#ARC">ARC File ! Format<span> (</span><span class="info">Burner, M. and B. Kahle, “The ARC File Format,” September 1996.</span><span>)</span></a>[ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. --- 280,284 ---- simple text headers and an arbitary data block into one long file. The WARC format is a revision of the <a class="info" href="#ARC">ARC File ! Format<span> (</span><span class="info">Burner, M. and B. Kahle, “The ARC File Format,” September 1996.</span><span>)</span></a> [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. *************** *** 276,294 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the <a class="info" href="#IIPC">International ! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, “International Internet Preservation Consortium (IIPC),” .</span><span>)</span></a>[IIPC], whose members include the IA and the national libraries of a dozen countries. The revised ! format is expected to become the primary output format of the ! open-source <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, “Heritrix Open Source Archival Web Crawler,” .</span><span>)</span></a>[HERITRIX] web crawler, and ! the input format for a wide array of cataloguing and access tools. </p> <p>The WARC format generalizes the older format to better support the ! harvesting, display, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that --- 287,307 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the <a class="info" href="#IIPC">International ! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, “International Internet Preservation Consortium (IIPC),” .</span><span>)</span></a> [IIPC], whose members include the IA and the national libraries of a dozen countries. The revised ! format is expected to be a standard way to structure, manage and ! store billions of collected web resources. For example, WARC will be ! an output format of harvesting software, such as the open-source ! <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, “Heritrix Open Source Archival Web Crawler,” .</span><span>)</span></a> [HERITRIX] web crawler, and an input ! format for a wide array of cataloguing and access tools. </p> <p>The WARC format generalizes the older format to better support the ! harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that *************** *** 353,357 **** block = *OCTET </pre> - <p>Elements of this grammar are further specified and explained in sections that follow (and in the case of <span class="emph">anvl-fields</span>, also a separate document). --- 366,369 ---- *************** *** 367,371 **** tsp = 1*WSP </pre> - <p>The amount of whitespace between <span class="emph">header-line</span> tokens is variable. This gives archive builders the flexibility to add padding and later adjust --- 379,382 ---- *************** *** 375,379 **** </p> <p>After the <span class="emph">header-line</span> come any number of ! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a>[ANVL] that is very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a>. Its format can be roughly summarized as the following: --- 386,390 ---- </p> <p>After the <span class="emph">header-line</span> come any number of ! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL] that is very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a>. Its format can be roughly summarized as the following: *************** *** 384,388 **** other-anvl = <see ANVL> </pre> - <p>This document defines a number of named fields which may appear in the <span class="emph">anvl-fields</span> area of the header. Note that --- 395,398 ---- *************** *** 424,428 **** appropriate and how they can be standardized is warranted.] </p> ! <a name="rfc.section.4.1"></a><h4><a name="anchor4">4.1</a> 'warcinfo'</h4> <p>A 'warcinfo' record describes the records that follow it, up through end of --- 434,440 ---- appropriate and how they can be standardized is warranted.] </p> ! <a name="anchor4"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.1"></a><h3>4.1. 'warcinfo'</h3> <p>A 'warcinfo' record describes the records that follow it, up through end of *************** *** 451,455 **** content block must be formally defined somewhere.] </p> ! <a name="rfc.section.4.2"></a><h4><a name="anchor5">4.2</a> 'response'</h4> <p>A 'response' record contains an entire protocol response, such as a full --- 463,469 ---- content block must be formally defined somewhere.] </p> ! <a name="anchor5"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.2"></a><h3>4.2. 'response'</h3> <p>A 'response' record contains an entire protocol response, such as a full *************** *** 461,465 **** 'IP-Address' and 'Related-Record-ID'. </p> ! <a name="rfc.section.4.3"></a><h4><a name="anchor6">4.3</a> 'resource'</h4> <p>A 'resource' record contains a resource, without full protocol response --- 475,481 ---- 'IP-Address' and 'Related-Record-ID'. </p> ! <a name="anchor6"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.3"></a><h3>4.3. 'resource'</h3> <p>A 'resource' record contains a resource, without full protocol response *************** *** 469,473 **** includes the named parameter 'Related-Record-ID'. </p> ! <a name="rfc.section.4.4"></a><h4><a name="anchor7">4.4</a> 'request'</h4> <p>A 'request' record holds the manner in which a primary record's content was --- 485,491 ---- includes the named parameter 'Related-Record-ID'. </p> ! <a name="anchor7"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.4"></a><h3>4.4. 'request'</h3> <p>A 'request' record holds the manner in which a primary record's content was *************** *** 476,480 **** 'Related-Record-ID'. </p> ! <a name="rfc.section.4.5"></a><h4><a name="anchor8">4.5</a> 'metadata'</h4> <p>A 'metadata' record contains content created in order to further describe, --- 494,500 ---- 'Related-Record-ID'. </p> ! <a name="anchor8"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.5"></a><h3>4.5. 'metadata'</h3> <p>A 'metadata' record contains content created in order to further describe, *************** *** 494,501 **** formally specified somewhere.] </p> ! <a name="rfc.section.4.6"></a><h4><a name="anchor9">4.6</a> 'revisit'</h4> <p>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to --- 514,523 ---- formally specified somewhere.] </p> ! <a name="anchor9"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.6"></a><h3>4.6. 'revisit'</h3> <p>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to *************** *** 527,531 **** somewhere.] </p> ! <a name="rfc.section.4.7"></a><h4><a name="anchor10">4.7</a> 'conversion'</h4> <p>A 'conversion' record contains an alternative version of another record's --- 549,555 ---- somewhere.] </p> ! <a name="anchor10"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.7"></a><h3>4.7. 'conversion'</h3> <p>A 'conversion' record contains an alternative version of another record's *************** *** 549,553 **** specified somewhere.] </p> ! <a name="rfc.section.4.8"></a><h4><a name="anchor11">4.8</a> 'continuation'</h4> <p>A 'continuation' record needs to be logically appended to a prior record --- 573,579 ---- specified somewhere.] </p> ! <a name="anchor11"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.8"></a><h3>4.8. 'continuation'</h3> <p>A 'continuation' record needs to be logically appended to a prior record *************** *** 599,608 **** record-id = uri </pre> - <p>The warc-id string may change in future versions, but will always begin "warc/", and will always be 8 octets long. </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a>[ANVL]. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters --- 625,633 ---- record-id = uri </pre> <p>The warc-id string may change in future versions, but will always begin "warc/", and will always be 8 octets long. </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL]. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters *************** *** 612,616 **** consecutive newlines). </p> ! <a name="rfc.section.5.1"></a><h4><a name="anchor13">5.1</a> Positional Parameters</h4> <p>This section describes each of the individual positional parameters --- 637,643 ---- consecutive newlines). </p> ! <a name="anchor13"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.5.1"></a><h3>5.1. Positional Parameters</h3> <p>This section describes each of the individual positional parameters *************** *** 638,642 **** this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the ! end of the file. <br /> --- 665,670 ---- this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the ! end of the file. (WARC reading implementations may choose to tolerate ! more or fewer newlines at the end of a record.) <br /> *************** *** 644,649 **** ! Defensive programming suggests the practice of tolerating fewer or ! more than two newlines at record's end. If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might --- 672,676 ---- ! If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might *************** *** 725,729 **** </dd> </dl></blockquote> ! <a name="rfc.section.5.2"></a><h4><a name="anchor14">5.2</a> Named Parameters</h4> <p>Named parameters, also referred to as named fields, are optional --- 752,758 ---- </dd> </dl></blockquote> ! <a name="anchor14"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.5.2"></a><h3>5.2. Named Parameters</h3> <p>Named parameters, also referred to as named fields, are optional *************** *** 757,761 **** </pre> - [REVIEW ISSUE: Should we recommend an algorithm? SHA1's continued viability as a secure hash is in doubt given recent crypto research --- 786,789 ---- *************** *** 863,867 **** header-line.] </p> ! <a name="rfc.section.7.1"></a><h4><a name="anchor17">7.1</a> Record Truncation</h4> <p>Any record may indicate that truncation has occurred and give the --- 891,897 ---- header-line.] </p> ! <a name="anchor17"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.7.1"></a><h3>7.1. Record Truncation</h3> <p>Any record may indicate that truncation has occurred and give the *************** *** 871,875 **** exceeding a length limit. </p> ! <a name="rfc.section.7.2"></a><h4><a name="anchor18">7.2</a> Record Segmentation</h4> <p>A record that will not fit into a single WARC file of desired --- 901,907 ---- exceeding a length limit. </p> ! <a name="anchor18"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.7.2"></a><h3>7.2. Record Segmentation</h3> <p>A record that will not fit into a single WARC file of desired *************** *** 906,910 **** <a name="rfc.section.8"></a><h3>8. WARC Application to Specific Protocols</h3> ! <a name="rfc.section.8.1"></a><h4><a name="anchor20">8.1</a> HTTP and HTTPS</h4> <p>A full HTTP or HTTPS response, with protocol information and --- 938,944 ---- <a name="rfc.section.8"></a><h3>8. WARC Application to Specific Protocols</h3> ! <a name="anchor20"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.1"></a><h3>8.1. HTTP and HTTPS</h3> <p>A full HTTP or HTTPS response, with protocol information and *************** *** 956,960 **** "message/http" type. </p> ! <a name="rfc.section.8.2"></a><h4><a name="anchor21">8.2</a> DNS</h4> <p>A request for DNS information can be summarized in a URI in --- 990,996 ---- "message/http" type. </p> ! <a name="anchor21"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.2"></a><h3>8.2. DNS</h3> <p>A request for DNS information can be summarized in a URI in *************** *** 966,970 **** type. </p> ! <a name="rfc.section.8.3"></a><h4><a name="anchor22">8.3</a> Other Resources with URIs, and Other Protocols</h4> <p>Any resource that can be identified with a URI, even if it is not --- 1002,1008 ---- type. </p> ! <a name="anchor22"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.3"></a><h3>8.3. Other Resources with URIs, and Other Protocols</h3> <p>Any resource that can be identified with a URI, even if it is not *************** *** 1009,1013 **** compressing WARC files with GZIP. </p> ! <a name="rfc.section.9.1"></a><h4><a name="anchor24">9.1</a> Record-at-a-time Compression</h4> <p>Per section 2.2 of the GZIP specification, a valid GZIP file --- 1047,1053 ---- compressing WARC files with GZIP. </p> ! <a name="anchor24"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.1"></a><h3>9.1. Record-at-a-time Compression</h3> <p>Per section 2.2 of the GZIP specification, a valid GZIP file *************** *** 1029,1033 **** record. </p> ! <a name="rfc.section.9.2"></a><h4><a name="anchor25">9.2</a> GZIP extra field: skip-lengths ('sl')</h4> <p>Customarily, GZIP members do not declare their compressed --- 1069,1075 ---- record. </p> ! <a name="anchor25"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.2"></a><h3>9.2. GZIP extra field: skip-lengths ('sl')</h3> <p>Customarily, GZIP members do not declare their compressed *************** *** 1069,1073 **** appropriate. </p> ! <a name="rfc.section.9.3"></a><h4><a name="anchor26">9.3</a> GZIP WARC File Extension</h4> <p>WARC files compressed with the above conventions remain legal GZIP --- 1111,1117 ---- appropriate. </p> ! <a name="anchor26"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Extension</h3> <p>WARC files compressed with the above conventions remain legal GZIP *************** *** 1195,1199 **** there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include ! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., “URN Syntax,” May 1997.</span><span>)</span></a>[RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, “The ARK Persistent Identifier Scheme,” February 2005.</span><span>)</span></a>, <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, “Wikipedia: Globally Unique Identifiers,” .</span><span>)</span></a>, etc. </p> --- 1239,1243 ---- there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include ! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., “URN Syntax,” May 1997.</span><span>)</span></a> [RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.</span><span>)</span></a>, <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, “Wikipedia: Globally Unique Identifiers,” .</span><span>)</span></a>, etc. </p> *************** *** 1208,1212 **** </p> <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.</span><span>)</span></a>, ! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, “The ARK Persistent Identifier Scheme,” February 2005.</span><span>)</span></a> scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that --- 1252,1256 ---- </p> <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.</span><span>)</span></a>, ! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.</span><span>)</span></a> scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that *************** *** 1218,1222 **** http://abc.org/12026/987654321 </pre> - <p>The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and --- 1262,1265 ---- *************** *** 1230,1234 **** http://abc.org/12026/987654321/_t </pre> - <p>...in which an integer count may further extend the identifier when more there is more than one relationship of the given type. --- 1273,1276 ---- *************** *** 1246,1255 **** and checksums shown are plausible random filler. </p> ! <a name="rfc.section.B.1"></a><h4><a name="anchor33">Appendix B.1</a> Example of 'warcinfo' Record</h4> <p>The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbrieviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with --- 1288,1299 ---- and checksums shown are plausible random filler. </p> ! <a name="anchor33"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.1"></a><h3>Appendix B.1. Example of 'warcinfo' Record</h3> <p>The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbreviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with *************** *** 1283,1287 **** </pre> - <p>The first line (spread over three lines for readability) shows the required line of positional parameters. This record has no named --- 1327,1330 ---- *************** *** 1290,1294 **** header-line. Two newlines follow the content block. </p> ! <a name="rfc.section.B.2"></a><h4><a name="anchor34">Appendix B.2</a> Example of 'request' Record</h4> <p>A 'request' record captures the protocol request used to collect a --- 1333,1339 ---- header-line. Two newlines follow the content block. </p> ! <a name="anchor34"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.2"></a><h3>Appendix B.2. Example of 'request' Record</h3> <p>A 'request' record captures the protocol request used to collect a *************** *** 1307,1312 **** </pre> ! ! <a name="rfc.section.B.3"></a><h4><a name="anchor35">Appendix B.3</a> Example of 'response' Record</h4> <p>The archived response to the above request might look like the --- 1352,1358 ---- </pre> ! <a name="anchor35"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.3"></a><h3>Appendix B.3. Example of 'response' Record</h3> <p>The archived response to the above request might look like the *************** *** 1333,1342 **** [6958 bytes of binary data here] </pre> - <p>Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record. </p> ! <a name="rfc.section.B.4"></a><h4><a name="anchor36">Appendix B.4</a> Example of 'resource' Record</h4> <p>This same file, "logo.jpg", might be archived internally to an --- 1379,1389 ---- [6958 bytes of binary data here] </pre> <p>Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record. </p> ! <a name="anchor36"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.4"></a><h3>Appendix B.4. Example of 'resource' Record</h3> <p>This same file, "logo.jpg", might be archived internally to an *************** *** 1351,1356 **** [6958 bytes of binary data here] </pre> ! ! <a name="rfc.section.B.5"></a><h4><a name="anchor37">Appendix B.5</a> Example of 'metadata' Record</h4> <p>If some crawl-time metadata should be archived near the above --- 1398,1404 ---- [6958 bytes of binary data here] </pre> ! <a name="anchor37"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.5"></a><h3>Appendix B.5. Example of 'metadata' Record</h3> <p>If some crawl-time metadata should be archived near the above *************** *** 1370,1379 **** </harvestmetadata> </pre> - <p>Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal. </p> ! <a name="rfc.section.B.6"></a><h4><a name="anchor38">Appendix B.6</a> Example of 'revisit' Record</h4> <p>If the same URI is later revisited and the content is unchanged, a --- 1418,1428 ---- </harvestmetadata> </pre> <p>Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal. </p> ! <a name="anchor38"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.6"></a><h3>Appendix B.6. Example of 'revisit' Record</h3> <p>If the same URI is later revisited and the content is unchanged, a *************** *** 1396,1400 **** </revisit> </pre> - <p>Again, reference is made back to the original 'response' record. A new creation-date reflects he time of revisit. This content block --- 1445,1448 ---- *************** *** 1405,1409 **** defined. </p> ! <a name="rfc.section.B.7"></a><h4><a name="anchor39">Appendix B.7</a> Example of 'conversion' Record</h4> <p>At some future date, the "image/jpeg" format may no longer be --- 1453,1459 ---- defined. </p> ! <a name="anchor39"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.7"></a><h3>Appendix B.7. Example of 'conversion' Record</h3> <p>At some future date, the "image/jpeg" format may no longer be *************** *** 1421,1425 **** [3098 bytes of binary data here] </pre> - <p>An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the --- 1471,1474 ---- *************** *** 1427,1431 **** serve this role.) </p> ! <a name="rfc.section.B.8"></a><h4><a name="anchor40">Appendix B.8</a> Example of 'continuation' Record</h4> <p>If the 'response' above had been so large that it would not fit --- 1476,1482 ---- serve this role.) </p> ! <a name="anchor40"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.8"></a><h3>Appendix B.8. Example of 'continuation' Record</h3> <p>If the 'response' above had been so large that it would not fit *************** *** 1447,1451 **** [39514114 bytes of binary data here] </pre> - <p>Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the "Segment-Number: 1" named field. --- 1498,1501 ---- *************** *** 1460,1464 **** <td class="author-text">Burner, M. and B. Kahle, “<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,” September 1996.</td></tr> <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td> ! <td class="author-text">Kunze, J. and R. Rogers, “<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,” February 2005.</td></tr> <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td> <td class="author-text">“<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.”</td></tr> --- 1510,1514 ---- <td class="author-text">Burner, M. and B. Kahle, “<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,” September 1996.</td></tr> <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td> ! <td class="author-text">Kunze, J. and R. Rodgers, “<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,” August 2005.</td></tr> <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td> <td class="author-text">“<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.”</td></tr> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** warc_file_format.xml 22 Aug 2005 17:28:24 -0000 1.6 --- warc_file_format.xml 23 Aug 2005 17:35:41 -0000 1.7 *************** *** 121,125 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the --- 121,125 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the *************** *** 137,141 **** organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that --- 137,141 ---- organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that *************** *** 367,371 **** <t>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to --- 367,371 ---- <t>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to *************** *** 1129,1133 **** enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbrieviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with --- 1129,1133 ---- enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbreviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** warc_file_format.txt 18 Aug 2005 01:57:10 -0000 1.3 --- warc_file_format.txt 23 Aug 2005 17:35:41 -0000 1.4 *************** *** 120,163 **** 3. The WARC Record Model . . . . . . . . . . . . . . . . . . . . 6 4. Record Types . . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.1 'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.2 'response' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.3 'resource' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.4 'request' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.5 'metadata' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.6 'revisit' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.7 'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10 ! 4.8 'continuation' . . . . . . . . . . . . . . . . . . . . . . 10 5. Record Header . . . . . . . . . . . . . . . . . . . . . . . . 12 ! 5.1 Positional Parameters . . . . . . . . . . . . . . . . . . 13 ! 5.2 Named Parameters . . . . . . . . . . . . . . . . . . . . . 14 6. Record Content Block . . . . . . . . . . . . . . . . . . . . . 17 7. Truncated and Segmented Records . . . . . . . . . . . . . . . 18 ! 7.1 Record Truncation . . . . . . . . . . . . . . . . . . . . 18 ! 7.2 Record Segmentation . . . . . . . . . . . . . . . . . . . 18 8. WARC Application to Specific Protocols . . . . . . . . . . . . 20 ! 8.1 HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20 ! 8.2 DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ! 8.3 Other Resources with URIs, and Other Protocols . . . . . . 21 9. Compression Recommendations . . . . . . . . . . . . . . . . . 22 ! 9.1 Record-at-a-time Compression . . . . . . . . . . . . . . . 22 ! 9.2 GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3 GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 ! 10. WARC File Name and Size Recommendations . . . . . . . . . . 24 ! 11. Registration of MIME Media Type application/warc . . . . . . 25 ! 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . 26 ! 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 27 ! A. Consideratons in Choice of record-id . . . . . . . . . . . . . 28 ! B. Examples of WARC Records . . . . . . . . . . . . . . . . . . . 29 ! B.1 Example of 'warcinfo' Record . . . . . . . . . . . . . . . 29 ! B.2 Example of 'request' Record . . . . . . . . . . . . . . . 30 ! B.3 Example of 'response' Record . . . . . . . . . . . . . . . 30 ! B.4 Example of 'resource' Record . . . . . . . . . . . . . . . 31 ! B.5 Example of 'metadata' Record . . . . . . . . . . . . . . . 31 ! B.6 Example of 'revisit' Record . . . . . . . . . . . . . . . 31 ! B.7 Example of 'conversion' Record . . . . . . . . . . . . . . 32 ! B.8 Example of 'continuation' Record . . . . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 34 ! Intellectual Property and Copyright Statements . . . . . . . . 36 --- 120,163 ---- 3. The WARC Record Model . . . . . . . . . . . . . . . . . . . . 6 4. Record Types . . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.1. 'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.2. 'response' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.3. 'resource' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.4. 'request' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.5. 'metadata' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.6. 'revisit' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.7. 'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10 ! 4.8. 'continuation' . . . . . . . . . . . . . . . . . . . . . . 10 5. Record Header . . . . . . . . . . . . . . . . . . . . . . . . 12 ! 5.1. Positional Parameters . . . . . . . . . . . . . . . . . . 13 ! 5.2. Named Parameters . . . . . . . . . . . . . . . . . . . . . 14 6. Record Content Block . . . . . . . . . . . . . . . . . . . . . 17 7. Truncated and Segmented Records . . . . . . . . . . . . . . . 18 ! 7.1. Record Truncation . . . . . . . . . . . . . . . . . . . . 18 ! 7.2. Record Segmentation . . . . . . . . . . . . . . . . . . . 18 8. WARC Application to Specific Protocols . . . . . . . . . . . . 20 ! 8.1. HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20 ! 8.2. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ! 8.3. Other Resources with URIs, and Other Protocols . . . . . . 21 9. Compression Recommendations . . . . . . . . . . . . . . . . . 22 ! 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 ! 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 ! 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 ! 11. Registration of MIME Media Type application/warc . . . . . . . 25 ! 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 ! 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 ! Appendix A. Consideratons in Choice of record-id . . . . . . . . 28 ! Appendix B. Examples of WARC Records . . . . . . . . . . . . . . 29 ! Appendix B.1. Example of 'warcinfo' Record . . . . . . . . . . . . 29 ! Appendix B.2. Example of 'request' Record . . . . . . . . . . . . 30 ! Appendix B.3. Example of 'response' Record . . . . . . . . . . . . 30 ! Appendix B.4. Example of 'resource' Record . . . . . . . . . . . . 31 ! Appendix B.5. Example of 'metadata' Record . . . . . . . . . . . . 31 ! Appendix B.6. Example of 'revisit' Record . . . . . . . . . . . . 31 ! Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 ! Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 ! Intellectual Property and Copyright Statements . . . . . . . . . . 36 *************** *** 182,200 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This ! is directly followed by the the retrieval protocol response messages ! and content. The motivation to revise the format arose from the ! discussion and experiences of the International Internet Preservation ! Consortium (IIPC) [IIPC], whose members include the IA and the ! national libraries of a dozen countries. The revised format is ! expected to become the primary output format of the open-source ! Heritrix [HERITRIX] web crawler, and the input format for a wide ! array of cataloguing and access tools. The WARC format generalizes the older format to better support the ! harvesting, display, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, ! abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools --- 182,202 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. ! This is directly followed by the the retrieval protocol response ! messages and content. The motivation to revise the format arose from ! the discussion and experiences of the International Internet ! Preservation Consortium (IIPC) [IIPC], whose members include the IA ! and the national libraries of a dozen countries. The revised format ! is expected to be a standard way to structure, manage and store ! billions of collected web resources. For example, WARC will be an ! output format of harvesting software, such as the open-source ! Heritrix [HERITRIX] web crawler, and an input format for a wide array ! of cataloguing and access tools. The WARC format generalizes the older format to better support the ! harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, ! abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools *************** *** 219,224 **** - - Kunze, et al. Expires January 2, 2006 [Page 4] --- 221,224 ---- *************** *** 409,413 **** appropriate and how they can be standardized is warranted.] ! 4.1 'warcinfo' A 'warcinfo' record describes the records that follow it, up through --- 409,413 ---- appropriate and how they can be standardized is warranted.] ! 4.1. 'warcinfo' A 'warcinfo' record describes the records that follow it, up through *************** *** 436,440 **** content block must be formally defined somewhere.] ! 4.2 'response' A 'response' record contains an entire protocol response, such as a --- 436,440 ---- content block must be formally defined somewhere.] ! 4.2. 'response' A 'response' record contains an entire protocol response, such as a *************** *** 454,458 **** named parameters 'IP-Address' and 'Related-Record-ID'. ! 4.3 'resource' A 'resource' record contains a resource, without full protocol --- 454,458 ---- named parameters 'IP-Address' and 'Related-Record-ID'. ! 4.3. 'resource' A 'resource' record contains a resource, without full protocol *************** *** 462,466 **** often includes the named parameter 'Related-Record-ID'. ! 4.4 'request' A 'request' record holds the manner in which a primary record's --- 462,466 ---- often includes the named parameter 'Related-Record-ID'. ! 4.4. 'request' A 'request' record holds the manner in which a primary record's *************** *** 469,473 **** parameter 'Related-Record-ID'. ! 4.5 'metadata' A 'metadata' record contains content created in order to further --- 469,473 ---- parameter 'Related-Record-ID'. ! 4.5. 'metadata' A 'metadata' record contains content created in order to further *************** *** 487,494 **** formally specified somewhere.] ! 4.6 'revisit' A 'revisit' record describes the revisitation of content already ! archived, and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' --- 487,494 ---- formally specified somewhere.] ! 4.6. 'revisit' A 'revisit' record describes the revisitation of content already ! archived, and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' *************** *** 528,532 **** somewhere.] ! 4.7 'conversion' A 'conversion' record contains an alternative version of another --- 528,532 ---- somewhere.] ! 4.7. 'conversion' A 'conversion' record contains an alternative version of another *************** *** 550,554 **** specified somewhere.] ! 4.8 'continuation' A 'continuation' record needs to be logically appended to a prior --- 550,554 ---- specified somewhere.] ! 4.8. 'continuation' A 'continuation' record needs to be logically appended to a prior *************** *** 674,678 **** ! 5.1 Positional Parameters This section describes each of the individual positional parameters --- 674,678 ---- ! 5.1. Positional Parameters This section describes each of the individual positional parameters *************** *** 695,707 **** After proceeding this many octets from that first character of the record header, there should be two newlines and either the ! beginning of a new record or the end of the file. ! Defensive programming suggests the practice of tolerating fewer or ! more than two newlines at record's end. If the first next token ! does not match the first token of a WARC record, then the previous ! data-length should be considered in error; corrective action might ! include searching for a nearby occurrence of "warc/0.8" and other ! character patterns indicative of a legal record beginning. record-type The kind of WARC record. All record types are optional, --- 695,708 ---- After proceeding this many octets from that first character of the record header, there should be two newlines and either the ! beginning of a new record or the end of the file. (WARC reading ! implementations may choose to tolerate more or fewer newlines at ! the end of a record.) ! If the first next token does not match the first token of a WARC ! record, then the previous data-length should be considered in ! error; corrective action might include searching for a nearby ! occurrence of "warc/0.8" and other character patterns indicative ! of a legal record beginning. record-type The kind of WARC... [truncated message content] |
From: Michael S. <sta...@us...> - 2005-08-23 00:26:27
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7484/xdocs Modified Files: wacs-oswir.doc wacs-oswir.pdf Log Message: * xdocs/wacs-oswir.doc xdocs/wacs-oswir.pdf Final submissions. Index: wacs-oswir.pdf =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/wacs-oswir.pdf,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsR4BYtk and /tmp/cvsHIeiab differ Index: wacs-oswir.doc =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/wacs-oswir.doc,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 Binary files /tmp/cvsMHqZYm and /tmp/cvshLVeId differ |
From: stack <st...@ar...> - 2005-08-18 17:59:50
|
Lukáš Matějka wrote: > Hi, >does anybody have an idea? > > What is your complete indexarcs.sh line? Looks like we're passing in a '*' character -- i.e. ./nutch-data/segments/*/fetcher/data -- and internally is not expanding the glob character. Try something simple w/o '*' characters for your '-d' value. St.Ack > xmatejk2@war:~/nutchwax-0.2.1$ ./bin/indexarcs.sh -s /home... >Tue Aug 9 13:52:36 CEST 2005 Checking environment variables. > > >>Tue Aug 9 13:52:36 CEST 2005 Cleaning up all ./nutch-data/ content. >>Tue Aug 9 13:52:36 CEST 2005 Creating new queue, and segments. >>Tue Aug 9 13:52:36 CEST 2005 Started segmenting. >>ERROR: ./nutch-data//queue/ directory does not exist. >>/home/xmatejk2/nutchwax-0.2.1/bin/arcs2segs.sh DIR_OF_ARCS DIR_FOR_SEGMENTS [#ARCS] >>Tue Aug 9 13:52:36 CEST 2005 Started build of link database. >>050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135236 No FS indicated, using default:local >>050809 135236 Created webdb at LocalFS,./nutch-data/db >>050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135237 No FS indicated, using default:local >>050809 135237 Updating ./nutch-data/db >>050809 135237 Updating for ./nutch-data//segments/* >>Exception in thread "main" java.io.FileNotFoundException: ./nutch-data/segments/*/fetcher/data >>at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93) >>at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194) >> at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187) >> at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190) >> at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179) >> at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50) >>at org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTool.java:92) >>at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366) >>050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135238 Updating ./nutch-data//segments from ./nutch-data//db >>Exception in thread "main" java.lang.NullPointerException >>at org.apache.nutch.tools.UpdateSegmentsFromDb.run(UpdateSegmentsFromDb.java:181) >>at org.apache.nutch.tools.UpdateSegmentsFromDb.main(UpdateSegmentsFromDb.java:345) >>Tue Aug 9 13:52:38 CEST 2005 Started indexing. >>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135239 No FS indicated, using default:local >>050809 135239 indexing segment: ./nutch-data/segments/* >>050809 135239 * Opening segment * >>Exception in thread "main" java.lang.NullPointerException >>at org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:165) >>at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263) >>Tue Aug 9 13:52:39 CEST 2005 Started dedup. >>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135239 No FS indicated, using default:local >>050809 135240 Reading url hashes... >>050809 135240 Sorting url hashes... >>050809 135240 Deleting url duplicates... >>050809 135240 Deleted 0 url duplicates. >>050809 135240 Reading content hashes... >>050809 135240 Sorting content hashes... >>050809 135240 Deleting content duplicates... >>050809 135240 Deleted 0 content duplicates. >>050809 135240 Duplicate deletion complete locally. Now returning to NFS... >>050809 135240 DeleteDuplicates complete >>Tue Aug 9 13:52:40 CEST 2005 Merging indices. >>050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml >>050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml >>050809 135240 No FS indicated, using default:local >>050809 135240 merging segment indexes to: ./nutch-data/index >>050809 135240 done merging >> >>-lm >> >> >> >> >> > > > >------------------------------------------------------- >SF.Net email is Sponsored by the Better Software Conference & EXPO >September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices >Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA >Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf >_______________________________________________ >Archive-access-cvs mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-cvs > > |
From:
<mat...@ce...> - 2005-08-10 07:27:00
|
Hi, does anybody have an idea? xmatejk2@war:~/nutchwax-0.2.1$ ./bin/indexarcs.sh -s /home... Tue Aug 9 13:52:36 CEST 2005 Checking environment variables. > Tue Aug 9 13:52:36 CEST 2005 Cleaning up all ./nutch-data/ content. > Tue Aug 9 13:52:36 CEST 2005 Creating new queue, and segments. > Tue Aug 9 13:52:36 CEST 2005 Started segmenting. > ERROR: ./nutch-data//queue/ directory does not exist. > /home/xmatejk2/nutchwax-0.2.1/bin/arcs2segs.sh DIR_OF_ARCS DIR_FOR_SEGMENTS [#ARCS] > Tue Aug 9 13:52:36 CEST 2005 Started build of link database. > 050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135236 No FS indicated, using default:local > 050809 135236 Created webdb at LocalFS,./nutch-data/db > 050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135237 No FS indicated, using default:local > 050809 135237 Updating ./nutch-data/db > 050809 135237 Updating for ./nutch-data//segments/* > Exception in thread "main" java.io.FileNotFoundException: ./nutch-data/segments/*/fetcher/data > at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93) > at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194) > at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187) > at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190) > at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179) > at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50) > at org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTool.java:92) > at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366) > 050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135238 Updating ./nutch-data//segments from ./nutch-data//db > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.tools.UpdateSegmentsFromDb.run(UpdateSegmentsFromDb.java:181) > at org.apache.nutch.tools.UpdateSegmentsFromDb.main(UpdateSegmentsFromDb.java:345) > Tue Aug 9 13:52:38 CEST 2005 Started indexing. > 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135239 No FS indicated, using default:local > 050809 135239 indexing segment: ./nutch-data/segments/* > 050809 135239 * Opening segment * > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:165) > at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263) > Tue Aug 9 13:52:39 CEST 2005 Started dedup. > 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135239 No FS indicated, using default:local > 050809 135240 Reading url hashes... > 050809 135240 Sorting url hashes... > 050809 135240 Deleting url duplicates... > 050809 135240 Deleted 0 url duplicates. > 050809 135240 Reading content hashes... > 050809 135240 Sorting content hashes... > 050809 135240 Deleting content duplicates... > 050809 135240 Deleted 0 content duplicates. > 050809 135240 Duplicate deletion complete locally. Now returning to NFS... > 050809 135240 DeleteDuplicates complete > Tue Aug 9 13:52:40 CEST 2005 Merging indices. > 050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml > 050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml > 050809 135240 No FS indicated, using default:local > 050809 135240 merging segment indexes to: ./nutch-data/index > 050809 135240 done merging > > -lm > > > |
From: <st...@du...> - 2005-07-29 00:40:13
|
We would like to announce the release of nutchwax -- the nutch search application + extensions for searching of web archive collections -- and WERA, a web collection viewer application from the NWA Toolset that has been adapted to nutchwax. The two tools used in concert provide full-text search of small web archive collections and a means of browsing an archive collection over time. Nutchwax is hosted on sourceforge at http://archive-access.sourceforge.net. St.Ack |