archive-access-cvs Mailing List for Web Archive Access Utilities (Page 171)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 169 170 171 (Page 171 of 171)

[Archive-access-cvs] archive-access/projects/nutch/conf nutch-site.xml,1.24.2.2,1.24.2.3

From: Doug C. <cu...@us...> - 2005-09-01 18:45:38

Update of /cvsroot/archive-access/archive-access/projects/nutch/conf
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/conf

Modified Files:
      Tag: mapred
	nutch-site.xml 
Log Message:
Add indexArcs command.

Index: nutch-site.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v
retrieving revision 1.24.2.2
retrieving revision 1.24.2.3
diff -C2 -d -r1.24.2.2 -r1.24.2.3
*** nutch-site.xml	24 Aug 2005 04:15:48 -0000	1.24.2.2
--- nutch-site.xml	1 Sep 2005 18:45:29 -0000	1.24.2.3
***************
*** 7,14 ****
  <!-- NDFS -->
  
! <property>
!   <name>fs.default.name</name>
!   <value>ia109102:8009</value>
! </property>
  
  <property>
--- 7,14 ----
  <!-- NDFS -->
  
! <!-- <property> -->
! <!--   <name>fs.default.name</name> -->
! <!--   <value>ia109102:8009</value> -->
! <!-- </property> -->
  
  <property>
***************
*** 29,56 ****
  <!-- MapReduce -->
  
! <property>
!   <name>mapred.job.tracker</name>
!   <value>ia109102:8010</value>
! </property>
  
! <property>
!   <name>mapred.job.tracker.info.port</name>
!   <value>7846</value>
! </property>
  
! <property>
!   <name>mapred.local.dir</name>
!   <value>/0/nutch/mapred/local</value>
! </property>
  
! <property>
!   <name>mapred.system.dir</name>
!   <value>/mapred/system</value>
! </property>
  
! <property>
!   <name>mapred.task.timeout</name>
!   <value>3600000</value>
! </property>
  
  <!-- Override a few Nutch defaults -->
--- 29,56 ----
  <!-- MapReduce -->
  
! <!-- <property> -->
! <!--   <name>mapred.job.tracker</name> -->
! <!--   <value>ia109102:8010</value> -->
! <!-- </property> -->
  
! <!-- <property> -->
! <!--   <name>mapred.job.tracker.info.port</name> -->
! <!--   <value>7846</value> -->
! <!-- </property> -->
  
! <!-- <property> -->
! <!--   <name>mapred.local.dir</name> -->
! <!--   <value>/0/nutch/mapred/local</value> -->
! <!-- </property> -->
  
! <!-- <property> -->
! <!--   <name>mapred.system.dir</name> -->
! <!--   <value>/mapred/system</value> -->
! <!-- </property> -->
  
! <!-- <property> -->
! <!--   <name>mapred.task.timeout</name> -->
! <!--   <value>3600000</value> -->
! <!-- </property> -->
  
  <!-- Override a few Nutch defaults -->

[Archive-access-cvs] archive-access/projects/nutch/src/java/org/archive/access/nutch ImportArcs.java,NONE,1.1.2.1 IndexArcs.java,NONE,1.1.2.1 Arc2Segment.java,1.28.2.8,NONE

From: Doug C. <cu...@us...> - 2005-09-01 18:45:38

Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/src/java/org/archive/access/nutch

Added Files:
      Tag: mapred
	ImportArcs.java IndexArcs.java 
Removed Files:
      Tag: mapred
	Arc2Segment.java 
Log Message:
Add indexArcs command.

--- NEW FILE: ImportArcs.java ---
/*
 * $Id: ImportArcs.java,v 1.1.2.1 2005/09/01 18:45:29 cutting Exp $
 * 
 * Copyright (C) 2003 Internet Archive.
 * 
 * This file is part of the archive-access tools project
 * (http://sourceforge.net/projects/archive-access).
 * 
 * The archive-access tools are free software; you can redistribute them and/or
 * modify them under the terms of the GNU Lesser Public License as published by
 * the Free Software Foundation; either version 2.1 of the License, or any
 * later version.
 * 
 * The archive-access tools are distributed in the hope that they will be
 * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser
 * Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser Public License along with
 * the archive-access tools; if not, write to the Free Software Foundation,
 * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 */

package org.archive.access.nutch;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.net.URI;

import org.apache.commons.httpclient.Header;

import org.apache.nutch.io.Writable;
import org.apache.nutch.io.WritableComparable;
import org.apache.nutch.io.UTF8;
import org.apache.nutch.io.MD5Hash;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConf;
import org.apache.nutch.util.NutchConfigured;
import org.apache.nutch.util.mime.MimeType;
import org.apache.nutch.util.mime.MimeTypes;
import org.apache.nutch.mapred.JobConf;
import org.apache.nutch.mapred.JobClient;
import org.apache.nutch.mapred.Mapper;
import org.apache.nutch.mapred.OutputCollector;
import org.apache.nutch.mapred.Reporter;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Fetcher;
import org.apache.nutch.crawl.FetcherOutput;
import org.apache.nutch.crawl.FetcherOutputFormat;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseStatus;
import org.apache.nutch.parse.Parser;
import org.apache.nutch.parse.ParserFactory;
import org.apache.nutch.parse.ParseImpl;

import org.archive.io.arc.ARCReader;
import org.archive.io.arc.ARCReaderFactory;
import org.archive.io.arc.ARCRecord;
import org.archive.io.arc.ARCRecordMetaData;
import org.archive.util.ArchiveUtils;
import org.archive.util.TextUtils;

public class ImportArcs extends NutchConfigured implements Mapper {
  private static final Logger LOG =
    Logger.getLogger(ImportArcs.class.getName());

  private static final String  WHITESPACE = "\\s+";
  
  public static final String ARCFILENAME_KEY = "arcname";
  public static final String ARCFILEOFFSET_KEY = "arcoffset";
  public static final String ARCCOLLECTION_KEY = "collection";
  private static final String CONTENT_TYPE_KEY = "content-type";
  private static final String TEXT_TYPE = "text/";
  private static final String APPLICATION_TYPE = "application/";

  private boolean indexAll;
  private int contentLimit;
  private MimeTypes mimeTypes;
  private String collectionName;
  private String segmentName;

  public ImportArcs() { super(null); }

  public ImportArcs(NutchConf conf) { super(conf); }

  public void configure(JobConf job) {
    setConf(job);
    this.indexAll = job.getBoolean("archive.index.all", false);
    this.contentLimit = job.getInt("http.content.limit", 100000);
    this.mimeTypes = MimeTypes.get(job.get("mime.types.file"));
    this.collectionName = job.get("archive.collection", "web");
    this.segmentName = job.get(Fetcher.SEGMENT_NAME_KEY);

    if (job.getBoolean("arc2segment.verbose", false)) {
      LOG.setLevel(Level.FINE);
    }

    System.setProperty("java.protocol.handler.pkgs", "org.archive.net");
  }

  public void map(WritableComparable key, Writable value,
                  OutputCollector output, Reporter reporter)
    throws IOException {
    String arcLocation = ((UTF8)value).toString();
    LOG.info("opening "+arcLocation);
    
    ARCReader arc = null;
    String arcName = null;
    try {
      arc = ARCReaderFactory.get(arcLocation);
    } catch (Throwable e) {
      LOG.log(Level.WARNING, "Error opening: " + arcLocation, e);
      return;
    }

    // Don't run the digester. Digest is unused and it costs CPU.
    arc.setDigest(false);

    try {
      for (Iterator i = arc.iterator(); i.hasNext();) {
        ARCRecord rec = (ARCRecord) i.next();

        if (arcName == null) {                    // first entry has arc name
          String arcPath = new URI(rec.getMetaData().getUrl()).getPath();
          arcName = new File(arcPath).getName();
          if (arcName.endsWith(".arc")) {
            arcName = arcName.substring(0, arcName.indexOf(".arc"));
          }
          reporter.setStatus(arcName);
        }

        if (rec.getStatusCode() != 200)
          continue;
        try {
          processRecord(arcName, rec, output);
        } catch (Throwable e) {
          LOG.log(Level.WARNING, "Error processing: " + arcLocation, e);
        }
      }
    } catch (Throwable e) {                     // problem parsing arc file
      LOG.log(Level.WARNING, "Error parsing: " + arcLocation, e);
    }
  }

  private void processRecord(final String arcName, final ARCRecord rec,
                             OutputCollector output)
    throws IOException {

    ARCRecordMetaData arcData = rec.getMetaData();
    String url = arcData.getUrl();

    String mimetype = arcData.getMimetype();
    if (mimetype != null && mimetype.length() > 0) {
      mimetype = mimetype.toLowerCase();
    } else {
      MimeType mt = mimeTypes.getMimeType(url);
      if (mt != null) {
        mimetype = mt.getName();
      }
    }
    if (!indexAll) {
      if ((mimetype == null) || 
          (!mimetype.startsWith(TEXT_TYPE) &&
           !mimetype.startsWith(APPLICATION_TYPE))) {
        // Skip any but basic types.
        return;
      }
    }
    String noSpacesMimetype =
      TextUtils.replaceAll(WHITESPACE, mimetype, "-");
//     LOG.info("adding " + Long.toString(arcData.getLength())
//              + " bytes of mimetype " + noSpacesMimetype + " " + url);

    // copy http headers to nutch metadata
    Properties metaData = new Properties();
    Header[] headers = rec.getHttpHeaders();
    for (int j = 0; j < headers.length; j++) {
      Header header = headers[j];
      metaData.put(header.getName(), header.getValue());
    }
    // Add the collection name, the arcfile name, and the offset.
    // Also add mimetype.  Needed by the ia indexers.
    metaData.put(ARCCOLLECTION_KEY, this.collectionName);
    metaData.put(ARCFILENAME_KEY, arcName);
    metaData.put(ARCFILEOFFSET_KEY, Long.toString(arcData.getOffset()));
    metaData.put(CONTENT_TYPE_KEY, mimetype);

    // Collect content bytes
    // TODO: Skip if unindexable type.
    rec.skipHttpHeader();
    ByteArrayOutputStream contentBuffer = new ByteArrayOutputStream();
    byte[] buf = new byte[1024 * 4];
    int total = 0;
    int len = rec.read(buf, 0, buf.length);
    while (len != -1 && total < this.contentLimit) {
      total += len;
      contentBuffer.write(buf, 0, len);
      len = rec.read(buf, 0, buf.length);
    }

    // System.out.println("--------------");
    // System.out.write(contentBuffer.toByteArray());
    // System.out.println("--------------");

    byte[] contentBytes = contentBuffer.toByteArray();
    Content content = new Content(url, url, contentBytes, mimetype, metaData);

    metaData.put(Fetcher.DIGEST_KEY, MD5Hash.digest(contentBytes).toString());
    metaData.put(Fetcher.SEGMENT_NAME_KEY, segmentName);
        
    CrawlDatum datum = new CrawlDatum();
    datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);

    long date = 0;
    try {
      date = ArchiveUtils.parse14DigitDate(arcData.getDate()).getTime();
    } catch (java.text.ParseException e) {
      LOG.severe("Failed parse of date: " + arcData.getDate());
    }
    datum.setFetchTime(date);

    Parse parse = null;
    ParseStatus parseStatus;
    try {
      Parser parser = ParserFactory.getParser(content.getContentType(),
                                              content.getBaseUrl());
      parse = parser.getParse(content);
      parseStatus = parse.getData().getStatus();
    } catch (Exception e) {
      parseStatus = new ParseStatus(e);
    }
    if (!parseStatus.isSuccess()) {
      LOG.warning("Error parsing: "+url+": "+parseStatus);
      parse = null;
    }

    output.collect(new UTF8(url),
                   new FetcherOutput(datum, null,
                                     parse!=null ? new ParseImpl(parse):null));
  }

  public void importArcs(File arcUrlsDir, File segment) throws IOException {

    LOG.info("ImportArcs: starting");
    LOG.info("ImportArcs: arcUrlsDir: " + arcUrlsDir);
    LOG.info("ImportArcs: segment: " + segment);

    JobConf job = new JobConf(getConf());
    job.setJar("build/nutchwax.job.jar");

    job.set(Fetcher.SEGMENT_NAME_KEY, segment.getName());

    job.setInputDir(arcUrlsDir);
    job.setMapperClass(ImportArcs.class);

    job.setOutputDir(segment);
    job.setOutputFormat(FetcherOutputFormat.class);
    job.setOutputKeyClass(UTF8.class);
    job.setOutputValueClass(FetcherOutput.class);

    JobClient.runJob(job);
    LOG.info("ImportArcs: done");
  }

  public static void main(String[] args) throws Exception {
    // parse command line options
    String usage = "Usage: ImportArcs arcUrlsDir segmentDir";

    if (args.length != 2) {
      System.err.println(usage);
      System.exit(-1);
    }

    File arcUrlsDir = new File(args[0]);
    File segmentDir = new File(args[1]);

    new ImportArcs(NutchConf.get()).importArcs(arcUrlsDir, segmentDir);
  }
}

--- NEW FILE: IndexArcs.java ---
/**
 * Copyright 2005 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.archive.access.nutch;

import java.io.*;
import java.net.*;
import java.util.*;
import java.text.*;
import java.util.logging.*;

import org.apache.nutch.io.*;
import org.apache.nutch.fs.*;
import org.apache.nutch.util.*;
import org.apache.nutch.mapred.*;
import org.apache.nutch.crawl.*;

public class IndexArcs {
  public static final Logger LOG =
    LogFormatter.getLogger("org.archive.acces.nutch.IndexArcs");

  private static String getDate() {
    return new SimpleDateFormat("yyyyMMddHHmmss").format
      (new Date(System.currentTimeMillis()));
  }

  /* Import and index a set of arc files. */
  public static void main(String args[]) throws Exception {
    if (args.length < 1) {
      System.out.println("Usage: IndexArcs <arcsDir> [-dir d]");
      return;
    }

    JobConf conf = new JobConf(NutchConf.get());

    File arcsDir = null;
    File dir = new File("crawl-" + getDate());

    for (int i = 0; i < args.length; i++) {
      if ("-dir".equals(args[i])) {
        dir = new File(args[i+1]);
        i++;
      } else if (args[i] != null) {
        arcsDir = new File(args[i]);
      }
    }

    NutchFileSystem fs = NutchFileSystem.get(conf);
    if (fs.exists(dir)) {
      throw new RuntimeException(dir + " already exists.");
    }

    LOG.info("IndexArcs started in: " + dir);
    LOG.info("arcsDir = " + arcsDir);

    File linkDb = new File(dir + "/linkdb");
    File index = new File(dir + "/indexes");
    File segments = new File(dir + "/segments");
    File segment = new File(segments, getDate());
      
    // import arcs
    new ImportArcs(conf).importArcs(arcsDir, segment);

    // invert links
    new LinkDb(conf).invert(linkDb, segments);

    // index everything
    new Indexer(conf).index(index, linkDb, fs.listFiles(segments));

    LOG.info("IndexArcs finished: " + dir);
  }
}

--- Arc2Segment.java DELETED ---

[Archive-access-cvs] archive-access/projects/nutch/bin indexArcs.sh,NONE,1.8.2.2 arc2seg.sh,1.9,NONE arcs2segs.sh,1.4,NONE indexarcs.sh,1.9,NONE

From: Doug C. <cu...@us...> - 2005-09-01 18:45:38

Update of /cvsroot/archive-access/archive-access/projects/nutch/bin
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24577/bin

Added Files:
      Tag: mapred
	indexArcs.sh 
Removed Files:
      Tag: mapred
	arc2seg.sh arcs2segs.sh indexarcs.sh 
Log Message:
Add indexArcs command.

--- arcs2segs.sh DELETED ---

--- indexarcs.sh DELETED ---

--- NEW FILE: indexArcs.sh ---
#!/bin/sh

# resolve links - $0 may be a softlink
THIS="$0"
while [ -h "$THIS" ]; do
  ls=`ls -ld "$THIS"`
  link=`expr "$ls" : '.*-> \(.*\)$'`
  if expr "$link" : '.*/.*' > /dev/null; then
    THIS="$link"
  else
    THIS=`dirname "$THIS"`/"$link"
  fi
done

# some directories
THIS_DIR=`dirname "$THIS"`
PROJECT_HOME=`cd "$THIS_DIR/.." ; pwd`

# If no 'nutch' directory, assume the binaries-only layout (All scripts are
# in a single 'bin' directory and NUTCH_HOME=PROJECT_HOME).
NUTCH_HOME="${PROJECT_HOME}/nutch"
if [ ! -d "${NUTCH_HOME}" ]
then
    NUTCH_HOME="${PROJECT_HOME}"
fi

if [ "$JAVA_HOME" = "" ]; then
  echo "Error: JAVA_HOME is not set."
  exit 1
fi

JAVA=$JAVA_HOME/bin/java
if [ -z "$JAVA_OPTS" ]
then
  JAVA_OPTS=(-Xmx400m -server)
fi

# CLASSPATH initially contains conf dirs
CLASSPATH=${PROJECT_HOME}/conf:${NUTCH_HOME}/conf

# for developers, add classes to CLASSPATH
if [ -d "$PROJECT_HOME/build/classes" ]; then
  CLASSPATH=${CLASSPATH}:$PROJECT_HOME/build/classes
fi

# for developers, add Nutch classes to CLASSPATH
if [ -d "$NUTCH_HOME/build/classes" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi
if [ -d "$NUTCH_HOME/build/plugins" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d "$NUTCH_HOME/build/test/classes" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi

# so that filenames w/ spaces are handled correctly in loops below
IFS=

# for releases, add Nutch jar to CLASSPATH
for f in $NUTCH_HOME/nutch-*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done

# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
  CLASSPATH=${CLASSPATH}:$NUTCH_HOME
fi

# Add our libs to CLASSPATH but take care to make heritrix jar come
# before the httpclient jar (heritrix overlays a couple of httpclient
# classes).
httpclient_jar=
for f in ${PROJECT_HOME}/lib/*.jar; do
  case `basename $f` in
    commons-httpclient*.jar) httpclient_jar=$f ;;
    *) CLASSPATH=${CLASSPATH}:$f ;;
  esac
done
CLASSPATH=${CLASSPATH}:${httpclient_jar}

# Add Nutch libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
  CLASSPATH=${CLASSPATH}:$f;
done

# restore ordinary behaviour
unset IFS

CLASS=org.archive.access.nutch.IndexArcs

# cygwin path translation
if expr match `uname` 'CYGWIN*' &> /dev/null; then
  CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi

# Run it. Add in to java.net.URL the heritrix rsync handler.
exec $JAVA ${JAVA_OPTS[@]} \
        -Djava.protocol.handler.pkgs=org.archive.net \
        -classpath "$CLASSPATH" $CLASS "$@"

--- arc2seg.sh DELETED ---

[Archive-access-cvs] archive-access/projects/nutch/src/java/org/archive/access/nutch Arc2Segment.java,1.28.2.7,1.28.2.8

From: Doug C. <cu...@us...> - 2005-09-01 17:37:02

Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6304

Modified Files:
      Tag: mapred
	Arc2Segment.java 
Log Message:
Use reporter to set status.


Index: Arc2Segment.java
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Arc2Segment.java,v
retrieving revision 1.28.2.7
retrieving revision 1.28.2.8
diff -C2 -d -r1.28.2.7 -r1.28.2.8
*** Arc2Segment.java	24 Aug 2005 04:15:48 -0000	1.28.2.7
--- Arc2Segment.java	1 Sep 2005 17:36:53 -0000	1.28.2.8
***************
*** 132,136 ****
              arcName = arcName.substring(0, arcName.indexOf(".arc"));
            }
!           LOG.info("arcName="+arcName);
          }
  
--- 132,136 ----
              arcName = arcName.substring(0, arcName.indexOf(".arc"));
            }
!           reporter.setStatus(arcName);
          }
  
***************
*** 138,142 ****
            continue;
          try {
!           processRecord(arcName, rec, output, reporter);
          } catch (Throwable e) {
            LOG.log(Level.WARNING, "Error processing: " + arcLocation, e);
--- 138,142 ----
            continue;
          try {
!           processRecord(arcName, rec, output);
          } catch (Throwable e) {
            LOG.log(Level.WARNING, "Error processing: " + arcLocation, e);
***************
*** 149,153 ****
  
    private void processRecord(final String arcName, final ARCRecord rec,
!                              OutputCollector output, Reporter reporter)
      throws IOException {
  
--- 149,153 ----
  
    private void processRecord(final String arcName, final ARCRecord rec,
!                              OutputCollector output)
      throws IOException {
  
***************
*** 155,160 ****
      String url = arcData.getUrl();
  
-     reporter.setStatus(url);
- 
      String mimetype = arcData.getMimetype();
      if (mimetype != null && mimetype.length() > 0) {
--- 155,158 ----

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.6,1.7 warc_file_format.txt,1.5,1.6 warc_file_format.xml,1.10,1.11

From: John A. K. <joh...@us...> - 2005-08-28 18:55:38

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv17998

Modified Files:
	warc_file_format.html warc_file_format.txt 
	warc_file_format.xml 
Log Message:
added complete in-line description of ANVL

Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** warc_file_format.html	26 Aug 2005 23:19:18 -0000	1.6
--- warc_file_format.html	28 Aug 2005 18:55:30 -0000	1.7
***************
*** 363,371 ****
    warc-file   = 1*warc-record
    warc-record = header block CRLF CRLF
!   header      = header-line CRLF anvl-fields
    block       = *OCTET
  </pre>
  <p>Elements of this grammar are further specified and explained in
! sections that follow (and in the case of <span class="emph">anvl-fields</span>, also a separate document).
  </p>
  <p>The record <span class="emph">header-line</span> is a
--- 363,371 ----
    warc-file   = 1*warc-record
    warc-record = header block CRLF CRLF
!   header      = header-line CRLF *anvl-field CRLF
    block       = *OCTET
  </pre>
  <p>Elements of this grammar are further specified and explained in
! sections that follow.
  </p>
  <p>The record <span class="emph">header-line</span> is a
***************
*** 385,401 ****
  been written.
  </p>
! <p>After the <span class="emph">header-line</span> come any number of
! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL] that is very similar to that of email
! headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., &ldquo;Standard for the format of ARPA Internet text messages,&rdquo; August&nbsp;1982.</span><span>)</span></a>. Its format can be roughly summarized
! as the following:
  </p><pre>
!   anvl-fields = *line CRLF
!   line        = (field / other-anvl) CRLF
!   field       = &lt;field per RFC0822>
!   other-anvl  = &lt;see ANVL>
  </pre>
! <p>This document defines a number of named fields which may appear in
! the <span class="emph">anvl-fields</span> area of the header. Note that
! the smallest possible <span class="emph">anvl-fields</span> is a
  single CRLF, indicating no named fields.
  </p>
--- 385,411 ----
  been written.
  </p>
! <p>After the <span class="emph">header-line</span> come zero or more
! named <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL] fields in a line-oriented syntax
! very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., &ldquo;Standard for the format of ARPA Internet text messages,&rdquo; August&nbsp;1982.</span><span>)</span></a> but with
! unrestricted "text" values (none of its 13 reserved special characters).
! The precise format is as follows:
  </p><pre>
!   anvl-field  =  field-name ":" [ field-body ] CRLF
!   field-name  =  1*&lt;any CHAR, excluding control-chars and ":">
!   field-body  =  text [CRLF LWSP-char field-body]
!   text        =  1*&lt;any UTF-8 character, including bare
!                     CR and bare LF, but NOT including CRLF>
!                                              ; (Octal, Decimal.)
!   CHAR        =  &lt;any ASCII/UTF-8 character> ; (0-177,  0.-127.)
!   CR          =  &lt;ASCII CR, carriage return> ; (   15,      13.)
!   LF          =  &lt;ASCII LF, linefeed>        ; (   12,      10.)
!   SPACE       =  &lt;ASCII SP, space>           ; (   40,      32.)
!   HTAB        =  &lt;ASCII HT, horizontal-tab>  ; (   11,       9.)
!   CRLF        =  CR LF
!   LWSP-char   =  SPACE / HTAB                ; semantics = SPACE
  </pre>
! <p>This document defines a number of named fields that may appear as
! an <span class="emph">anvl-field</span>.  Note that the smallest
! possible <span class="emph">anvl-fields</span> is a
  single CRLF, indicating no named fields.
  </p>
***************
*** 632,636 ****
  </p>
  <p>Named parameters after the header-line, if any, follow the
! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL]. Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters
--- 642,647 ----
  </p>
  <p>Named parameters after the header-line, if any, follow the
! line-oriented syntax defined previously (also know as
! <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL]).  Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** warc_file_format.xml	26 Aug 2005 23:19:18 -0000	1.10
--- warc_file_format.xml	28 Aug 2005 18:55:30 -0000	1.11
***************
*** 203,207 ****
    warc-file   = 1*warc-record
    warc-record = header block CRLF CRLF
!   header      = header-line CRLF anvl-fields
    block       = *OCTET
   </artwork>
--- 203,207 ----
    warc-file   = 1*warc-record
    warc-record = header block CRLF CRLF
!   header      = header-line CRLF *anvl-field CRLF
    block       = *OCTET
   </artwork>
***************
*** 209,214 ****
  
  <t>Elements of this grammar are further specified and explained in
! sections that follow (and in the case of <spanx
! style="emph">anvl-fields</spanx>, also a separate document).</t>
  
  <t>The record <spanx style="emph">header-line</spanx> is a
--- 209,213 ----
  
  <t>Elements of this grammar are further specified and explained in
! sections that follow.</t>
  
  <t>The record <spanx style="emph">header-line</spanx> is a
***************
*** 233,254 ****
  been written.</t>
  
! <t>After the <spanx style="emph">header-line</spanx> come any number of
! named fields in a line-oriented syntax called <xref
! target="ANVL">ANVL</xref> that is very similar to that of email
! headers <xref target="RFC0822" />. Its format can be roughly summarized
! as the following:</t>
  
  <figure>
   <artwork>
!   anvl-fields = *line CRLF
!   line        = (field / other-anvl) CRLF
!   field       = &lt;field per RFC0822>
!   other-anvl  = &lt;see ANVL>
   </artwork>
  </figure>
  
! <t>This document defines a number of named fields which may appear in
! the <spanx style="emph">anvl-fields</spanx> area of the header. Note that
! the smallest possible <spanx style="emph">anvl-fields</spanx> is a
  single CRLF, indicating no named fields.</t>
  
--- 232,262 ----
  been written.</t>
  
! <t>After the <spanx style="emph">header-line</spanx> come zero or more
! named <xref target="ANVL">ANVL</xref> fields in a line-oriented syntax
! very similar to that of email headers <xref target="RFC0822" /> but with
! unrestricted "text" values (none of its 13 reserved special characters).
! The precise format is as follows:</t>
  
  <figure>
   <artwork>
!   anvl-field  =  field-name ":" [ field-body ] CRLF
!   field-name  =  1*&lt;any CHAR, excluding control-chars and ":">
!   field-body  =  text [CRLF LWSP-char field-body]
!   text        =  1*&lt;any UTF-8 character, including bare
!                     CR and bare LF, but NOT including CRLF>
!                                              ; (Octal, Decimal.)
!   CHAR        =  &lt;any ASCII/UTF-8 character> ; (0-177,  0.-127.)
!   CR          =  &lt;ASCII CR, carriage return> ; (   15,      13.)
!   LF          =  &lt;ASCII LF, linefeed>        ; (   12,      10.)
!   SPACE       =  &lt;ASCII SP, space>           ; (   40,      32.)
!   HTAB        =  &lt;ASCII HT, horizontal-tab>  ; (   11,       9.)
!   CRLF        =  CR LF
!   LWSP-char   =  SPACE / HTAB                ; semantics = SPACE
   </artwork>
  </figure>
  
! <t>This document defines a number of named fields that may appear as
! an <spanx style="emph">anvl-field</spanx>.  Note that the smallest
! possible <spanx style="emph">anvl-fields</spanx> is a
  single CRLF, indicating no named fields.</t>
  
***************
*** 488,492 ****
  
  <t>Named parameters after the header-line, if any, follow the
! line-oriented syntax called <xref target="ANVL">ANVL</xref>. Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters
--- 496,501 ----
  
  <t>Named parameters after the header-line, if any, follow the
! line-oriented syntax defined previously (also know as
! <xref target="ANVL">ANVL</xref>).  Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters

Index: warc_file_format.txt
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** warc_file_format.txt	26 Aug 2005 23:19:18 -0000	1.5
--- warc_file_format.txt	28 Aug 2005 18:55:30 -0000	1.6
***************
*** 293,302 ****
       warc-file   = 1*warc-record
       warc-record = header block CRLF CRLF
!      header      = header-line CRLF anvl-fields
       block       = *OCTET
  
     Elements of this grammar are further specified and explained in
!    sections that follow (and in the case of _anvl-fields_, also a
!    separate document).
  
     The record _header-line_ is a newline-terminated sequence of
--- 293,301 ----
       warc-file   = 1*warc-record
       warc-record = header block CRLF CRLF
!      header      = header-line CRLF *anvl-field CRLF
       block       = *OCTET
  
     Elements of this grammar are further specified and explained in
!    sections that follow.
  
     The record _header-line_ is a newline-terminated sequence of
***************
*** 314,333 ****
     completely known after the record content _block_ has been written.
  
!    After the _header-line_ come any number of named fields in a line-
!    oriented syntax called ANVL [ANVL] that is very similar to that of
!    email headers [RFC0822].  Its format can be roughly summarized as the
!    following:
! 
!      anvl-fields = *line CRLF
!      line        = (field / other-anvl) CRLF
!      field       = <field per RFC0822>
!      other-anvl  = <see ANVL>
! 
!    This document defines a number of named fields which may appear in
!    the _anvl-fields_ area of the header.  Note that the smallest
!    possible _anvl-fields_ is a single CRLF, indicating no named fields.
  
!    Following the headers comes the content _block_, if any, which may
!    contain arbitrary binary data, up through the remaining number of
  
  
--- 313,333 ----
     completely known after the record content _block_ has been written.
  
!    After the _header-line_ come zero or more named ANVL [ANVL] fields in
!    a line-oriented syntax very similar to that of email headers
!    [RFC0822] but with unrestricted "text" values (none of its 13
!    reserved special characters).  The precise format is as follows:
  
!      anvl-field  =  field-name ":" [ field-body ] CRLF
!      field-name  =  1*<any CHAR, excluding control-chars and ":">
!      field-body  =  text [CRLF LWSP-char field-body]
!      text        =  1*<any UTF-8 character, including bare
!                        CR and bare LF, but NOT including CRLF>
!                                                 ; (Octal, Decimal.)
!      CHAR        =  <any ASCII/UTF-8 character> ; (0-177,  0.-127.)
!      CR          =  <ASCII CR, carriage return> ; (   15,      13.)
!      LF          =  <ASCII LF, linefeed>        ; (   12,      10.)
!      SPACE       =  <ASCII SP, space>           ; (   40,      32.)
!      HTAB        =  <ASCII HT, horizontal-tab>  ; (   11,       9.)
!      CRLF        =  CR LF
  
  
***************
*** 338,341 ****
--- 338,349 ----
  
  
+      LWSP-char   =  SPACE / HTAB                ; semantics = SPACE
+ 
+    This document defines a number of named fields that may appear as an
+    _anvl-field_.  Note that the smallest possible _anvl-fields_ is a
+    single CRLF, indicating no named fields.
+ 
+    Following the headers comes the content _block_, if any, which may
+    contain arbitrary binary data, up through the remaining number of
     octets as specified in the previously-given _data-length_ parameter.
     Finally come two CRLF newlines, not counted in the declared record
***************
*** 381,392 ****
  
  
- 
- 
- 
- 
- 
- 
- 
- 
  Kunze, et al.            Expires January 2, 2006                [Page 7]
  
--- 389,392 ----
***************
*** 658,668 ****
  
     Named parameters after the header-line, if any, follow the line-
!    oriented syntax called ANVL [ANVL].  Normally, named parameters are
!    optional and their order is insignificant, however, specific record
!    types require that certain named parameters be present (and future
!    extensions may have ordering requirements).  If there are no named
!    parameters present, the entire WARC record header is the line of
!    positional parameters followed by one blank line (two consecutive
!    newlines).
  
  
--- 658,668 ----
  
     Named parameters after the header-line, if any, follow the line-
!    oriented syntax defined previously (also know as ANVL [ANVL]).
!    Normally, named parameters are optional and their order is
!    insignificant, however, specific record types require that certain
!    named parameters be present (and future extensions may have ordering
!    requirements).  If there are no named parameters present, the entire
!    WARC record header is the line of positional parameters followed by
!    one blank line (two consecutive newlines).

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.5,1.6 warc_file_format.txt,1.4,1.5 warc_file_format.xml,1.9,1.10

From: John A. K. <joh...@us...> - 2005-08-26 23:19:27

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20105

Modified Files:
	warc_file_format.html warc_file_format.txt 
	warc_file_format.xml 
Log Message:
added proposed text for a Warcinfo-ID named parameter

Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** warc_file_format.html	24 Aug 2005 01:39:51 -0000	1.5
--- warc_file_format.html	26 Aug 2005 23:19:18 -0000	1.6
***************
*** 234,238 ****
  GZIP extra field: skip-lengths ('sl')<br />
  &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3.</a>&nbsp;
! GZIP WARC File Extension<br />
  <a href="#anchor27">10.</a>&nbsp;
  WARC File Name and Size Recommendations<br />
--- 234,238 ----
  GZIP extra field: skip-lengths ('sl')<br />
  &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3.</a>&nbsp;
! GZIP WARC File Name Suffix<br />
  <a href="#anchor27">10.</a>&nbsp;
  WARC File Name and Size Recommendations<br />
***************
*** 406,412 ****
  record <span class="emph">data-length</span>.
  </p>
! <p>It is customary, and recommended, that the first record of a WARC
! describe the file itself, using the 'warcinfo' record-type, and a
! descriptive content block format.
  </p>
  <p>Subsequent records contain content blocks that are either the
--- 406,415 ----
  record <span class="emph">data-length</span>.
  </p>
! <p>It is often the case that the first record of a WARC to has the
! record-type 'warcinfo' and is used to describe the records that follow it.
! It is always the case that the concatenation of any two WARC files is a
! syntactically correct WARC file; care should be taken, however, when
! concatenation would inadvertently cause 'warcinfo' records to appear
! at points in the result that would create confusion.
  </p>
  <p>Subsequent records contain content blocks that are either the
***************
*** 851,854 ****
--- 854,873 ----
   
  </dd>
+ <dt>Warcinfo-ID: record-id</dt>
+ <dd>
+ When present, indicates the record-id of the associated 'warcinfo'
+ record for this record.  Typically, the Warcinfo-ID parameter is used
+ when the context of the applicable 'warcinfo' record is unavailable,
+ such as after distributing single records into separate WARC files.
+ WARC writing applications (such web crawlers) may choose to record
+ this parameter routinely (e.g., before computing checksums).
+ 
+ The Warcinfo-ID parameter overrides any association with a previously
+ occurring (in the WARC) 'warcinfo' record, thus providing a way to protect
+ the true association when records are combined from different WARCs.
+ Use of this parameter in a record of type 'warcinfo' is undefined and
+ reserved for possible future extension.
+  
+ </dd>
  </dl></blockquote>
  <a name="anchor15"></a><br /><hr />
***************
*** 1113,1124 ****
  <a name="anchor26"></a><br /><hr />
  <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.3"></a><h3>9.3.&nbsp;GZIP WARC File Extension</h3>
  
! <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
! should only get the customary additional ".gz" file extension suffix,
! making their suffix ".warc.gz". Software which works with WARC files
! compressed using these conventions will detect and exploit them; other
! GZIP software will harmlessly ignore the extensions.
  </p>
  <a name="anchor27"></a><br /><hr />
--- 1132,1143 ----
  <a name="anchor26"></a><br /><hr />
  <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.3"></a><h3>9.3.&nbsp;GZIP WARC File Name Suffix</h3>
  
! <p>A WARC file compressed with the extra GZIP field conventions described
! in this document is a legal GZIP file.  To ensure that it is properly
! recognized by GZIP tools, its name should have the customary ".gz"
! appended to it, making the complete suffix, ".warc.gz".
! GZIP software that does not recognize the extra GZIP fields will
! simply pass over them without benefit or harm.
  </p>
  <a name="anchor27"></a><br /><hr />

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** warc_file_format.xml	26 Aug 2005 22:29:40 -0000	1.9
--- warc_file_format.xml	26 Aug 2005 23:19:18 -0000	1.10
***************
*** 260,266 ****
  record <spanx style="emph">data-length</spanx>.</t>
  
! <t>It is customary, and recommended, that the first record of a WARC
! describe the file itself, using the 'warcinfo' record-type, and a
! descriptive content block format.</t>
  
  <t>Subsequent records contain content blocks that are either the
--- 260,269 ----
  record <spanx style="emph">data-length</spanx>.</t>
  
! <t>It is often the case that the first record of a WARC to has the
! record-type 'warcinfo' and is used to describe the records that follow it.
! It is always the case that the concatenation of any two WARC files is a
! syntactically correct WARC file; care should be taken, however, when
! concatenation would inadvertently cause 'warcinfo' records to appear
! at points in the result that would create confusion.</t>
  
  <t>Subsequent records contain content blocks that are either the
***************
*** 680,683 ****
--- 683,701 ----
   </t>
  
+  <t hangText="Warcinfo-ID: record-id">
+ When present, indicates the record-id of the associated 'warcinfo'
+ record for this record.  Typically, the Warcinfo-ID parameter is used
+ when the context of the applicable 'warcinfo' record is unavailable,
+ such as after distributing single records into separate WARC files.
+ WARC writing applications (such web crawlers) may choose to record
+ this parameter routinely (e.g., before computing checksums).
+ 
+ The Warcinfo-ID parameter overrides any association with a previously
+ occurring (in the WARC) 'warcinfo' record, thus providing a way to protect
+ the true association when records are combined from different WARCs.
+ Use of this parameter in a record of type 'warcinfo' is undefined and
+ reserved for possible future extension.
+  </t>
+ 
  </list>
  

Index: warc_file_format.txt
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** warc_file_format.txt	23 Aug 2005 17:35:41 -0000	1.4
--- warc_file_format.txt	26 Aug 2005 23:19:18 -0000	1.5
***************
*** 142,146 ****
       9.1.  Record-at-a-time Compression . . . . . . . . . . . . . . . 22
       9.2.  GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3.  GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23
     10. WARC File Name and Size Recommendations  . . . . . . . . . . . 24
     11. Registration of MIME Media Type application/warc . . . . . . . 25
--- 142,146 ----
       9.1.  Record-at-a-time Compression . . . . . . . . . . . . . . . 22
       9.2.  GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3.  GZIP WARC File Name Suffix . . . . . . . . . . . . . . . . 23
     10. WARC File Name and Size Recommendations  . . . . . . . . . . . 24
     11. Registration of MIME Media Type application/warc . . . . . . . 25
***************
*** 342,348 ****
     _data-length_.
  
!    It is customary, and recommended, that the first record of a WARC
!    describe the file itself, using the 'warcinfo' record-type, and a
!    descriptive content block format.
  
     Subsequent records contain content blocks that are either the direct
--- 342,352 ----
     _data-length_.
  
!    It is often the case that the first record of a WARC to has the
!    record-type 'warcinfo' and is used to describe the records that
!    follow it.  It is always the case that the concatenation of any two
!    WARC files is a syntactically correct WARC file; care should be
!    taken, however, when concatenation would inadvertently cause
!    'warcinfo' records to appear at points in the result that would
!    create confusion.
  
     Subsequent records contain content blocks that are either the direct
***************
*** 385,392 ****
  
  
- 
- 
- 
- 
  Kunze, et al.            Expires January 2, 2006                [Page 7]
  
--- 389,392 ----
***************
*** 474,480 ****
     describe, explain, or accompany a harvested resource, in ways not
     covered by other record types.  A 'metadata' record will almost
!    always refer to another record of another type, with hat other record
!    holding original harvested or transformed content.  (However, it is
!    allowable for a 'metadata' record to refer to any record type,
     including other 'metadata' records, or to refer to no other
     individual record at all.)  Any number of metadata records may be
--- 474,480 ----
     describe, explain, or accompany a harvested resource, in ways not
     covered by other record types.  A 'metadata' record will almost
!    always refer to another record of another type, with that other
!    record holding original harvested or transformed content.  (However,
!    it is allowable for a 'metadata' record to refer to any record type,
     including other 'metadata' records, or to refer to no other
     individual record at all.)  Any number of metadata records may be
***************
*** 506,510 ****
  
  
!    preferred if the current record's is understandable standing alone.
     (It is not required that any revisit of a previously-visited URI use
     'revisit', only those which refer back to other records.)
--- 506,510 ----
  
  
!    preferred if the current record is understandable standing alone.
     (It is not required that any revisit of a previously-visited URI use
     'revisit', only those which refer back to other records.)
***************
*** 532,544 ****
     A 'conversion' record contains an alternative version of another
     record's content that was created as the result of an archival
!    process.  Typically, this is used to hold content ransformations that
!    maintain viability of content after widely available rendering ools
!    for the originally stored format disappear.  As needed, the original
!    content may be migrated (transformed) to a more viable format in
!    order to keep the information usable with current tools while
!    minimizing loss of information (intellectual content, look and feel,
!    etc).  Any number of transformation records may be created that
     reference a specific source record, which may itself contain
!    ransformed content.  Each transformation should result in a
     freestanding, complete record, with no dependency on survival of the
     original record.  Metadata records may be used to further describe
--- 532,544 ----
     A 'conversion' record contains an alternative version of another
     record's content that was created as the result of an archival
!    process.  Typically, this is used to hold content transformations
!    that maintain viability of content after widely available rendering
!    tools for the originally stored format disappear.  As needed, the
!    original content may be migrated (transformed) to a more viable
!    format in order to keep the information usable with current tools
!    while minimizing loss of information (intellectual content, look and
!    feel, etc).  Any number of transformation records may be created that
     reference a specific source record, which may itself contain
!    transformed content.  Each transformation should result in a
     freestanding, complete record, with no dependency on survival of the
     original record.  Metadata records may be used to further describe
***************
*** 711,715 ****
  
     subject-uri The original URI whose collection gave rise to the
!       information content in this record.  In he context of web
        harvesting, this is the URI that was the target of a crawler's
        retrieval request.  Indirectly, such as for a 'revisit',
--- 711,715 ----
  
     subject-uri The original URI whose collection gave rise to the
!       information content in this record.  In the context of web
        harvesting, this is the URI that was the target of a crawler's
        retrieval request.  Indirectly, such as for a 'revisit',
***************
*** 717,725 ****
        uri appearing in the original record to which the newer record
        pertains.  For a 'warcinfo' record, this parameter is given a
!       synthesized value for the creation name of he WARC file, as a URI.
  
  
        Care should be taken to ensure that the URI in this value is
-       properly escaped (per [RFC2396] and that it is written with no
  
  
--- 717,725 ----
        uri appearing in the original record to which the newer record
        pertains.  For a 'warcinfo' record, this parameter is given a
!       synthesized value for the creation name of the WARC file, as a
!       URI.
  
  
        Care should be taken to ensure that the URI in this value is
  
  
***************
*** 730,733 ****
--- 730,734 ----
  
  
+       properly escaped (per [RFC2396] and that it is written with no
        internal whitespace.
  
***************
*** 780,784 ****
  
  
- 
  Kunze, et al.            Expires January 2, 2006               [Page 14]
  
--- 781,784 ----
***************
*** 825,829 ****
        A potential strategy, after choosing one record to be primary, is
        to extend its record-id as described in the Appendix about
!       record-id considerations.  This creates satellite record- ids for
        related records that contain the primary record-id as an initial
        substring, which greatly optimizes the detection (and in some
--- 825,829 ----
        A potential strategy, after choosing one record to be primary, is
        to extend its record-id as described in the Appendix about
!       record-id considerations.  This creates satellite record-ids for
        related records that contain the primary record-id as an initial
        substring, which greatly optimizes the detection (and in some
***************
*** 850,871 ****
     Truncated: reason-token When present, indicates that the current
        record ends before the apparent end of the source material, but no
!       continuation records are forthcoming.  Possible values indicate he
!       reason for the truncation: 'length' for exceeding a desired length
!       limit; 'time' for exceeding a desired time limit during
        collection.
  
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
  
  
--- 850,871 ----
     Truncated: reason-token When present, indicates that the current
        record ends before the apparent end of the source material, but no
!       continuation records are forthcoming.  Possible values indicate
!       the reason for the truncation: 'length' for exceeding a desired
!       length limit; 'time' for exceeding a desired time limit during
        collection.
  
!    Warcinfo-ID: record-id When present, indicates the record-id of the
!       associated 'warcinfo' record for this record.  Typically, the
!       Warcinfo-ID parameter is used when the context of the applicable
!       'warcinfo' record is unavailable, such as after distributing
!       single records into separate WARC files.  WARC writing
!       applications (such web crawlers) may choose to record this
!       parameter routinely (e.g., before computing checksums).  The
!       Warcinfo-ID parameter overrides any association with a previously
!       occurring (in the WARC) 'warcinfo' record, thus providing a way to
!       protect the true association when records are combined from
!       different WARCs.  Use of this parameter in a record of type
!       'warcinfo' is undefined and reserved for possible future
!       extension.
  
  
***************
*** 974,978 ****
     records to be written without know their ultimate length, with only a
     small fixed-size edit to the header when the length is eventually
!    know to complete the record.  This named-field-based mechanism does
     not allow a later discovery that a record needs truncation or
     segmentation to be reflected via a small header edit; it requires
--- 974,978 ----
     records to be written without know their ultimate length, with only a
     small fixed-size edit to the header when the length is eventually
!    known to complete the record.  This named-field-based mechanism does
     not allow a later discovery that a record needs truncation or
     segmentation to be reflected via a small header edit; it requires
***************
*** 1011,1015 ****
  
     with an incremented 'Segment-Number' field.  They must also include a
!    'Segment-Origin-ID' field with a value of he Record-ID of the record
     containing the first segment of the set.  All segments of a set must
     have identical subject-uri parameters.
--- 1011,1015 ----
  
     with an incremented 'Segment-Number' field.  They must also include a
!    'Segment-Origin-ID' field with a value of the Record-ID of the record
     containing the first segment of the set.  All segments of a set must
     have identical subject-uri parameters.
***************
*** 1140,1144 ****
     Any resource that can be identified with a URI, even if it is not
     retrieved via an Internet operation, may be archived in a WARC file
!    under a 'resource' type record.  This includes files hat have
     meaningful URIs retrieved from a locally-accessible filesystem or
     other repository.
--- 1140,1144 ----
     Any resource that can be identified with a URI, even if it is not
     retrieved via an Internet operation, may be archived in a WARC file
!    under a 'resource' type record.  This includes files that have
     meaningful URIs retrieved from a locally-accessible filesystem or
     other repository.
***************
*** 1184,1190 ****
  
     However, experience with the precursor ARC format at the Internet
!    Archive has demonstrated hat applying simple standard compression can
!    result in significant storage savings, while preserving random access
!    to individual records.
  
     For this purpose, the GZIP format with customary "deflate"
--- 1184,1190 ----
  
     However, experience with the precursor ARC format at the Internet
!    Archive has demonstrated that applying simple standard compression
!    can result in significant storage savings, while preserving random
!    access to individual records.
  
     For this purpose, the GZIP format with customary "deflate"
***************
*** 1221,1229 ****
     Customarily, GZIP members do not declare their compressed length.
     This presents a problem for WARC processing which, after reading a
!    small portion of a record, would like to skip to he next full record.
!    In the absence of an external, precalculated index, using only the
!    WARC record's uncompressed length would require the complete current
!    record to be decompressed o find the start of the next record.
! 
  
  
--- 1221,1229 ----
     Customarily, GZIP members do not declare their compressed length.
     This presents a problem for WARC processing which, after reading a
!    small portion of a record, would like to skip to the next full
!    record.  In the absence of an external, precalculated index, using
!    only the WARC record's uncompressed length would require the complete
!    current record to be decompressed to find the start of the next
!    record.
  
  
***************
*** 1264,1275 ****
     appropriate.
  
! 9.3.  GZIP WARC File Extension
  
!    WARC files compressed with the above conventions remain legal GZIP
!    files.  Thus, to ensure hey are properly recognized by GZIP tools,
!    they should only get the customary additional ".gz" file extension
!    suffix, making their suffix ".warc.gz".  Software which works with
!    WARC files compressed using these conventions will detect and exploit
!    them; other GZIP software will harmlessly ignore the extensions.
  
  
--- 1264,1275 ----
     appropriate.
  
! 9.3.  GZIP WARC File Name Suffix
  
!    A WARC file compressed with the extra GZIP field conventions
!    described in this document is a legal GZIP file.  To ensure that it
!    is properly recognized by GZIP tools, its name should have the
!    customary ".gz" appended to it, making the complete suffix,
!    ".warc.gz".  GZIP software that does not recognize the extra GZIP
!    fields will simply pass over them without benefit or harm.
  
  
***************
*** 1300,1304 ****
  
     Prefix is an abbreviation usually reflective of the project or crawl
!    that created this file. imestamp is a 14-digit GMT timestamp
     indicating the time the file was initially begun.  Serial is an
     increasing serial-number within the process creating the files, often
--- 1300,1304 ----
  
     Prefix is an abbreviation usually reflective of the project or crawl
!    that created this file.  Timestamp is a 14-digit GMT timestamp
     indicating the time the file was initially begun.  Serial is an
     increasing serial-number within the process creating the files, often
***************
*** 1314,1319 ****
     This specification does not require any particular WARC file naming
     practice, but recommends conventions similar to the above be adopted
!    within WARC-creating institutions. he file name prefix "iipc" should
!    be avoided unless participating in the IIPC naming registry.
  
     [REVIEW ISSUE: Discover sense of the group for what naming and
--- 1314,1319 ----
     This specification does not require any particular WARC file naming
     practice, but recommends conventions similar to the above be adopted
!    within WARC-creating institutions.  The file name prefix "iipc"
!    should be avoided unless participating in the IIPC naming registry.
  
     [REVIEW ISSUE: Discover sense of the group for what naming and
***************
*** 1405,1409 ****
  
     After IESG approval, IANA is expected to register the WARC type
!    "application/warc" using he application provided in this document.
  
  
--- 1405,1409 ----
  
     After IESG approval, IANA is expected to register the WARC type
!    "application/warc" using the application provided in this document.
  
  
***************
*** 1461,1465 ****
  
     This document could not have been written without major contributions
!    from participants of he International Internet Preservation
     Consortium, especially Steen Christensen, and Julien Masanes.
  
--- 1461,1465 ----
  
     This document could not have been written without major contributions
!    from participants of the International Internet Preservation
     Consortium, especially Steen Christensen, and Julien Masanes.
  
***************
*** 1534,1538 ****
     blocks.  Although the 'Related-Record-ID' parameter required of
     'metadata', 'revisit', and 'conversion' records is sufficient to
!    convey relatedness in he context of a single WARC file, great
     optimization can be had when relatedness can be inferred by third
     parties through identifier comparison rather than by lookup in a
--- 1534,1538 ----
     blocks.  Although the 'Related-Record-ID' parameter required of
     'metadata', 'revisit', and 'conversion' records is sufficient to
!    convey relatedness in the context of a single WARC file, great
     optimization can be had when relatedness can be inferred by third
     parties through identifier comparison rather than by lookup in a
***************
*** 1595,1602 ****
  
     <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
!    <warcmetadata>
!    xmlns:dc="http://purl.org/dc/elements/1.1/"
!    xmlns:dcterms="http://purl.org/dc/terms/"
!    xmlns:warc="http://archive.org/warc/0.8/">
     <warc:software>
     Heritrix 1.4.0 http://crawler.archive.org
--- 1595,1602 ----
  
     <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
!    <warcmetadata
!        xmlns:dc="http://purl.org/dc/elements/1.1/"
!        xmlns:dcterms="http://purl.org/dc/terms/"
!        xmlns:warc="http://archive.org/warc/0.8/">
     <warc:software>
     Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1611,1615 ****
     </warc:http-header-user-agent>
     <dc:format>WARC file version 0.8</dc:format>
!    <dcterms:conformsTo nxsi:type="dcterms:URI">
     http://www.archive.org/documents/WarcFileFormat.php
     </dcterms:conformsTo>
--- 1611,1615 ----
     </warc:http-header-user-agent>
     <dc:format>WARC file version 0.8</dc:format>
!    <dcterms:conformsTo xsi:type="dcterms:URI">
     http://www.archive.org/documents/WarcFileFormat.php
     </dcterms:conformsTo>
***************
*** 1754,1763 ****
  
     Again, reference is made back to the original 'response' record.  A
!    new creation-date reflects he time of revisit.  This content block
     hypothesizes including header excerpts from a server response to
     explain the results of the revisit.  (In this case, the remote server
     indicated the resource was unchanged from the previous 'Etag' value.)
!    The actual formats for describing he result of a revisit remain to be
!    defined.
  
  Appendix B.7.  Example of 'conversion' Record
--- 1754,1763 ----
  
     Again, reference is made back to the original 'response' record.  A
!    new creation-date reflects the time of revisit.  This content block
     hypothesizes including header excerpts from a server response to
     explain the results of the revisit.  (In this case, the remote server
     indicated the resource was unchanged from the previous 'Etag' value.)
!    The actual formats for describing the result of a revisit remain to
!    be defined.
  
  Appendix B.7.  Example of 'conversion' Record

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.xml,1.8,1.9

From: John A. K. <joh...@us...> - 2005-08-26 22:29:48

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6493

Modified Files:
	warc_file_format.xml 
Log Message:
tinkered with section 9.3 (GZIP WARC File Extension) for clarity

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.8
retrieving revision 1.9
diff -C2 -d -r1.8 -r1.9
*** warc_file_format.xml	24 Aug 2005 01:39:50 -0000	1.8
--- warc_file_format.xml	26 Aug 2005 22:29:40 -0000	1.9
***************
*** 945,956 ****
     </section>
  
!    <section title="GZIP WARC File Extension">
  
! <t>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
! should only get the customary additional ".gz" file extension suffix,
! making their suffix ".warc.gz". Software which works with WARC files
! compressed using these conventions will detect and exploit them; other
! GZIP software will harmlessly ignore the extensions.</t>
  
     </section>
--- 945,956 ----
     </section>
  
!    <section title="GZIP WARC File Name Suffix">
  
! <t>A WARC file compressed with the extra GZIP field conventions described
! in this document is a legal GZIP file.  To ensure that it is properly
! recognized by GZIP tools, its name should have the customary ".gz"
! appended to it, making the complete suffix, ".warc.gz".
! GZIP software that does not recognize the extra GZIP fields will
! simply pass over them without benefit or harm.</t>
  
     </section>

[Archive-access-cvs] archive-access/projects/nutch/src/java/org/archive/access/nutch Arc2Segment.java,1.28.2.6,1.28.2.7

From: Doug C. <cu...@us...> - 2005-08-24 04:15:56

Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv22467/src/java/org/archive/access/nutch

Modified Files:
      Tag: mapred
	Arc2Segment.java 
Log Message:
Put task timeout in nutch-site.xml so that it is seen when tasktracker is started.

Index: Arc2Segment.java
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch/Arc2Segment.java,v
retrieving revision 1.28.2.6
retrieving revision 1.28.2.7
diff -C2 -d -r1.28.2.6 -r1.28.2.7
*** Arc2Segment.java	22 Aug 2005 18:18:34 -0000	1.28.2.6
--- Arc2Segment.java	24 Aug 2005 04:15:48 -0000	1.28.2.7
***************
*** 258,263 ****
      job.set(Fetcher.SEGMENT_NAME_KEY, segment.getName());
  
-     job.set("mapred.task.timeout", 60 * 60 * 1000); // 1 hour
- 
      job.setInputDir(arcUrlsDir);
      job.setMapperClass(Arc2Segment.class);
--- 258,261 ----

[Archive-access-cvs] archive-access/projects/nutch/conf nutch-site.xml,1.24.2.1,1.24.2.2

From: Doug C. <cu...@us...> - 2005-08-24 04:15:56

Update of /cvsroot/archive-access/archive-access/projects/nutch/conf
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv22467/conf

Modified Files:
      Tag: mapred
	nutch-site.xml 
Log Message:
Put task timeout in nutch-site.xml so that it is seen when tasktracker is started.

Index: nutch-site.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v
retrieving revision 1.24.2.1
retrieving revision 1.24.2.2
diff -C2 -d -r1.24.2.1 -r1.24.2.2
*** nutch-site.xml	22 Aug 2005 18:18:34 -0000	1.24.2.1
--- nutch-site.xml	24 Aug 2005 04:15:48 -0000	1.24.2.2
***************
*** 49,52 ****
--- 49,57 ----
  </property>
  
+ <property>
+   <name>mapred.task.timeout</name>
+   <value>3600000</value>
+ </property>
+ 
  <!-- Override a few Nutch defaults -->

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.xml,1.7,1.8 warc_file_format.html,1.4,1.5

From: Michael S. <sta...@us...> - 2005-08-24 01:40:01

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25656

Modified Files:
	warc_file_format.xml warc_file_format.html 
Log Message:
* warc_file_format.xml
    Added entity definition for mdash. Typos.  Fixed warcinfo example xml.


Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** warc_file_format.html	23 Aug 2005 17:35:41 -0000	1.4
--- warc_file_format.html	24 Aug 2005 01:39:51 -0000	1.5
***************
*** 411,417 ****
  </p>
  <p>Subsequent records contain content blocks that are either the
! direct result of a retrieval attempt &mdash; web pages, inline images,
  URL redirection information, DNS hostname lookup results, standalone
! files, etc. &mdash; or they are synthesized content blocks (e.g.,
  metadata, transformed content) that provide additional information
  about archived content. Any content block may contain arbitrary text
--- 411,417 ----
  </p>
  <p>Subsequent records contain content blocks that are either the
! direct result of a retrieval attempt &#8212; web pages, inline images,
  URL redirection information, DNS hostname lookup results, standalone
! files, etc. &#8212; or they are synthesized content blocks (e.g.,
  metadata, transformed content) that provide additional information
  about archived content. Any content block may contain arbitrary text
***************
*** 501,505 ****
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with hat other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
--- 501,505 ----
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with that other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
***************
*** 527,531 ****
  <p>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record's is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)
--- 527,531 ----
  <p>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)
***************
*** 555,560 ****
  <p>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content ransformations that
! maintain viability of content after widely available rendering ools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
--- 555,560 ----
  <p>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content transformations that
! maintain viability of content after widely available rendering tools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
***************
*** 562,566 ****
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain ransformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
--- 562,566 ----
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain transformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
***************
*** 662,666 ****
  The number of octets in the record, starting with the first letter
  ("w") of the first token, through to the end of the content block 
! &mdash; not including the 2 record-ending newlines.  After proceeding 
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
--- 662,666 ----
  The number of octets in the record, starting with the first letter
  ("w") of the first token, through to the end of the content block 
! &#8212; not including the 2 record-ending newlines.  After proceeding 
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
***************
*** 688,697 ****
  <dd>
  The original URI whose collection gave rise to the information content
! in this record. In he context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of he WARC file, as a URI.
  
  <br />
--- 688,697 ----
  <dd>
  The original URI whose collection gave rise to the information content
! in this record. In the context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of the WARC file, as a URI.
  
  <br />
***************
*** 820,824 ****
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record- ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
--- 820,824 ----
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record-ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
***************
*** 846,850 ****
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate he reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
--- 846,850 ----
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate the reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
***************
*** 883,887 ****
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually know to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
--- 883,887 ----
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually known to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
***************
*** 917,921 ****
  <p>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of he Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.
--- 917,921 ----
  <p>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of the Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.
***************
*** 1008,1012 ****
  <p>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files hat have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.
--- 1008,1012 ----
  <p>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files that have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.
***************
*** 1033,1037 ****
  </p>
  <p>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated hat applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.
--- 1033,1037 ----
  </p>
  <p>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated that applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.
***************
*** 1075,1082 ****
  <p>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to he next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed o find the start of the next
  record.
  </p>
--- 1075,1082 ----
  <p>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to the next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed to find the start of the next
  record.
  </p>
***************
*** 1116,1120 ****
  
  <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure hey are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
--- 1116,1120 ----
  
  <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
***************
*** 1134,1138 ****
  </p>
  <p>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  imestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
--- 1134,1138 ----
  </p>
  <p>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  Timestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
***************
*** 1148,1152 ****
  <p>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  he file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.
  </p>
--- 1148,1152 ----
  <p>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  The file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.
  </p>
***************
*** 1212,1216 ****
  
  <p>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using he application provided in this document.
  </p>
  <a name="anchor30"></a><br /><hr />
--- 1212,1216 ----
  
  <p>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using the application provided in this document.
  </p>
  <a name="anchor30"></a><br /><hr />
***************
*** 1219,1223 ****
  
  <p>This document could not have been written without major
! contributions from participants of he International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.
--- 1219,1223 ----
  
  <p>This document could not have been written without major
! contributions from participants of the International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.
***************
*** 1246,1250 ****
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in he context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
--- 1246,1250 ----
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in the context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
***************
*** 1305,1312 ****
  
  &lt;?xml version="1.0" encoding="UTF-8" standalone="yes"?&gt;
! &lt;warcmetadata&gt;
! xmlns:dc="http://purl.org/dc/elements/1.1/"
! xmlns:dcterms="http://purl.org/dc/terms/"
! xmlns:warc="http://archive.org/warc/0.8/"&gt;
  &lt;warc:software&gt;
  Heritrix 1.4.0 http://crawler.archive.org
--- 1305,1312 ----
  
  &lt;?xml version="1.0" encoding="UTF-8" standalone="yes"?&gt;
! &lt;warcmetadata
!     xmlns:dc="http://purl.org/dc/elements/1.1/"
!     xmlns:dcterms="http://purl.org/dc/terms/"
!     xmlns:warc="http://archive.org/warc/0.8/"&gt;
  &lt;warc:software&gt;
  Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1321,1325 ****
  &lt;/warc:http-header-user-agent&gt;
  &lt;dc:format&gt;WARC file version 0.8&lt;/dc:format&gt;
! &lt;dcterms:conformsTo nxsi:type="dcterms:URI"&gt;
  http://www.archive.org/documents/WarcFileFormat.php
  &lt;/dcterms:conformsTo&gt;
--- 1321,1325 ----
  &lt;/warc:http-header-user-agent&gt;
  &lt;dc:format&gt;WARC file version 0.8&lt;/dc:format&gt;
! &lt;dcterms:conformsTo xsi:type="dcterms:URI"&gt;
  http://www.archive.org/documents/WarcFileFormat.php
  &lt;/dcterms:conformsTo&gt;
***************
*** 1446,1454 ****
  </pre>
  <p>Again, reference is made back to the original 'response' record. A
! new creation-date reflects he time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing he result of a revisit remain to be
  defined.
  </p>
--- 1446,1454 ----
  </pre>
  <p>Again, reference is made back to the original 'response' record. A
! new creation-date reflects the time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing the result of a revisit remain to be
  defined.
  </p>

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** warc_file_format.xml	23 Aug 2005 17:35:41 -0000	1.7
--- warc_file_format.xml	24 Aug 2005 01:39:50 -0000	1.8
***************
*** 2,5 ****
--- 2,7 ----
  <!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd' [
  
+   <!ENTITY mdash '&#8212;' >
+ 
    <!ENTITY rfc0822 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0822.xml'>
    <!ENTITY rfc1034 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1034.xml'>
***************
*** 349,353 ****
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with hat other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
--- 351,355 ----
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with that other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
***************
*** 375,379 ****
  <t>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record's is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)</t>
--- 377,381 ----
  <t>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)</t>
***************
*** 403,408 ****
  <t>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content ransformations that
! maintain viability of content after widely available rendering ools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
--- 405,410 ----
  <t>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content transformations that
! maintain viability of content after widely available rendering tools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
***************
*** 410,414 ****
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain ransformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
--- 412,416 ----
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain transformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
***************
*** 535,544 ****
   <t hangText="subject-uri">
  The original URI whose collection gave rise to the information content
! in this record. In he context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of he WARC file, as a URI.
  
  <vspace blankLines="2" />
--- 537,546 ----
   <t hangText="subject-uri">
  The original URI whose collection gave rise to the information content
! in this record. In the context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of the WARC file, as a URI.
  
  <vspace blankLines="2" />
***************
*** 650,654 ****
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record- ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
--- 652,656 ----
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record-ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
***************
*** 673,677 ****
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate he reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
--- 675,679 ----
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate the reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
***************
*** 713,717 ****
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually know to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
--- 715,719 ----
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually known to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
***************
*** 745,749 ****
  <t>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of he Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.</t>
--- 747,751 ----
  <t>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of the Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.</t>
***************
*** 838,842 ****
  <t>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files hat have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.</t>
--- 840,844 ----
  <t>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files that have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.</t>
***************
*** 865,869 ****
  
  <t>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated hat applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.</t>
--- 867,871 ----
  
  <t>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated that applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.</t>
***************
*** 905,912 ****
  <t>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to he next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed o find the start of the next
  record.</t>
  
--- 907,914 ----
  <t>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to the next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed to find the start of the next
  record.</t>
  
***************
*** 946,950 ****
  
  <t>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure hey are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
--- 948,952 ----
  
  <t>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
***************
*** 966,970 ****
  
  <t>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  imestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
--- 968,972 ----
  
  <t>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  Timestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
***************
*** 980,984 ****
  <t>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  he file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.</t>
  
--- 982,986 ----
  <t>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  The file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.</t>
  
***************
*** 1044,1048 ****
  
  <t>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using he application provided in this document.</t>
  
    </section>
--- 1046,1050 ----
  
  <t>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using the application provided in this document.</t>
  
    </section>
***************
*** 1051,1055 ****
  
  <t>This document could not have been written without major
! contributions from participants of he International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.</t>
--- 1053,1057 ----
  
  <t>This document could not have been written without major
! contributions from participants of the International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.</t>
***************
*** 1078,1082 ****
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in he context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
--- 1080,1084 ----
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in the context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
***************
*** 1141,1148 ****
  
  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
! <warcmetadata>
! xmlns:dc="http://purl.org/dc/elements/1.1/"
! xmlns:dcterms="http://purl.org/dc/terms/"
! xmlns:warc="http://archive.org/warc/0.8/">
  <warc:software>
  Heritrix 1.4.0 http://crawler.archive.org
--- 1143,1150 ----
  
  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
! <warcmetadata
!     xmlns:dc="http://purl.org/dc/elements/1.1/"
!     xmlns:dcterms="http://purl.org/dc/terms/"
!     xmlns:warc="http://archive.org/warc/0.8/">
  <warc:software>
  Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1157,1161 ****
  </warc:http-header-user-agent>
  <dc:format>WARC file version 0.8</dc:format>
! <dcterms:conformsTo nxsi:type="dcterms:URI">
  http://www.archive.org/documents/WarcFileFormat.php
  </dcterms:conformsTo>
--- 1159,1163 ----
  </warc:http-header-user-agent>
  <dc:format>WARC file version 0.8</dc:format>
! <dcterms:conformsTo xsi:type="dcterms:URI">
  http://www.archive.org/documents/WarcFileFormat.php
  </dcterms:conformsTo>
***************
*** 1304,1312 ****
  
  <t>Again, reference is made back to the original 'response' record. A
! new creation-date reflects he time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing he result of a revisit remain to be
  defined.</t>
  
--- 1306,1314 ----
  
  <t>Again, reference is made back to the original 'response' record. A
! new creation-date reflects the time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing the result of a revisit remain to be
  defined.</t>

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.3,1.4 warc_file_format.txt,1.3,1.4 warc_file_format.xml,1.6,1.7

From: John A. K. <joh...@us...> - 2005-08-23 17:36:04

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30443

Modified Files:
	warc_file_format.html warc_file_format.txt 
	warc_file_format.xml 
Log Message:
trivial changes (typos) plus test of xml2rfc-1.30 outputs

Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** warc_file_format.html	18 Aug 2005 01:57:10 -0000	1.3
--- warc_file_format.html	23 Aug 2005 17:35:41 -0000	1.4
***************
*** 3,7 ****
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <meta name="description" content="The WARC File Format (Version 0.8 rev B)">
! <meta name="generator" content="xml2rfc v1.29 (http://xml.resource.org/)">
  <style type='text/css'>
  <!--
--- 3,7 ----
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <meta name="description" content="The WARC File Format (Version 0.8 rev B)">
! <meta name="generator" content="xml2rfc v1.30 (http://xml.resource.org/)">
  <style type='text/css'>
  <!--
***************
*** 28,32 ****
          font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif;
          font-size: x-small ; background-color: #000000; }
! /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */
      div#counter{margin-top: 100px}
  
--- 28,32 ----
          font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif;
          font-size: x-small ; background-color: #000000; }
!     /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */
      div#counter{margin-top: 100px}
  
***************
*** 58,61 ****
--- 58,63 ----
      p.copyright { font-size: x-small ; }
      p.toc { font-size: small ; font-weight: bold ; margin-left: 3em ;}
+     table.toc { margin: 0 0 0 3em; padding: 0; border: 0; vertical-align: text-top; }
+     td.toc { font-size: small; font-weight: bold; vertical-align: text-top; }
  
      span.emph { font-style: italic; }
***************
*** 95,108 ****
      td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; }
      td.author-text { font-size: x-small; }
!     table.data { vertical-align: top ; border-collapse: collapse ;
          border-style: solid solid solid solid ;
          border-color: black black black black ;
          font-size: small ; text-align: center ; }
!     table.data th { font-weight: bold ;
!         border-style: solid solid solid solid ;
          border-color: black black black black ; }
!     table.data td {
          border-style: solid solid solid solid ;
          border-color: #333333 #333333 #333333 #333333 ; }
  
      hr { height: 1px }
--- 97,119 ----
      td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; }
      td.author-text { font-size: x-small; }
!     table.full { vertical-align: top ; border-collapse: collapse ;
          border-style: solid solid solid solid ;
          border-color: black black black black ;
          font-size: small ; text-align: center ; }
!     table.headers, table.none { vertical-align: top ; border-collapse: collapse ;
!         border-style: none;
!         font-size: small ; text-align: center ; }
!     table.full th { font-weight: bold ;
!         border-style: solid ;
          border-color: black black black black ; }
!     table.headers th { font-weight: bold ;
!         border-style: none none solid none;
!         border-color: black black black black ; }
!     table.none th { font-weight: bold ;
!         border-style: none; }
!     table.full td {
          border-style: solid solid solid solid ;
          border-color: #333333 #333333 #333333 #333333 ; }
+     table.headers td, table.none td { border-style: none; }
  
      hr { height: 1px }
***************
*** 178,202 ****
  <a href="#record_types">4.</a>&nbsp;
  Record Types<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor4">4.1</a>&nbsp;
  'warcinfo'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor5">4.2</a>&nbsp;
  'response'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor6">4.3</a>&nbsp;
  'resource'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor7">4.4</a>&nbsp;
  'request'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor8">4.5</a>&nbsp;
  'metadata'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor9">4.6</a>&nbsp;
  'revisit'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor10">4.7</a>&nbsp;
  'conversion'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor11">4.8</a>&nbsp;
  'continuation'<br />
  <a href="#anchor12">5.</a>&nbsp;
  Record Header<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor13">5.1</a>&nbsp;
  Positional Parameters<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor14">5.2</a>&nbsp;
  Named Parameters<br />
  <a href="#anchor15">6.</a>&nbsp;
--- 189,213 ----
  <a href="#record_types">4.</a>&nbsp;
  Record Types<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor4">4.1.</a>&nbsp;
  'warcinfo'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor5">4.2.</a>&nbsp;
  'response'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor6">4.3.</a>&nbsp;
  'resource'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor7">4.4.</a>&nbsp;
  'request'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor8">4.5.</a>&nbsp;
  'metadata'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor9">4.6.</a>&nbsp;
  'revisit'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor10">4.7.</a>&nbsp;
  'conversion'<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor11">4.8.</a>&nbsp;
  'continuation'<br />
  <a href="#anchor12">5.</a>&nbsp;
  Record Header<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor13">5.1.</a>&nbsp;
  Positional Parameters<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor14">5.2.</a>&nbsp;
  Named Parameters<br />
  <a href="#anchor15">6.</a>&nbsp;
***************
*** 204,226 ****
  <a href="#anchor16">7.</a>&nbsp;
  Truncated and Segmented Records<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor17">7.1</a>&nbsp;
  Record Truncation<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor18">7.2</a>&nbsp;
  Record Segmentation<br />
  <a href="#anchor19">8.</a>&nbsp;
  WARC Application to Specific Protocols<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor20">8.1</a>&nbsp;
  HTTP and HTTPS<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor21">8.2</a>&nbsp;
  DNS<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor22">8.3</a>&nbsp;
  Other Resources with URIs, and Other Protocols<br />
  <a href="#anchor23">9.</a>&nbsp;
  Compression Recommendations<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor24">9.1</a>&nbsp;
  Record-at-a-time Compression<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor25">9.2</a>&nbsp;
  GZIP extra field: skip-lengths ('sl')<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3</a>&nbsp;
  GZIP WARC File Extension<br />
  <a href="#anchor27">10.</a>&nbsp;
--- 215,237 ----
  <a href="#anchor16">7.</a>&nbsp;
  Truncated and Segmented Records<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor17">7.1.</a>&nbsp;
  Record Truncation<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor18">7.2.</a>&nbsp;
  Record Segmentation<br />
  <a href="#anchor19">8.</a>&nbsp;
  WARC Application to Specific Protocols<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor20">8.1.</a>&nbsp;
  HTTP and HTTPS<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor21">8.2.</a>&nbsp;
  DNS<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor22">8.3.</a>&nbsp;
  Other Resources with URIs, and Other Protocols<br />
  <a href="#anchor23">9.</a>&nbsp;
  Compression Recommendations<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor24">9.1.</a>&nbsp;
  Record-at-a-time Compression<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor25">9.2.</a>&nbsp;
  GZIP extra field: skip-lengths ('sl')<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3.</a>&nbsp;
  GZIP WARC File Extension<br />
  <a href="#anchor27">10.</a>&nbsp;
***************
*** 232,254 ****
  <a href="#anchor30">13.</a>&nbsp;
  Acknowledgements<br />
! <a href="#anchor31">A.</a>&nbsp;
  Consideratons in Choice of record-id<br />
! <a href="#anchor32">B.</a>&nbsp;
  Examples of WARC Records<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor33">B.1</a>&nbsp;
  Example of 'warcinfo' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor34">B.2</a>&nbsp;
  Example of 'request' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor35">B.3</a>&nbsp;
  Example of 'response' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor36">B.4</a>&nbsp;
  Example of 'resource' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor37">B.5</a>&nbsp;
  Example of 'metadata' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor38">B.6</a>&nbsp;
  Example of 'revisit' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor39">B.7</a>&nbsp;
  Example of 'conversion' Record<br />
! &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor40">B.8</a>&nbsp;
  Example of 'continuation' Record<br />
  <a href="#rfc.references1">14.</a>&nbsp;
--- 243,265 ----
  <a href="#anchor30">13.</a>&nbsp;
  Acknowledgements<br />
! <a href="#anchor31">Appendix&nbsp;A.</a>&nbsp;
  Consideratons in Choice of record-id<br />
! <a href="#anchor32">Appendix&nbsp;B.</a>&nbsp;
  Examples of WARC Records<br />
! <a href="#anchor33">Appendix&nbsp;B.1.</a>&nbsp;
  Example of 'warcinfo' Record<br />
! <a href="#anchor34">Appendix&nbsp;B.2.</a>&nbsp;
  Example of 'request' Record<br />
! <a href="#anchor35">Appendix&nbsp;B.3.</a>&nbsp;
  Example of 'response' Record<br />
! <a href="#anchor36">Appendix&nbsp;B.4.</a>&nbsp;
  Example of 'resource' Record<br />
! <a href="#anchor37">Appendix&nbsp;B.5.</a>&nbsp;
  Example of 'metadata' Record<br />
! <a href="#anchor38">Appendix&nbsp;B.6.</a>&nbsp;
  Example of 'revisit' Record<br />
! <a href="#anchor39">Appendix&nbsp;B.7.</a>&nbsp;
  Example of 'conversion' Record<br />
! <a href="#anchor40">Appendix&nbsp;B.8.</a>&nbsp;
  Example of 'continuation' Record<br />
  <a href="#rfc.references1">14.</a>&nbsp;
***************
*** 269,273 ****
  simple text headers and an arbitary data block into one long file. The
  WARC format is a revision of the <a class="info" href="#ARC">ARC File
! Format<span> (</span><span class="info">Burner, M. and B. Kahle, &ldquo;The ARC File Format,&rdquo; September&nbsp;1996.</span><span>)</span></a>[ARC] format that has traditionally been used to store "web
  crawls" as sequences of content blocks harvested from the World Wide
  Web.
--- 280,284 ----
  simple text headers and an arbitary data block into one long file. The
  WARC format is a revision of the <a class="info" href="#ARC">ARC File
! Format<span> (</span><span class="info">Burner, M. and B. Kahle, &ldquo;The ARC File Format,&rdquo; September&nbsp;1996.</span><span>)</span></a> [ARC] format that has traditionally been used to store "web
  crawls" as sequences of content blocks harvested from the World Wide
  Web.
***************
*** 276,294 ****
  Archive (IA) to record a sequence of materials captured from the web
  (e.g., web "pages"). Each capture is preceded by a one-line header
! that very briey describes the harvested content and its length. This
  is directly followed by the the retrieval protocol response messages
  and content. The motivation to revise the format arose from the
  discussion and experiences of the <a class="info" href="#IIPC">International
! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, &ldquo;International Internet Preservation Consortium (IIPC),&rdquo; .</span><span>)</span></a>[IIPC], whose members include
  the IA and the national libraries of a dozen countries. The revised
! format is expected to become the primary output format of the
! open-source <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, &ldquo;Heritrix Open Source Archival Web Crawler,&rdquo; .</span><span>)</span></a>[HERITRIX] web crawler, and
! the input format for a wide array of cataloguing and access tools.
  </p>
  <p>The WARC format generalizes the older format to better support the
! harvesting, display, and exchange needs of archiving
  organizations. Besides the primary content currently recorded, the
  revision accommodates related secondary content, such as assigned
! metadata, abbrieviated duplicate detection events, and later-date
  transformations. The revision may also be useful for more general
  applications than web archiving. To aid the development of tools that
--- 287,307 ----
  Archive (IA) to record a sequence of materials captured from the web
  (e.g., web "pages"). Each capture is preceded by a one-line header
! that very briefly describes the harvested content and its length. This
  is directly followed by the the retrieval protocol response messages
  and content. The motivation to revise the format arose from the
  discussion and experiences of the <a class="info" href="#IIPC">International
! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, &ldquo;International Internet Preservation Consortium (IIPC),&rdquo; .</span><span>)</span></a> [IIPC], whose members include
  the IA and the national libraries of a dozen countries. The revised
! format is expected to be a standard way to structure, manage and 
! store billions of collected web resources. For example, WARC will be 
! an output format of harvesting software, such as the open-source 
! <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, &ldquo;Heritrix Open Source Archival Web Crawler,&rdquo; .</span><span>)</span></a> [HERITRIX] web crawler, and an input 
! format for a wide array of cataloguing and access tools.
  </p>
  <p>The WARC format generalizes the older format to better support the
! harvesting, access, and exchange needs of archiving
  organizations. Besides the primary content currently recorded, the
  revision accommodates related secondary content, such as assigned
! metadata, abbreviated duplicate detection events, and later-date
  transformations. The revision may also be useful for more general
  applications than web archiving. To aid the development of tools that
***************
*** 353,357 ****
    block       = *OCTET
  </pre>
- 
  <p>Elements of this grammar are further specified and explained in
  sections that follow (and in the case of <span class="emph">anvl-fields</span>, also a separate document).
--- 366,369 ----
***************
*** 367,371 ****
    tsp         = 1*WSP
  </pre>
- 
  <p>The amount of whitespace between <span class="emph">header-line</span> tokens is variable. This gives
  archive builders the flexibility to add padding and later adjust
--- 379,382 ----
***************
*** 375,379 ****
  </p>
  <p>After the <span class="emph">header-line</span> come any number of
! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a>[ANVL] that is very similar to that of email
  headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., &ldquo;Standard for the format of ARPA Internet text messages,&rdquo; August&nbsp;1982.</span><span>)</span></a>. Its format can be roughly summarized
  as the following:
--- 386,390 ----
  </p>
  <p>After the <span class="emph">header-line</span> come any number of
! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL] that is very similar to that of email
  headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., &ldquo;Standard for the format of ARPA Internet text messages,&rdquo; August&nbsp;1982.</span><span>)</span></a>. Its format can be roughly summarized
  as the following:
***************
*** 384,388 ****
    other-anvl  = &lt;see ANVL>
  </pre>
- 
  <p>This document defines a number of named fields which may appear in
  the <span class="emph">anvl-fields</span> area of the header. Note that
--- 395,398 ----
***************
*** 424,428 ****
  appropriate and how they can be standardized is warranted.]
  </p>
! <a name="rfc.section.4.1"></a><h4><a name="anchor4">4.1</a>&nbsp;'warcinfo'</h4>
  
  <p>A 'warcinfo' record describes the records that follow it, up through end of
--- 434,440 ----
  appropriate and how they can be standardized is warranted.]
  </p>
! <a name="anchor4"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.1"></a><h3>4.1.&nbsp;'warcinfo'</h3>
  
  <p>A 'warcinfo' record describes the records that follow it, up through end of
***************
*** 451,455 ****
  content block must be formally defined somewhere.]
  </p>
! <a name="rfc.section.4.2"></a><h4><a name="anchor5">4.2</a>&nbsp;'response'</h4>
  
  <p>A 'response' record contains an entire protocol response, such as a full
--- 463,469 ----
  content block must be formally defined somewhere.]
  </p>
! <a name="anchor5"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.2"></a><h3>4.2.&nbsp;'response'</h3>
  
  <p>A 'response' record contains an entire protocol response, such as a full
***************
*** 461,465 ****
  'IP-Address' and 'Related-Record-ID'.
  </p>
! <a name="rfc.section.4.3"></a><h4><a name="anchor6">4.3</a>&nbsp;'resource'</h4>
  
  <p>A 'resource' record contains a resource, without full protocol response
--- 475,481 ----
  'IP-Address' and 'Related-Record-ID'.
  </p>
! <a name="anchor6"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.3"></a><h3>4.3.&nbsp;'resource'</h3>
  
  <p>A 'resource' record contains a resource, without full protocol response
***************
*** 469,473 ****
  includes the named parameter 'Related-Record-ID'.
  </p>
! <a name="rfc.section.4.4"></a><h4><a name="anchor7">4.4</a>&nbsp;'request'</h4>
  
  <p>A 'request' record holds the manner in which a primary record's content was
--- 485,491 ----
  includes the named parameter 'Related-Record-ID'.
  </p>
! <a name="anchor7"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.4"></a><h3>4.4.&nbsp;'request'</h3>
  
  <p>A 'request' record holds the manner in which a primary record's content was
***************
*** 476,480 ****
  'Related-Record-ID'.
  </p>
! <a name="rfc.section.4.5"></a><h4><a name="anchor8">4.5</a>&nbsp;'metadata'</h4>
  
  <p>A 'metadata' record contains content created in order to further describe,
--- 494,500 ----
  'Related-Record-ID'.
  </p>
! <a name="anchor8"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.5"></a><h3>4.5.&nbsp;'metadata'</h3>
  
  <p>A 'metadata' record contains content created in order to further describe,
***************
*** 494,501 ****
  formally specified somewhere.]
  </p>
! <a name="rfc.section.4.6"></a><h4><a name="anchor9">4.6</a>&nbsp;'revisit'</h4>
  
  <p>A 'revisit' record describes the revisitation of content already archived,
! and includes only an abbrieviated content block which must be
  interpreted relative to a previous record. Most typically, a 'revisit'
  record is be used instead of 'response' or 'resource' record to
--- 514,523 ----
  formally specified somewhere.]
  </p>
! <a name="anchor9"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.6"></a><h3>4.6.&nbsp;'revisit'</h3>
  
  <p>A 'revisit' record describes the revisitation of content already archived,
! and includes only an abbreviated content block which must be
  interpreted relative to a previous record. Most typically, a 'revisit'
  record is be used instead of 'response' or 'resource' record to
***************
*** 527,531 ****
  somewhere.]
  </p>
! <a name="rfc.section.4.7"></a><h4><a name="anchor10">4.7</a>&nbsp;'conversion'</h4>
  
  <p>A 'conversion' record contains an alternative version of another record's
--- 549,555 ----
  somewhere.]
  </p>
! <a name="anchor10"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.7"></a><h3>4.7.&nbsp;'conversion'</h3>
  
  <p>A 'conversion' record contains an alternative version of another record's
***************
*** 549,553 ****
  specified somewhere.]
  </p>
! <a name="rfc.section.4.8"></a><h4><a name="anchor11">4.8</a>&nbsp;'continuation'</h4>
  
  <p>A 'continuation' record needs to be logically appended to a prior record 
--- 573,579 ----
  specified somewhere.]
  </p>
! <a name="anchor11"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.4.8"></a><h3>4.8.&nbsp;'continuation'</h3>
  
  <p>A 'continuation' record needs to be logically appended to a prior record 
***************
*** 599,608 ****
    record-id     = uri
  </pre>
- 
  <p>The warc-id string may change in future versions, but will always
  begin "warc/", and will always be 8 octets long.
  </p>
  <p>Named parameters after the header-line, if any, follow the
! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a>[ANVL]. Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters
--- 625,633 ----
    record-id     = uri
  </pre>
  <p>The warc-id string may change in future versions, but will always
  begin "warc/", and will always be 8 octets long.
  </p>
  <p>Named parameters after the header-line, if any, follow the
! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, &ldquo;A Name-Value Language,&rdquo; .</span><span>)</span></a> [ANVL]. Normally,
  named parameters are optional and their order is insignificant,
  however, specific record types require that certain named parameters
***************
*** 612,616 ****
  consecutive newlines).
  </p>
! <a name="rfc.section.5.1"></a><h4><a name="anchor13">5.1</a>&nbsp;Positional Parameters</h4>
  
  <p>This section describes each of the individual positional parameters
--- 637,643 ----
  consecutive newlines).
  </p>
! <a name="anchor13"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.5.1"></a><h3>5.1.&nbsp;Positional Parameters</h3>
  
  <p>This section describes each of the individual positional parameters
***************
*** 638,642 ****
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
! end of the file.
  
  <br />
--- 665,670 ----
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
! end of the file. (WARC reading implementations may choose to tolerate
! more or fewer newlines at the end of a record.)
  
  <br />
***************
*** 644,649 ****
  
  
! Defensive programming suggests the practice of tolerating fewer or
! more than two newlines at record's end. If the first next token does
  not match the first token of a WARC record, then the previous
  data-length should be considered in error; corrective action might
--- 672,676 ----
  
  
! If the first next token does
  not match the first token of a WARC record, then the previous
  data-length should be considered in error; corrective action might
***************
*** 725,729 ****
  </dd>
  </dl></blockquote>
! <a name="rfc.section.5.2"></a><h4><a name="anchor14">5.2</a>&nbsp;Named Parameters</h4>
  
  <p>Named parameters, also referred to as named fields, are optional
--- 752,758 ----
  </dd>
  </dl></blockquote>
! <a name="anchor14"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.5.2"></a><h3>5.2.&nbsp;Named Parameters</h3>
  
  <p>Named parameters, also referred to as named fields, are optional
***************
*** 757,761 ****
  </pre>
  
- 
  [REVIEW ISSUE: Should we recommend an algorithm? SHA1's continued
  viability as a secure hash is in doubt given recent crypto research
--- 786,789 ----
***************
*** 863,867 ****
  header-line.]
  </p>
! <a name="rfc.section.7.1"></a><h4><a name="anchor17">7.1</a>&nbsp;Record Truncation</h4>
  
  <p>Any record may indicate that truncation has occurred and give the
--- 891,897 ----
  header-line.]
  </p>
! <a name="anchor17"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.7.1"></a><h3>7.1.&nbsp;Record Truncation</h3>
  
  <p>Any record may indicate that truncation has occurred and give the
***************
*** 871,875 ****
  exceeding a length limit.
  </p>
! <a name="rfc.section.7.2"></a><h4><a name="anchor18">7.2</a>&nbsp;Record Segmentation</h4>
  
  <p>A record that will not fit into a single WARC file of desired
--- 901,907 ----
  exceeding a length limit.
  </p>
! <a name="anchor18"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.7.2"></a><h3>7.2.&nbsp;Record Segmentation</h3>
  
  <p>A record that will not fit into a single WARC file of desired
***************
*** 906,910 ****
  <a name="rfc.section.8"></a><h3>8.&nbsp;WARC Application to Specific Protocols</h3>
  
! <a name="rfc.section.8.1"></a><h4><a name="anchor20">8.1</a>&nbsp;HTTP and HTTPS</h4>
  
  <p>A full HTTP or HTTPS response, with protocol information and
--- 938,944 ----
  <a name="rfc.section.8"></a><h3>8.&nbsp;WARC Application to Specific Protocols</h3>
  
! <a name="anchor20"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.8.1"></a><h3>8.1.&nbsp;HTTP and HTTPS</h3>
  
  <p>A full HTTP or HTTPS response, with protocol information and
***************
*** 956,960 ****
  "message/http" type.
  </p>
! <a name="rfc.section.8.2"></a><h4><a name="anchor21">8.2</a>&nbsp;DNS</h4>
  
  <p>A request for DNS information can be summarized in a URI in
--- 990,996 ----
  "message/http" type.
  </p>
! <a name="anchor21"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.8.2"></a><h3>8.2.&nbsp;DNS</h3>
  
  <p>A request for DNS information can be summarized in a URI in
***************
*** 966,970 ****
  type.
  </p>
! <a name="rfc.section.8.3"></a><h4><a name="anchor22">8.3</a>&nbsp;Other Resources with URIs, and Other Protocols</h4>
  
  <p>Any resource that can be identified with a URI, even if it is not
--- 1002,1008 ----
  type.
  </p>
! <a name="anchor22"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.8.3"></a><h3>8.3.&nbsp;Other Resources with URIs, and Other Protocols</h3>
  
  <p>Any resource that can be identified with a URI, even if it is not
***************
*** 1009,1013 ****
  compressing WARC files with GZIP.
  </p>
! <a name="rfc.section.9.1"></a><h4><a name="anchor24">9.1</a>&nbsp;Record-at-a-time Compression</h4>
  
  <p>Per section 2.2 of the GZIP specification, a valid GZIP file
--- 1047,1053 ----
  compressing WARC files with GZIP.
  </p>
! <a name="anchor24"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.1"></a><h3>9.1.&nbsp;Record-at-a-time Compression</h3>
  
  <p>Per section 2.2 of the GZIP specification, a valid GZIP file
***************
*** 1029,1033 ****
  record.
  </p>
! <a name="rfc.section.9.2"></a><h4><a name="anchor25">9.2</a>&nbsp;GZIP extra field: skip-lengths ('sl')</h4>
  
  <p>Customarily, GZIP members do not declare their compressed
--- 1069,1075 ----
  record.
  </p>
! <a name="anchor25"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.2"></a><h3>9.2.&nbsp;GZIP extra field: skip-lengths ('sl')</h3>
  
  <p>Customarily, GZIP members do not declare their compressed
***************
*** 1069,1073 ****
  appropriate.
  </p>
! <a name="rfc.section.9.3"></a><h4><a name="anchor26">9.3</a>&nbsp;GZIP WARC File Extension</h4>
  
  <p>WARC files compressed with the above conventions remain legal GZIP
--- 1111,1117 ----
  appropriate.
  </p>
! <a name="anchor26"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.3"></a><h3>9.3.&nbsp;GZIP WARC File Extension</h3>
  
  <p>WARC files compressed with the above conventions remain legal GZIP
***************
*** 1195,1199 ****
  there are providers to service them. This specification does not
  dictate what identifier scheme to use; suitable schemes include 
! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., &ldquo;URN Syntax,&rdquo; May&nbsp;1997.</span><span>)</span></a>[RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, &ldquo;The ARK Persistent Identifier Scheme,&rdquo; February&nbsp;2005.</span><span>)</span></a>, 
  <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, &ldquo;Wikipedia: Globally Unique Identifiers,&rdquo; .</span><span>)</span></a>, etc.
  </p>
--- 1239,1243 ----
  there are providers to service them. This specification does not
  dictate what identifier scheme to use; suitable schemes include 
! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., &ldquo;URN Syntax,&rdquo; May&nbsp;1997.</span><span>)</span></a> [RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, &ldquo;The ARK Persistent Identifier Scheme,&rdquo; August&nbsp;2005.</span><span>)</span></a>, 
  <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, &ldquo;Wikipedia: Globally Unique Identifiers,&rdquo; .</span><span>)</span></a>, etc.
  </p>
***************
*** 1208,1212 ****
  </p>
  <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, &ldquo;Uniform Resource Identifiers (URI): Generic Syntax,&rdquo; August&nbsp;1998.</span><span>)</span></a>,
! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, &ldquo;The ARK Persistent Identifier Scheme,&rdquo; February&nbsp;2005.</span><span>)</span></a> scheme, and are applicable to
  such things as the summarizing of large search results from
  Internet-wide indexing engines. As an example of a convention that
--- 1252,1256 ----
  </p>
  <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, &ldquo;Uniform Resource Identifiers (URI): Generic Syntax,&rdquo; August&nbsp;1998.</span><span>)</span></a>,
! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, &ldquo;The ARK Persistent Identifier Scheme,&rdquo; August&nbsp;2005.</span><span>)</span></a> scheme, and are applicable to
  such things as the summarizing of large search results from
  Internet-wide indexing engines. As an example of a convention that
***************
*** 1218,1222 ****
  http://abc.org/12026/987654321
  </pre>
- 
  <p>The convention could also reserve the extension strings "_s", "_d",
  and "_t" to indicate record- ids for secondary, duplicate, and
--- 1262,1265 ----
***************
*** 1230,1234 ****
  http://abc.org/12026/987654321/_t
  </pre>
- 
  <p>...in which an integer count may further extend the identifier 
  when more there is more than one relationship of the given type.
--- 1273,1276 ----
***************
*** 1246,1255 ****
  and checksums shown are plausible random filler.
  </p>
! <a name="rfc.section.B.1"></a><h4><a name="anchor33">Appendix B.1</a>&nbsp;Example of 'warcinfo' Record</h4>
  
  <p>The following 'warcinfo' example includes an XML description of the
  enclosing WARC file that is loosely modelled after the descriptions
  currently used in Internet Archive ARC files.  However, this is an
! abbrieviated and speculative illustration; the referenced
  WARC-specific namespace "http://archive.org/warc/0.8" has not been
  formally defined anywhere, and may not reflect eventual practice with
--- 1288,1299 ----
  and checksums shown are plausible random filler.
  </p>
! <a name="anchor33"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.1"></a><h3>Appendix B.1.&nbsp;Example of 'warcinfo' Record</h3>
  
  <p>The following 'warcinfo' example includes an XML description of the
  enclosing WARC file that is loosely modelled after the descriptions
  currently used in Internet Archive ARC files.  However, this is an
! abbreviated and speculative illustration; the referenced
  WARC-specific namespace "http://archive.org/warc/0.8" has not been
  formally defined anywhere, and may not reflect eventual practice with
***************
*** 1283,1287 ****
  
  </pre>
- 
  <p>The first line (spread over three lines for readability) shows the
  required line of positional parameters. This record has no named
--- 1327,1330 ----
***************
*** 1290,1294 ****
  header-line. Two newlines follow the content block.
  </p>
! <a name="rfc.section.B.2"></a><h4><a name="anchor34">Appendix B.2</a>&nbsp;Example of 'request' Record</h4>
  
  <p>A 'request' record captures the protocol request used to collect a
--- 1333,1339 ----
  header-line. Two newlines follow the content block.
  </p>
! <a name="anchor34"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.2"></a><h3>Appendix B.2.&nbsp;Example of 'request' Record</h3>
  
  <p>A 'request' record captures the protocol request used to collect a
***************
*** 1307,1312 ****
  
  </pre>
! 
! <a name="rfc.section.B.3"></a><h4><a name="anchor35">Appendix B.3</a>&nbsp;Example of 'response' Record</h4>
  
  <p>The archived response to the above request might look like the
--- 1352,1358 ----
  
  </pre>
! <a name="anchor35"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.3"></a><h3>Appendix B.3.&nbsp;Example of 'response' Record</h3>
  
  <p>The archived response to the above request might look like the
***************
*** 1333,1342 ****
  [6958 bytes of binary data here]
  </pre>
- 
  <p>Note the 'Related-Record-ID' named field referring back to the
  generating 'request' record, and the creation-date identical to the
  previous record.
  </p>
! <a name="rfc.section.B.4"></a><h4><a name="anchor36">Appendix B.4</a>&nbsp;Example of 'resource' Record</h4>
  
  <p>This same file, "logo.jpg", might be archived internally to an
--- 1379,1389 ----
  [6958 bytes of binary data here]
  </pre>
  <p>Note the 'Related-Record-ID' named field referring back to the
  generating 'request' record, and the creation-date identical to the
  previous record.
  </p>
! <a name="anchor36"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.4"></a><h3>Appendix B.4.&nbsp;Example of 'resource' Record</h3>
  
  <p>This same file, "logo.jpg", might be archived internally to an
***************
*** 1351,1356 ****
  [6958 bytes of binary data here]
  </pre>
! 
! <a name="rfc.section.B.5"></a><h4><a name="anchor37">Appendix B.5</a>&nbsp;Example of 'metadata' Record</h4>
  
  <p>If some crawl-time metadata should be archived near the above
--- 1398,1404 ----
  [6958 bytes of binary data here]
  </pre>
! <a name="anchor37"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.5"></a><h3>Appendix B.5.&nbsp;Example of 'metadata' Record</h3>
  
  <p>If some crawl-time metadata should be archived near the above
***************
*** 1370,1379 ****
  &lt;/harvestmetadata&gt;
  </pre>
- 
  <p>Note again the same creation-date as the preceding related
  records. A relationship is declared o the preceding 'response' record,
  but declaring a relationship to the 'request' would also be legal.
  </p>
! <a name="rfc.section.B.6"></a><h4><a name="anchor38">Appendix B.6</a>&nbsp;Example of 'revisit' Record</h4>
  
  <p>If the same URI is later revisited and the content is unchanged, a
--- 1418,1428 ----
  &lt;/harvestmetadata&gt;
  </pre>
  <p>Note again the same creation-date as the preceding related
  records. A relationship is declared o the preceding 'response' record,
  but declaring a relationship to the 'request' would also be legal.
  </p>
! <a name="anchor38"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.6"></a><h3>Appendix B.6.&nbsp;Example of 'revisit' Record</h3>
  
  <p>If the same URI is later revisited and the content is unchanged, a
***************
*** 1396,1400 ****
  &lt;/revisit&gt;
  </pre>
- 
  <p>Again, reference is made back to the original 'response' record. A
  new creation-date reflects he time of revisit. This content block
--- 1445,1448 ----
***************
*** 1405,1409 ****
  defined.
  </p>
! <a name="rfc.section.B.7"></a><h4><a name="anchor39">Appendix B.7</a>&nbsp;Example of 'conversion' Record</h4>
  
  <p>At some future date, the "image/jpeg" format may no longer be
--- 1453,1459 ----
  defined.
  </p>
! <a name="anchor39"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.7"></a><h3>Appendix B.7.&nbsp;Example of 'conversion' Record</h3>
  
  <p>At some future date, the "image/jpeg" format may no longer be
***************
*** 1421,1425 ****
  [3098 bytes of binary data here]
  </pre>
- 
  <p>An accompanying 'metadata' record, referring to this 'conversion'
  record, could contain additional details about the
--- 1471,1474 ----
***************
*** 1427,1431 ****
  serve this role.)
  </p>
! <a name="rfc.section.B.8"></a><h4><a name="anchor40">Appendix B.8</a>&nbsp;Example of 'continuation' Record</h4>
  
  <p>If the 'response' above had been so large that it would not fit
--- 1476,1482 ----
  serve this role.)
  </p>
! <a name="anchor40"></a><br /><hr />
! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.B.8"></a><h3>Appendix B.8.&nbsp;Example of 'continuation' Record</h3>
  
  <p>If the 'response' above had been so large that it would not fit
***************
*** 1447,1451 ****
  [39514114 bytes of binary data here]
  </pre>
- 
  <p>Note that the 'Segment-Origin-ID' refers to the first segment of
  the set, the one with the "Segment-Number: 1" named field.
--- 1498,1501 ----
***************
*** 1460,1464 ****
  <td class="author-text">Burner, M. and B. Kahle, &ldquo;<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,&rdquo; September&nbsp;1996.</td></tr>
  <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td>
! <td class="author-text">Kunze, J. and R. Rogers, &ldquo;<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,&rdquo; February&nbsp;2005.</td></tr>
  <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td>
  <td class="author-text">&ldquo;<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.&rdquo;</td></tr>
--- 1510,1514 ----
  <td class="author-text">Burner, M. and B. Kahle, &ldquo;<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,&rdquo; September&nbsp;1996.</td></tr>
  <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td>
! <td class="author-text">Kunze, J. and R. Rodgers, &ldquo;<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,&rdquo; August&nbsp;2005.</td></tr>
  <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td>
  <td class="author-text">&ldquo;<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.&rdquo;</td></tr>

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** warc_file_format.xml	22 Aug 2005 17:28:24 -0000	1.6
--- warc_file_format.xml	23 Aug 2005 17:35:41 -0000	1.7
***************
*** 121,125 ****
  Archive (IA) to record a sequence of materials captured from the web
  (e.g., web "pages"). Each capture is preceded by a one-line header
! that very briey describes the harvested content and its length. This
  is directly followed by the the retrieval protocol response messages
  and content. The motivation to revise the format arose from the
--- 121,125 ----
  Archive (IA) to record a sequence of materials captured from the web
  (e.g., web "pages"). Each capture is preceded by a one-line header
! that very briefly describes the harvested content and its length. This
  is directly followed by the the retrieval protocol response messages
  and content. The motivation to revise the format arose from the
***************
*** 137,141 ****
  organizations. Besides the primary content currently recorded, the
  revision accommodates related secondary content, such as assigned
! metadata, abbrieviated duplicate detection events, and later-date
  transformations. The revision may also be useful for more general
  applications than web archiving. To aid the development of tools that
--- 137,141 ----
  organizations. Besides the primary content currently recorded, the
  revision accommodates related secondary content, such as assigned
! metadata, abbreviated duplicate detection events, and later-date
  transformations. The revision may also be useful for more general
  applications than web archiving. To aid the development of tools that
***************
*** 367,371 ****
  
  <t>A 'revisit' record describes the revisitation of content already archived,
! and includes only an abbrieviated content block which must be
  interpreted relative to a previous record. Most typically, a 'revisit'
  record is be used instead of 'response' or 'resource' record to
--- 367,371 ----
  
  <t>A 'revisit' record describes the revisitation of content already archived,
! and includes only an abbreviated content block which must be
  interpreted relative to a previous record. Most typically, a 'revisit'
  record is be used instead of 'response' or 'resource' record to
***************
*** 1129,1133 ****
  enclosing WARC file that is loosely modelled after the descriptions
  currently used in Internet Archive ARC files.  However, this is an
! abbrieviated and speculative illustration; the referenced
  WARC-specific namespace "http://archive.org/warc/0.8" has not been
  formally defined anywhere, and may not reflect eventual practice with
--- 1129,1133 ----
  enclosing WARC file that is loosely modelled after the descriptions
  currently used in Internet Archive ARC files.  However, this is an
! abbreviated and speculative illustration; the referenced
  WARC-specific namespace "http://archive.org/warc/0.8" has not been
  formally defined anywhere, and may not reflect eventual practice with

Index: warc_file_format.txt
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** warc_file_format.txt	18 Aug 2005 01:57:10 -0000	1.3
--- warc_file_format.txt	23 Aug 2005 17:35:41 -0000	1.4
***************
*** 120,163 ****
     3.  The WARC Record Model  . . . . . . . . . . . . . . . . . . . .  6
     4.  Record Types . . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.1   'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.2   'response' . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.3   'resource' . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.4   'request'  . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.5   'metadata' . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.6   'revisit'  . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.7   'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10
!      4.8   'continuation' . . . . . . . . . . . . . . . . . . . . . . 10
     5.  Record Header  . . . . . . . . . . . . . . . . . . . . . . . . 12
!      5.1   Positional Parameters  . . . . . . . . . . . . . . . . . . 13
!      5.2   Named Parameters . . . . . . . . . . . . . . . . . . . . . 14
     6.  Record Content Block . . . . . . . . . . . . . . . . . . . . . 17
     7.  Truncated and Segmented Records  . . . . . . . . . . . . . . . 18
!      7.1   Record Truncation  . . . . . . . . . . . . . . . . . . . . 18
!      7.2   Record Segmentation  . . . . . . . . . . . . . . . . . . . 18
     8.  WARC Application to Specific Protocols . . . . . . . . . . . . 20
!      8.1   HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20
!      8.2   DNS  . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
!      8.3   Other Resources with URIs, and Other Protocols . . . . . . 21
     9.  Compression Recommendations  . . . . . . . . . . . . . . . . . 22
!      9.1   Record-at-a-time Compression . . . . . . . . . . . . . . . 22
!      9.2   GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3   GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23
!    10.   WARC File Name and Size Recommendations  . . . . . . . . . . 24
!    11.   Registration of MIME Media Type application/warc . . . . . . 25
!    12.   IANA Considerations  . . . . . . . . . . . . . . . . . . . . 26
!    13.   Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 27
!    A.  Consideratons in Choice of record-id . . . . . . . . . . . . . 28
!    B.  Examples of WARC Records . . . . . . . . . . . . . . . . . . . 29
!      B.1   Example of 'warcinfo' Record . . . . . . . . . . . . . . . 29
!      B.2   Example of 'request' Record  . . . . . . . . . . . . . . . 30
!      B.3   Example of 'response' Record . . . . . . . . . . . . . . . 30
!      B.4   Example of 'resource' Record . . . . . . . . . . . . . . . 31
!      B.5   Example of 'metadata' Record . . . . . . . . . . . . . . . 31
!      B.6   Example of 'revisit' Record  . . . . . . . . . . . . . . . 31
!      B.7   Example of 'conversion' Record . . . . . . . . . . . . . . 32
!      B.8   Example of 'continuation' Record . . . . . . . . . . . . . 32
!    14.   References . . . . . . . . . . . . . . . . . . . . . . . . . 33
!        Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 34
!        Intellectual Property and Copyright Statements . . . . . . . . 36
  
  
--- 120,163 ----
     3.  The WARC Record Model  . . . . . . . . . . . . . . . . . . . .  6
     4.  Record Types . . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.1.  'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.2.  'response' . . . . . . . . . . . . . . . . . . . . . . . .  8
!      4.3.  'resource' . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.4.  'request'  . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.5.  'metadata' . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.6.  'revisit'  . . . . . . . . . . . . . . . . . . . . . . . .  9
!      4.7.  'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10
!      4.8.  'continuation' . . . . . . . . . . . . . . . . . . . . . . 10
     5.  Record Header  . . . . . . . . . . . . . . . . . . . . . . . . 12
!      5.1.  Positional Parameters  . . . . . . . . . . . . . . . . . . 13
!      5.2.  Named Parameters . . . . . . . . . . . . . . . . . . . . . 14
     6.  Record Content Block . . . . . . . . . . . . . . . . . . . . . 17
     7.  Truncated and Segmented Records  . . . . . . . . . . . . . . . 18
!      7.1.  Record Truncation  . . . . . . . . . . . . . . . . . . . . 18
!      7.2.  Record Segmentation  . . . . . . . . . . . . . . . . . . . 18
     8.  WARC Application to Specific Protocols . . . . . . . . . . . . 20
!      8.1.  HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20
!      8.2.  DNS  . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
!      8.3.  Other Resources with URIs, and Other Protocols . . . . . . 21
     9.  Compression Recommendations  . . . . . . . . . . . . . . . . . 22
!      9.1.  Record-at-a-time Compression . . . . . . . . . . . . . . . 22
!      9.2.  GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3.  GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23
!    10. WARC File Name and Size Recommendations  . . . . . . . . . . . 24
!    11. Registration of MIME Media Type application/warc . . . . . . . 25
!    12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 26
!    13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27
!    Appendix A.   Consideratons in Choice of record-id . . . . . . . . 28
!    Appendix B.   Examples of WARC Records . . . . . . . . . . . . . . 29
!    Appendix B.1. Example of 'warcinfo' Record . . . . . . . . . . . . 29
!    Appendix B.2. Example of 'request' Record  . . . . . . . . . . . . 30
!    Appendix B.3. Example of 'response' Record . . . . . . . . . . . . 30
!    Appendix B.4. Example of 'resource' Record . . . . . . . . . . . . 31
!    Appendix B.5. Example of 'metadata' Record . . . . . . . . . . . . 31
!    Appendix B.6. Example of 'revisit' Record  . . . . . . . . . . . . 31
!    Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32
!    Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32
!    14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33
!    Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35
!    Intellectual Property and Copyright Statements . . . . . . . . . . 36
  
  
***************
*** 182,200 ****
     Archive (IA) to record a sequence of materials captured from the web
     (e.g., web "pages").  Each capture is preceded by a one-line header
!    that very briey describes the harvested content and its length.  This
!    is directly followed by the the retrieval protocol response messages
!    and content.  The motivation to revise the format arose from the
!    discussion and experiences of the International Internet Preservation
!    Consortium (IIPC) [IIPC], whose members include the IA and the
!    national libraries of a dozen countries.  The revised format is
!    expected to become the primary output format of the open-source
!    Heritrix [HERITRIX] web crawler, and the input format for a wide
!    array of cataloguing and access tools.
  
     The WARC format generalizes the older format to better support the
!    harvesting, display, and exchange needs of archiving organizations.
     Besides the primary content currently recorded, the revision
     accommodates related secondary content, such as assigned metadata,
!    abbrieviated duplicate detection events, and later-date
     transformations.  The revision may also be useful for more general
     applications than web archiving.  To aid the development of tools
--- 182,202 ----
     Archive (IA) to record a sequence of materials captured from the web
     (e.g., web "pages").  Each capture is preceded by a one-line header
!    that very briefly describes the harvested content and its length.
!    This is directly followed by the the retrieval protocol response
!    messages and content.  The motivation to revise the format arose from
!    the discussion and experiences of the International Internet
!    Preservation Consortium (IIPC) [IIPC], whose members include the IA
!    and the national libraries of a dozen countries.  The revised format
!    is expected to be a standard way to structure, manage and store
!    billions of collected web resources.  For example, WARC will be an
!    output format of harvesting software, such as the open-source
!    Heritrix [HERITRIX] web crawler, and an input format for a wide array
!    of cataloguing and access tools.
  
     The WARC format generalizes the older format to better support the
!    harvesting, access, and exchange needs of archiving organizations.
     Besides the primary content currently recorded, the revision
     accommodates related secondary content, such as assigned metadata,
!    abbreviated duplicate detection events, and later-date
     transformations.  The revision may also be useful for more general
     applications than web archiving.  To aid the development of tools
***************
*** 219,224 ****
  
  
- 
- 
  Kunze, et al.            Expires January 2, 2006                [Page 4]
  
--- 221,224 ----
***************
*** 409,413 ****
     appropriate and how they can be standardized is warranted.]
  
! 4.1  'warcinfo'
  
     A 'warcinfo' record describes the records that follow it, up through
--- 409,413 ----
     appropriate and how they can be standardized is warranted.]
  
! 4.1.  'warcinfo'
  
     A 'warcinfo' record describes the records that follow it, up through
***************
*** 436,440 ****
     content block must be formally defined somewhere.]
  
! 4.2  'response'
  
     A 'response' record contains an entire protocol response, such as a
--- 436,440 ----
     content block must be formally defined somewhere.]
  
! 4.2.  'response'
  
     A 'response' record contains an entire protocol response, such as a
***************
*** 454,458 ****
     named parameters 'IP-Address' and 'Related-Record-ID'.
  
! 4.3  'resource'
  
     A 'resource' record contains a resource, without full protocol
--- 454,458 ----
     named parameters 'IP-Address' and 'Related-Record-ID'.
  
! 4.3.  'resource'
  
     A 'resource' record contains a resource, without full protocol
***************
*** 462,466 ****
     often includes the named parameter 'Related-Record-ID'.
  
! 4.4  'request'
  
     A 'request' record holds the manner in which a primary record's
--- 462,466 ----
     often includes the named parameter 'Related-Record-ID'.
  
! 4.4.  'request'
  
     A 'request' record holds the manner in which a primary record's
***************
*** 469,473 ****
     parameter 'Related-Record-ID'.
  
! 4.5  'metadata'
  
     A 'metadata' record contains content created in order to further
--- 469,473 ----
     parameter 'Related-Record-ID'.
  
! 4.5.  'metadata'
  
     A 'metadata' record contains content created in order to further
***************
*** 487,494 ****
     formally specified somewhere.]
  
! 4.6  'revisit'
  
     A 'revisit' record describes the revisitation of content already
!    archived, and includes only an abbrieviated content block which must
     be interpreted relative to a previous record.  Most typically, a
     'revisit' record is be used instead of 'response' or 'resource'
--- 487,494 ----
     formally specified somewhere.]
  
! 4.6.  'revisit'
  
     A 'revisit' record describes the revisitation of content already
!    archived, and includes only an abbreviated content block which must
     be interpreted relative to a previous record.  Most typically, a
     'revisit' record is be used instead of 'response' or 'resource'
***************
*** 528,532 ****
     somewhere.]
  
! 4.7  'conversion'
  
     A 'conversion' record contains an alternative version of another
--- 528,532 ----
     somewhere.]
  
! 4.7.  'conversion'
  
     A 'conversion' record contains an alternative version of another
***************
*** 550,554 ****
     specified somewhere.]
  
! 4.8  'continuation'
  
     A 'continuation' record needs to be logically appended to a prior
--- 550,554 ----
     specified somewhere.]
  
! 4.8.  'continuation'
  
     A 'continuation' record needs to be logically appended to a prior
***************
*** 674,678 ****
  
  
! 5.1  Positional Parameters
  
     This section describes each of the individual positional parameters
--- 674,678 ----
  
  
! 5.1.  Positional Parameters
  
     This section describes each of the individual positional parameters
***************
*** 695,707 ****
        After proceeding this many octets from that first character of the
        record header, there should be two newlines and either the
!       beginning of a new record or the end of the file.
  
  
!       Defensive programming suggests the practice of tolerating fewer or
!       more than two newlines at record's end.  If the first next token
!       does not match the first token of a WARC record, then the previous
!       data-length should be considered in error; corrective action might
!       include searching for a nearby occurrence of "warc/0.8" and other
!       character patterns indicative of a legal record beginning.
  
     record-type The kind of WARC record.  All record types are optional,
--- 695,708 ----
        After proceeding this many octets from that first character of the
        record header, there should be two newlines and either the
!       beginning of a new record or the end of the file.  (WARC reading
!       implementations may choose to tolerate more or fewer newlines at
!       the end of a record.)
  
  
!       If the first next token does not match the first token of a WARC
!       record, then the previous data-length should be considered in
!       error; corrective action might include searching for a nearby
!       occurrence of "warc/0.8" and other character patterns indicative
!       of a legal record beginning.
  
     record-type The kind of WARC...
 
[truncated message content]

[Archive-access-cvs] archive-access/projects/nutch/xdocs wacs-oswir.doc,1.2,1.3 wacs-oswir.pdf,1.1,1.2

From: Michael S. <sta...@us...> - 2005-08-23 00:26:27

Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7484/xdocs

Modified Files:
	wacs-oswir.doc wacs-oswir.pdf 
Log Message:
* xdocs/wacs-oswir.doc xdocs/wacs-oswir.pdf 
    Final submissions.


Index: wacs-oswir.pdf
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/wacs-oswir.pdf,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
Binary files /tmp/cvsR4BYtk and /tmp/cvsHIeiab differ

Index: wacs-oswir.doc
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/wacs-oswir.doc,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
Binary files /tmp/cvsMHqZYm and /tmp/cvshLVeId differ

Re: [Archive-access-cvs] error in indexing

From: stack <st...@ar...> - 2005-08-18 17:59:50

 Lukáš Matějka wrote:

> Hi,
>does anybody have an idea?
>  
>
What is your complete indexarcs.sh line?  Looks like we're passing in a 
'*' character -- i.e. ./nutch-data/segments/*/fetcher/data -- and 
internally is not expanding the glob character. Try something simple w/o 
'*' characters for your '-d' value.

St.Ack

> xmatejk2@war:~/nutchwax-0.2.1$ ./bin/indexarcs.sh -s /home...
>Tue Aug 9 13:52:36 CEST 2005 Checking environment variables.
>  
>
>>Tue Aug 9 13:52:36 CEST 2005 Cleaning up all ./nutch-data/ content.
>>Tue Aug 9 13:52:36 CEST 2005 Creating new queue, and segments.
>>Tue Aug 9 13:52:36 CEST 2005 Started segmenting.
>>ERROR: ./nutch-data//queue/ directory does not exist.
>>/home/xmatejk2/nutchwax-0.2.1/bin/arcs2segs.sh DIR_OF_ARCS DIR_FOR_SEGMENTS [#ARCS]
>>Tue Aug 9 13:52:36 CEST 2005 Started build of link database.
>>050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135236 No FS indicated, using default:local
>>050809 135236 Created webdb at LocalFS,./nutch-data/db
>>050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135237 No FS indicated, using default:local
>>050809 135237 Updating ./nutch-data/db
>>050809 135237 Updating for ./nutch-data//segments/*
>>Exception in thread "main" java.io.FileNotFoundException: ./nutch-data/segments/*/fetcher/data
>>at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93)
>>at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194)
>>        at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187)
>>        at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190)
>>        at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179)
>>        at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50)
>>at org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTool.java:92)
>>at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366)
>>050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135238 Updating ./nutch-data//segments from ./nutch-data//db
>>Exception in thread "main" java.lang.NullPointerException
>>at org.apache.nutch.tools.UpdateSegmentsFromDb.run(UpdateSegmentsFromDb.java:181)
>>at org.apache.nutch.tools.UpdateSegmentsFromDb.main(UpdateSegmentsFromDb.java:345)
>>Tue Aug 9 13:52:38 CEST 2005 Started indexing.
>>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135239 No FS indicated, using default:local
>>050809 135239 indexing segment: ./nutch-data/segments/*
>>050809 135239 * Opening segment *
>>Exception in thread "main" java.lang.NullPointerException
>>at org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:165)
>>at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263)
>>Tue Aug 9 13:52:39 CEST 2005 Started dedup.
>>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135239 No FS indicated, using default:local
>>050809 135240 Reading url hashes...
>>050809 135240 Sorting url hashes...
>>050809 135240 Deleting url duplicates...
>>050809 135240 Deleted 0 url duplicates.
>>050809 135240 Reading content hashes...
>>050809 135240 Sorting content hashes...
>>050809 135240 Deleting content duplicates...
>>050809 135240 Deleted 0 content duplicates.
>>050809 135240 Duplicate deletion complete locally. Now returning to NFS...
>>050809 135240 DeleteDuplicates complete
>>Tue Aug 9 13:52:40 CEST 2005 Merging indices.
>>050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
>>050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
>>050809 135240 No FS indicated, using default:local
>>050809 135240 merging segment indexes to: ./nutch-data/index
>>050809 135240 done merging
>>
>>-lm
>>
>>
>>
>>    
>>
>
>
>
>-------------------------------------------------------
>SF.Net email is Sponsored by the Better Software Conference & EXPO
>September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
>Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
>Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
>_______________________________________________
>Archive-access-cvs mailing list
>Arc...@li...
>https://lists.sourceforge.net/lists/listinfo/archive-access-cvs
>  
>

[Archive-access-cvs] error in indexing

From: <mat...@ce...> - 2005-08-10 07:27:00

 Hi,
does anybody have an idea?

 xmatejk2@war:~/nutchwax-0.2.1$ ./bin/indexarcs.sh -s /home...
Tue Aug 9 13:52:36 CEST 2005 Checking environment variables.
> Tue Aug 9 13:52:36 CEST 2005 Cleaning up all ./nutch-data/ content.
> Tue Aug 9 13:52:36 CEST 2005 Creating new queue, and segments.
> Tue Aug 9 13:52:36 CEST 2005 Started segmenting.
> ERROR: ./nutch-data//queue/ directory does not exist.
> /home/xmatejk2/nutchwax-0.2.1/bin/arcs2segs.sh DIR_OF_ARCS DIR_FOR_SEGMENTS [#ARCS]
> Tue Aug 9 13:52:36 CEST 2005 Started build of link database.
> 050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135236 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135236 No FS indicated, using default:local
> 050809 135236 Created webdb at LocalFS,./nutch-data/db
> 050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135237 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135237 No FS indicated, using default:local
> 050809 135237 Updating ./nutch-data/db
> 050809 135237 Updating for ./nutch-data//segments/*
> Exception in thread "main" java.io.FileNotFoundException: ./nutch-data/segments/*/fetcher/data
> at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93)
> at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194)
>         at org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187)
>         at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190)
>         at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179)
>         at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50)
> at org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTool.java:92)
> at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366)
> 050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135238 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135238 Updating ./nutch-data//segments from ./nutch-data//db
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.tools.UpdateSegmentsFromDb.run(UpdateSegmentsFromDb.java:181)
> at org.apache.nutch.tools.UpdateSegmentsFromDb.main(UpdateSegmentsFromDb.java:345)
> Tue Aug 9 13:52:38 CEST 2005 Started indexing.
> 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135239 No FS indicated, using default:local
> 050809 135239 indexing segment: ./nutch-data/segments/*
> 050809 135239 * Opening segment *
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:165)
> at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263)
> Tue Aug 9 13:52:39 CEST 2005 Started dedup.
> 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135239 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135239 No FS indicated, using default:local
> 050809 135240 Reading url hashes...
> 050809 135240 Sorting url hashes...
> 050809 135240 Deleting url duplicates...
> 050809 135240 Deleted 0 url duplicates.
> 050809 135240 Reading content hashes...
> 050809 135240 Sorting content hashes...
> 050809 135240 Deleting content duplicates...
> 050809 135240 Deleted 0 content duplicates.
> 050809 135240 Duplicate deletion complete locally. Now returning to NFS...
> 050809 135240 DeleteDuplicates complete
> Tue Aug 9 13:52:40 CEST 2005 Merging indices.
> 050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-default.xml
> 050809 135240 parsing file:/home/xmatejk2/nutchwax-0.2.1/conf/nutch-site.xml
> 050809 135240 No FS indicated, using default:local
> 050809 135240 merging segment indexes to: ./nutch-data/index
> 050809 135240 done merging
> 
> -lm
> 
> 
>

[Archive-access-cvs] [Annoucement] First release of nutchwax + WERA access tool

From: <st...@du...> - 2005-07-29 00:40:13

We would like to announce the release of nutchwax -- the nutch search 
application + extensions for searching of web archive collections -- and 
WERA, a web collection viewer application from the NWA Toolset that has 
been adapted to nutchwax.  The two tools used in concert provide 
full-text search of small web archive collections and a means of 
browsing an archive collection over time.

Nutchwax is hosted on sourceforge at http://archive-access.sourceforge.net.

St.Ack

Flat | Threaded

<< < 1 .. 169 170 171 (Page 171 of 171)