deduplicator-cvs Mailing List for DeDuplicator (Heritrix add-on)

Brought to you by: kristinn_sig

deduplicator-cvs — DeDuplicator CVS Commits

You can subscribe to this list here.

2006	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov (14)	_Dec (4)
2007	_Jan	_Feb	_Mar	_Apr	_May (2)	_Jun (3)	_Jul	_Aug	_Sep (1)	_Oct	_Nov	_Dec
2008	_Jan (2)	_Feb	_Mar	_Apr	_May (8)	_Jun	_Jul (14)	_Aug	_Sep	_Oct	_Nov	_Dec
2009	_Jan	_Feb	_Mar	_Apr	_May (3)	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2010	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul (60)	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

1 2 3 .. 5 > >> (Page 1 of 5)

[Deduplicator-cvs] deduplicator3/src/main/java/is/landsbokasafn/deduplicator CrawlLogIterator.java, 1.2, 1.3

From: Kristinn S. <kri...@us...> - 2010-07-29 16:49:39

Update of /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator
In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv30877/src/main/java/is/landsbokasafn/deduplicator

Modified Files:
	CrawlLogIterator.java 
Log Message:
Missing size in log is now handled correctly by omitting the relevant URL. Missing size is always an indication that the visit failed.

Index: CrawlLogIterator.java
===================================================================
RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/CrawlLogIterator.java,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** CrawlLogIterator.java	27 Jul 2010 09:09:46 -0000	1.2
--- CrawlLogIterator.java	29 Jul 2010 16:49:31 -0000	1.3
***************
*** 165,172 ****
              // Index 2: File size 
              long size = -1;
              try {
              	size = Long.parseLong(lineParts[2]);
              } catch (NumberFormatException e) {
!                 System.err.println("Error parsing size for: " + line);
              }
  
--- 165,177 ----
              // Index 2: File size 
              long size = -1;
+             if (lineParts[2].equals("-")) {
+             	// If size is missing then this URL was not successfully visited. Skip in index
+             	return null;
+             }
              try {
              	size = Long.parseLong(lineParts[2]);
              } catch (NumberFormatException e) {
!                 System.err.println("Error parsing size for: " + line + 
!                 		" Item: " + lineParts[2] + " Message: " + e.getMessage());
              }

[Deduplicator-cvs] deduplicator3/src/site/apt release.apt, NONE, 1.1 release3.apt, NONE, 1.1 format.apt, NONE, 1.1 started.apt, NONE, 1.1 license.apt, NONE, 1.1 index.apt, NONE, 1.1

From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:55

Update of /cvsroot/deduplicator/deduplicator3/src/site/apt
In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site/apt

Added Files:
	release.apt release3.apt format.apt started.apt license.apt 
	index.apt 
Log Message:
Added site.

Improvments on how the Lucene index is accessed

Added size filter to DigestIndexer.

--- NEW FILE: started.apt ---

Getting Started
~~~~~~~~~~~~~~~

* Building an index
~~~~~~~~~~~~~~~~~~~

  [[1]] A functional installation of Heritrix is required for this software to 
        work. While Heritrix can be deployed on non-Linux operating systems that
        requires some degree of work as the bundled scripts are written for 
        Linux. The same applies to this software and the following instructions 
        assume that Heritrix is installed on a Linux machine under 
        <<<$HERITRIX_HOME>>>.

  [[2]] Install the DeDuplicator software. The jar files should be included in 
        <<<$HERITRIX_HOME/lib/>>> while the dedupdigest script should be added
        to <<<$HERITRIX_HOME/bin/>>>. If you've downloaded a .tar.gz (.zip) 
        bundle, explode it into <<<$HERITRIX_HOME>>> and all the files will be
        correctly deployed. 

  [[3]]	Make the dedupdigest script executable with <<<chmod u+x 
		$HERITRIX_HOME/bin/dedupdigest>>>

  [[4]] Run <<<$HERITRIX_HOME/bin/dedupdigest --help>>> 
  		This will display the usage information for the indexing.
		The program takes two arguments, the source data (crawl.log usually) 
		and the target directory where the index will be written (will be 
		created if not present). Several options are provided to custom
		tailor the type of index.

  [[5]] Create an index. 
   		A typical index can be built with
        <<<$HERITRIX_HOME/bin/dedupdigest -o URL -s -t <location of 
		crawl.log> <index output directory>>>>
		This will create an index that is indexed by URL only (not by the 
		content digest) and includes equivalent URLs and timestamps.
		
* Using the index
~~~~~~~~~~~~~~~~~

  [[1]] Having built an appropriate index, launch Heritrix. Make sure that 
		the installation of Heritrix that you launched has the two JARs that
		come with the DeDuplicator (deduplicator-[version].jar and 
		lucene-[version].jar) if it is not the same one used for creating the
		index.

  [[2]] Configure a crawl job as normal except add the DeDuplicator
		processor to the processing chain at some point <<after>> the
		HTTPFetcher processor and prior to any processor which should be 
		skipped if a duplicate is detected.
		When the DeDuplicator finds a duplicate the processing moves 
		straight to the PostProcessing chain. So if you insert it at the top
		of the Extractor chain you can skip both link extraction and writing
		to disk. If you do not wish to skip link extraction you can insert the
		processor at the end of the link extraction chain etc.

  [[3]] The DeDuplicator processor has several configurable parameters.

		 *  <<enabled>> Standard Heritrix property for processors. 
			Should be true. Setting it to false will disable the processor.</li>		

		 *  <<index-location>> The most important setting. A full path
			to the directory that contains the index (output directory of the 
			indexing.)

		 *  <<matching-method>> Whether to lookup URLs or content
			digests first when looking for matches. This setting depends on
			how the index was built (indexing mode). If it was set to BOTH then
			either setting will work. Otherwise it must be set according to the
			indexing mode.

		 *  <<try-equivalent>> Should equivalent URLs be tried if an
			exact URL and content digest match is not found. Using equivalent
			matches means that duplicate documents whose URLs differ only in the
			parameter list or because of www[0-9]* prefixes are detected.
			
		 *  <<mime-filter>> Which documents to process
			
		 *  <<filter-mode>>
			
		 *  <<analysis-mode>> Enables analysis of the usefulness and
			accuracy of header information in predicting change and non-change
			in documents. For statistical gathering purposes only.
			
		 *  <<log-level>> Enables more logging.
			
		 *  <<stats-per-host>> Maintains statistics per host in 
			addition to the crawl wide stats.
			
  [[4]] Once the processor has been configured the crawl can be started
		and run normally. Information about the processor is available via
		the Processor report in the Heritrix GUI (this is saved to 
		processors-report.txt at the end of a crawl).
		
		Duplicate URLs will still show up in the crawl log but with a note 
		'duplicate' in the annotation field at the end of the log line.

--- NEW FILE: index.apt ---
The DeDuplicator (Heritrix add-on module)

* Release information
~~~~~~~~~~~~~~~~~~~~~
 
 Current stable release is {{{release.html#0.4.0}0.4.0}}.

 All releases, including interim (potentially unstable) releases can be 
 found here: {{{release.html}Release History of DeDuplicator for Heritrix 1}} 
 and here: {{{release3.html}Release History of DeDuplicator for Heritrix 3}}
 
* News
~~~~~~

** DeDuplicator for Heritrix 3 - 23/07/2010
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Version 3.0.0-SNAPSHOT-20100727 is now available {{{release3.html}here}}. 

 This version is compiled against Heritrix 3.0.0. 

 It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in
 the Lucene library mean that memory usage will be approximately 40% greater than
 before. Memory usage appears to be approximately 5 bytes per URL in index, as 
 compared to 3.6 bytes per URL previously. Query times have however improved 
 significantly and are now fixed time without regard for the index size. For
 large indexes this can mean as much as 10-30 times shorter query times. Building
 indexes is also much faster (approximately 3-4 times as fast).

 Currently the DeDupFetchHTTP processor has not been converted. 

 This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. 
 One final release (1.0.0) will be released soon with some accumulated bugfixes. A release
 candidate is available {{{release.html}here}}.


** Version 0.4.0 released / Future plans - 15/07/2008
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Version 0.4.0 includes numerous tweaks and patches introduced since 0.2.0.
 
 Notable changes:
 
      * Support for changed crawl.log format that Heritrix introduced in 1.12.0.
      
      * Improved memory usage for large indexes.
      
      * Can now exclude duplicate URIs from new index.
      
      * Various bug fixes.  

 This will be the last version of the DeDuplicator that is built against 
 Heritrix 1.10.0. Building against that version of Heritrix has made the
 DeDuplicator compatible with almost all 1.x versions of Heritrix. Note though
 that 0.4.0 is built with Java 1.5, unlike 0.2.0 which was built with Java 
 1.4.2.
 
 In version 1.12.0 Heritrix added some useful features that the DeDuplicator 
 should make use of, most notably marking content as 'not novel' 
 (i.e. duplicate). Also in 1.14.0 there is rudimentary WARC support and the
 aim is to have the DeDuplicator support writing to WARC files. Therefor, any
 future versions will be built against Heritrix 1.14.0.  
 
 Support for Heritrix 2.0 is planned but there is no set timeframe for it.
 This requires considerable changes to the DeDuplicator and will likely not
 be implemented until Heritrix 2.x is sufficiently mature that it is used  
 routinely instead of 1.x for large scale production crawls.
 

** Support for Heritrix 1.12.0 - 1/06/2007
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 A new interim release has been uploaded to deal with the changed crawl.log format
 in Heritrix 1.12.0. 0.4.0 will be the final release for Heritrix up to version 
 1.12.0 and should be released soon. 
 
 Heritrix version 2.0.0, currently in development, will greatly change Heritrix's
 API and so will require significant changes to the DeDuplicator. Look for 
 the first interim release built against the new Heritrix API as soon as the 
 changes are moved into the trunk of the Heritrix project. Probably sometime
 this month.


** Moved to Sourceforge.net - 7/11/2006
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The project has now been moved to Sourceforge.net. The code has been moved
 to SF's CVS and anonymous access is now possible. Initial commit was of version 
 0.2.0.
 Change history prior to 0.2.0 will be discarded except that we will keep the
 packaged PreRelease versions that were made.
 
 Along with the public CVS, SourceForge also provides 
 {{{http://sourceforge.net/tracker/?group_id=181565}bug and RFE trackers}}.
 The project website has also been moved to 
 {{http://deduplicator.sourceforge.net/}}.
 
 Stable releases will now be distributed via SourceForge while iterim builds
 will continue to be made available on the {{{release.html}Release History}}
 page (until continous integration is set up).

** Managing duplicates across sequential crawls - 31/10/2006
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 On September 21, Kristinn Sigurï¿½sson presented a paper on the 
 DeDuplicator titled 'Managing duplicates across sequential crawls' at the 
 {{{http://iwaw.net/06}6th International Web Archiving Workshop}} held in 
 conjunction with the {{{http://ecdl2006.org}10th ECDL}} in Alicante, Spain.
 
 The paper is available in the 
 {{{http://www.iwaw.net/06/PDF/iwaw06-proceedings.pdf}Workshop Procedings}} 
 and can also be downloaded by itself directly from here:
 {{{http://vefsofnun.bok.hi.is/upload/3/ManagingDuplicatesAcrossSequentialCrawls.pdf}Managing duplicates across sequential crawls}}.
--- NEW FILE: release.apt ---
Release History - DeDuplicator for Heritrix 1
~~~~~~~~~~~~~~~

  The following is a list of releases of the DeDuplicator for Heritrix 1. For Heritrix 3 see 
  {{{release3.html}here}}.   
  
  Stable releases are clearly labeled as such and can be downloaded via our 
  {{{https://sourceforge.net/project/showfiles.php?group_id=181565}SourceForge download page}}. 
  Any other release may contain unstable/untested elements. They are 
  provided since a continous build process is not currently available.
  
  Most recent stable release is {{{release.html#0.4.0}0.4.0}}

* {1.0.0-RC1} (Release candidate for 1.0.0)
~~~~~~~~~~~~~~~~~~

  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-bin.tar.gz?download}deduplicator-1.0.0-RC1-bin.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-bin.zip?download}deduplicator-1.0.0-RC1-bin.zip}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-src.tar.gz?download}deduplicator-1.0.0-RC1-src.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-src.zip?download}deduplicator-1.0.0-RC1-src.zip}}

 Incorporates a few minor bugfixes from version 0.4.0. Namely, it fixes a bug when doing matches by digest and a bug
 in how it hooked into Heritrix 1.12 (and up) way of marking content as duplicate. 
 
 
* {0.4.0} (Stable)
~~~~~~~~~~~~~~~~~~

  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-bin.tar.gz?download}deduplicator-0.4.0-bin.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-bin.zip?download}deduplicator-0.4.0-bin.zip}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-src.tar.gz?download}deduplicator-0.4.0-src.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-src.zip?download}deduplicator-0.4.0-src.zip}}

 Incorporates the numerous tweaks and patches introduced since 0.2.0 (see comments for interim builds below for details).
 Tested in production crawls. No known issues.
 This is the last version to be built against Heritrix 1.10.0 (fully compatible with any Heritrix version from 1.6-1.14, but not 2.x).
 Unlike 0.2.0 it is built with Java 1.5, not 1.4.2. 
 
 
* 0.3.0-20080527
~~~~~~~~~~~~~~~~
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-bin.tar.gz}deduplicator-0.3.0-20080527-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-bin.zip}deduplicator-0.3.0-20080527-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-src.tar.gz}deduplicator-0.3.0-20080527-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-src.zip}deduplicator-0.3.0-20080527-src.zip}}
 
 Applied patches from Kï¿½re Fiedler Christiansen. It includes a new 'SpareRangeFilter' that now optionally replaces Lucene's RangeFilter when 
 making queries. This reduces the memory usage at a cost to performance. Also a minor NPE bugfix patch was included. 
 Module is now compiled against Java 1.5. Some 1.5 specific changes have been made, mostly using generics. Some clean up of warnings.  
 This version is largely untested!
 
		
* 0.3.0-20080129
~~~~~~~~~~~~~~~~
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-bin.tar.gz}deduplicator-0.3.0-20080129-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-bin.zip}deduplicator-0.3.0-20080129-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-src.tar.gz}deduplicator-0.3.0-20080129-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-src.zip}deduplicator-0.3.0-20080129-src.zip}}
 
 Applied patch from Lars Clausen. Issue as explained by Lars: "We've run across a scaling issue in the use of TermQuery for Lucene indexes of 400+ million entries.  TermQuery uses norms, which spends one byte of memory per entry in the index.  Even turning off norms on the index doesn't help, since TermQuery in the most friendly way creates a fake array of norms.  This patch changes the deduplicator to use a ConstantScoreQuery with a RangeFilter, which avoids most of the memory usage and doesn't seem to affect the speed."
 Patch is untested at this time. 
 
		
* 0.3.0-20070601
~~~~~~~~~~~~~~~~
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-bin.tar.gz}deduplicator-0.3.0-20070601-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-bin.zip}deduplicator-0.3.0-20070601-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-src.tar.gz}deduplicator-0.3.0-20070601-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-src.zip}deduplicator-0.3.0-20070601-src.zip}}

 While still compiled against version 1.10.0 of Heritrix, this release now handles the changed crawl.log format of Heritrix 1.12.0 which prefixes the content digest with the name of the scheme.

		
* 0.3.0-20061218
~~~~~~~~~~~~~~~~
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-bin.tar.gz}deduplicator-0.3.0-20061218-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-bin.zip}deduplicator-0.3.0-20061218-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-src.tar.gz}deduplicator-0.3.0-20061218-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-src.zip}deduplicator-0.3.0-20061218-src.zip}}

 Added (patch by Maximilian Schoefmann) the ability to exclude URLs marked as duplicates in crawl.log from the index.

		
* 0.3.0-20061031
~~~~~~~~~~~~~~~~
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-bin.tar.gz}deduplicator-0.3.0-20061031-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-bin.zip}deduplicator-0.3.0-20061031-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-src.tar.gz}deduplicator-0.3.0-20061031-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-src.zip}deduplicator-0.3.0-20061031-src.zip}}

 Fixed a bug (reported by Lars Clausen) in CrawlLogIterator where malformed lines in the crawl.log would cause an exception. Is now handled gracefully. Also added unit tests for CrawlLogIterator.parseLine(). To facilitate that parseLine() is now a static method.
		
		
* {0.2.0} (Stable)
~~~~~~~~~~~~~~~~~~

  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-bin.tar.gz?download}deduplicator-0.2.0-bin.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-bin.zip?download}deduplicator-0.2.0-bin.zip}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-src.tar.gz?download}deduplicator-0.2.0-src.tar.gz}}
  
  * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-src.zip?download}deduplicator-0.2.0-src.zip}}

 First official release. September 13, 2006.
 
 
* PreRelease20060808
~~~~~~~~~~~~~~~~~~~~

  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-bin.tar.gz}deduplicator-PreRelease20060808-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-bin.zip}deduplicator-PreRelease20060808-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-src.tar.gz}deduplicator-PreRelease20060808-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-src.zip}deduplicator-PreRelease20060808-src.zip}}

 Fixed a bug (reported by Lars Clausen) in CrawlLogIterator where entries of files exceeding 10GB would not be parsed correctly since the crawl.log format assumes that the byte size string can never be longer then 10 characters, 10 GB requires 11 characters causing the URL to be shifted down the line.


* PreRelease20060717
~~~~~~~~~~~~~~~~~~~~

  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-bin.tar.gz}deduplicator-PreRelease20060718-bin.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-bin.zip}deduplicator-PreRelease20060718-bin.zip}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-src.tar.gz}deduplicator-PreRelease20060718-src.tar.gz}}
  
  * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-src.zip}deduplicator-PreRelease20060718-src.zip}}

 CrawlLogIterator refactored some more. Project now built with Maven 2.0. First seperate source release.


* Older
~~~~~~~

 The following releases were made prior to the implementation of the Maven automatic build/release process. Consequently only .tar.gz of the binaries are available. Note that the jar files also contain source files.
 
 * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060717.tar.gz}deduplicator-prerelease20060717.tar.gz}}
 
  * CrawlLogIterator refactored to make subclassing easier - patch from Lars Clausen. Added setters to CrawlDataItem. Improved Javadoc.


 * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060623.tar.gz}deduplicator-prerelease20060623.tar.gz}}

  * DigestIndexer can now be used by other classes. Moved to Lucene 2.0. Improved bash script.


 * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060606.tar.gz}deduplicator-prerelease20060606.tar.gz}}
 
  * Adds 'origin' and makes overriding of content size configurable


 * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060601.tar.gz}deduplicator-prerelease20060601.tar.gz}}
 
  * Adds DeDupFetchHTTP


 * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060516.tar.gz}deduplicator-prerelease20060516.tar.gz}}
  
  * Initial preview release















--- NEW FILE: format.apt ---

The APT format
~~~~~~~~~~~~~~

  In the following section, boxes containing text in typewriter-like font are
  examples of APT source.

* Document structure
~~~~~~~~~~~~~~~~~~~~

  A short APT document is contained in a single text file. A longer document
  may be contained in a ordered list of text files. For instance, first text
  file contains section 1, second text file contains section 2, and so on.

      [Note:] Splitting the APT document in several text files on a section
              boundary is not mandatory. The split may occur anywhere.
              However doing so is recommended because a text file containing a
              section is by itself a valid APT document.

  A file contains a sequence of paragraphs and ``displays'' (non paragraphs
  such as tables) separated by open lines.

  A paragraph is simply a sequence of consecutive text lines.

+------------------------------------------------------------------------+
  First line of first paragraph.
  Second line of first paragraph.
  Third line of first paragraph.
  
  Line 1 of paragraph 2 (separated from first paragraph by an open line).
  Line 2 of paragraph 2.
+------------------------------------------------------------------------+

  The indentation of the first line of a paragraph is the main method used by
  an APT processor to recognize the type of the paragraph. For example, a
  section title must not be indented at all.

  A ``plain'' paragraph must be indented by a certain amount of space. For
  example, a plain paragraph which is not contained in a list may be indented
  by two spaces.

+-------------------------------------------------+
My section title (not indented).

  My paragraph first line (indented by 2 spaces).
+-------------------------------------------------+

  Indentation is not rigid. Any amount of space will do. You don't even need
  to use a consistent indentation all over your document. What really matters
  for an APT processor is whether the paragraph is not indented at all or,
  when inside a list, whether a paragraph is more or less indented than the
  first item of the list (more about this later).

+-------------------------------------------------------+
    First paragraph has its first line indented by four
spaces. Then the author did even bother to indent the
other lines of the paragraph.

  Second paragraph contains several lines which are all 
  indented by two spaces. This style is much nicer than 
  the one used for the previous paragraph.
+-------------------------------------------------------+

  Note that tabs are expanded with a tab width set to 8.

* Document elements
~~~~~~~~~~~~~~~~~~~

** Block level elements
~~~~~~~~~~~~~~~~~~~~~~~

*** Title
~~~~~~~~~~

  A title is optional. If used, it must appear as the first block of the
  document.

+----------------------------------------------------------------------------+
				    ------
				    Title
				    ------
				    Author
				    ------
				     Date
+----------------------------------------------------------------------------+

  A title block is indented (centering it is nicer). It begins with a line
  containing at least 3 dashes (<<<--->>>).

  After the first <<<--->>> line, one or several consecutive lines of text
  (implicit line break after each line) specify the title of the document.

  This text may immediately be followed by another <<<--->>> line and one or
  several consecutive lines of text which specifies the author of the
  document.

  The author sub-block may optionaly be followed by a date sub-block using the
  same syntax.

  The following example is used for a document with an title and a date but
  with no declared author.

+----------------------------------------------------------------------------+
				    ------
				    Title
				    ------
				    ------
				     Date
				    ------
+----------------------------------------------------------------------------+

  The last line is ignored. It is just there to make the block nicer.

*** Paragraph
~~~~~~~~~~~~~

  Paragraphs other than the title block may appear before the first section.

+----------------------+
  Paragraph 1, line 1.
  Paragraph 1, line 2.

  Paragraph 2, line 1.
  Paragraph 2, line 2.
+----------------------+

  Paragraphs are indented. They have already been described in the {{document
  structure}} section.

*** Section
~~~~~~~~~~~

  Sections are created by inserting section titles into the document. Simple
  documents need not contain sections.

+-----------------------------------+
Section title

* Sub-section title

** Sub-sub-section title

*** Sub-sub-sub-section title

**** Sub-sub-sub-sub-section title
+-----------------------------------+

  Section titles are not indented. A sub-section title begins with one
  asterisk (<<<*>>>), a sub-sub-section title begins with two asterisks
  (<<<**>>>), and so forth up to four sub-section levels.

*** List
~~~~~~~~

+---------------------------------------+
      * List item 1.

      * List item 2.

	Paragraph contained in list item 2.

	    * Sub-list item 1.

	    * Sub-list item 2.

      * List item 3.
+---------------------------------------+

  List items are indented and begin with a asterisk (<<<*>>>). 

  Plain paragraphs more indented than the first list item are nested in that
  list. Displays such as tables (not indented) are always nested in the
  current list.

  To nest a list inside a list, indent its first item more than its parent
  list. To end a list, add a paragraph or list item less indented than the
  current list.

  Section titles always end a list. Displays cannot end a list but the
  <<<[]>>> pseudo-element may be used to force the end of a list.

+------------------------------------+
      * List item 3.
        Force end of list:

      []

--------------------------------------------
Verbatim text not contained in list item 3
--------------------------------------------
+------------------------------------+

  In the previous example, without the <<<[]>>>, the verbatim text (not
  indented as all displays) would have been contained in list item 3.

  A single <<<[]>>> may be used to end several nested lists at the same
  time. The indentation of <<<[]>>> may be used to specify exactly which
  lists should be ended. Example:

+------------------------------------+
      * List item 1.

      * List item 2.

	    * Sub-list item 1.

	    * Sub-list item 2.

	    []

-------------------------------------------------------------------
Verbatim text contained in list item 2, but not in sub-list item 2
-------------------------------------------------------------------
+------------------------------------+

  There are three kind of lists, the bulleted lists we have already described,
  the numbered lists and the definition lists.

+-----------------------------------------+
      [[1]] Numbered item 1.

                [[A]] Numbered item A.

                [[B]] Numbered item B.

      [[2]] Numbered item 2.
+-----------------------------------------+

  A numbered list item begins with a label beetween two square brackets. The
  label of the first item establishes the numbering scheme for the whole list:

      [<<<[[1\]\]>>>] Decimal numbering: 1, 2, 3, 4, etc.

      [<<<[[a\]\]>>>] Lower-alpha numbering: a, b, c, d, etc.

      [<<<[[A\]\]>>>] Upper-alpha numbering: A, B, C, D, etc.

      [<<<[[i\]\]>>>] Lower-roman numbering: i, ii, iii, iv, etc.

      [<<<[[I\]\]>>>] Upper-roman numbering: I, II, III, IV, etc.

  The labels of the items other than the first one are ignored. It is
  recommended to take the time to type the correct label for each item in
  order to keep the APT source document readable.

+-------------------------------------------+
      [Defined term 1] of definition list 2.

      [Defined term 2] of definition list 2.
+-------------------------------------------+

  A definition list item begins with a defined term: text between square
  brackets.

*** Verbatim text
~~~~~~~~~~~~~~~~~

+----------------------------------------+
----------------------------------------
Verbatim 
	 text,
		preformatted,
				escaped.
----------------------------------------
+----------------------------------------+

  A verbatim block is not indented. It begins with a non indented line
  containing at least 3 dashes (<<<--->>>). It ends with a similar line.

  <<<+-->>> instead of <<<--->>> draws a box around verbatim text.

  Like in HTML, verbatim text is preformatted. Unlike HTML, verbatim text is
  escaped: inside a verbatim display, markup is not interpreted by the APT
  processor.

*** Figure
~~~~~~~~~~

+---------------------------+
[Figure name] Figure caption
+---------------------------+

  A figure block is not indented. It begins with the figure name between
  square brackets. The figure name is optionally followed by some text: the
  figure caption.

  The figure name is the pathname of the file containing the figure but
  without an extension. Example: if your figure is contained in
  <<</home/joe/docs/mylogo.jpeg>>>, the figure name is
  <<</home/joe/docs/mylogo>>>.

  If the figure name comes from a relative pathname (recommended practice)
  rather than from an absolute pathname, this relative pathname is taken to be
  relative to the directory of the current APT document (a la HTML)
  rather than relative to the current working directory.

  Why not leave the file extension in the figure name? This is better
  explained by an example. You need to convert an APT document to PostScript
  and your figure name is <<</home/joe/docs/mylogo>>>. A APT processor will
  first try to load <<</home/joe/docs/mylogo.eps>>>. When the desired format
  is not found, a APT processor tries to convert one of the existing
  formats. In our example, the APT processor tries to convert
  <<</home/joe/docs/mylogo.jpeg>>> to encapsulated PostScript.

*** Table
~~~~~~~~~

  A table block is not indented. It begins with a non indented line containing
  an asterisk and at least 2 dashes (<<<*-->>>). It ends with a
  similar line.

  The first line is not only used to recognize a table but also to specify
  column justification. In the following example, 

      * the second asterisk (<<<*>>>) is used to specify that column 1 is
        centered,

      * the plus sign (<<<+>>>) specifies that column 2 is left aligned, 

      * the colon (<<<:>>>) specifies that column 3 is right aligned.

      []

+---------------------------------------------+
*----------*--------------+----------------:
| Centered | Left-aligned | Right-aligned  |
| cell 1,1 | cell 1,2     | cell 1,3       |
*----------*--------------+----------------:
| cell 2,1 | cell 2,2     | cell 2,3       |
*----------*--------------+----------------:
Table caption
+---------------------------------------------+

  Rows are separated by a non indented line beginning with <<<*-->>>.

  An optional table caption (non indented text) may immediately follow the
  table.

  Rows may contain single line or multiple line cells. Each line of cell text
  is separated from the adjacent cell by the pipe character (<<<|>>>).
  (<<<|>>> may be used in the cell text if quoted: <<<\\|>>>.)

  The last <<<|>>> is only used to make the table nicer.  The first <<<|>>> is
  not only used to make the table nicer, but also to specify that a grid is to
  be drawn around table cells.

  The following example shows a simple table with no grid and no caption.

+---------------+
*-----*------*
 cell | cell
*-----*------*
 cell | cell
*-----*------*
+---------------+

*** Horizontal rule
~~~~~~~~~~~~~~~~~~~

+---------------------+
=====================
+---------------------+

  A non indented line containing at least 3 equal signs (<<<===>>>).

*** Page break
~~~~~~~~~~~~~~

+---+
^L
+---+

  A non indented line containing a single form feed character (Control-L).

** Text level elements
~~~~~~~~~~~~~~~~~~~~~~

*** Font
~~~~~~~~

+-----------------------------------------------------+
  <Italic> font. <<Bold>> font. <<<Monospaced>>> font.
+-----------------------------------------------------+

  Text between \< and > must be rendered in italic. Text between \<\< and >>
  must be rendered in bold. Text between \<\<\< and >>> must be rendered using
  a monospaced, typewriter-like font.

  Font elements may appear anywhere except inside other font elements.

  It is not recommended to use font elements inside titles, section titles,
  links and defined terms because a APT processor automatically applies
  appropriate font styles to these elements.

*** Anchor and link
~~~~~~~~~~~~~~~~~~~

+-----------------------------------------------------------------+
  {Anchor}. Link to {{anchor}}. Link to {{http://www.pixware.fr}}. 
  Link to {{{anchor}showing alternate text}}.
  Link to {{{http://www.pixware.fr}Pixware home page}}.
+-----------------------------------------------------------------+

  Text between curly braces (<<<\{}>>>) specifies an anchor. Text between
  double curly braces (<<<\{\{}}>>>) specifies a link.

  It is an error to create a link element that does not refer to an anchor of
  the same name. The name of an anchor/link is its text with all non
  alphanumeric characters stripped.

  This rule does not apply to links to <external> anchors. Text beginning
  with <<<http:/>>>, <<<https:/>>>, <<<ftp:/>>>, <<<file:/>>>, <<<mailto:>>>,
  <<<../>>>, <<<./>>> (<<<..\\>>> and <<<.\\>>> on Windows) is recognized as
  an external anchor name.

  When the construct <<\{\{\{>><name><<}>><text><<}}>> is used, the link text
  <text> may differ from the link name <name>.

  Anchor/link elements may appear anywhere except inside other anchor/link
  elements.

  Section titles are implicitly defined anchors.

*** Line break
~~~~~~~~~~~~~~

+-------------+
  Force line\
  break.
+-------------+

  A backslash character (<<<\\>>>) followed by a newline character.

  Line breaks must not be used inside titles and tables (which are line
  oriented blocks with implicit line breaks).

*** Non breaking space
~~~~~~~~~~~~~~~~~~~~~~

+----------------------+
  Non\ breaking\ space.
+----------------------+

  A backslash character (<<<\\>>>) followed by a space character.

*** Special character
~~~~~~~~~~~~~~~~~~~~~

+---------------------------------------------------------------------------+
  Escaped special characters: \~, \=, \-, \+, \*, \[, \], \<, \>, \{, \}, \\.
+---------------------------------------------------------------------------+

  In certain contexts, these characters have a special meaning and therefore
  must be escaped if needed as is. They are escaped by adding a backslash in
  front of them. The backslash may itself be escaped by adding another
  backslash in front of it.

  Note that an asterisk, for example, needs to be escaped only if its begins a
  paragraph. (<<<*>>> has no special meaning in the middle of a paragraph.)

+--------------------------------------+
  Copyright symbol: \251, \xA9, \u00a9.
+--------------------------------------+

  Latin-1 characters (whatever is the encoding of the APT document) may be
  specified by their codes using a backslash followed by one to three octal
  digits or by using the <<<\x>>><NN> notation, where <NN> are two hexadecimal
  digits.

  Unicode characters may be specified by their codes using the <<<\u>>><NNNN>
  notation, where <NNNN> are four hexadecimal digits.

*** Comment
~~~~~~~~~~~

+---------------+
~~Commented out.
+---------------+

  Text found after two tildes (<<<\~~>>>) is ignored up to the end of line.

  A line of <<<~>>> is often used to ``underline'' section titles in order to
  make them stand out of other paragraphs.


* The APT format at a glance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

------------------------------------------------------------------------------
				    ------
				    Title
				    ------
				    Author
				    ------
				     Date

  Paragraph 1, line 1.
  Paragraph 1, line 2.

  Paragraph 2, line 1.
  Paragraph 2, line 2.

Section title

* Sub-section title

** Sub-sub-section title

*** Sub-sub-sub-section title

**** Sub-sub-sub-sub-section title

      * List item 1.

      * List item 2.

	Paragraph contained in list item 2.

	    * Sub-list item 1.

	    * Sub-list item 2.

      * List item 3.
        Force end of list:

      []

+------------------------------------------+
Verbatim text not contained in list item 3
+------------------------------------------+

      [[1]] Numbered item 1.

                [[A]] Numbered item A.

                [[B]] Numbered item B.

      [[2]] Numbered item 2.

  List numbering schemes: [[1]], [[a]], [[A]], [[i]], [[I]].

      [Defined term 1] of definition list.

      [Defined term 2] of definition list.

+-------------------------------+
Verbatim text
			in a box	
+-------------------------------+

  --- instead of +-- suppresses the box around verbatim text.

[Figure name] Figure caption

*----------*--------------+----------------:
| Centered | Left-aligned | Right-aligned  |
| cell 1,1 | cell 1,2     | cell 1,3       |
*----------*--------------+----------------:
| cell 2,1 | cell 2,2     | cell 2,3       |
*----------*--------------+----------------:
Table caption

  No grid, no caption:

*-----*------*
 cell | cell
*-----*------*
 cell | cell
*-----*------*

  Horizontal line:

=======================================================================

^L
  New page.

  <Italic> font. <<Bold>> font. <<<Monospaced>>> font.

  {Anchor}. Link to {{anchor}}. Link to {{http://www.pixware.fr}}. 
  Link to {{{anchor}showing alternate text}}.
  Link to {{{http://www.pixware.fr}Pixware home page}}.

  Force line\
  break.

  Non\ breaking\ space.

  Escaped special characters: \~, \=, \-, \+, \*, \[, \], \<, \>, \{, \}, \\.

  Copyright symbol: \251, \xA9, \u00a9.

~~Commented out.

------------------------------------------------------------------------------


--- NEW FILE: license.apt ---
License

+------------------------------------------------------------------------+
DeDuplicator is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser Public license (LGPL) reproduced below.  

DeDuplicator includes the libraries it depends upon.  The libraries used can be
found under the 'lib' directory.


		  GNU LESSER GENERAL PUBLIC LICENSE
		       Version 2.1, February 1999

 Copyright (C) 1991, 1999 Free Software Foundation, Inc.
     59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

[This is the first released version of the Lesser GPL.  It also counts
 as the successor of the GNU Library Public License, version 2, hence
 the version number 2.1.]

			    Preamble

  The licenses for most software are designed to take away your
freedom to share and change it.  By contrast, the GNU General Public
Licenses are intended to guarantee your freedom to share and change
free software--to make sure the software is free for all its users.

  This license, the Lesser General Public License, applies to some
specially designated software packages--typically libraries--of the
Free Software Foundation and other authors who decide to use it.  You
can use it too, but we suggest you first think carefully about whether
this license or the ordinary General Public License is the better
strategy to use in any particular case, based on the explanations below.

  When we speak of free software, we are referring to freedom of use,
not price.  Our General Public Licenses are designed to make sure that
you have the freedom to distribute copies of free software (and charge
for this service if you wish); that you receive source code or can get
it if you want it; that you can change the software and use pieces of
it in new free programs; and that you are informed that you can do
these things.

  To protect your rights, we need to make restrictions that forbid
distributors to deny you these rights or to ask you to surrender these
rights.  These restrictions translate to certain responsibilities for
you if you distribute copies of the library or if you modify it.

  For example, if you distribute copies of the library, whether gratis
or for a fee, you must give the recipients all the rights that we gave
you.  You must make sure that they, too, receive or can get the source
code.  If you link other code with the library, you must provide
complete object files to the recipients, so that they can relink them
with the library after making changes to the library and recompiling
it.  And you must show them these terms so they know their rights.

  We protect your rights with a two-step method: (1) we copyright the
library, and (2) we offer you this license, which gives you legal
permission to copy, distribute and/or modify the library.

  To protect each distributor, we want to make it very clear that
there is no warranty for the free library.  Also, if the library is
modified by someone else and passed on, the recipients should know
that what they have is not the original version, so that the original
author's reputation will not be affected by problems that might be
introduced by others.

  Finally, software patents pose a constant threat to the existence of
any free program.  We wish to make sure that a company cannot
effectively restrict the users of a free program by obtaining a
restrictive license from a patent holder.  Therefore, we insist that
any patent license obtained for a version of the library must be
consistent with the full freedom of use specified in this license.

  Most GNU software, including some libraries, is covered by the
ordinary GNU General Public License.  This license, the GNU Lesser
General Public License, applies to certain designated libraries, and
is quite different from the ordinary General Public License.  We use
this license for certain libraries in order to permit linking those
libraries into non-free programs.

  When a program is linked with a library, whether statically or using
a shared library, the combination of the two is legally speaking a
combined work, a derivative of the original library.  The ordinary
General Public License therefore permits such linking only if the
entire combination fits its criteria of freedom.  The Lesser General
Public License permits more lax criteria for linking other code with
the library.

  We call this license the "Lesser" General Public License because it
does Less to protect the user's freedom than the ordinary General
Public License.  It also provides other free software developers Less
of an advantage over competing non-free programs.  These disadvantages
are the reason we use the ordinary General Public License for many
libraries.  However, the Lesser license provides advantages in certain
special circumstances.

  For example, on rare occasions, there may be a special need to
encourage the widest possible use of a certain library, so that it becomes
a de-facto standard.  To achieve this, non-free programs must be
allowed to use the library.  A more frequent case is that a free
library does the same job as widely used non-free libraries.  In this
case, there is little to gain by limiting the free library to free
software only, so we use the Lesser General Public License.

  In other cases, permission to use a particular library in non-free
programs enables a greater number of people to use a large body of
free software.  For example, permission to use the GNU C Library in
non-free programs enables many more people to use the whole GNU
operating system, as well as its variant, the GNU/Linux operating
system.

  Although the Lesser General Public License is Less protective of the
users' freedom, it does ensure that the user of a program that is
linked with the Library has the freedom and the wherewithal to run
that program using a modified version of the Library.

  The precise terms and conditions for copying, distribution and
modification follow.  Pay close attention to the difference between a
"work based on the library" and a "work that uses the library".  The
former contains code derived from the library, whereas the latter must
be combined with the library in order to run.

		  GNU LESSER GENERAL PUBLIC LICENSE
   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

  0. This License Agreement applies to any software library or other
program which contains a notice placed by the copyright holder or
other authorized party saying it may be distributed under the terms of
this Lesser General Public License (also called "this License").
Each licensee is addressed as "you".

  A "library" means a collection of software functions and/or data
prepared so as to be conveniently linked with application programs
(which use some of those functions and data) to form executables.

  The "Library", below, refers to any such software library or work
which has been distributed under these terms.  A "work based on the
Library" means either the Library or any derivative work under
copyright law: that is to say, a work containing the Library or a
portion of it, either verbatim or with modifications and/or translated
straightforwardly into another language.  (Hereinafter, translation is
included without limitation in the term "modification".)

  "Source code" for a work means the preferred form of the work for
making modifications to it.  For a library, complete source code means
all the source code for all modules it contains, plus any associated
interface definition files, plus the scripts used to control compilation
and installation of the library.

  Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope.  The act of
running a program using the Library is not restricted, and output from
such a program is covered only if its contents constitute a work based
on the Library (independent of the use of the Library in a tool for
writing it).  Whether that is true depends on what the Library does
and what the program that uses the Library does.
  
  1. You may copy and distribute verbatim copies of the Library's
complete source code as you receive it, in any medium, provided that
you conspicuously and appropriately publish on each copy an
appropriate copyright notice and disclaimer of warranty; keep intact
all the notices that refer to this License and to the absence of any
warranty; and distribute a copy of this License along with the
Library.

  You may charge a fee for the physical act of transferring a copy,
and you may at your option offer warranty protection in exchange for a
fee.

  2. You may modify your copy or copies of the Library or any portion
of it, thus forming a work based on the Library, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:

    a) The modified work must itself be a software library.

    b) You must cause the files modified to carry prominent notices
    stating that you changed the files and the date of any change.

    c) You must cause the whole of the work to be licensed at no
    charge to all third parties under the terms of this License.

    d) If a facility in the modified Library refers to a function or a
    table of data to be supplied by an application program that uses
    the facility, other than as an argument passed when the facility
    is invoked, then you must make a good faith effort to ensure that,
    in the event an application does not supply such function or
    table, the facility still operates, and performs whatever part of
    its purpose remains meaningful.

    (For example, a function in a library to compute square roots has
    a purpose that is entirely well-defined independent of the
    application.  Therefore, Subsection 2d requires that any
    application-supplied function or table used by this function must
    be optional: if the application does not supply it, the square
    root function must still compute square roots.)

These requirements apply to the modified work as a whole.  If
identifiable sections of that work are not derived from the Library,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works.  But when you
distribute the same sections as part of a whole which is a work based
on the Library, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote
it.

Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Library.

In addition, mere aggregation of another work not based on the Library
with the Library (or with a work based on the Library) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.

  3. You may opt to apply the terms of the ordinary GNU General Public
License instead of this License to a given copy of the Library.  To do
this, you must alter all the notices that refer to this License, so
that they refer to the ordinary GNU General Public License, version 2,
instead of to this License.  (If a newer version than version 2 of the
ordinary GNU General Public License has appeared, then you can specify
that version instead if you wish.)  Do not make any other change in
these notices.

  Once this change is made in a given copy, it is irreversible for
that copy, so the ordinary GNU General Public License applies to all
subsequent copies and derivative works made from that copy.

  This option is useful when you wish to copy part of the code of
the Library into a program that is not a library.

  4. You may copy and distribute the Library (or a portion or
derivative of it, under Section 2) in object code or executable form
under the terms of Sections 1 and 2 above provided that you accompany
it with the complete corresponding machine-readable source code, which
must be distributed under the terms of Sections 1 and 2 above on a
medium customarily used for software interchange.

  If distribution of object code is made by offering access to copy
from a designated place, then offering equivalent access to copy the
source code from the same place satisfies the requirement to
distribute the source code, even though third parties are not
compelled to copy the source along with the object code.

  5. A program that contains no derivative of any portion of the
Library, but is designed to work with the Library by being compiled or
linked with it, is called a "work that uses the Library".  Such a
work, in isolation, is not a derivative work of the Library, and
therefore falls outside the scope of this License.

  However, linking a "work that uses the Library" with the Library
creates an executable that is a derivative of the Library (because it
contains portions of the Library), rather than a "work that uses the
library".  The executable is therefore covered by this License.
Section 6 states terms for distribution of such executables.

  When a "work that uses the Library" uses material from a header file
that is part of the Library, the object code for the work may be a
derivative work of the Library even though the source code is not.
Whether this is true is especially significant if the work can be
linked without the Library, or if the work is itself a library.  The
threshold for this to be true is not precisely defined by law.

  If such an object file uses only numerical parameters, data
structure layouts and accessors, and small macros and small inline
functions (ten lines or less in length), then the use of the object
file is unrestricted, regardless of whether it is legally a derivative
work.  (Executables containing this object code plus portions of the
Library will still fall under Section 6.)

  Otherwise, if the work is a derivative of the Library, you may
distribute the object code for the work under the terms of Section 6.
Any executables containing that work also fall under Section 6,
whether or not they are linked directly with the Library itself.

  6. As an exception to the Sections above, you may also combine or
link a "work that uses the Library" with the Library to produce a
work containing portions of the Library, and distribute that work
under terms of your choice, provided that the terms permit
modification of the work for the customer's own use and reverse
engineering for debugging such modifications.

  You must give prominent notice with each copy of the work that the
Library is used in it and that the Library and its use are covered by
this License.  You must supply a copy of this License.  If the work
during execution displays copyright notices, you must include the
copyright notice for the Library among them, as well as a reference
directing the user to the copy of this License.  Also, you must do one
of these things:

    a) Accompany the work with the complete corresponding
    machine-readable source code for the Library including whatever
    changes were used in the work (which must be distributed under
    Sections 1 and 2 above); and, if the work is an executable linked
    with the Library, with the complete machine-readable "work that
    uses the Library", as object code and/or source code, so that the
    user can modify the Library and then relink to produce a modified
    executable containing the modified Library.  (It is understood
    that the user who changes the contents of definitions files in the
    Library will not necessarily be able to recompile the application
    to use the modified definitions.)

    b) Use a suitable shared library mechanism for linking with the
    Library.  A suitable mechanism is one that (1) uses at run time a
    copy of the library already present on the user's computer system,
    rather than copying library functions into the executable, and (2)
    will operate properly with a modified version of the library, if
    the user installs one, as long as the modified version is
    interface-compatible with the version that the work was made with.

    c) Accompany the work with a written offer, valid for at
    least three years, to give the same user the materials
    specified in Subsection 6a, above, for a charge no more
    than the cost of performing this distribution.

    d) If distribution of the work is made by offering access to copy
    from a designated place, offer equivalent access to copy the above
    specified materials from the same place.

    e) Verify that the user has already received a copy of these
    materials or that you have already sent this user a copy.

  For an executable, the required form of the "work that uses the
Library" must include any data and utility programs needed for
reproducing the executable from it.  However, as a special exception,
the materials to be distributed need not include anything that is
normally distributed (in either source or binary form) with the major
components (compiler, kernel, and so on) of the operating system on
which the executable runs, unless that component itself accompanies
the executable.

  It may happen that this requirement contradicts the license
restrictions of other proprietary libraries that do not normally
accompany the operating system.  Such a contradiction means you cannot
use both them and the Library together in an executable that you
distribute.

  7. You may place library facilities that are a work based on the
Library side-by-side in a single library together with other library
facilities not covered by this License, and distribute such a combined
library, provided that the separate distribution of the work based on
the Library and of the other library facilities is otherwise
permitted, and provided that you do these two things:

    a) Accompany the combined library with a copy of the same work
    based on the Library, uncombined with any other library
    facilities.  This must be distributed under the terms of the
    Sections above.

    b) Give prominent notice with the combined library of the fact
    that part of it is a work based on the Library, and explaining
    where to find the accompanying uncombined form of the same work.

  8. You may not copy, modify, sublicense, link with, or distribute
the Library except as expressly provided under this License.  Any
attempt otherwise to copy, modify, sublicense, link with, or
distribute the Library is void, and will automatically terminate your
rights under this License.  However, parties who have received copies,
or rights, from you under this License will not have their licenses
terminated so long as such parties remain in full compliance.

  9. You are not required to accept this License, since you have not
signed it.  However, nothing else grants you permission to modify or
distribute the Library or its derivative works.  These actions are
prohibited by law if you do not accept this License.  Therefore, by
modifying or distributing the Library (or any work based on the
Library), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Library or works based on it.

  10. Each time you redistribute the Library (or any work based on the
Library), the recipient automatically receives a license from the
original licensor to copy, distribute, link with or modify the Library
subject to these terms and conditions.  You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties with
this License.

  11. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Library at all.  For example, if a patent
license would not permit royalty-free redistribution of the Library by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Library.

If any portion of this section is held invalid or unenforceable under any
particular circumstance, the balance of the section is intended to apply,
and the section as a whole is intended to apply in other circumstances.

It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system which is
implemented by public license practices.  Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.

This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.

  12. If the distribution and/or use of the Library is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Library under this License may add
an explicit geographical distribution limitation excluding those countries,
so that distribution is permitted only in or among countries not thus
excluded.  In such case, this License incorporates the limitation as if
written in the body of this License.

  13. The Free Software Foundation may publish revised and/or new
versions of the Lesser General Public License from time to time.
Such new versions will be similar in spirit to the present version,
but may differ in detail to address new problems or concerns.

Each version is given a distinguishing version number.  If the Library
specifies a version number of this License which applies to it and
"any later version", you have the option of following the terms and
conditions either of that version or of any later version published by
the Free Software Foundation.  If the Library does not s...
 
[truncated message content]

[Deduplicator-cvs] deduplicator3/src/site site.xml,NONE,1.1

From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:55

Update of /cvsroot/deduplicator/deduplicator3/src/site
In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site

Added Files:
	site.xml 
Log Message:
Added site.

Improvments on how the Lucene index is accessed

Added size filter to DigestIndexer.

--- NEW FILE: site.xml ---
<?xml version="1.0" encoding="ISO-8859-1"?>
<project name="DeDuplicator">
    <skin>
    	<groupId>org.apache.maven.skins</groupId>
        <artifactId>maven-default-skin</artifactId>
    	<version>1.0</version>
	</skin>

  <bannerLeft>
  	<name>DeDuplicator</name>
	<src>images/dedup.png</src>
    <href>http://vefsofnun.bok.hi.is/deduplicator</href>
  </bannerLeft>
  <bannerRight>
    <name>National and University Library of Iceland</name>
	<src>images/lbs.gif</src>
    <href>http://landsbokasafn.is</href>
  </bannerRight>
  
  <poweredBy>
  	<logo
  		name="SourceForge.net Logo"
  		href="http://sourceforge.net"
  		img="http://sflogo.sourceforge.net/sflogo.php?group_id=181565&amp;type=1"/>
  	<logo
  		name="Lucene Logo"
  		href="http://lucene.apache.org"
  		img="images/lucene.jpg"/>
	<logo 
		name="Build with Maven 2" 
		href="http://maven.apache.org/"
        img="images/logos/maven-feather.png"/>
  </poweredBy>
  		
  <publishDate format="MMMM d, yyyy"/>
  
  <body>
    <links>
      <item name="Heritrix" href="http://crawler.archive.org/" />
      <item name="Lucene" href="http://lucene.apache.org/" />
      <item name="SourceForge" href="http://sourceforge.net/projects/deduplicator/" />
    </links>

    <menu name="DeDuplicator">
      <item name="Welcome" href="index.html"/>
      <item name="FAQ" href="faq.html"/>
      <item name="Releases" href="release.html"/>
      <item name="License" href="license.html"/>
      <item name="Getting started" href="started.html"/>
      <item name="Javadoc" href="apidocs/index.html"/>
    </menu>
    <menu ref="reports" />
  </body>
</project>

[Deduplicator-cvs] deduplicator3/src/main/java/is/landsbokasafn/deduplicator CommandLineParser.java, 1.1, 1.2 DeDuplicator.java, 1.2, 1.3 CrawlLogIterator.java, 1.1, 1.2 DigestIndexer.java, 1.1, 1.2 DeDupFetchHTTP.java, 1.1, 1.2 CrawlDataItem.java, 1.1, 1.2