deduplicator-cvs Mailing List for DeDuplicator (Heritrix add-on)
Brought to you by:
kristinn_sig
You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(14) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
(3) |
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(8) |
Jun
|
Jul
(14) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(60) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Kristinn S. <kri...@us...> - 2010-07-29 16:49:39
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv30877/src/main/java/is/landsbokasafn/deduplicator Modified Files: CrawlLogIterator.java Log Message: Missing size in log is now handled correctly by omitting the relevant URL. Missing size is always an indication that the visit failed. Index: CrawlLogIterator.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/CrawlLogIterator.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** CrawlLogIterator.java 27 Jul 2010 09:09:46 -0000 1.2 --- CrawlLogIterator.java 29 Jul 2010 16:49:31 -0000 1.3 *************** *** 165,172 **** // Index 2: File size long size = -1; try { size = Long.parseLong(lineParts[2]); } catch (NumberFormatException e) { ! System.err.println("Error parsing size for: " + line); } --- 165,177 ---- // Index 2: File size long size = -1; + if (lineParts[2].equals("-")) { + // If size is missing then this URL was not successfully visited. Skip in index + return null; + } try { size = Long.parseLong(lineParts[2]); } catch (NumberFormatException e) { ! System.err.println("Error parsing size for: " + line + ! " Item: " + lineParts[2] + " Message: " + e.getMessage()); } |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:55
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/apt In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site/apt Added Files: release.apt release3.apt format.apt started.apt license.apt index.apt Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: started.apt --- Getting Started ~~~~~~~~~~~~~~~ * Building an index ~~~~~~~~~~~~~~~~~~~ [[1]] A functional installation of Heritrix is required for this software to work. While Heritrix can be deployed on non-Linux operating systems that requires some degree of work as the bundled scripts are written for Linux. The same applies to this software and the following instructions assume that Heritrix is installed on a Linux machine under <<<$HERITRIX_HOME>>>. [[2]] Install the DeDuplicator software. The jar files should be included in <<<$HERITRIX_HOME/lib/>>> while the dedupdigest script should be added to <<<$HERITRIX_HOME/bin/>>>. If you've downloaded a .tar.gz (.zip) bundle, explode it into <<<$HERITRIX_HOME>>> and all the files will be correctly deployed. [[3]] Make the dedupdigest script executable with <<<chmod u+x $HERITRIX_HOME/bin/dedupdigest>>> [[4]] Run <<<$HERITRIX_HOME/bin/dedupdigest --help>>> This will display the usage information for the indexing. The program takes two arguments, the source data (crawl.log usually) and the target directory where the index will be written (will be created if not present). Several options are provided to custom tailor the type of index. [[5]] Create an index. A typical index can be built with <<<$HERITRIX_HOME/bin/dedupdigest -o URL -s -t <location of crawl.log> <index output directory>>>> This will create an index that is indexed by URL only (not by the content digest) and includes equivalent URLs and timestamps. * Using the index ~~~~~~~~~~~~~~~~~ [[1]] Having built an appropriate index, launch Heritrix. Make sure that the installation of Heritrix that you launched has the two JARs that come with the DeDuplicator (deduplicator-[version].jar and lucene-[version].jar) if it is not the same one used for creating the index. [[2]] Configure a crawl job as normal except add the DeDuplicator processor to the processing chain at some point <<after>> the HTTPFetcher processor and prior to any processor which should be skipped if a duplicate is detected. When the DeDuplicator finds a duplicate the processing moves straight to the PostProcessing chain. So if you insert it at the top of the Extractor chain you can skip both link extraction and writing to disk. If you do not wish to skip link extraction you can insert the processor at the end of the link extraction chain etc. [[3]] The DeDuplicator processor has several configurable parameters. * <<enabled>> Standard Heritrix property for processors. Should be true. Setting it to false will disable the processor.</li> * <<index-location>> The most important setting. A full path to the directory that contains the index (output directory of the indexing.) * <<matching-method>> Whether to lookup URLs or content digests first when looking for matches. This setting depends on how the index was built (indexing mode). If it was set to BOTH then either setting will work. Otherwise it must be set according to the indexing mode. * <<try-equivalent>> Should equivalent URLs be tried if an exact URL and content digest match is not found. Using equivalent matches means that duplicate documents whose URLs differ only in the parameter list or because of www[0-9]* prefixes are detected. * <<mime-filter>> Which documents to process * <<filter-mode>> * <<analysis-mode>> Enables analysis of the usefulness and accuracy of header information in predicting change and non-change in documents. For statistical gathering purposes only. * <<log-level>> Enables more logging. * <<stats-per-host>> Maintains statistics per host in addition to the crawl wide stats. [[4]] Once the processor has been configured the crawl can be started and run normally. Information about the processor is available via the Processor report in the Heritrix GUI (this is saved to processors-report.txt at the end of a crawl). Duplicate URLs will still show up in the crawl log but with a note 'duplicate' in the annotation field at the end of the log line. --- NEW FILE: index.apt --- The DeDuplicator (Heritrix add-on module) * Release information ~~~~~~~~~~~~~~~~~~~~~ Current stable release is {{{release.html#0.4.0}0.4.0}}. All releases, including interim (potentially unstable) releases can be found here: {{{release.html}Release History of DeDuplicator for Heritrix 1}} and here: {{{release3.html}Release History of DeDuplicator for Heritrix 3}} * News ~~~~~~ ** DeDuplicator for Heritrix 3 - 23/07/2010 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version 3.0.0-SNAPSHOT-20100727 is now available {{{release3.html}here}}. This version is compiled against Heritrix 3.0.0. It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before. Memory usage appears to be approximately 5 bytes per URL in index, as compared to 3.6 bytes per URL previously. Query times have however improved significantly and are now fixed time without regard for the index size. For large indexes this can mean as much as 10-30 times shorter query times. Building indexes is also much faster (approximately 3-4 times as fast). Currently the DeDupFetchHTTP processor has not been converted. This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. One final release (1.0.0) will be released soon with some accumulated bugfixes. A release candidate is available {{{release.html}here}}. ** Version 0.4.0 released / Future plans - 15/07/2008 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version 0.4.0 includes numerous tweaks and patches introduced since 0.2.0. Notable changes: * Support for changed crawl.log format that Heritrix introduced in 1.12.0. * Improved memory usage for large indexes. * Can now exclude duplicate URIs from new index. * Various bug fixes. This will be the last version of the DeDuplicator that is built against Heritrix 1.10.0. Building against that version of Heritrix has made the DeDuplicator compatible with almost all 1.x versions of Heritrix. Note though that 0.4.0 is built with Java 1.5, unlike 0.2.0 which was built with Java 1.4.2. In version 1.12.0 Heritrix added some useful features that the DeDuplicator should make use of, most notably marking content as 'not novel' (i.e. duplicate). Also in 1.14.0 there is rudimentary WARC support and the aim is to have the DeDuplicator support writing to WARC files. Therefor, any future versions will be built against Heritrix 1.14.0. Support for Heritrix 2.0 is planned but there is no set timeframe for it. This requires considerable changes to the DeDuplicator and will likely not be implemented until Heritrix 2.x is sufficiently mature that it is used routinely instead of 1.x for large scale production crawls. ** Support for Heritrix 1.12.0 - 1/06/2007 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A new interim release has been uploaded to deal with the changed crawl.log format in Heritrix 1.12.0. 0.4.0 will be the final release for Heritrix up to version 1.12.0 and should be released soon. Heritrix version 2.0.0, currently in development, will greatly change Heritrix's API and so will require significant changes to the DeDuplicator. Look for the first interim release built against the new Heritrix API as soon as the changes are moved into the trunk of the Heritrix project. Probably sometime this month. ** Moved to Sourceforge.net - 7/11/2006 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The project has now been moved to Sourceforge.net. The code has been moved to SF's CVS and anonymous access is now possible. Initial commit was of version 0.2.0. Change history prior to 0.2.0 will be discarded except that we will keep the packaged PreRelease versions that were made. Along with the public CVS, SourceForge also provides {{{http://sourceforge.net/tracker/?group_id=181565}bug and RFE trackers}}. The project website has also been moved to {{http://deduplicator.sourceforge.net/}}. Stable releases will now be distributed via SourceForge while iterim builds will continue to be made available on the {{{release.html}Release History}} page (until continous integration is set up). ** Managing duplicates across sequential crawls - 31/10/2006 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On September 21, Kristinn Sigur�sson presented a paper on the DeDuplicator titled 'Managing duplicates across sequential crawls' at the {{{http://iwaw.net/06}6th International Web Archiving Workshop}} held in conjunction with the {{{http://ecdl2006.org}10th ECDL}} in Alicante, Spain. The paper is available in the {{{http://www.iwaw.net/06/PDF/iwaw06-proceedings.pdf}Workshop Procedings}} and can also be downloaded by itself directly from here: {{{http://vefsofnun.bok.hi.is/upload/3/ManagingDuplicatesAcrossSequentialCrawls.pdf}Managing duplicates across sequential crawls}}. --- NEW FILE: release.apt --- Release History - DeDuplicator for Heritrix 1 ~~~~~~~~~~~~~~~ The following is a list of releases of the DeDuplicator for Heritrix 1. For Heritrix 3 see {{{release3.html}here}}. Stable releases are clearly labeled as such and can be downloaded via our {{{https://sourceforge.net/project/showfiles.php?group_id=181565}SourceForge download page}}. Any other release may contain unstable/untested elements. They are provided since a continous build process is not currently available. Most recent stable release is {{{release.html#0.4.0}0.4.0}} * {1.0.0-RC1} (Release candidate for 1.0.0) ~~~~~~~~~~~~~~~~~~ * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-bin.tar.gz?download}deduplicator-1.0.0-RC1-bin.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-bin.zip?download}deduplicator-1.0.0-RC1-bin.zip}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-src.tar.gz?download}deduplicator-1.0.0-RC1-src.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-1.0.0-RC1-src.zip?download}deduplicator-1.0.0-RC1-src.zip}} Incorporates a few minor bugfixes from version 0.4.0. Namely, it fixes a bug when doing matches by digest and a bug in how it hooked into Heritrix 1.12 (and up) way of marking content as duplicate. * {0.4.0} (Stable) ~~~~~~~~~~~~~~~~~~ * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-bin.tar.gz?download}deduplicator-0.4.0-bin.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-bin.zip?download}deduplicator-0.4.0-bin.zip}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-src.tar.gz?download}deduplicator-0.4.0-src.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.4.0-src.zip?download}deduplicator-0.4.0-src.zip}} Incorporates the numerous tweaks and patches introduced since 0.2.0 (see comments for interim builds below for details). Tested in production crawls. No known issues. This is the last version to be built against Heritrix 1.10.0 (fully compatible with any Heritrix version from 1.6-1.14, but not 2.x). Unlike 0.2.0 it is built with Java 1.5, not 1.4.2. * 0.3.0-20080527 ~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-bin.tar.gz}deduplicator-0.3.0-20080527-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-bin.zip}deduplicator-0.3.0-20080527-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-src.tar.gz}deduplicator-0.3.0-20080527-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080527-src.zip}deduplicator-0.3.0-20080527-src.zip}} Applied patches from K�re Fiedler Christiansen. It includes a new 'SpareRangeFilter' that now optionally replaces Lucene's RangeFilter when making queries. This reduces the memory usage at a cost to performance. Also a minor NPE bugfix patch was included. Module is now compiled against Java 1.5. Some 1.5 specific changes have been made, mostly using generics. Some clean up of warnings. This version is largely untested! * 0.3.0-20080129 ~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-bin.tar.gz}deduplicator-0.3.0-20080129-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-bin.zip}deduplicator-0.3.0-20080129-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-src.tar.gz}deduplicator-0.3.0-20080129-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20080129-src.zip}deduplicator-0.3.0-20080129-src.zip}} Applied patch from Lars Clausen. Issue as explained by Lars: "We've run across a scaling issue in the use of TermQuery for Lucene indexes of 400+ million entries. TermQuery uses norms, which spends one byte of memory per entry in the index. Even turning off norms on the index doesn't help, since TermQuery in the most friendly way creates a fake array of norms. This patch changes the deduplicator to use a ConstantScoreQuery with a RangeFilter, which avoids most of the memory usage and doesn't seem to affect the speed." Patch is untested at this time. * 0.3.0-20070601 ~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-bin.tar.gz}deduplicator-0.3.0-20070601-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-bin.zip}deduplicator-0.3.0-20070601-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-src.tar.gz}deduplicator-0.3.0-20070601-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20070601-src.zip}deduplicator-0.3.0-20070601-src.zip}} While still compiled against version 1.10.0 of Heritrix, this release now handles the changed crawl.log format of Heritrix 1.12.0 which prefixes the content digest with the name of the scheme. * 0.3.0-20061218 ~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-bin.tar.gz}deduplicator-0.3.0-20061218-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-bin.zip}deduplicator-0.3.0-20061218-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-src.tar.gz}deduplicator-0.3.0-20061218-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061218-src.zip}deduplicator-0.3.0-20061218-src.zip}} Added (patch by Maximilian Schoefmann) the ability to exclude URLs marked as duplicates in crawl.log from the index. * 0.3.0-20061031 ~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-bin.tar.gz}deduplicator-0.3.0-20061031-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-bin.zip}deduplicator-0.3.0-20061031-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-src.tar.gz}deduplicator-0.3.0-20061031-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-0.3.0-20061031-src.zip}deduplicator-0.3.0-20061031-src.zip}} Fixed a bug (reported by Lars Clausen) in CrawlLogIterator where malformed lines in the crawl.log would cause an exception. Is now handled gracefully. Also added unit tests for CrawlLogIterator.parseLine(). To facilitate that parseLine() is now a static method. * {0.2.0} (Stable) ~~~~~~~~~~~~~~~~~~ * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-bin.tar.gz?download}deduplicator-0.2.0-bin.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-bin.zip?download}deduplicator-0.2.0-bin.zip}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-src.tar.gz?download}deduplicator-0.2.0-src.tar.gz}} * {{{http://prdownloads.sourceforge.net/deduplicator/deduplicator-0.2.0-src.zip?download}deduplicator-0.2.0-src.zip}} First official release. September 13, 2006. * PreRelease20060808 ~~~~~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-bin.tar.gz}deduplicator-PreRelease20060808-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-bin.zip}deduplicator-PreRelease20060808-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-src.tar.gz}deduplicator-PreRelease20060808-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060808-src.zip}deduplicator-PreRelease20060808-src.zip}} Fixed a bug (reported by Lars Clausen) in CrawlLogIterator where entries of files exceeding 10GB would not be parsed correctly since the crawl.log format assumes that the byte size string can never be longer then 10 characters, 10 GB requires 11 characters causing the URL to be shifted down the line. * PreRelease20060717 ~~~~~~~~~~~~~~~~~~~~ * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-bin.tar.gz}deduplicator-PreRelease20060718-bin.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-bin.zip}deduplicator-PreRelease20060718-bin.zip}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-src.tar.gz}deduplicator-PreRelease20060718-src.tar.gz}} * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-PreRelease20060718-src.zip}deduplicator-PreRelease20060718-src.zip}} CrawlLogIterator refactored some more. Project now built with Maven 2.0. First seperate source release. * Older ~~~~~~~ The following releases were made prior to the implementation of the Maven automatic build/release process. Consequently only .tar.gz of the binaries are available. Note that the jar files also contain source files. * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060717.tar.gz}deduplicator-prerelease20060717.tar.gz}} * CrawlLogIterator refactored to make subclassing easier - patch from Lars Clausen. Added setters to CrawlDataItem. Improved Javadoc. * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060623.tar.gz}deduplicator-prerelease20060623.tar.gz}} * DigestIndexer can now be used by other classes. Moved to Lucene 2.0. Improved bash script. * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060606.tar.gz}deduplicator-prerelease20060606.tar.gz}} * Adds 'origin' and makes overriding of content size configurable * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060601.tar.gz}deduplicator-prerelease20060601.tar.gz}} * Adds DeDupFetchHTTP * {{{http://vefsofnun.bok.hi.is/deduplicator_release/deduplicator-prerelease20060516.tar.gz}deduplicator-prerelease20060516.tar.gz}} * Initial preview release --- NEW FILE: format.apt --- The APT format ~~~~~~~~~~~~~~ In the following section, boxes containing text in typewriter-like font are examples of APT source. * Document structure ~~~~~~~~~~~~~~~~~~~~ A short APT document is contained in a single text file. A longer document may be contained in a ordered list of text files. For instance, first text file contains section 1, second text file contains section 2, and so on. [Note:] Splitting the APT document in several text files on a section boundary is not mandatory. The split may occur anywhere. However doing so is recommended because a text file containing a section is by itself a valid APT document. A file contains a sequence of paragraphs and ``displays'' (non paragraphs such as tables) separated by open lines. A paragraph is simply a sequence of consecutive text lines. +------------------------------------------------------------------------+ First line of first paragraph. Second line of first paragraph. Third line of first paragraph. Line 1 of paragraph 2 (separated from first paragraph by an open line). Line 2 of paragraph 2. +------------------------------------------------------------------------+ The indentation of the first line of a paragraph is the main method used by an APT processor to recognize the type of the paragraph. For example, a section title must not be indented at all. A ``plain'' paragraph must be indented by a certain amount of space. For example, a plain paragraph which is not contained in a list may be indented by two spaces. +-------------------------------------------------+ My section title (not indented). My paragraph first line (indented by 2 spaces). +-------------------------------------------------+ Indentation is not rigid. Any amount of space will do. You don't even need to use a consistent indentation all over your document. What really matters for an APT processor is whether the paragraph is not indented at all or, when inside a list, whether a paragraph is more or less indented than the first item of the list (more about this later). +-------------------------------------------------------+ First paragraph has its first line indented by four spaces. Then the author did even bother to indent the other lines of the paragraph. Second paragraph contains several lines which are all indented by two spaces. This style is much nicer than the one used for the previous paragraph. +-------------------------------------------------------+ Note that tabs are expanded with a tab width set to 8. * Document elements ~~~~~~~~~~~~~~~~~~~ ** Block level elements ~~~~~~~~~~~~~~~~~~~~~~~ *** Title ~~~~~~~~~~ A title is optional. If used, it must appear as the first block of the document. +----------------------------------------------------------------------------+ ------ Title ------ Author ------ Date +----------------------------------------------------------------------------+ A title block is indented (centering it is nicer). It begins with a line containing at least 3 dashes (<<<--->>>). After the first <<<--->>> line, one or several consecutive lines of text (implicit line break after each line) specify the title of the document. This text may immediately be followed by another <<<--->>> line and one or several consecutive lines of text which specifies the author of the document. The author sub-block may optionaly be followed by a date sub-block using the same syntax. The following example is used for a document with an title and a date but with no declared author. +----------------------------------------------------------------------------+ ------ Title ------ ------ Date ------ +----------------------------------------------------------------------------+ The last line is ignored. It is just there to make the block nicer. *** Paragraph ~~~~~~~~~~~~~ Paragraphs other than the title block may appear before the first section. +----------------------+ Paragraph 1, line 1. Paragraph 1, line 2. Paragraph 2, line 1. Paragraph 2, line 2. +----------------------+ Paragraphs are indented. They have already been described in the {{document structure}} section. *** Section ~~~~~~~~~~~ Sections are created by inserting section titles into the document. Simple documents need not contain sections. +-----------------------------------+ Section title * Sub-section title ** Sub-sub-section title *** Sub-sub-sub-section title **** Sub-sub-sub-sub-section title +-----------------------------------+ Section titles are not indented. A sub-section title begins with one asterisk (<<<*>>>), a sub-sub-section title begins with two asterisks (<<<**>>>), and so forth up to four sub-section levels. *** List ~~~~~~~~ +---------------------------------------+ * List item 1. * List item 2. Paragraph contained in list item 2. * Sub-list item 1. * Sub-list item 2. * List item 3. +---------------------------------------+ List items are indented and begin with a asterisk (<<<*>>>). Plain paragraphs more indented than the first list item are nested in that list. Displays such as tables (not indented) are always nested in the current list. To nest a list inside a list, indent its first item more than its parent list. To end a list, add a paragraph or list item less indented than the current list. Section titles always end a list. Displays cannot end a list but the <<<[]>>> pseudo-element may be used to force the end of a list. +------------------------------------+ * List item 3. Force end of list: [] -------------------------------------------- Verbatim text not contained in list item 3 -------------------------------------------- +------------------------------------+ In the previous example, without the <<<[]>>>, the verbatim text (not indented as all displays) would have been contained in list item 3. A single <<<[]>>> may be used to end several nested lists at the same time. The indentation of <<<[]>>> may be used to specify exactly which lists should be ended. Example: +------------------------------------+ * List item 1. * List item 2. * Sub-list item 1. * Sub-list item 2. [] ------------------------------------------------------------------- Verbatim text contained in list item 2, but not in sub-list item 2 ------------------------------------------------------------------- +------------------------------------+ There are three kind of lists, the bulleted lists we have already described, the numbered lists and the definition lists. +-----------------------------------------+ [[1]] Numbered item 1. [[A]] Numbered item A. [[B]] Numbered item B. [[2]] Numbered item 2. +-----------------------------------------+ A numbered list item begins with a label beetween two square brackets. The label of the first item establishes the numbering scheme for the whole list: [<<<[[1\]\]>>>] Decimal numbering: 1, 2, 3, 4, etc. [<<<[[a\]\]>>>] Lower-alpha numbering: a, b, c, d, etc. [<<<[[A\]\]>>>] Upper-alpha numbering: A, B, C, D, etc. [<<<[[i\]\]>>>] Lower-roman numbering: i, ii, iii, iv, etc. [<<<[[I\]\]>>>] Upper-roman numbering: I, II, III, IV, etc. The labels of the items other than the first one are ignored. It is recommended to take the time to type the correct label for each item in order to keep the APT source document readable. +-------------------------------------------+ [Defined term 1] of definition list 2. [Defined term 2] of definition list 2. +-------------------------------------------+ A definition list item begins with a defined term: text between square brackets. *** Verbatim text ~~~~~~~~~~~~~~~~~ +----------------------------------------+ ---------------------------------------- Verbatim text, preformatted, escaped. ---------------------------------------- +----------------------------------------+ A verbatim block is not indented. It begins with a non indented line containing at least 3 dashes (<<<--->>>). It ends with a similar line. <<<+-->>> instead of <<<--->>> draws a box around verbatim text. Like in HTML, verbatim text is preformatted. Unlike HTML, verbatim text is escaped: inside a verbatim display, markup is not interpreted by the APT processor. *** Figure ~~~~~~~~~~ +---------------------------+ [Figure name] Figure caption +---------------------------+ A figure block is not indented. It begins with the figure name between square brackets. The figure name is optionally followed by some text: the figure caption. The figure name is the pathname of the file containing the figure but without an extension. Example: if your figure is contained in <<</home/joe/docs/mylogo.jpeg>>>, the figure name is <<</home/joe/docs/mylogo>>>. If the figure name comes from a relative pathname (recommended practice) rather than from an absolute pathname, this relative pathname is taken to be relative to the directory of the current APT document (a la HTML) rather than relative to the current working directory. Why not leave the file extension in the figure name? This is better explained by an example. You need to convert an APT document to PostScript and your figure name is <<</home/joe/docs/mylogo>>>. A APT processor will first try to load <<</home/joe/docs/mylogo.eps>>>. When the desired format is not found, a APT processor tries to convert one of the existing formats. In our example, the APT processor tries to convert <<</home/joe/docs/mylogo.jpeg>>> to encapsulated PostScript. *** Table ~~~~~~~~~ A table block is not indented. It begins with a non indented line containing an asterisk and at least 2 dashes (<<<*-->>>). It ends with a similar line. The first line is not only used to recognize a table but also to specify column justification. In the following example, * the second asterisk (<<<*>>>) is used to specify that column 1 is centered, * the plus sign (<<<+>>>) specifies that column 2 is left aligned, * the colon (<<<:>>>) specifies that column 3 is right aligned. [] +---------------------------------------------+ *----------*--------------+----------------: | Centered | Left-aligned | Right-aligned | | cell 1,1 | cell 1,2 | cell 1,3 | *----------*--------------+----------------: | cell 2,1 | cell 2,2 | cell 2,3 | *----------*--------------+----------------: Table caption +---------------------------------------------+ Rows are separated by a non indented line beginning with <<<*-->>>. An optional table caption (non indented text) may immediately follow the table. Rows may contain single line or multiple line cells. Each line of cell text is separated from the adjacent cell by the pipe character (<<<|>>>). (<<<|>>> may be used in the cell text if quoted: <<<\\|>>>.) The last <<<|>>> is only used to make the table nicer. The first <<<|>>> is not only used to make the table nicer, but also to specify that a grid is to be drawn around table cells. The following example shows a simple table with no grid and no caption. +---------------+ *-----*------* cell | cell *-----*------* cell | cell *-----*------* +---------------+ *** Horizontal rule ~~~~~~~~~~~~~~~~~~~ +---------------------+ ===================== +---------------------+ A non indented line containing at least 3 equal signs (<<<===>>>). *** Page break ~~~~~~~~~~~~~~ +---+ ^L +---+ A non indented line containing a single form feed character (Control-L). ** Text level elements ~~~~~~~~~~~~~~~~~~~~~~ *** Font ~~~~~~~~ +-----------------------------------------------------+ <Italic> font. <<Bold>> font. <<<Monospaced>>> font. +-----------------------------------------------------+ Text between \< and > must be rendered in italic. Text between \<\< and >> must be rendered in bold. Text between \<\<\< and >>> must be rendered using a monospaced, typewriter-like font. Font elements may appear anywhere except inside other font elements. It is not recommended to use font elements inside titles, section titles, links and defined terms because a APT processor automatically applies appropriate font styles to these elements. *** Anchor and link ~~~~~~~~~~~~~~~~~~~ +-----------------------------------------------------------------+ {Anchor}. Link to {{anchor}}. Link to {{http://www.pixware.fr}}. Link to {{{anchor}showing alternate text}}. Link to {{{http://www.pixware.fr}Pixware home page}}. +-----------------------------------------------------------------+ Text between curly braces (<<<\{}>>>) specifies an anchor. Text between double curly braces (<<<\{\{}}>>>) specifies a link. It is an error to create a link element that does not refer to an anchor of the same name. The name of an anchor/link is its text with all non alphanumeric characters stripped. This rule does not apply to links to <external> anchors. Text beginning with <<<http:/>>>, <<<https:/>>>, <<<ftp:/>>>, <<<file:/>>>, <<<mailto:>>>, <<<../>>>, <<<./>>> (<<<..\\>>> and <<<.\\>>> on Windows) is recognized as an external anchor name. When the construct <<\{\{\{>><name><<}>><text><<}}>> is used, the link text <text> may differ from the link name <name>. Anchor/link elements may appear anywhere except inside other anchor/link elements. Section titles are implicitly defined anchors. *** Line break ~~~~~~~~~~~~~~ +-------------+ Force line\ break. +-------------+ A backslash character (<<<\\>>>) followed by a newline character. Line breaks must not be used inside titles and tables (which are line oriented blocks with implicit line breaks). *** Non breaking space ~~~~~~~~~~~~~~~~~~~~~~ +----------------------+ Non\ breaking\ space. +----------------------+ A backslash character (<<<\\>>>) followed by a space character. *** Special character ~~~~~~~~~~~~~~~~~~~~~ +---------------------------------------------------------------------------+ Escaped special characters: \~, \=, \-, \+, \*, \[, \], \<, \>, \{, \}, \\. +---------------------------------------------------------------------------+ In certain contexts, these characters have a special meaning and therefore must be escaped if needed as is. They are escaped by adding a backslash in front of them. The backslash may itself be escaped by adding another backslash in front of it. Note that an asterisk, for example, needs to be escaped only if its begins a paragraph. (<<<*>>> has no special meaning in the middle of a paragraph.) +--------------------------------------+ Copyright symbol: \251, \xA9, \u00a9. +--------------------------------------+ Latin-1 characters (whatever is the encoding of the APT document) may be specified by their codes using a backslash followed by one to three octal digits or by using the <<<\x>>><NN> notation, where <NN> are two hexadecimal digits. Unicode characters may be specified by their codes using the <<<\u>>><NNNN> notation, where <NNNN> are four hexadecimal digits. *** Comment ~~~~~~~~~~~ +---------------+ ~~Commented out. +---------------+ Text found after two tildes (<<<\~~>>>) is ignored up to the end of line. A line of <<<~>>> is often used to ``underline'' section titles in order to make them stand out of other paragraphs. * The APT format at a glance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------------------------------------------------------ ------ Title ------ Author ------ Date Paragraph 1, line 1. Paragraph 1, line 2. Paragraph 2, line 1. Paragraph 2, line 2. Section title * Sub-section title ** Sub-sub-section title *** Sub-sub-sub-section title **** Sub-sub-sub-sub-section title * List item 1. * List item 2. Paragraph contained in list item 2. * Sub-list item 1. * Sub-list item 2. * List item 3. Force end of list: [] +------------------------------------------+ Verbatim text not contained in list item 3 +------------------------------------------+ [[1]] Numbered item 1. [[A]] Numbered item A. [[B]] Numbered item B. [[2]] Numbered item 2. List numbering schemes: [[1]], [[a]], [[A]], [[i]], [[I]]. [Defined term 1] of definition list. [Defined term 2] of definition list. +-------------------------------+ Verbatim text in a box +-------------------------------+ --- instead of +-- suppresses the box around verbatim text. [Figure name] Figure caption *----------*--------------+----------------: | Centered | Left-aligned | Right-aligned | | cell 1,1 | cell 1,2 | cell 1,3 | *----------*--------------+----------------: | cell 2,1 | cell 2,2 | cell 2,3 | *----------*--------------+----------------: Table caption No grid, no caption: *-----*------* cell | cell *-----*------* cell | cell *-----*------* Horizontal line: ======================================================================= ^L New page. <Italic> font. <<Bold>> font. <<<Monospaced>>> font. {Anchor}. Link to {{anchor}}. Link to {{http://www.pixware.fr}}. Link to {{{anchor}showing alternate text}}. Link to {{{http://www.pixware.fr}Pixware home page}}. Force line\ break. Non\ breaking\ space. Escaped special characters: \~, \=, \-, \+, \*, \[, \], \<, \>, \{, \}, \\. Copyright symbol: \251, \xA9, \u00a9. ~~Commented out. ------------------------------------------------------------------------------ --- NEW FILE: license.apt --- License +------------------------------------------------------------------------+ DeDuplicator is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser Public license (LGPL) reproduced below. DeDuplicator includes the libraries it depends upon. The libraries used can be found under the 'lib' directory. GNU LESSER GENERAL PUBLIC LICENSE Version 2.1, February 1999 Copyright (C) 1991, 1999 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. [This is the first released version of the Lesser GPL. It also counts as the successor of the GNU Library Public License, version 2, hence the version number 2.1.] Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public Licenses are intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This license, the Lesser General Public License, applies to some specially designated software packages--typically libraries--of the Free Software Foundation and other authors who decide to use it. You can use it too, but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case, based on the explanations below. When we speak of free software, we are referring to freedom of use, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish); that you receive source code or can get it if you want it; that you can change the software and use pieces of it in new free programs; and that you are informed that you can do these things. To protect your rights, we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights. These restrictions translate to certain responsibilities for you if you distribute copies of the library or if you modify it. For example, if you distribute copies of the library, whether gratis or for a fee, you must give the recipients all the rights that we gave you. You must make sure that they, too, receive or can get the source code. If you link other code with the library, you must provide complete object files to the recipients, so that they can relink them with the library after making changes to the library and recompiling it. And you must show them these terms so they know their rights. We protect your rights with a two-step method: (1) we copyright the library, and (2) we offer you this license, which gives you legal permission to copy, distribute and/or modify the library. To protect each distributor, we want to make it very clear that there is no warranty for the free library. Also, if the library is modified by someone else and passed on, the recipients should know that what they have is not the original version, so that the original author's reputation will not be affected by problems that might be introduced by others. Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license. Most GNU software, including some libraries, is covered by the ordinary GNU General Public License. This license, the GNU Lesser General Public License, applies to certain designated libraries, and is quite different from the ordinary General Public License. We use this license for certain libraries in order to permit linking those libraries into non-free programs. When a program is linked with a library, whether statically or using a shared library, the combination of the two is legally speaking a combined work, a derivative of the original library. The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom. The Lesser General Public License permits more lax criteria for linking other code with the library. We call this license the "Lesser" General Public License because it does Less to protect the user's freedom than the ordinary General Public License. It also provides other free software developers Less of an advantage over competing non-free programs. These disadvantages are the reason we use the ordinary General Public License for many libraries. However, the Lesser license provides advantages in certain special circumstances. For example, on rare occasions, there may be a special need to encourage the widest possible use of a certain library, so that it becomes a de-facto standard. To achieve this, non-free programs must be allowed to use the library. A more frequent case is that a free library does the same job as widely used non-free libraries. In this case, there is little to gain by limiting the free library to free software only, so we use the Lesser General Public License. In other cases, permission to use a particular library in non-free programs enables a greater number of people to use a large body of free software. For example, permission to use the GNU C Library in non-free programs enables many more people to use the whole GNU operating system, as well as its variant, the GNU/Linux operating system. Although the Lesser General Public License is Less protective of the users' freedom, it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library. The precise terms and conditions for copying, distribution and modification follow. Pay close attention to the difference between a "work based on the library" and a "work that uses the library". The former contains code derived from the library, whereas the latter must be combined with the library in order to run. GNU LESSER GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any software library or other program which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License (also called "this License"). Each licensee is addressed as "you". A "library" means a collection of software functions and/or data prepared so as to be conveniently linked with application programs (which use some of those functions and data) to form executables. The "Library", below, refers to any such software library or work which has been distributed under these terms. A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".) "Source code" for a work means the preferred form of the work for making modifications to it. For a library, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the library. Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running a program using the Library is not restricted, and output from such a program is covered only if its contents constitute a work based on the Library (independent of the use of the Library in a tool for writing it). Whether that is true depends on what the Library does and what the program that uses the Library does. 1. You may copy and distribute verbatim copies of the Library's complete source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and distribute a copy of this License along with the Library. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Library or any portion of it, thus forming a work based on the Library, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) The modified work must itself be a software library. b) You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change. c) You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License. d) If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility, other than as an argument passed when the facility is invoked, then you must make a good faith effort to ensure that, in the event an application does not supply such function or table, the facility still operates, and performs whatever part of its purpose remains meaningful. (For example, a function in a library to compute square roots has a purpose that is entirely well-defined independent of the application. Therefore, Subsection 2d requires that any application-supplied function or table used by this function must be optional: if the application does not supply it, the square root function must still compute square roots.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Library, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Library, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Library. In addition, mere aggregation of another work not based on the Library with the Library (or with a work based on the Library) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library. To do this, you must alter all the notices that refer to this License, so that they refer to the ordinary GNU General Public License, version 2, instead of to this License. (If a newer version than version 2 of the ordinary GNU General Public License has appeared, then you can specify that version instead if you wish.) Do not make any other change in these notices. Once this change is made in a given copy, it is irreversible for that copy, so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy. This option is useful when you wish to copy part of the code of the Library into a program that is not a library. 4. You may copy and distribute the Library (or a portion or derivative of it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange. If distribution of object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code, even though third parties are not compelled to copy the source along with the object code. 5. A program that contains no derivative of any portion of the Library, but is designed to work with the Library by being compiled or linked with it, is called a "work that uses the Library". Such a work, in isolation, is not a derivative work of the Library, and therefore falls outside the scope of this License. However, linking a "work that uses the Library" with the Library creates an executable that is a derivative of the Library (because it contains portions of the Library), rather than a "work that uses the library". The executable is therefore covered by this License. Section 6 states terms for distribution of such executables. When a "work that uses the Library" uses material from a header file that is part of the Library, the object code for the work may be a derivative work of the Library even though the source code is not. Whether this is true is especially significant if the work can be linked without the Library, or if the work is itself a library. The threshold for this to be true is not precisely defined by law. If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work. (Executables containing this object code plus portions of the Library will still fall under Section 6.) Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself. 6. As an exception to the Sections above, you may also combine or link a "work that uses the Library" with the Library to produce a work containing portions of the Library, and distribute that work under terms of your choice, provided that the terms permit modification of the work for the customer's own use and reverse engineering for debugging such modifications. You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License. You must supply a copy of this License. If the work during execution displays copyright notices, you must include the copyright notice for the Library among them, as well as a reference directing the user to the copy of this License. Also, you must do one of these things: a) Accompany the work with the complete corresponding machine-readable source code for the Library including whatever changes were used in the work (which must be distributed under Sections 1 and 2 above); and, if the work is an executable linked with the Library, with the complete machine-readable "work that uses the Library", as object code and/or source code, so that the user can modify the Library and then relink to produce a modified executable containing the modified Library. (It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified definitions.) b) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (1) uses at run time a copy of the library already present on the user's computer system, rather than copying library functions into the executable, and (2) will operate properly with a modified version of the library, if the user installs one, as long as the modified version is interface-compatible with the version that the work was made with. c) Accompany the work with a written offer, valid for at least three years, to give the same user the materials specified in Subsection 6a, above, for a charge no more than the cost of performing this distribution. d) If distribution of the work is made by offering access to copy from a designated place, offer equivalent access to copy the above specified materials from the same place. e) Verify that the user has already received a copy of these materials or that you have already sent this user a copy. For an executable, the required form of the "work that uses the Library" must include any data and utility programs needed for reproducing the executable from it. However, as a special exception, the materials to be distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system. Such a contradiction means you cannot use both them and the Library together in an executable that you distribute. 7. You may place library facilities that are a work based on the Library side-by-side in a single library together with other library facilities not covered by this License, and distribute such a combined library, provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted, and provided that you do these two things: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities. This must be distributed under the terms of the Sections above. b) Give prominent notice with the combined library of the fact that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 8. You may not copy, modify, sublicense, link with, or distribute the Library except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, link with, or distribute the Library is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 9. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Library or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Library (or any work based on the Library), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Library or works based on it. 10. Each time you redistribute the Library (or any work based on the Library), the recipient automatically receives a license from the original licensor to copy, distribute, link with or modify the Library subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties with this License. 11. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Library at all. For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply, and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 12. If the distribution and/or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 13. The Free Software Foundation may publish revised and/or new versions of the Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Library does not s... [truncated message content] |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:55
|
Update of /cvsroot/deduplicator/deduplicator3/src/site In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site Added Files: site.xml Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: site.xml --- <?xml version="1.0" encoding="ISO-8859-1"?> <project name="DeDuplicator"> <skin> <groupId>org.apache.maven.skins</groupId> <artifactId>maven-default-skin</artifactId> <version>1.0</version> </skin> <bannerLeft> <name>DeDuplicator</name> <src>images/dedup.png</src> <href>http://vefsofnun.bok.hi.is/deduplicator</href> </bannerLeft> <bannerRight> <name>National and University Library of Iceland</name> <src>images/lbs.gif</src> <href>http://landsbokasafn.is</href> </bannerRight> <poweredBy> <logo name="SourceForge.net Logo" href="http://sourceforge.net" img="http://sflogo.sourceforge.net/sflogo.php?group_id=181565&type=1"/> <logo name="Lucene Logo" href="http://lucene.apache.org" img="images/lucene.jpg"/> <logo name="Build with Maven 2" href="http://maven.apache.org/" img="images/logos/maven-feather.png"/> </poweredBy> <publishDate format="MMMM d, yyyy"/> <body> <links> <item name="Heritrix" href="http://crawler.archive.org/" /> <item name="Lucene" href="http://lucene.apache.org/" /> <item name="SourceForge" href="http://sourceforge.net/projects/deduplicator/" /> </links> <menu name="DeDuplicator"> <item name="Welcome" href="index.html"/> <item name="FAQ" href="faq.html"/> <item name="Releases" href="release.html"/> <item name="License" href="license.html"/> <item name="Getting started" href="started.html"/> <item name="Javadoc" href="apidocs/index.html"/> </menu> <menu ref="reports" /> </body> </project> |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/main/java/is/landsbokasafn/deduplicator Modified Files: CommandLineParser.java DeDuplicator.java CrawlLogIterator.java DigestIndexer.java DeDupFetchHTTP.java CrawlDataItem.java Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. Index: DeDupFetchHTTP.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/DeDupFetchHTTP.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** DeDupFetchHTTP.java 14 Jul 2010 16:19:11 -0000 1.1 --- DeDupFetchHTTP.java 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 23,36 **** package is.landsbokasafn.deduplicator; - import java.io.IOException; - import java.text.SimpleDateFormat; - import java.util.logging.Level; - import java.util.logging.Logger; - import org.archive.modules.fetcher.FetchHTTP; - - import dk.netarkivet.common.utils.SparseRangeFilter; - /** * An extentsion of Heritrix's {@link org.archive.crawler.fetcher.FetchHTTP} --- 23,28 ---- Index: CrawlDataItem.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/CrawlDataItem.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** CrawlDataItem.java 14 Jul 2010 16:19:11 -0000 1.1 --- CrawlDataItem.java 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 44,47 **** --- 44,48 ---- protected String origin; protected boolean duplicate; + protected long size; /** *************** *** 57,60 **** --- 58,62 ---- origin = null; duplicate = false; + size = -1; } *************** *** 75,79 **** */ public CrawlDataItem(String URL, String contentDigest, String timestamp, ! String etag, String mimetype, String origin, boolean duplicate){ this.URL = URL; this.contentDigest = contentDigest; --- 77,81 ---- */ public CrawlDataItem(String URL, String contentDigest, String timestamp, ! String etag, String mimetype, String origin, boolean duplicate, long size){ this.URL = URL; this.contentDigest = contentDigest; *************** *** 83,86 **** --- 85,89 ---- this.origin = origin; this.duplicate = duplicate; + this.size = size; } *************** *** 201,203 **** --- 204,222 ---- } + /** + * Get the size of the CrawlDataItem. + * @return The size or -1 if the size could not be determined. + */ + public long getSize() { + return size; + } + + /** + * Set the size of the CrawlDataItem + * @param size The size or -1 if the size is indeterminate + */ + public void setSize(long size) { + this.size = size; + } + } Index: DeDuplicator.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/DeDuplicator.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** DeDuplicator.java 21 Jul 2010 14:02:58 -0000 1.2 --- DeDuplicator.java 27 Jul 2010 09:09:46 -0000 1.3 *************** *** 32,41 **** import java.text.ParseException; import java.text.SimpleDateFormat; - import java.util.ArrayList; - import java.util.Arrays; import java.util.Date; import java.util.HashMap; import java.util.Iterator; - import java.util.List; import java.util.Locale; import java.util.Map; --- 32,38 ---- *************** *** 45,48 **** --- 42,46 ---- import org.apache.commons.httpclient.HttpMethod; import org.apache.lucene.document.Document; + import org.apache.lucene.index.Term; import org.apache.lucene.search.ConstantScoreQuery; import org.apache.lucene.search.IndexSearcher; *************** *** 61,66 **** import org.springframework.beans.factory.annotation.Autowired; - import dk.netarkivet.common.utils.SparseRangeFilter; - /** * Heritrix compatible processor. --- 59,62 ---- *************** *** 79,84 **** Logger.getLogger(DeDuplicator.class.getName()); - private static final int MAX_HITS = 1000; - // Spring configurable parameters --- 75,78 ---- *************** *** 197,201 **** } public void setChangeContentSize(boolean changeContentSize){ ! kp.put(ATTR_CHANGE_CONTENT_SIZE,changeContentSize); } --- 191,195 ---- } public void setChangeContentSize(boolean changeContentSize){ ! kp.put(ATTR_CHANGE_CONTENT_SIZE, changeContentSize); } *************** *** 483,489 **** // Look the CrawlURI's URL up in the index. try { ! Query query = queryField(DigestIndexer.FIELD_URL, ! curi.toString()); ! TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; --- 477,483 ---- // Look the CrawlURI's URL up in the index. try { ! Query query = queryField(DigestIndexer.FIELD_URL, curi.toString()); ! TopScoreDocCollector collector = TopScoreDocCollector.create( ! searcher.docFreq(new Term(DigestIndexer.FIELD_URL, curi.toString())), false); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; *************** *** 516,522 **** // No exact hits. Let's try lenient matching. String normalizedURL = DigestIndexer.stripURL(curi.toString()); ! query = queryField(DigestIndexer.FIELD_URL_NORMALIZED, ! normalizedURL); ! collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query,collector); hits = collector.topDocs().scoreDocs; --- 510,516 ---- // No exact hits. Let's try lenient matching. String normalizedURL = DigestIndexer.stripURL(curi.toString()); ! query = queryField(DigestIndexer.FIELD_URL_NORMALIZED, normalizedURL); ! collector = TopScoreDocCollector.create( ! searcher.docFreq(new Term(DigestIndexer.FIELD_URL_NORMALIZED, normalizedURL)), false); searcher.search(query,collector); hits = collector.topDocs().scoreDocs; *************** *** 569,573 **** Query query = queryField(DigestIndexer.FIELD_DIGEST, currentDigest); try { ! TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query,collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; --- 563,568 ---- Query query = queryField(DigestIndexer.FIELD_DIGEST, currentDigest); try { ! int hitsOnDigest = searcher.docFreq(new Term(DigestIndexer.FIELD_DIGEST,currentDigest)); ! TopScoreDocCollector collector = TopScoreDocCollector.create(hitsOnDigest, false); searcher.search(query,collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; *************** *** 592,597 **** currHostStats.exactURLDuplicates++; } ! logger.finest("Found exact match for " + ! curi.toString()); } --- 587,591 ---- currHostStats.exactURLDuplicates++; } ! logger.finest("Found exact match for " + curi.toString()); } *************** *** 754,760 **** boolean isDuplicate) { try{ ! Query query = queryField(DigestIndexer.FIELD_URL, ! curi.toString()); ! TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query,collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; --- 748,754 ---- boolean isDuplicate) { try{ ! Query query = queryField(DigestIndexer.FIELD_URL, curi.toString()); ! TopScoreDocCollector collector = TopScoreDocCollector.create( ! searcher.docFreq(new Term(DigestIndexer.FIELD_URL, curi.toString())), false); searcher.search(query,collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; *************** *** 874,884 **** protected Query queryField(String fieldName, String value) { Query query = null; - if(getUseSparseRengeFilter()){ - query = new ConstantScoreQuery( - new SparseRangeFilter(fieldName, value, value, true, true)); - } else { query = new ConstantScoreQuery( new TermRangeFilter(fieldName, value, value, true, true)); - } return query; --- 868,873 ---- Index: CrawlLogIterator.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/CrawlLogIterator.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** CrawlLogIterator.java 14 Jul 2010 16:19:11 -0000 1.1 --- CrawlLogIterator.java 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 163,167 **** // Index 1: status return code (ignore) ! // Index 2: File size (ignore) // Index 3: URL --- 163,173 ---- // Index 1: status return code (ignore) ! // Index 2: File size ! long size = -1; ! try { ! size = Long.parseLong(lineParts[2]); ! } catch (NumberFormatException e) { ! System.err.println("Error parsing size for: " + line); ! } // Index 3: URL *************** *** 215,220 **** } // Got a valid item. ! return new CrawlDataItem( ! url,digest,timestamp,null,mime,origin,duplicate); } return null; --- 221,225 ---- } // Got a valid item. ! return new CrawlDataItem(url, digest, timestamp, null, mime, origin, duplicate, size); } return null; Index: CommandLineParser.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/CommandLineParser.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** CommandLineParser.java 14 Jul 2010 16:19:11 -0000 1.1 --- CommandLineParser.java 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 118,121 **** --- 118,127 ---- "index.")); + opt = new Option("l","minsize", true, + "If set (with a value greather than zero), documents with a known size smaller than the " + + "value given here will be omitted from the index. Minimum size should be specified in bytes."); + opt.setArgName("minsize"); + this.options.addOption(opt); + PosixParser parser = new PosixParser(); try { Index: DigestIndexer.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/DigestIndexer.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** DigestIndexer.java 14 Jul 2010 16:19:11 -0000 1.1 --- DigestIndexer.java 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 163,168 **** boolean verbose) throws IOException { ! return writeToIndex(dataIt, mimefilter, blacklist, defaultOrigin, ! verbose,false); } --- 163,167 ---- boolean verbose) throws IOException { ! return writeToIndex(dataIt, mimefilter, blacklist, defaultOrigin, verbose, false, -1); } *************** *** 184,189 **** * @param verbose If true then progress information will be sent to * System.out. ! * @param skipDuplicates Do not add URLs that are marked as duplicates to ! * the index * @return The number of items added to the index. * @throws IOException If an error occurs writing the index. --- 183,190 ---- * @param verbose If true then progress information will be sent to * System.out. ! * @param skipDuplicates Do not add URLs that are marked as duplicates to the index ! * @param minSize The minimum size of documents added to the index. Documents ! * smaller than this are ignored. Documents with unknown size (CrawlDataItem size set to -1) ! * are not subject to this limit. A value of lesser than or equal to zero disables this feature. * @return The number of items added to the index. * @throws IOException If an error occurs writing the index. *************** *** 195,199 **** String defaultOrigin, boolean verbose, ! boolean skipDuplicates) throws IOException { --- 196,201 ---- String defaultOrigin, boolean verbose, ! boolean skipDuplicates, ! long minSize) throws IOException { *************** *** 202,207 **** while (dataIt.hasNext()) { CrawlDataItem item = dataIt.next(); ! if(!(skipDuplicates && item.duplicate) && ! item.mimetype.matches(mimefilter) != blacklist){ // Ok, we wish to index this URL/Digest count++; --- 204,210 ---- while (dataIt.hasNext()) { CrawlDataItem item = dataIt.next(); ! if ( !(skipDuplicates && item.duplicate) && // Check for duplicates ! item.mimetype.matches(mimefilter) != blacklist && // Apply mime-filter ! (item.size==-1 || item.size > minSize)) { // Apply size filter // Ok, we wish to index this URL/Digest count++; *************** *** 212,216 **** Document doc = new Document(); ! // Add URL to index. doc.add(new Field( FIELD_URL, --- 215,219 ---- Document doc = new Document(); ! // Add URL to document. doc.add(new Field( FIELD_URL, *************** *** 229,233 **** } ! // Add digest to index doc.add(new Field( FIELD_DIGEST, --- 232,236 ---- } ! // Add digest to document doc.add(new Field( FIELD_DIGEST, *************** *** 237,241 **** Field.Index.NOT_ANALYZED : Field.Index.NO) )); ! if(timestamp){ doc.add(new Field( --- 240,245 ---- Field.Index.NOT_ANALYZED : Field.Index.NO) )); ! ! // Include timestamp? if(timestamp){ doc.add(new Field( *************** *** 246,249 **** --- 250,254 ---- )); } + // Include etag? if(etag && item.getEtag()!=null){ doc.add(new Field( *************** *** 254,257 **** --- 259,263 ---- )); } + // Set origin if(defaultOrigin!=null){ String tmp = item.getOrigin(); *************** *** 272,277 **** } if(verbose){ ! System.out.println("Indexed " + count + " items (skipped " + ! skipped + ")"); } return count; --- 278,282 ---- } if(verbose){ ! System.out.println("Indexed " + count + " items (skipped " + skipped + ")"); } return count; *************** *** 327,330 **** --- 332,336 ---- String origin = null; boolean skipDuplicates = false; + long size = -1; // Process the options *************** *** 344,347 **** --- 350,354 ---- case 'r' : origin = opt.getValue(); break; case 'd' : skipDuplicates = true; break; + case 'l' : size = Long.parseLong(opt.getValue()); break; } } *************** *** 387,391 **** // Create the index ! di.writeToIndex(iterator,mimefilter,blacklist,origin,true,skipDuplicates); // Clean-up --- 394,398 ---- // Create the index ! di.writeToIndex(iterator, mimefilter, blacklist, origin, true, skipDuplicates, size); // Clean-up |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/fml In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site/fml Added Files: faq.fml Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: faq.fml --- <?xml version="1.0"?> <faqs id="General FAQ"> <part id="General"> <faq id="what"> <question>What is the DeDuplicator?</question> <answer> <p> The DeDuplicator is an add-on module for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls. </p> </answer> </faq> </part> </faqs> |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/resources/images In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site/resources/images Added Files: dedup.png lbs.gif lucene.jpg Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: lbs.gif --- (This appears to be a binary file; contents omitted.) --- NEW FILE: lucene.jpg --- (This appears to be a binary file; contents omitted.) --- NEW FILE: dedup.png --- (This appears to be a binary file; contents omitted.) |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/xdoc In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/site/xdoc Added Files: xdoc.xml Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: xdoc.xml --- <?xml version="1.0"?> <document> <properties> <title>Welcome</title> <author email="de...@ma...">The Maven Team</author> </properties> <body> <section name="Welcome to an XDOC file!"> <p> This is some text for the xdoc file. </p> </section> </body> </document> |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/java/dk/netarkivet/common/utils In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/src/main/java/dk/netarkivet/common/utils Removed Files: SparseRangeFilter.java SparseBitSet.java Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- SparseBitSet.java DELETED --- --- SparseRangeFilter.java DELETED --- |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3 In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048 Modified Files: .project Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. Index: .project =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/.project,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** .project 14 Jul 2010 16:19:11 -0000 1.1 --- .project 27 Jul 2010 09:09:46 -0000 1.2 *************** *** 7,10 **** --- 7,15 ---- <buildSpec> <buildCommand> + <name>org.eclipse.wst.jsdt.core.javascriptValidator</name> + <arguments> + </arguments> + </buildCommand> + <buildCommand> <name>org.eclipse.jdt.core.javabuilder</name> <arguments> *************** *** 20,23 **** --- 25,29 ---- <nature>org.eclipse.jdt.core.javanature</nature> <nature>org.maven.ide.eclipse.maven2Nature</nature> + <nature>org.eclipse.wst.jsdt.core.jsNature</nature> </natures> </projectDescription> |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:54
|
Update of /cvsroot/deduplicator/deduplicator3/.settings In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv23048/.settings Added Files: .jsdtscope org.eclipse.wst.jsdt.ui.superType.container org.eclipse.wst.jsdt.ui.superType.name Log Message: Added site. Improvments on how the Lucene index is accessed Added size filter to DigestIndexer. --- NEW FILE: org.eclipse.wst.jsdt.ui.superType.name --- Window --- NEW FILE: org.eclipse.wst.jsdt.ui.superType.container --- org.eclipse.wst.jsdt.launching.baseBrowserLibrary --- NEW FILE: .jsdtscope --- <?xml version="1.0" encoding="UTF-8"?> <classpath> <classpathentry kind="con" path="org.eclipse.wst.jsdt.launching.JRE_CONTAINER"/> <classpathentry kind="con" path="org.eclipse.wst.jsdt.launching.WebProject"> <attributes> <attribute name="hide" value="true"/> </attributes> </classpathentry> <classpathentry kind="con" path="org.eclipse.wst.jsdt.launching.baseBrowserLibrary"/> <classpathentry kind="output" path=""/> </classpath> |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/apt In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site/apt Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site/apt added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/fml In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site/fml Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site/fml added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/xdoc In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site/xdoc Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site/xdoc added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/resources In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site/resources Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site/resources added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 09:09:48
|
Update of /cvsroot/deduplicator/deduplicator3/src/site/resources/images In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv22995/src/site/resources/images Log Message: Directory /cvsroot/deduplicator/deduplicator3/src/site/resources/images added to the repository |
From: Kristinn S. <kri...@us...> - 2010-07-27 08:53:09
|
Update of /cvsroot/deduplicator/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv20868 Modified Files: pom.xml Log Message: 1.0.0-RC1 Index: pom.xml =================================================================== RCS file: /cvsroot/deduplicator/deduplicator/pom.xml,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** pom.xml 28 May 2009 14:53:35 -0000 1.13 --- pom.xml 27 Jul 2010 08:53:00 -0000 1.14 *************** *** 7,11 **** <artifactId>deduplicator</artifactId> <name>DeDuplicator (Heritrix add-on module)</name> ! <version>0.5.0</version> <description> An add-on module for the web crawler Heritrix that offers a --- 7,11 ---- <artifactId>deduplicator</artifactId> <name>DeDuplicator (Heritrix add-on module)</name> ! <version>1.0.0-RC1</version> <description> An add-on module for the web crawler Heritrix that offers a *************** *** 13,17 **** series of snapshot crawls. </description> ! <url>http://vefsofnun.bok.hi.is/deduplicator</url> <issueManagement> <system>SourceForge Trackers</system> --- 13,17 ---- series of snapshot crawls. </description> ! <url>http://deduplicator.sourceforge.net/</url> <issueManagement> <system>SourceForge Trackers</system> *************** *** 50,54 **** <developer> <id>Kristinn</id> ! <name>Kristinn Sigurðsson</name> <email>kristsi at bok.hi.is</email> <organization> --- 50,54 ---- <developer> <id>Kristinn</id> ! <name>Kristinn Sigurðsson</name> <email>kristsi at bok.hi.is</email> <organization> *************** *** 73,77 **** </contributor> <contributor> ! <name>Kåre Fiedler Christiansen</name> <email>kfc at statsbiblioteket.dk</email> <organization>Netarkivet.dk</organization> --- 73,77 ---- </contributor> <contributor> ! <name>Kare Fiedler Christiansen</name> <email>kfc at statsbiblioteket.dk</email> <organization>Netarkivet.dk</organization> *************** *** 97,100 **** --- 97,110 ---- <finalName>${project.artifactId}-${project.version}-${buildNumber}</finalName> <plugins> + <!-- this is a java 1.5 project --> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-compiler-plugin</artifactId> + <configuration> + <source>1.5</source> + <target>1.5</target> + <encoding>UTF-8</encoding> + </configuration> + </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> *************** *** 129,133 **** </descriptor> </descriptors> ! <finalName>${project.artifactId}-${project.version}-${buildNumber}</finalName> </configuration> <executions> --- 139,143 ---- </descriptor> </descriptors> ! <finalName>${project.artifactId}-${project.version}</finalName> </configuration> <executions> *************** *** 185,189 **** <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> ! <version>2.0.0</version> <scope>compile</scope> </dependency> --- 195,199 ---- <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> ! <version>2.2.0</version> <scope>compile</scope> </dependency> |
From: Kristinn S. <kri...@us...> - 2010-07-26 10:05:12
|
Update of /cvsroot/deduplicator/deduplicator/src/main/java/is/hi/bok/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv27795/src/main/java/is/hi/bok/deduplicator Modified Files: DeDuplicator.java Log Message: Bugfix in how we make Heritrix accept that a CrawlURI is in effect a duplicate. Index: DeDuplicator.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator/src/main/java/is/hi/bok/deduplicator/DeDuplicator.java,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** DeDuplicator.java 3 Jun 2009 10:00:43 -0000 1.8 --- DeDuplicator.java 26 Jul 2010 10:05:00 -0000 1.9 *************** *** 602,606 **** } AList oldVisit = new HashtableAList(); ! oldVisit.putString(CoreAttributeConstants.A_CONTENT_DIGEST, curi.getContentDigestSchemeString()); history[1]=oldVisit; --- 602,606 ---- } AList oldVisit = new HashtableAList(); ! oldVisit.putString(CoreAttributeConstants.A_CONTENT_DIGEST, curi.getContentDigestString()); history[1]=oldVisit; |
From: Kristinn S. <kri...@us...> - 2010-07-21 14:04:36
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/conf/jobs/profile-deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv16558/src/main/conf/jobs/profile-deduplicator Modified Files: profile-crawler-beans.cxml Log Message: Made settings that have a limited set of options into enums. Index: profile-crawler-beans.cxml =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/conf/jobs/profile-deduplicator/profile-crawler-beans.cxml,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** profile-crawler-beans.cxml 14 Jul 2010 16:19:11 -0000 1.1 --- profile-crawler-beans.cxml 21 Jul 2010 14:04:28 -0000 1.2 *************** *** 268,272 **** <!-- <property name="statsPerHost" value="false" /> --> <!-- <property name="useSparseRengeFilter" value="false" /> --> ! <!-- <property name="originHandling" value="No origin information" /> --> </bean> --- 268,272 ---- <!-- <property name="statsPerHost" value="false" /> --> <!-- <property name="useSparseRengeFilter" value="false" /> --> ! <!-- <property name="originHandling" value="NONE" /> --> </bean> |
From: Kristinn S. <kri...@us...> - 2010-07-21 14:03:09
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv16168/src/main/java/is/landsbokasafn/deduplicator Modified Files: DeDuplicator.java Log Message: Made settings that have a limited set of options into enums. Fixed a bug with how the 'last' entry is faked to have Heritrix realize that the curi was deemed a duplicate. Fixed a bug that prevented origin info from being correctly added to crawl.log annotations. Index: DeDuplicator.java =================================================================== RCS file: /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator/DeDuplicator.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** DeDuplicator.java 14 Jul 2010 16:19:11 -0000 1.1 --- DeDuplicator.java 21 Jul 2010 14:02:58 -0000 1.2 *************** *** 95,115 **** /* The matching method in use (by url or content digest) */ private final static String ATTR_MATCHING_METHOD = "matching-method"; ! private final static List<String> AVAILABLE_MATCHING_METHODS = new ArrayList<String>(Arrays.asList(new String[]{ ! "URL", ! "Content digest" ! })); ! private final static String DEFAULT_MATCHING_METHOD = AVAILABLE_MATCHING_METHODS.get(0); { setMatchingMethod(DEFAULT_MATCHING_METHOD); } ! public String getMatchingMethod() { ! return (String) kp.get(ATTR_MATCHING_METHOD); } ! public void setMatchingMethod(String matchinMethod) { ! if (AVAILABLE_MATCHING_METHODS.contains(matchinMethod)) { ! kp.put(ATTR_MATCHING_METHOD,matchinMethod); ! } else { ! throw new IllegalArgumentException("Invalid matching method: " + matchinMethod); ! } } --- 95,111 ---- /* The matching method in use (by url or content digest) */ private final static String ATTR_MATCHING_METHOD = "matching-method"; ! enum MatchingMethod { ! URL, ! DIGEST ! } ! private final static MatchingMethod DEFAULT_MATCHING_METHOD = MatchingMethod.URL; { setMatchingMethod(DEFAULT_MATCHING_METHOD); } ! public MatchingMethod getMatchingMethod() { ! return (MatchingMethod) kp.get(ATTR_MATCHING_METHOD); } ! public void setMatchingMethod(MatchingMethod matchinMethod) { ! kp.put(ATTR_MATCHING_METHOD, matchinMethod); } *************** *** 230,254 **** /* How should 'origin' be handled */ public final static String ATTR_ORIGIN_HANDLING = "origin-handling"; ! public final static String ORIGIN_HANDLING_NONE = "No origin information"; ! public final static String ORIGIN_HANDLING_PROCESSOR = "Use processor setting"; ! public final static String ORIGIN_HANDLING_INDEX = "Use index information"; ! public final static List<String> AVAILABLE_ORIGIN_HANDLING = new ArrayList<String>(Arrays.asList(new String[]{ ! ORIGIN_HANDLING_NONE, ! ORIGIN_HANDLING_PROCESSOR, ! ORIGIN_HANDLING_INDEX ! })); ! public final static String DEFAULT_ORIGIN_HANDLING = ORIGIN_HANDLING_NONE; { setOriginHandling(DEFAULT_ORIGIN_HANDLING); } ! public String getOriginHandling() { ! return (String) kp.get(ATTR_ORIGIN); } ! public void setOriginHandling(String originHandling) { ! if (AVAILABLE_ORIGIN_HANDLING.contains(originHandling)) { ! kp.put(ATTR_ORIGIN_HANDLING,originHandling); ! } else { ! throw new IllegalArgumentException("Invalid origin handling: " + originHandling); ! } } --- 226,243 ---- /* How should 'origin' be handled */ public final static String ATTR_ORIGIN_HANDLING = "origin-handling"; ! enum OriginHandling { ! NONE, // No origin information ! PROCESSOR, // Use processor setting -- ATTR_ORIGIN ! INDEX // Use index information, each hit on index should contain origin ! } ! public final static OriginHandling DEFAULT_ORIGIN_HANDLING = OriginHandling.NONE; { setOriginHandling(DEFAULT_ORIGIN_HANDLING); } ! public OriginHandling getOriginHandling() { ! return (OriginHandling) kp.get(ATTR_ORIGIN_HANDLING); } ! public void setOriginHandling(OriginHandling originHandling) { ! kp.put(ATTR_ORIGIN_HANDLING,originHandling); } *************** *** 291,296 **** // Matching method ! String matchingMethod = getMatchingMethod(); ! lookupByURL = matchingMethod.equals(DEFAULT_MATCHING_METHOD); // Track per host stats --- 280,285 ---- // Matching method ! MatchingMethod matchingMethod = getMatchingMethod(); ! lookupByURL = matchingMethod == MatchingMethod.URL; // Track per host stats *************** *** 298,306 **** // Origin handling. ! String originHandling = getOriginHandling(); ! if(originHandling.equals(ORIGIN_HANDLING_NONE)==false){ useOrigin = true; ! if(originHandling.equals(ORIGIN_HANDLING_INDEX)){ useOriginFromIndex = true; } } --- 287,297 ---- // Origin handling. ! OriginHandling originHandling = getOriginHandling(); ! if (originHandling != OriginHandling.NONE) { useOrigin = true; ! logger.fine("Use origin"); ! if (originHandling == OriginHandling.INDEX) { useOriginFromIndex = true; + logger.fine("Use origin from index"); } } *************** *** 419,424 **** duplicate.get(DigestIndexer.FIELD_ORIGIN)!=null){ // Index contains origin, use it. ! annotation += ":\"" + duplicate.get( ! DigestIndexer.FIELD_ORIGIN) + "\""; } else { String tmp = getOrigin(); --- 410,414 ---- duplicate.get(DigestIndexer.FIELD_ORIGIN)!=null){ // Index contains origin, use it. ! annotation += ":\"" + duplicate.get(DigestIndexer.FIELD_ORIGIN) + "\""; } else { String tmp = getOrigin(); *************** *** 438,442 **** // TODO: Reconsider this curi.setContentSize(0); ! } else { // A hack to have Heritrix count this as a duplicate. // TODO: Get gojomo to change how Heritrix decides CURIs are duplicates. --- 428,432 ---- // TODO: Reconsider this curi.setContentSize(0); ! } else if (lookupByURL) { // A hack to have Heritrix count this as a duplicate. // TODO: Get gojomo to change how Heritrix decides CURIs are duplicates. *************** *** 462,472 **** history[i] = history[i-1]; } Map oldVisit = new HashMap(); ! oldVisit.put(A_CONTENT_DIGEST, curi.getContentDigestSchemeString()); history[1]=oldVisit; curi.getData().put(A_FETCH_HISTORY,history); ! } // Mark as duplicate for other processors curi.getData().put(A_CONTENT_STATE_KEY, CONTENT_UNCHANGED); --- 452,463 ---- history[i] = history[i-1]; } + // Fake the 'last' entry Map oldVisit = new HashMap(); ! oldVisit.put(A_CONTENT_DIGEST, curi.getContentDigest()); history[1]=oldVisit; curi.getData().put(A_FETCH_HISTORY,history); ! } // TODO: Handle matching on digest // Mark as duplicate for other processors curi.getData().put(A_CONTENT_STATE_KEY, CONTENT_UNCHANGED); |
From: Kristinn S. <kri...@us...> - 2010-07-14 16:19:23
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/conf/jobs/profile-deduplicator/.svn/text-base In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv6840/src/main/conf/jobs/profile-deduplicator/.svn/text-base Added Files: profile-crawler-beans.cxml.svn-base Log Message: Initial check-in of v3. Compiles and runs. Need to do more regression testing and review how Lucene index is used. Also improve documentation. --- NEW FILE: profile-crawler-beans.cxml.svn-base --- <?xml version="1.0" encoding="UTF-8"?> <!-- HERITRIX 3 CRAWL JOB CONFIGURATION FILE This is a relatively minimal configuration suitable for many crawls. Commented-out beans and properties are provided as an example; values shown in comments reflect the actual defaults which are in effect without specification. (To change from the default behavior, uncomment AND alter the shown values.) --> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xmlns:aop="http://www.springframework.org/schema/aop" xmlns:tx="http://www.springframework.org/schema/tx" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-2.5.xsd http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-2.5.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-2.5.xsd"> <context:annotation-config/> <!-- OVERRIDES Values elsewhere in the configuration may be replaced ('overridden') by a Properties map declared in a PropertiesOverrideConfigurer, using a dotted-bean-path to address individual bean properties. This allows us to collect a few of the most-often changed values in an easy-to-edit format here at the beginning of the model configuration. --> <!-- overrides from a text property list --> <bean id="simpleOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer"> <property name="properties"> <value> # This Properties map is specified in the Java 'property list' text format # http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29 metadata.operatorContactUrl=ENTER_AN_URL_WITH_YOUR_CONTACT_INFO_HERE_FOR_WEBMASTERS_AFFECTED_BY_YOUR_CRAWL metadata.jobName=basic metadata.description=Basic crawl starting with useful defaults ##..more?..## </value> </property> </bean> <!-- overrides from declared <prop> elements, more easily allowing multiline values or even declared beans --> <bean id="longerOverrides" class="org.springframework.beans.factory.config.PropertyOverrideConfigurer"> <property name="properties"> <props> <prop key="seeds.textSource.value"> # URLS HERE http://example.example/example </prop> </props> </property> </bean> <!-- CRAWL METADATA: including identification of crawler/operator --> <bean id="metadata" class="org.archive.modules.CrawlMetadata" autowire="byName"> <property name="operatorContactUrl" value="[see override above]"/> <property name="jobName" value="[see override above]"/> <property name="description" value="[see override above]"/> <!-- <property name="operator" value=""/> --> <!-- <property name="operatorFrom" value=""/> --> <!-- <property name="organization" value=""/> --> <!-- <property name="audience" value=""/> --> <!-- <property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/@VERSION@ +@OPERATOR_CONTACT_URL@)"/> --> </bean> <!-- SEEDS: crawl starting points ConfigString allows simple, inline specification of a moderate number of seeds; see below comment for example of using an arbitrarily-large external file. --> <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule"> <property name="textSource"> <bean class="org.archive.spring.ConfigString"> <property name="value"> <value> # [see override above] </value> </property> </bean> </property> <!-- <property name='sourceTagSeeds' value='false'/> --> </bean> <!-- SEEDS ALTERNATE APPROACH: specifying external seeds.txt file Use either the above, or this, but not both. --> <!-- <bean id="seeds" class="org.archive.modules.seeds.TextSeedModule"> <property name="textSource"> <bean class="org.archive.spring.ConfigFile"> <property name="path" value="seeds.txt" /> </bean> </property> <property name='sourceTagSeeds' value='false'/> </bean> --> <!-- SCOPE: rules for which discovered URIs to crawl; order is very important because last decision returned other than 'NONE' wins. --> <bean id="scope" class="org.archive.modules.deciderules.DecideRuleSequence"> <property name="rules"> <list> <!-- Begin by REJECTing all... --> <bean class="org.archive.modules.deciderules.RejectDecideRule"> </bean> <!-- ...then ACCEPT those within configured/seed-implied SURT prefixes... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <!-- <property name="seedsAsSurtPrefixes" value="true" /> --> <!-- <property name="alsoCheckVia" value="true" /> --> <!-- <property name="surtsSourceFile" value="" /> --> <!-- <property name="surtsDumpFile" value="surts.dump" /> --> </bean> <!-- ...but REJECT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TooManyHopsDecideRule"> <!-- <property name="maxHops" value="20" /> --> </bean> <!-- ...but ACCEPT those more than a configured link-hop-count from start... --> <bean class="org.archive.modules.deciderules.TransclusionDecideRule"> <!-- <property name="maxTransHops" value="2" /> --> <!-- <property name="maxSpeculativeHops" value="1" /> --> </bean> <!-- ...but REJECT those from a configurable (initially empty) set of REJECT SURTs... --> <bean class="org.archive.modules.deciderules.surt.SurtPrefixedDecideRule"> <property name="decision" value="REJECT"/> <property name="seedsAsSurtPrefixes" value="false"/> <property name="surtsDumpFile" value="negative-surts.dump" /> <!-- <property name="surtsSourceFile" value="" /> --> </bean> <!-- ...and REJECT those from a configurable (initially empty) set of URI regexes... --> <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule"> <property name="decision" value="REJECT"/> <!-- <property name="listLogicalOr" value="true" /> --> <!-- <property name="regexList"> <list> </list> </property> --> </bean> <!-- ...and REJECT those with suspicious repeating path-segments... --> <bean class="org.archive.modules.deciderules.PathologicalPathDecideRule"> <!-- <property name="maxRepetitions" value="2" /> --> </bean> <!-- ...and REJECT those with more than threshold number of path-segments... --> <bean class="org.archive.modules.deciderules.TooManyPathSegmentsDecideRule"> <!-- <property name="maxPathDepth" value="20" /> --> </bean> <!-- ...but always ACCEPT those marked as prerequisitee for another URI... --> <bean class="org.archive.modules.deciderules.PrerequisiteAcceptDecideRule"> </bean> </list> </property> </bean> <!-- PROCESSING CHAINS Much of the crawler's work is specified by the sequential application of swappable Processor modules. These Processors are collected into three 'chains. The CandidateChain is applied to URIs being considered for inclusion, before a URI is enqueued for collection. The FetchChain is applied to URIs when their turn for collection comes up. The DispositionChain is applied after a URI is fetched and analyzed/link-extracted. --> <!-- CANDIDATE CHAIN --> <!-- processors declared as named beans --> <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper"> </bean> <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer"> <!-- <property name="preferenceDepthHops" value="-1" /> --> <!-- <property name="preferenceEmbedHops" value="1" /> --> <!-- <property name="canonicalizationPolicy"> <ref bean="canonicalizationPolicy" /> </property> --> <!-- <property name="queueAssignmentPolicy"> <ref bean="queueAssignmentPolicy" /> </property> --> <!-- <property name="uriPrecedencePolicy"> <ref bean="uriPrecedencePolicy" /> </property> --> <!-- <property name="costAssignmentPolicy"> <ref bean="costAssignmentPolicy" /> </property> --> </bean> <!-- assembled into ordered CandidateChain bean --> <bean id="candidateProcessors" class="org.archive.modules.CandidateChain"> <property name="processors"> <list> <!-- apply scoping rules to each individual candidate URI... --> <ref bean="candidateScoper"/> <!-- ...then prepare those ACCEPTed for enqueuing to frontier. --> <ref bean="preparer"/> </list> </property> </bean> <!-- FETCH CHAIN --> <!-- processors declared as named beans --> <bean id="preselector" class="org.archive.crawler.prefetch.Preselector"> <!-- <property name="recheckScope" value="false" /> --> <!-- <property name="blockAll" value="false" /> --> <!-- <property name="blockByRegex" value="" /> --> <!-- <property name="allowByRegex" value="" /> --> </bean> <bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer"> <!-- <property name="ipValidityDurationSeconds" value="21600" /> --> <!-- <property name="robotsValidityDurationSeconds" value="86400" /> --> <!-- <property name="calculateRobotsOnly" value="false" /> --> </bean> <bean id="fetchDns" class="org.archive.modules.fetcher.FetchDNS"> <!-- <property name="acceptNonDnsResolves" value="false" /> --> <!-- <property name="digestContent" value="true" /> --> <!-- <property name="digestAlgorithm" value="sha1" /> --> </bean> <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP"> <!-- <property name="maxLengthBytes" value="0" /> --> <!-- <property name="timeoutSeconds" value="1200" /> --> <!-- <property name="maxFetchKBSec" value="0" /> --> <!-- <property name="defaultEncoding" value="ISO-8859-1" /> --> <!-- <property name="shouldFetchBodyRule"> <bean class="org.archive.modules.deciderules.AcceptDecideRule"/> </property> --> <!-- <property name="soTimeoutMs" value="20000" /> --> <!-- <property name="sendIfModifiedSince" value="true" /> --> <!-- <property name="sendIfNoneMatch" value="true" /> --> <!-- <property name="sendConnectionClose" value="true" /> --> <!-- <property name="sendReferer" value="true" /> --> <!-- <property name="sendRange" value="false" /> --> <!-- <property name="ignoreCookies" value="false" /> --> <!-- <property name="sslTrustLevel" value="OPEN" /> --> <!-- <property name="acceptHeaders"> <list> </list> </property> --> <!-- <property name="httpBindAddress" value="" /> --> <!-- <property name="httpProxyHost" value="" /> --> <!-- <property name="httpProxyPort" value="0" /> --> <!-- <property name="digestContent" value="true" /> --> <!-- <property name="digestAlgorithm" value="sha1" /> --> </bean> <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP"> </bean> <bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML"> <!-- <property name="extractJavascript" value="true" /> --> <!-- <property name="extractValueAttributes" value="true" /> --> <!-- <property name="ignoreFormActionUrls" value="false" /> --> <!-- <property name="extractOnlyFormGets" value="true" /> --> <!-- <property name="treatFramesAsEmbedLinks" value="true" /> --> <!-- <property name="ignoreUnexpectedHtml" value="true" /> --> <!-- <property name="maxElementLength" value="1024" /> --> <!-- <property name="maxAttributeNameLength" value="1024" /> --> <!-- <property name="maxAttributeValueLength" value="16384" /> --> </bean> <bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS"> </bean> <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS"> </bean> <bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF"> </bean> <!-- assembled into ordered FetchChain bean --> <bean id="fetchProcessors" class="org.archive.modules.FetchChain"> <property name="processors"> <list> <!-- recheck scope, if so enabled... --> <ref bean="preselector"/> <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... --> <ref bean="preconditions"/> <!-- ...fetch if DNS URI... --> <ref bean="fetchDns"/> <!-- ...fetch if HTTP URI... --> <ref bean="fetchHttp"/> <!-- ...extract oulinks from HTTP headers... --> <ref bean="extractorHttp"/> <!-- ...extract oulinks from HTML content... --> <ref bean="extractorHtml"/> <!-- ...extract oulinks from CSS content... --> <ref bean="extractorCss"/> <!-- ...extract oulinks from Javascript content... --> <ref bean="extractorJs"/> <!-- ...extract oulinks from Flash content... --> <ref bean="extractorSwf"/> </list> </property> </bean> <!-- DISPOSITION CHAIN --> <!-- processors declared as named beans --> <bean id="warcWriter" class="org.archive.modules.writer.WARCWriterProcessor"> <!-- <property name="compress" value="true" /> --> <!-- <property name="prefix" value="IAH" /> --> <!-- <property name="suffix" value="${HOSTNAME}" /> --> <!-- <property name="maxFileSizeBytes" value="1000000000" /> --> <!-- <property name="poolMaxActive" value="1" /> --> <!-- <property name="poolMaxWaitMs" value="300000" /> --> <!-- <property name="skipIdenticalDigests" value="false" /> --> <!-- <property name="maxTotalBytesToWrite" value="0" /> --> <!-- <property name="directory" value="." /> --> <!-- <property name="storePaths"> <list> <value>warcs</value> </list> </property> --> <!-- <property name="writeRequests" value="true" /> --> <!-- <property name="writeMetadata" value="true" /> --> <!-- <property name="writeRevisitForIdenticalDigests" value="true" /> --> <!-- <property name="writeRevisitForNotModified" value="true" /> --> </bean> <bean id="candidates" class="org.archive.crawler.postprocessor.CandidatesProcessor"> <!-- <property name="seedsRedirectNewSeeds" value="true" /> --> </bean> <bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor"> <!-- <property name="delayFactor" value="5.0" /> --> <!-- <property name="minDelayMs" value="3000" /> --> <!-- <property name="respectCrawlDelayUpToSeconds" value="300" /> --> <!-- <property name="maxDelayMs" value="30000" /> --> <!-- <property name="maxPerHostBandwidthUsageKbSec" value="0" /> --> </bean> <!-- assembled into ordered DispositionChain bean --> <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain"> <property name="processors"> <list> <!-- write to aggregate archival files... --> <ref bean="warcWriter"/> <!-- ...send each outlink candidate URI to CandidatesChain, and enqueue those ACCEPTed to the frontier... --> <ref bean="candidates"/> <!-- ...then update stats, shared-structures, frontier decisions --> <ref bean="disposition"/> </list> </property> </bean> <!-- CRAWLCONTROLLER: Control interface, unifying context --> <bean id="crawlController" class="org.archive.crawler.framework.CrawlController"> <!-- <property name="maxToeThreads" value="25" /> --> <!-- <property name="pauseAtStart" value="true" /> --> <!-- <property name="pauseAtFinish" value="false" /> --> <!-- <property name="recorderInBufferBytes" value="524288" /> --> <!-- <property name="recorderOutBufferBytes" value="16384" /> --> <!-- <property name="scratchDir" value="scratch" /> --> </bean> <!-- FRONTIER: Record of all URIs discovered and queued-for-collection --> <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier"> <!-- <property name="holdQueues" value="true" /> --> <!-- <property name="queueTotalBudget" value="-1" /> --> <!-- <property name="balanceReplenishAmount" value="3000" /> --> <!-- <property name="errorPenaltyAmount" value="100" /> --> <!-- <property name="precedenceFloor" value="255" /> --> <!-- <property name="queuePrecedencePolicy"> <bean class="org.archive.crawler.frontier.precedence.BaseQueuePrecedencePolicy" /> </property> --> <!-- <property name="snoozeLongMs" value="300000" /> --> <!-- <property name="retryDelaySeconds" value="900" /> --> <!-- <property name="maxRetries" value="30" /> --> <!-- <property name="recoveryLogEnabled" value="true" /> --> <!-- <property name="maxOutlinks" value="6000" /> --> <!-- <property name="outbound"> <bean class="java.util.concurrent.ArrayBlockingQueue"> <constructor-arg value="200"/> <constructor-arg value="true"/> </bean> </property> --> <!-- <property name="inbound"> <bean class="java.util.concurrent.ArrayBlockingQueue"> <constructor-arg value="40000"/> <constructor-arg value="true"/> </bean> </property> --> <!-- <property name="dumpPendingAtClose" value="false" /> --> </bean> <!-- URI UNIQ FILTER: Used by frontier to remember already-included URIs --> <bean id="uriUniqFilter" class="org.archive.crawler.util.BdbUriUniqFilter"> </bean> <!-- OPTIONAL BUT RECOMMENDED BEANS --> <!-- ACTIONDIRECTORY: disk directory for mid-crawl operations Running job will watch directory for new files with URIs, scripts, and other data to be processed during a crawl. --> <bean id="actionDirectory" class="org.archive.crawler.framework.ActionDirectory"> <!-- <property name="actionDir" value="action" /> --> <!-- <property name="initialDelaySeconds" value="10" /> --> <!-- <property name="delaySeconds" value="30" /> --> </bean> <!-- CRAWLLIMITENFORCER: stops crawl when it reaches configured limits --> <bean id="crawlLimiter" class="org.archive.crawler.framework.CrawlLimitEnforcer"> <!-- <property name="maxBytesDownload" value="0" /> --> <!-- <property name="maxDocumentsDownload" value="0" /> --> <!-- <property name="maxTimeSeconds" value="0" /> --> </bean> <!-- CHECKPOINTSERVICE: checkpointing assistance --> <bean id="checkpointService" class="org.archive.crawler.framework.CheckpointService"> <!-- <property name="checkpointIntervalMinutes" value="-1"/> --> <!-- <property name="checkpointsDir" value="checkpoints"/> --> </bean> <!-- OPTIONAL BEANS Uncomment and expand as needed, or if non-default alternate implementations are preferred. --> <!-- CANONICALIZATION POLICY --> <!-- <bean id="canonicalizationPolicy" class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy"> <property name="rules"> <list> <bean class="org.archive.modules.canonicalize.LowercaseRule" /> <bean class="org.archive.modules.canonicalize.StripUserinfoRule" /> <bean class="org.archive.modules.canonicalize.StripWWWNRule" /> <bean class="org.archive.modules.canonicalize.StripSessionIDs" /> <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" /> <bean class="org.archive.modules.canonicalize.FixupQueryString" /> </list> </property> </bean> --> <!-- QUEUE ASSIGNMENT POLICY --> <!-- <bean id="queueAssignmentPolicy" class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy"> <property name="forceQueueAssignment" value="" /> <property name="deferToPrevious" value="true" /> <property name="parallelQueues" value="1" /> </bean> --> <!-- URI PRECEDENCE POLICY --> <!-- <bean id="uriPrecedencePolicy" class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy"> </bean> --> <!-- COST ASSIGNMENT POLICY --> <!-- <bean id="costAssignmentPolicy" class="org.archive.crawler.frontier.UnitCostAssignmentPolicy"> </bean> --> <!-- CREDENTIAL STORE: HTTP authentication or FORM POST credentials --> <!-- <bean id="credentialStore" class="org.archive.modules.credential.CredentialStore"> </bean> --> <!-- REQUIRED STANDARD BEANS It will be very rare to replace or reconfigure the following beans. --> <!-- STATISTICSTRACKER: standard stats/reporting collector --> <bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName"> <!-- <property name="reportsDir" value="reports" /> --> <!-- <property name="liveHostReportSize" value="20" /> --> <!-- <property name="intervalSeconds" value="20" /> --> <!-- <property name="keepSnapshotsCount" value="5" /> --> <!-- <property name="liveHostReportSize" value="20" /> --> </bean> <!-- CRAWLERLOGGERMODULE: shared logging facility --> <bean id="loggerModule" class="org.archive.crawler.reporting.CrawlerLoggerModule"> <!-- <property name="path" value="logs" /> --> <!-- <property name="crawlLogPath" value="crawl.log" /> --> <!-- <property name="alertsLogPath" value="alerts.log" /> --> <!-- <property name="progressLogPath" value="progress-statistics.log" /> --> <!-- <property name="uriErrorsLogPath" value="uri-errors.log" /> --> <!-- <property name="runtimeErrorsLogPath" value="runtime-errors.log" /> --> <!-- <property name="nonfatalErrorsLogPath" value="nonfatal-errors.log" /> --> </bean> <!-- SHEETOVERLAYMANAGER: manager of sheets of contextual overlays Autowired to include any SheetForSurtPrefix or SheetForDecideRuled beans --> <bean id="sheetOverlaysManager" autowire="byType" class="org.archive.crawler.spring.SheetOverlaysManager"> </bean> <!-- BDBMODULE: shared BDB-JE disk persistence manager --> <bean id="bdb" class="org.archive.bdb.BdbModule"> <!-- <property name="dir" value="state" /> --> <!-- <property name="cachePercent" value="60" /> --> <!-- <property name="useSharedCache" value="true" /> --> <!-- <property name="expectedConcurrency" value="25" /> --> </bean> <!-- BDBCOOKIESTORAGE: disk-based cookie storage for FetchHTTP --> <bean id="cookieStorage" class="org.archive.modules.fetcher.BdbCookieStorage"> <!-- <property name="cookiesLoadFile"><null/></property> --> <!-- <property name="cookiesSaveFile"><null/></property> --> <!-- <property name="bdb"> <ref bean="bdb"/> </property> --> </bean> <!-- SERVERCACHE: shared cache of server/host info --> <bean id="serverCache" class="org.archive.modules.net.BdbServerCache"> <!-- <property name="bdb"> <ref bean="bdb"/> </property> --> </bean> <!-- CONFIG PATH CONFIGURER: required helper making crawl paths relative to crawler-beans.cxml file, and tracking crawl files for web UI --> <bean id="configPathConfigurer" class="org.archive.spring.ConfigPathConfigurer"> </bean> </beans> |
From: Kristinn S. <kri...@us...> - 2010-07-14 16:19:20
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/java/is/landsbokasafn/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv6840/src/main/java/is/landsbokasafn/deduplicator Added Files: DeDuplicator.java CrawlLogIterator.java DigestIndexer.java CrawlDataIterator.java DeDupFetchHTTP.java CommandLineParser.java overview.html DedupAttributeConstants.java CrawlDataItem.java Log Message: Initial check-in of v3. Compiles and runs. Need to do more regression testing and review how Lucene index is used. Also improve documentation. --- NEW FILE: DedupAttributeConstants.java --- package is.landsbokasafn.deduplicator; /** * Lifted from H1 AdaptiveRevisitAttributeConstants and limited to what DeDuplicator was using. * * */ public interface DedupAttributeConstants { /** No knowledge of URI content. Possibly not fetched yet, unable * to check if different or an error occurred on last fetch attempt. */ public static final int CONTENT_UNKNOWN = -1; /** URI content has not changed between the two latest, successfully * completed fetches. */ public static final int CONTENT_UNCHANGED = 0; /** URI content had changed between the two latest, successfully completed * fetches. By definition, content has changed if there has only been one * successful fetch made. */ public static final int CONTENT_CHANGED = 1; /** * Key to use getting state of crawluri from the CrawlURI data. */ public static final String A_CONTENT_STATE_KEY = "revisit-state"; } --- NEW FILE: DeDupFetchHTTP.java --- /* DeDupFetchHTTP * * Created on 10.04.2006 * * Copyright (C) 2006 National and University Library of Iceland * * This file is part of the DeDuplicator (Heritrix add-on module). * * DeDuplicator is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * DeDuplicator is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with DeDuplicator; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package is.landsbokasafn.deduplicator; import java.io.IOException; import java.text.SimpleDateFormat; import java.util.logging.Level; import java.util.logging.Logger; import org.archive.modules.fetcher.FetchHTTP; import dk.netarkivet.common.utils.SparseRangeFilter; /** * An extentsion of Heritrix's {@link org.archive.crawler.fetcher.FetchHTTP} * processor for downloading HTTP documents. This extension adds a check after * the content header has been downloaded that compares the 'last-modified' and * or 'last-etag' values from the header against information stored in an * appropriate index. * * @author Kristinn Sigurðsson * @see is.hi.bok.deduplicator.DigestIndexer * @see org.archive.crawler.fetcher.FetchHTTP */ public class DeDupFetchHTTP extends FetchHTTP { // // private static final long serialVersionUID = // ArchiveUtils.classnameBasedUID(DeDupFetchHTTP.class,1); // // private static Logger logger = Logger.getLogger(FetchHTTP.class.getName()); // // protected IndexSearcher index; // protected String mimefilter = DEFAULT_MIME_FILTER; // protected boolean blacklist = true; // // SimpleDateFormat sdfLastModified; // SimpleDateFormat sdfIndexDate; // // protected long processedURLs = 0; // protected long unchangedURLs = 0; // // protected boolean useSparseRangeFilter = DEFAULT_USE_SPARSE_RANGE_FILTER; // // // Settings. // public static final String ATTR_DECISION_SCHEME = "decision-scheme"; // public static final String SCHEME_TIMESTAMP = "Timestamp only"; // public static final String SCHEME_ETAG = "Etag only"; // public static final String SCHEME_TIMESTAMP_AND_ETAG = "Timestamp AND Etag"; // public static final String SCHEME_TIMESTAMP_OR_ETAG = "Timestamp OR Etag"; // public static final String[] AVAILABLE_DECISION_SCHEMES = { // SCHEME_TIMESTAMP, // SCHEME_ETAG, // SCHEME_TIMESTAMP_AND_ETAG, // SCHEME_TIMESTAMP_OR_ETAG // }; // public static final String DEFAULT_DECISION_SCHEME = // SCHEME_TIMESTAMP; // // public static final String ATTR_INDEX_LOCATION = "index-location"; // public static final String DEFAULT_INDEX_LOCATION = ""; // // /** The filter on mime types. This is either a blacklist or whitelist // * depending on ATTR_FILTER_MODE. // */ // public final static String ATTR_MIME_FILTER = "mime-filter"; // public final static String DEFAULT_MIME_FILTER = "^text/.*"; // // /** Is the mime filter a blacklist (do not apply processor to what matches) // * or whitelist (apply processor only to what matches). // */ // public final static String ATTR_FILTER_MODE = "filter-mode"; // public final static String[] AVAILABLE_FILTER_MODES = { // "Blacklist", // "Whitelist" // }; // public final static String DEFAULT_FILTER_MODE = AVAILABLE_FILTER_MODES[0]; // // /** Should we use sparse queries (uses less memory at a cost to performance? **/ // public final static String ATTR_USE_SPARSE_RANGE_FILTER = "use-sparse-range-filter"; // public final static Boolean DEFAULT_USE_SPARSE_RANGE_FILTER = Boolean.FALSE; // // public DeDupFetchHTTP(String name){ // super(name); // setDescription("Fetch HTTP processor that aborts downloading of " + // "unchanged documents. This processor extends the standard " + // "FetchHTTP processor, adding a check after the header is " + // "downloaded where the header information for 'last-modified' " + // "and 'etag' is compared against values stored in a Lucene " + // "index built using the DigestIndexer.\n Note that the index " + // "must have been built indexed by URL and the Timestamp " + // "and/or Etag info must have been included in the index!"); // Type t; // t = new SimpleType( // ATTR_DECISION_SCHEME, // "The different schmes for deciding when to re-download a " + // "page given an old version of the same page (or rather " + // "meta-data on it)\n " + // "Timestamp only: Download when a datestamp is missing " + // "in either the downloaded header or index or if the header " + // "datestamp is newer then the one in the index.\n " + // "Etag only: Download when the Etag is missing in either the" + // "header download or the index or the header Etag and the one " + // "in the index differ.\n " + // "Timestamp AND Etag: When both datestamp and Etag are " + // "available in both the header download and the index, " + // "download if EITHER of them indicates change." + // "Timestamp OR Etag: When both datestamp and Etag are " + // "available in both the header download and the index, " + // "download only if BOTH of them indicate change.", // DEFAULT_DECISION_SCHEME,AVAILABLE_DECISION_SCHEMES); // addElementToDefinition(t); // t = new SimpleType( // ATTR_INDEX_LOCATION, // "Location of index (full path). Can not be changed at run " + // "time.", // DEFAULT_INDEX_LOCATION); // t.setOverrideable(false); // addElementToDefinition(t); // t = new SimpleType( // ATTR_MIME_FILTER, // "A regular expression that the mimetype of all documents " + // "will be compared against. Only those that pass will be " + // "considered. Others are given a pass. " + // "\nIf the attribute filter-mode is " + // "set to 'Blacklist' then all the documents whose mimetype " + // "matches will be ignored by this processor. If the filter-" + // "mode is set to 'Whitelist' only those documents whose " + // "mimetype matches will be processed.", // DEFAULT_MIME_FILTER); // t.setOverrideable(false); // t.setExpertSetting(true); // addElementToDefinition(t); // t = new SimpleType( // ATTR_FILTER_MODE, // "Determines if the mime-filter acts as a blacklist (declares " + // "what should be ignored) or whitelist (declares what should " + // "be processed).", // DEFAULT_FILTER_MODE,AVAILABLE_FILTER_MODES); // t.setOverrideable(false); // t.setExpertSetting(true); // addElementToDefinition(t); // // t = new SimpleType( // ATTR_USE_SPARSE_RANGE_FILTER, // "If set to true, then Lucene queries use a custom 'sparse' " + // "range filter. This uses less memory at the cost of some " + // "lost performance. Suitable for very large indexes.", // DEFAULT_USE_SPARSE_RANGE_FILTER); // t.setOverrideable(false); // t.setExpertSetting(true); // addElementToDefinition(t); // } // // protected boolean checkMidfetchAbort( // CrawlURI curi, HttpRecorderMethod method, HttpConnection conn) { // // We'll check for prerequisites here since there is no way to know // // if the super method returns false because of a prereq or because // // all filters accepeted. // if(curi.isPrerequisite()){ // return false; // } // // // Run super to allow filters to also abort. Also this method has // // been pressed into service as a general 'stuff to do at this point' // boolean ret = super.checkMidfetchAbort(curi, method, conn); // // // Ok, now check for duplicates. // if(isDuplicate(curi)){ // ret = true; // unchangedURLs++; // curi.putInt(A_CONTENT_STATE_KEY, CONTENT_UNCHANGED); // curi.addAnnotation("header-duplicate"); // // } // // return ret; // } // // /** // * Compare the header infomation for 'last-modified' and/or 'etag' against // * data in the index. // * @param curi The Crawl URI being processed. // * @return True if header infomation indicates that the document has not // * changed since the crawl that the index is based on was performed. // */ // protected boolean isDuplicate(CrawlURI curi) { // boolean ret = false; // if(curi.getContentType() != null && // curi.getContentType().matches(mimefilter) != blacklist){ // processedURLs++; // // Ok, passes mime-filter // HttpMethod method = (HttpMethod)curi.getObject(A_HTTP_TRANSACTION); // // Check the decision scheme. // String scheme = (String)getUncheckedAttribute( // curi,ATTR_DECISION_SCHEME); // // Document doc = lookup(curi); // // if(doc != null){ // // Found a hit. Do the necessary evalution. // if(scheme.equals(SCHEME_TIMESTAMP)){ // ret = datestampIndicatesNonChange(method,doc); // } else if(scheme.equals(SCHEME_ETAG)){ // ret = etagIndicatesNonChange(method,doc); // } else { // // if(scheme.equals(SCHEME_TIMESTAMP_AND_ETAG)){ // ret = datestampIndicatesNonChange(method,doc) // && etagIndicatesNonChange(method,doc); // } else if(scheme.equals(SCHEME_TIMESTAMP_OR_ETAG)){ // ret = datestampIndicatesNonChange(method,doc) // || etagIndicatesNonChange(method,doc); // } else { // logger.log(Level.SEVERE, "Unknown decision sceme: " + scheme); // } // } // } // } // return ret; // } // // /** // * Checks if the 'last-modified' in the HTTP header and compares it against // * the timestamp in the supplied Lucene document. If both dates are found // * and the header's date is older then the datestamp indicates non-change. // * Otherwise a change must be assumed. // * @param method HTTPMethod that allows access to the relevant HTTP header // * @param doc The Lucene document to compare against // * @return True if a the header and document data indicates a non-change. // * False otherwise. // */ // protected boolean datestampIndicatesNonChange( // HttpMethod method, Document doc) { // String headerDate = null; // if (method.getResponseHeader("last-modified") != null) { // headerDate = method.getResponseHeader("last-modified").getValue(); // } // String indexDate = doc.get(DigestIndexer.FIELD_TIMESTAMP); // // if(headerDate != null && indexDate != null){ // try { // // If both dates exist and last-modified is before the index // // date then we assume no change has occured. // return (sdfLastModified.parse(headerDate)).before( // sdfIndexDate.parse(indexDate)); // } catch (Exception e) { // // Any exceptions parsing the date should be interpreted as // // missing date information. // // ParseException and NumberFormatException are the most // // likely exceptions to occur. // return false; // } // } // return false; // } // // /** // * Checks if the 'etag' in the HTTP header and compares it against // * the etag in the supplied Lucene document. If both dates are found // * and match then the datestamp indicate non-change. // * Otherwise a change must be assumed. // * @param method HTTPMethod that allows access to the relevant HTTP header // * @param doc The Lucene document to compare against // * @return True if a the header and document data indicates a non-change. // * False otherwise. // */ // protected boolean etagIndicatesNonChange( // HttpMethod method, Document doc) { // String headerEtag = null; // if (method.getResponseHeader("last-etag") != null) { // headerEtag = method.getResponseHeader("last-etag").getValue(); // } // String indexEtag = doc.get(DigestIndexer.FIELD_ETAG); // // if(headerEtag != null && indexEtag != null){ // // If both etags exist and are identical then we assume no // // change has occured. // return headerEtag.equals(indexEtag); // } // return false; // } // // /** // * Searches the index for the URL of the given CrawlURI. If multiple hits // * are found the most recent one is returned if the index included the // * timestamp, otherwise a random one is returned. // * If no hit is found null is returned. // * @param curi The CrawlURI to search for // * @return the index Document matching the URI or null if none was found // */ // protected Document lookup(CrawlURI curi) { // try{ // Query query = null; // if(useSparseRangeFilter){ // query = new ConstantScoreQuery(new SparseRangeFilter( // DigestIndexer.FIELD_URL,curi.toString(),curi.toString(), // true,true)); // } else { // query = new ConstantScoreQuery(new RangeFilter( // DigestIndexer.FIELD_URL,curi.toString(),curi.toString(), // true,true)); // } // // Hits hits = index.search(query); // Document doc = null; // if(hits != null && hits.length() > 0){ // // If there are multiple hits, use the one with the most // // recent date. // Document docToEval = null; // for(int i=0 ; i<hits.length() ; i++){ // doc = hits.doc(i); // // The format of the timestamp ("yyyyMMddHHmmssSSS") allows // // us to do a greater then (later) or lesser than (earlier) // // comparison of the strings. // String timestamp = doc.get(DigestIndexer.FIELD_TIMESTAMP); // if(docToEval == null || timestamp == null // || docToEval.get(DigestIndexer.FIELD_TIMESTAMP) // .compareTo(timestamp)>0){ // // Found a more recent hit or timestamp is null // // NOTE: Either all hits should have a timestamp or // // none. This implementation will cause the last // // URI in the hit list to be returned if there is no // // timestamp. // docToEval = doc; // } // } // return docToEval; // } // } catch(IOException e){ // logger.log(Level.SEVERE,"Error accessing index.",e); // } // return null; // } // // public void finalTasks() { // super.finalTasks(); // try { // index.close(); // } catch (IOException e) { // logger.log(Level.SEVERE,"Error closing index",e); // } // } // // public void initialTasks() { // super.initialTasks(); // // Index location // try { // String indexLocation = (String)getAttribute(ATTR_INDEX_LOCATION); // index = new IndexSearcher(indexLocation); // } catch (Exception e) { // logger.log(Level.SEVERE,"Unable to find/open index.",e); // } // // // Mime filter // try { // mimefilter = (String)getAttribute(ATTR_MIME_FILTER); // } catch (Exception e) { // logger.log(Level.SEVERE,"Unable to get attribute " + // ATTR_MIME_FILTER,e); // } // // // Filter mode (blacklist (default) or whitelist) // try { // blacklist = ((String)getAttribute(ATTR_FILTER_MODE)).equals( // DEFAULT_FILTER_MODE); // } catch (Exception e) { // logger.log(Level.SEVERE,"Unable to get attribute " + // ATTR_FILTER_MODE,e); // } // // // Date format of last-modified is EEE, dd MMM yyyy HH:mm:ss z // sdfLastModified = new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss z"); // // Date format of indexDate is yyyyMMddHHmmssSSS // sdfIndexDate = new SimpleDateFormat("yyyyMMddHHmmssSSS"); // // // Range Filter type // try { // useSparseRangeFilter = ((Boolean)getAttribute( // ATTR_USE_SPARSE_RANGE_FILTER)).booleanValue(); // } catch (Exception e) { // logger.log(Level.SEVERE,"Unable to get attribute " + // ATTR_USE_SPARSE_RANGE_FILTER,e); // useSparseRangeFilter = DEFAULT_USE_SPARSE_RANGE_FILTER; // } // // } // // public String report() { // StringBuffer ret = new StringBuffer(); // ret.append("Processor: is.hi.bok.deduplicator.DeDupFetchHTTP\n"); // ret.append(" URLs compared against index: " + processedURLs + "\n"); // ret.append(" URLs judged unchanged: " + unchangedURLs + "\n"); // ret.append(" processor extends (parent report)\n"); // ret.append(super.report()); // return ret.toString(); // } } --- NEW FILE: CrawlDataItem.java --- /* CrawlDataItem * * Created on 10.04.2006 * * Copyright (C) 2006 National and University Library of Iceland * * This file is part of the DeDuplicator (Heritrix add-on module). * * DeDuplicator is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * DeDuplicator is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with DeDuplicator; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package is.landsbokasafn.deduplicator; /** * A base class for individual items of crawl data that should be added to the * index. * * @author Kristinn Sigurðsson */ public class CrawlDataItem { /** * The proper formating of {@link #setURL(String)} and {@link #getURL()} */ public static final String dateFormat = "yyyyMMddHHmmssSSS"; protected String URL; protected String contentDigest; protected String timestamp; protected String etag; protected String mimetype; protected String origin; protected boolean duplicate; /** * Constructor. Creates a new CrawlDataItem with all its data initialized * to null. */ public CrawlDataItem(){ URL = null; contentDigest = null; timestamp = null; etag = null; mimetype = null; origin = null; duplicate = false; } /** * Constructor. Creates a new CrawlDataItem with all its data initialized * via the constructor. * * @param URL The URL for this CrawlDataItem * @param contentDigest A content digest of the document found at the URL * @param timestamp Date of when the content digest was valid for that URL. * Format: yyyyMMddHHmmssSSS * @param etag Etag for the URL * @param mimetype MIME type of the document found at the URL * @param origin The origin of the CrawlDataItem (the exact meaning of the * origin is outside the scope of this class and it may be * any String value) * @param duplicate True if this CrawlDataItem was marked as duplicate */ public CrawlDataItem(String URL, String contentDigest, String timestamp, String etag, String mimetype, String origin, boolean duplicate){ this.URL = URL; this.contentDigest = contentDigest; this.timestamp = timestamp; this.etag = etag; this.mimetype = mimetype; this.origin = origin; this.duplicate = duplicate; } /** * Returns the URL * @return the URL */ public String getURL() { return URL; } /** * Set the URL * @param URL the new URL */ public void setURL(String URL){ this.URL = URL; } /** * Returns the documents content digest * @return the documents content digest */ public String getContentDigest(){ return contentDigest; } /** * Set the content digest * @param contentDigest The new value of the content digest */ public void setContentDigest(String contentDigest){ this.contentDigest = contentDigest; } /** * Returns a timestamp for when the URL was fetched in the format: * yyyyMMddHHmmssSSS * @return the time of the URLs fetching */ public String getTimestamp(){ return timestamp; } /** * Set a new timestamp. * @param timestamp The new timestamp. It should be in the format: * yyyyMMddHHmmssSSS */ public void setTimestamp(String timestamp){ this.timestamp = timestamp; } /** * Returns the etag that was associated with the document. * <p> * If etag is unavailable null will be returned. * @return the etag. */ public String getEtag(){ return etag; } /** * Set a new Etag * @param etag The new etag */ public void setEtag(String etag){ this.etag = etag; } /** * Returns the mimetype that was associated with the document. * @return the mimetype. */ public String getMimeType(){ return mimetype; } /** * Set new MIME type. * @param mimetype The new MIME type */ public void setMimeType(String mimetype){ this.mimetype = mimetype; } /** * Returns the "origin" that was associated with the document. * @return the origin (may be null if none was provided for the document) */ public String getOrigin() { return origin; } /** * Set new origin * @param origin A new origin. */ public void setOrigin(String origin){ this.origin = origin; } /** * Returns whether the CrawlDataItem was marked as duplicate. * @return true if duplicate, false otherwise */ public boolean isDuplicate() { return duplicate; } /** * Set whether duplicate or not. * @param duplicate true if duplicate, false otherwise */ public void setDuplicate(boolean duplicate) { this.duplicate = duplicate; } } --- NEW FILE: CrawlDataIterator.java --- /* CrawlDataIterator * * Created on 10.04.2006 * * Copyright (C) 2006 National and University Library of Iceland * * This file is part of the DeDuplicator (Heritrix add-on module). * * DeDuplicator is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * DeDuplicator is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with DeDuplicator; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package is.landsbokasafn.deduplicator; import java.io.IOException; /** * An abstract base class for implementations of iterators that iterate over * different sets of crawl data (i.e. crawl.log, ARC, WARC etc.) * * @author Kristinn Sigurðsson */ public abstract class CrawlDataIterator { String source; /** * Constructor. * * @param source The location of the crawl data. The meaning of this * value may vary based on the implementation of concrete * subclasses. Typically it will refer to a directory or a * file. */ public CrawlDataIterator(String source){ this.source = source; } /** * Are there more elements? * @return true if there are more elements, false otherwise * @throws IOException If an error occurs accessing the crawl data. */ public abstract boolean hasNext() throws IOException; /** * Get the next {@link CrawlDataItem}. * @return the next CrawlDataItem. If there are no further elements then * null will be returned. * @throws IOException If an error occurs accessing the crawl data. */ public abstract CrawlDataItem next() throws IOException; /** * Close any resources held open to read the crawl data. * @throws IOException If an error occurs closing access to crawl data. */ public abstract void close() throws IOException; /** * A short, human readable, string about what source this iterator uses. * I.e. "Iterator for Heritrix style crawl.log" etc. * @return A short, human readable, string about what source this iterator * uses. */ public abstract String getSourceType(); } --- NEW FILE: overview.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <title>DeDuplicator</title> <meta name="author" content="Kristinn Sigurdsson" > <meta http-equiv="content-type" content="text/html; charset=UTF-8" > <meta http-equiv="Content-Script-Type" content="text/javascript" > <meta http-equiv="Content-Style-Type" content="text/css" > </head> <body> <h1>DeDuplicator Overview.</h1> <h2>Getting started</h2> <h3>Building an index</h3> <ol> <li>A functional installation of Heritrix is required for this software to work. While Heritrix can be deployed on non-Linux operating systems that requires some degree of work as the bundled scripts are written for Linux. The same applies to this software and the following instructions assume that Heritrix is installed on a Linux machine under $HERITRIX_HOME.</li> <li>Install the DeDuplicator software. The jar files should be included in $HERITRIX_HOME/lib/ while the dedupdigest script should be added to $HERITRIX_HOME/bin/. If you've downloaded a .tar.gz bundle, explode it into $HERITRIX_HOME and all the files will be correctly deployed. <em>NOTE:</em> Heritrix can not be running at the same time as the DeDuplicator software is run.</li> <li>Make the dedupdigest script executable with <code>chmod u+x $HERITRIX_HOME/bin/dedupdigest</code></li> <li>Run <code>$HERITRIX_HOME/bin/dedupdigest --help</code> This will display the usage information for the indexing.<br> The program takes two arguments, the source data (crawl.log usually) and the target directory where the index will be written (will be created if not present). Several options are provided to custom tailor the type of index.</li> <li>Create an index. A typical index can be built with<br> <code>$HERITRIX_HOME/bin/dedupdigest -o URL -s -t <location of crawl.log> <index output directory></code><br> This will create an index that is indexed by URL only (not by the content digest) and includes equivalent URLs and timestamps.</li> </ol> <h3>Using the index</h3> <ol> <li>Having built an appropriate index, launch Heritrix. Make sure that the installation of Heritrix that you launched has the two JARs that come with the DeDuplicator (deduplicator-[version].jar and lucene-[version].jar) if it is not the same one used for creating the index.</li> <li>Configure a crawl job as normal except add the DeDuplicator processor to the processing chain at some point <em>after</em> the HTTPFetcher processor and prior to any processor which should be skipped if a duplicate is detected. When the DeDuplicator finds a duplicate the processing moves straight to the PostProcessing chain. So if you insert it at the top of the Extractor chain you can skip both link extraction and writing to disk. If you do not wish to skip link extraction you can insert the processor at the end of the link extraction chain etc.</li> <li>The DeDuplicator processor has several configurable parameters. <ol> <li><em>enabled</em> Standard Heritrix property for processors. Should be true. Setting it to false will disable the processor.</li> <li><em>index-location</em> The most important setting. A full path to the directory that contains the index (output directory of the indexing.)</li> <li><em>matching-method</em> Whether to lookup URLs or content digests first when looking for matches. This setting depends on how the index was built (indexing mode). If it was set to BOTH then either setting will work. Otherwise it must be set according to the indexing mode.</li> <li><em>try-equivalent</em> Should equivalent URLs be tried if an exact URL and content digest match is not found. Using equivalent matches means that duplicate documents whose URLs differ only in the parameter list or because of www[0-9]* prefixes are detected.</li> <li><em>mime-filter</em> Which documents to process</li> <li><em>filter-mode</em></li> <li><em>analysis-mode</em> Enables analysis of the usefulness and accuracy of header information in predicting change and non-change in documents. For statistical gathering purposes only.</li> <li><em>log-level</em> Enables more logging.</li> <li><em>stats-per-host</em> Maintains statistics per host in addition to the crawl wide stats.</li> </ol> </li> <li>Once the processor has been configured the crawl can be started and run normally. Information about the processor is available via the Processor report in the Heritrix GUI (this is saved to processors-report.txt at the end of a crawl).<br> Duplicate URLs will still show up in the crawl log but with a note 'duplicate' in the annotation field at the end of the log line.</li> </ol> </body> </html> --- NEW FILE: DeDuplicator.java --- /* DeDuplicator * * Created on 10.04.2006 * * Copyright (C) 2006-2010 National and University Library of Iceland * * This file is part of the DeDuplicator (Heritrix add-on module). * * DeDuplicator is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * any later version. * * DeDuplicator is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser Public License for more details. * * You should have received a copy of the GNU Lesser Public License * along with DeDuplicator; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ package is.landsbokasafn.deduplicator; import static is.landsbokasafn.deduplicator.DedupAttributeConstants.A_CONTENT_STATE_KEY; import static is.landsbokasafn.deduplicator.DedupAttributeConstants.CONTENT_UNCHANGED; import static org.archive.modules.recrawl.RecrawlAttributeConstants.A_CONTENT_DIGEST; import static org.archive.modules.recrawl.RecrawlAttributeConstants.A_FETCH_HISTORY; import java.io.File; import java.io.IOException; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Arrays; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Locale; import java.util.Map; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.commons.httpclient.HttpMethod; import org.apache.lucene.document.Document; import org.apache.lucene.search.ConstantScoreQuery; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermRangeFilter; import org.apache.lucene.search.TopScoreDocCollector; import org.apache.lucene.store.FSDirectory; import org.archive.modules.CrawlURI; import org.archive.modules.ProcessResult; import org.archive.modules.Processor; import org.archive.modules.net.ServerCache; import org.archive.util.ArchiveUtils; import org.archive.util.Base32; import org.springframework.beans.factory.InitializingBean; import org.springframework.beans.factory.annotation.Autowired; import dk.netarkivet.common.utils.SparseRangeFilter; /** * Heritrix compatible processor. * <p> * Will determine if CrawlURIs are <i>duplicates</i>. * <p> * Duplicate detection can only be performed <i>after</i> the fetch processors * have run. * * @author Kristinn Sigurðsson */ @SuppressWarnings({"serial", "unchecked"}) public class DeDuplicator extends Processor implements InitializingBean { private static Logger logger = Logger.getLogger(DeDuplicator.class.getName()); private static final int MAX_HITS = 1000; // Spring configurable parameters /* Location of Lucene Index to use for lookups */ private final static String ATTR_INDEX_LOCATION = "index-location"; public String getIndexLocation() { return (String) kp.get(ATTR_INDEX_LOCATION); } public void setIndexLocation(String indexLocation) { kp.put(ATTR_INDEX_LOCATION,indexLocation); } /* The matching method in use (by url or content digest) */ private final static String ATTR_MATCHING_METHOD = "matching-method"; private final static List<String> AVAILABLE_MATCHING_METHODS = new ArrayList<String>(Arrays.asList(new String[]{ "URL", "Content digest" })); private final static String DEFAULT_MATCHING_METHOD = AVAILABLE_MATCHING_METHODS.get(0); { setMatchingMethod(DEFAULT_MATCHING_METHOD); } public String getMatchingMethod() { return (String) kp.get(ATTR_MATCHING_METHOD); } public void setMatchingMethod(String matchinMethod) { if (AVAILABLE_MATCHING_METHODS.contains(matchinMethod)) { kp.put(ATTR_MATCHING_METHOD,matchinMethod); } else { throw new IllegalArgumentException("Invalid matching method: " + matchinMethod); } } /* On duplicate, should jump to which part of processing chain? * If not set, nothing is skipped. Otherwise this should be the identity of the processor to jump to. */ public final static String ATTR_JUMP_TO = "jump-to"; public String getJumpTo(){ return (String)kp.get(ATTR_JUMP_TO); } public void setJumpTo(String jumpTo){ kp.put(ATTR_JUMP_TO, jumpTo); } /* Origin of duplicate URLs. May be overridden by info from index*/ public final static String ATTR_ORIGIN = "origin"; { setOrigin(""); } public String getOrigin() { return (String) kp.get(ATTR_ORIGIN); } public void setOrigin(String origin) { kp.put(ATTR_ORIGIN,origin); } /* If an exact match is not made, should the processor try * to find an equivalent match? */ public final static String ATTR_EQUIVALENT = "try-equivalent"; { setTryEquivalent(false); } public boolean getTryEquivalent(){ return (Boolean)kp.get(ATTR_EQUIVALENT); } public void setTryEquivalent(boolean tryEquivalent){ kp.put(ATTR_EQUIVALENT,tryEquivalent); } /* The filter on mime types. This is either a blacklist or whitelist * depending on ATTR_FILTER_MODE. */ public final static String ATTR_MIME_FILTER = "mime-filter"; public final static String DEFAULT_MIME_FILTER = "^text/.*"; { setMimeFilter(DEFAULT_MIME_FILTER); } public String getMimeFilter(){ return (String)kp.get(ATTR_MIME_FILTER); } public void setMimeFilter(String mimeFilter){ kp.put(ATTR_MIME_FILTER, mimeFilter); } /* Is the mime filter a blacklist (do not apply processor to what matches) * or whitelist (apply processor only to what matches). */ public final static String ATTR_FILTER_MODE = "filter-mode"; { setBlacklist(true); } public boolean getBlacklist(){ return (Boolean)kp.get(ATTR_FILTER_MODE); } public void setBlacklist(boolean blacklist){ kp.put(ATTR_FILTER_MODE, blacklist); } /* Analysis mode. */ public final static String ATTR_ANALYZE_TIMESTAMP = "analyze-timestamp"; { setAnalyzeTimestamp(false); } public boolean getAnalyzeTimestamp() { return (Boolean) kp.get(ATTR_ANALYZE_TIMESTAMP); } public void setAnalyzeTimestamp(boolean analyzeTimestamp) { kp.put(ATTR_ANALYZE_TIMESTAMP,analyzeTimestamp); } /* Should the content size information be set to zero when a duplicate is found? */ public final static String ATTR_CHANGE_CONTENT_SIZE = "change-content-size"; { setChangeContentSize(false); } public boolean getChangeContentSize(){ return (Boolean)kp.get(ATTR_CHANGE_CONTENT_SIZE); } public void setChangeContentSize(boolean changeContentSize){ kp.put(ATTR_CHANGE_CONTENT_SIZE,changeContentSize); } /* Should statistics be tracked per host? **/ public final static String ATTR_STATS_PER_HOST = "stats-per-host"; { setStatsPerHost(false); } public boolean getStatsPerHost(){ return (Boolean)kp.get(ATTR_STATS_PER_HOST); } public void setStatsPerHost(boolean statsPerHost){ kp.put(ATTR_STATS_PER_HOST,statsPerHost); } /* Should we use sparse queries (uses less memory at a cost to performance? */ public final static String ATTR_USE_SPARSE_RANGE_FILTER = "use-sparse-range-filter"; { setUseSparseRengeFilter(false); } public boolean getUseSparseRengeFilter(){ return (Boolean)kp.get(ATTR_USE_SPARSE_RANGE_FILTER); } public void setUseSparseRengeFilter(boolean useSparseRengeFilter){ kp.put(ATTR_USE_SPARSE_RANGE_FILTER, useSparseRengeFilter); } /* How should 'origin' be handled */ public final static String ATTR_ORIGIN_HANDLING = "origin-handling"; public final static String ORIGIN_HANDLING_NONE = "No origin information"; public final static String ORIGIN_HANDLING_PROCESSOR = "Use processor setting"; public final static String ORIGIN_HANDLING_INDEX = "Use index information"; public final static List<String> AVAILABLE_ORIGIN_HANDLING = new ArrayList<String>(Arrays.asList(new String[]{ ORIGIN_HANDLING_NONE, ORIGIN_HANDLING_PROCESSOR, ORIGIN_HANDLING_INDEX })); public final static String DEFAULT_ORIGIN_HANDLING = ORIGIN_HANDLING_NONE; { setOriginHandling(DEFAULT_ORIGIN_HANDLING); } public String getOriginHandling() { return (String) kp.get(ATTR_ORIGIN); } public void setOriginHandling(String originHandling) { if (AVAILABLE_ORIGIN_HANDLING.contains(originHandling)) { kp.put(ATTR_ORIGIN_HANDLING,originHandling); } else { throw new IllegalArgumentException("Invalid origin handling: " + originHandling); } } // Spring configured access ot Heritrix resources // Gain access to the ServerCache for host based statistics. protected ServerCache serverCache; public ServerCache getServerCache() { return this.serverCache; } @Autowired public void setServerCache(ServerCache serverCache) { this.serverCache = serverCache; } // Member variables. protected IndexSearcher searcher = null; protected boolean lookupByURL = true; protected boolean statsPerHost = false; protected boolean useOrigin = false; protected boolean useOriginFromIndex = false; protected Statistics stats = null; protected HashMap<String, Statistics> perHostStats = null; public void afterPropertiesSet() throws Exception { // Index location String indexLocation = getIndexLocation(); try { searcher = new IndexSearcher(FSDirectory.open(new File(indexLocation))); } catch (Exception e) { throw new IllegalArgumentException("Unable to find/open index at " + indexLocation,e); } // Matching method String matchingMethod = getMatchingMethod(); lookupByURL = matchingMethod.equals(DEFAULT_MATCHING_METHOD); // Track per host stats statsPerHost = getStatsPerHost(); // Origin handling. String originHandling = getOriginHandling(); if(originHandling.equals(ORIGIN_HANDLING_NONE)==false){ useOrigin = true; if(originHandling.equals(ORIGIN_HANDLING_INDEX)){ useOriginFromIndex = true; } } // Initialize some internal variables: stats = new Statistics(); if (statsPerHost) { perHostStats = new HashMap<String, Statistics>(); } } @Override protected boolean shouldProcess(CrawlURI curi) { if (curi.isSuccess() == false) { // Early return. No point in doing comparison on failed downloads. logger.finest("Not handling " + curi.toString() + ", did not succeed."); return false; } if (curi.isPrerequisite()) { // Early return. Prerequisites are exempt from checking. logger.finest("Not handling " + curi.toString() + ", prerequisite."); return false; } if (curi.toString().startsWith("http")==false) { // Early return. Non-http documents are not handled at present logger.finest("Not handling " + curi.toString() + ", non-http."); return false; } if(curi.getContentType() == null){ // No content type means we can not handle it. logger.finest("Not handling " + curi.toString() + ", missing content (mime) type"); return false; } if(curi.getContentType().matches(getMimeFilter()) == getBlacklist()){ // Early return. Does not pass the mime filter logger.finest("Not handling " + curi.toString() + ", excluded by mimefilter (" + curi.getContentType() + ")."); return false; } if(curi.getData().containsKey(A_CONTENT_STATE_KEY) && ((Integer)curi.getData().get(A_CONTENT_STATE_KEY)).intValue()==CONTENT_UNCHANGED){ // Early return. A previous processor or filter has judged this // CrawlURI as having unchanged content. logger.finest("Not handling " + curi.toString() + ", already flagged as unchanged."); return false; } return true; } @Override protected void innerProcess(CrawlURI puri) { throw new AssertionError(); } @Override protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException { ProcessResult processResult = ProcessResult.PROCEED; // Default. Continue as normal logger.finest("Processing " + curi.toString() + "(" + curi.getContentType() + ")"); stats.handledNumber++; stats.totalAmount += curi.getContentSize(); Statistics currHostStats = null; if(statsPerHost){ synchronized (perHostStats) { String host = getServerCache().getHostFor(curi.getUURI()).getHostName(); currHostStats = perHostStats.get(host); if(currHostStats==null){ currHostStats = new Statistics(); perHostStats.put(host,currHostStats); } } currHostStats.handledNumber++; currHostStats.totalAmount += curi.getContentSize(); } Document duplicate = null; if(lookupByURL){ duplicate = lookupByURL(curi,currHostStats); } else { duplicate = lookupByDigest(curi,currHostStats); } if (duplicate != null){ // Perform tasks common to when a duplicate is found. // Increment statistics counters stats.duplicateAmount += curi.getContentSize(); stats.duplicateNumber++; if(statsPerHost){ currHostStats.duplicateAmount+=curi.getContentSize(); currHostStats.duplicateNumber++; } String jumpTo = getJumpTo(); // Duplicate. Skip part of processing chain? if(jumpTo!=null){ processResult = ProcessResult.jump(jumpTo); } // Record origin? String annotation = "duplicate"; if(useOrigin){ // TODO: Save origin in the CrawlURI so that other processors // can make use of it. (Future: WARC) if(useOriginFromIndex && duplicate.get(DigestIndexer.FIELD_ORIGIN)!=null){ // Index contains origin, use it. annotation += ":\"" + duplicate.get( DigestIndexer.FIELD_ORIGIN) + "\""; } else { String tmp = getOrigin(); // Check if an origin value is actually available if(tmp != null && tmp.trim().length() > 0){ // It is available, add it to the log line. annotation += ":\"" + tmp + "\""; } } } // Make note in log curi.getAnnotations().add(annotation); if(getChangeContentSize()){ // Set content size to zero, we are not planning to // 'write it to disk' // TODO: Reconsider this curi.setContentSize(0); } else { // A hack to have Heritrix count this as a duplicate. // TODO: Get gojomo to change how Heritrix decides CURIs are duplicates. int targetHistoryLength = 2; Map[] history = (HashMap[]) (curi.containsDataKey(A_FETCH_HISTORY) ? curi.getData().get(A_FETCH_HISTORY) : new HashMap[targetHistoryLength]); // Create space if(history.length != targetHistoryLength) { HashMap[] newHistory = new HashMap[targetHistoryLength]; System.arraycopy( history,0, newHistory,0, Math.min(history.length,newHistory.length)); history = newHistory; } // rotate all history entries up one slot except the newest // insert from index at [1] for(int i = history.length-1; i >1; i--) { history[i] = history[i-1]; } Map oldVisit = new HashMap(); oldVisit.put(A_CONTENT_DIGEST, curi.getContentDigestSchemeString()); history[1]=oldVisit; curi.getData().put(A_FETCH_HISTORY,history); } // Mark as duplicate for other processors curi.getData().put(A_CONTENT_STATE_KEY, CONTENT_UNCHANGED); } if(getAnalyzeTimestamp()){ doAnalysis(curi,currHostStats, duplicate!=null); } return processResult; } /** * Process a CrawlURI looking up in the index by URL * @param curi The CrawlURI to process * @param currHostStats A statistics object for the current host. * If per host statistics tracking is enabled this * must be non null and the method will increment * appropriate counters on it. * @return The result of the lookup (a Lucene document). If a duplicate is * not found null is returned. */ protected Document lookupByURL(CrawlURI curi, Statistics currHostStats){ // Look the CrawlURI's URL up in the index. try { Query query = queryField(DigestIndexer.FIELD_URL, curi.toString()); TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; Document doc = null; String currentDigest = getDigestAsString(curi); if(hits != null && hits.length > 0){ // Typically there should only be one hit, but we'll allow for // multiple hits. for(int i=0 ; i<hits.length ; i++){ // Multiple hits on same exact URL should be rare // See if any have matching content digests doc = searcher.doc(hits[i].doc); String oldDigest = doc.get(DigestIndexer.FIELD_DIGEST); if(oldDigest.equalsIgnoreCase(currentDigest)){ stats.exactURLDuplicates++; if(statsPerHost){ currHostStats.exactURLDuplicates++; } logger.finest("Found exact match for " + curi.toString()); // If we found a hit, no need to look at other hits. return doc; } } } if(getTryEquivalent()) { // No exact hits. Let's try lenient matching. String normalizedURL = DigestIndexer.stripURL(curi.toString()); query = queryField(DigestIndexer.FIELD_URL_NORMALIZED, normalizedURL); collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query,collector); hits = collector.topDocs().scoreDocs; for(int i=0 ; i<hits.length ; i++){ doc = searcher.doc(hits[i].doc); String indexDigest = doc.get(DigestIndexer.FIELD_DIGEST); if(indexDigest.equals(currentDigest)){ // Make note in log String equivURL = doc.get( DigestIndexer.FIELD_URL); curi.getAnnotations().add("equivalentURL:\"" + equivURL + "\""); // Increment statistics counters stats.equivalentURLDuplicates++; if(statsPerHost){ currHostStats.equivalentURLDuplicates++; } logger.finest("Found equivalent match for " + curi.toString() + ". Normalized: " + normalizedURL + ". Equivalent to: " + equivURL); //If we found a hit, no need to look at more. return doc; } } } } catch (IOException e) { logger.log(Level.SEVERE,"Error accessing index.",e); } // If we make it here then this is not a duplicate. return null; } /** * Process a CrawlURI looking up in the index by content digest * @param curi The CrawlURI to process * @param currHostStats A statistics object for the current host. * If per host statistics tracking is enabled this * must be non null and the method will increment * appropriate counters on it. * @return The result of the lookup (a Lucene document). If a duplicate is * not found null is returned. */ protected Document lookupByDigest(CrawlURI curi, Statistics currHostStats) { Document duplicate = null; String currentDigest = null; Object digest = curi.getContentDigest(); if (digest != null) { currentDigest = Base32.encode((byte[])digest); } Query query = queryField(DigestIndexer.FIELD_DIGEST, currentDigest); try { TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS, false); searcher.search(query,collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; StringBuffer mirrors = new StringBuffer(); mirrors.append("mirrors: "); String url = curi.toString(); String normalizedURL = getTryEquivalent() ? DigestIndexer.stripURL(url) : null; if(hits != null && hits.length > 0){ // Can definitely be more then one // Note: We may find an equivalent match before we find an // (existing) exact match. // TODO: Ensure that an exact match is recorded if it exists. for(int i=0 ; i<hits.length && duplicate==null ; i++){ Document doc = searcher.doc(hits[i].doc); String indexURL = doc.get(DigestIndexer.FIELD_URL); // See if the current hit is an exact match. if(url.equals(indexURL)){ duplicate = doc; stats.exactURLDuplicates++; if(statsPerHost){ currHostStats.exactURLDuplicates++; } logger.finest("Found exact match for " + curi.toString()); } // If not, then check if it is an equivalent match (if // equivalent matches are allowed). if(duplicate == null && getTryEquivalent()){ String indexNormalizedURL = doc.get(DigestIndexer.FIELD_URL_NORMALIZED); if(normalizedURL.equals(indexNormalizedURL)){ duplicate = doc; stats.equivalentURLDuplicates++; if(statsPerHost){ currHostStats.equivalentURLDuplicates++; } curi.getAnnotations().add("equivalentURL:\"" + indexURL + "\""); logger.finest("Found equivalent match for " + curi.toString() + ". Normalized: " + normalizedURL + ". Equivalent to: " + indexURL); } } if(duplicate == null){ // Will only be used if no exact (or equivalent) match // is found. mirrors.append(indexURL + " "); } } if(duplicate == null){ stats.mirrorNumber++; if (statsPerHost) { currHostStats.mirrorNumber++; } logger.log(Level.FINEST,"Found mirror URLs for " + curi.toString() + ". " + mirrors); } } } catch (IOException e) { logger.log(Level.SEVERE,"Error accessing index.",e); } return duplicate; } public String report() { StringBuffer ret = new StringBuffer(); ret.append("Processor: is.hi.bok.digest.DeDuplicator\n"); ret.append(" Function: Abort processing of duplicate records\n"); ret.append(" - Lookup by " + (lookupByURL?"url":"digest") + " in use\n"); ret.append(" Total handled: " + stats.handledNumber + "\n"); ret.append(" Duplicates found: " + stats.duplicateNumber + " " + getPercentage(stats.duplicateNumber,stats.handledNumber) + "\n"); ret.append(" Bytes total: " + stats.totalAmount + " (" + ArchiveUtils.formatBytesForDisplay(stats.totalAmount) + ")\n"); ret.append(" Bytes discarded: " + stats.duplicateAmount + " (" + ArchiveUtils.formatBytesForDisplay(stats.duplicateAmount) + ") " + getPercentage(stats.duplicateAmount, stats.totalAmount) + "\n"); ret.append(" New (no hits): " + (stats.handledNumber- (stats.mirrorNumber+stats.exactURLDuplicates+stats.equivalentURLDuplicates)) + "\n"); ret.append(" Exact hits: " + stats.exactURLDuplicates + "\n"); ret.append(" Equivalent hits: " + stats.equivalentURLDuplicates + "\n"); if(lookupByURL==false){ ret.append(" Mirror hits: " + stats.mirrorNumber + "\n"); } if(getAnalyzeTimestamp()){ ret.append("... [truncated message content] |
From: Kristinn S. <kri...@us...> - 2010-07-14 16:19:20
|
Update of /cvsroot/deduplicator/deduplicator3/src/main/conf/jobs/profile-deduplicator/.svn In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv6840/src/main/conf/jobs/profile-deduplicator/.svn Added Files: all-wcprops entries Log Message: Initial check-in of v3. Compiles and runs. Need to do more regression testing and review how Lucene index is used. Also improve documentation. --- NEW FILE: all-wcprops --- K 25 svn:wc:ra_dav:version-url V 95 /svnroot/archive-crawler/!svn/ver/6909/trunk/heritrix3/dist/src/main/conf/jobs/profile-defaults END profile-crawler-beans.cxml K 25 svn:wc:ra_dav:version-url V 122 /svnroot/archive-crawler/!svn/ver/6909/trunk/heritrix3/dist/src/main/conf/jobs/profile-defaults/profile-crawler-beans.cxml END --- NEW FILE: entries --- 10 dir 6911 https://kri...@ar.../svnroot/archive-crawler/trunk/heritrix3/dist/src/main/conf/jobs/profile-defaults https://kri...@ar.../svnroot/archive-crawler 2010-07-02T00:58:23.790893Z 6909 gojomo svn:special svn:externals svn:needs-lock daa5b2f2-a927-0410-8b2d-f5f262fa301a profile-crawler-beans.cxml file 2010-07-06T11:17:37.858000Z 1a97a3dd7c73e8edbe45a1c464218ccb 2010-07-02T00:58:23.790893Z 6909 gojomo 23397 |
From: Kristinn S. <kri...@us...> - 2010-07-14 16:19:20
|
Update of /cvsroot/deduplicator/deduplicator3/src/test/java/is/landsbokasafn/deduplicator In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv6840/src/test/java/is/landsbokasafn/deduplicator Added Files: CrawlLogIteratorTest.java DeDuplicatorTest.java Log Message: Initial check-in of v3. Compiles and runs. Need to do more regression testing and review how Lucene index is used. Also improve documentation. --- NEW FILE: CrawlLogIteratorTest.java --- package is.landsbokasafn.deduplicator; import java.io.File; import java.io.IOException; import junit.framework.TestCase; public class CrawlLogIteratorTest extends TestCase { public void testParseLine() throws IOException{ File testFile = new File("test"); testFile.createNewFile(); CrawlLogIterator cli = new CrawlLogIterator("test"); String lineValidWithoutAnnotation = "2006-10-17T14:22:29.343Z 200 29764 http://www.bok.hi.is/image.gif E http://www.bok.hi.is/ image/gif #008 20061017142229253+74 YA3G7O6TNMHXA5WWDSIZJDNXV56WDRCA - -"; String lineValidWithoutOrigin = "2006-10-17T14:22:29.391Z 200 7951 http://www.bok.hi.is/ X http://bok.hi.is/ text/html #029 20061017142228950+364 SBRY3NIKXYAIKSCJ5QL2F6AE4GG7P6VR - 3t"; String lineValidWithOrigin = "2006-10-17T14:22:29.399Z 200 18803 http://www.bok.hi.is/ X http://bok.hi.is/ text/html #041 20061017142229087+180 OHCVML7NJ4STPQSRRWY7WWJL6T5H2R6L - duplicate:\"ORIGIN\",3t"; String lineTruncated = "2006-10-17T14:22:29.399Z 200 18803 http://www.bok.hi.is/ X http://bok.hi."; String lineValidWithDigestPrefix = "2006-10-17T14:22:29.343Z 200 29764 http://www.bok.hi.is/image.gif E http://www.bok.hi.is/ image/gif #008 20061017142229253+74 sha1:YA3G7O6TNMHXA5WWDSIZJDNXV56WDRCA - -"; CrawlDataItem tmp = cli.parseLine(lineValidWithoutAnnotation); assertNotNull(tmp); tmp = cli.parseLine(lineValidWithoutOrigin); assertNotNull(tmp); assertNull(tmp.getOrigin()); tmp = cli.parseLine(lineValidWithOrigin); assertNotNull(tmp); assertEquals("ORIGIN", tmp.getOrigin()); tmp = cli.parseLine(lineTruncated); assertNull(tmp); tmp = cli.parseLine(lineValidWithDigestPrefix); assertEquals("YA3G7O6TNMHXA5WWDSIZJDNXV56WDRCA", tmp.getContentDigest()); cli.close(); testFile.delete(); //Cleanup } } --- NEW FILE: DeDuplicatorTest.java --- package is.landsbokasafn.deduplicator; import junit.framework.TestCase; public class DeDuplicatorTest extends TestCase { public void testGetPercentage() throws Exception{ assertEquals("2.5%",DeDuplicator.getPercentage(5,200)); } } |
From: Kristinn S. <kri...@us...> - 2010-07-14 16:19:20
|
Update of /cvsroot/deduplicator/deduplicator3 In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv6840 Added Files: .project pom.xml .cvsignore .classpath Log Message: Initial check-in of v3. Compiles and runs. Need to do more regression testing and review how Lucene index is used. Also improve documentation. --- NEW FILE: .cvsignore --- target --- NEW FILE: .project --- <?xml version="1.0" encoding="UTF-8"?> <projectDescription> <name>DeDuplicator3</name> <comment></comment> <projects> </projects> <buildSpec> <buildCommand> <name>org.eclipse.jdt.core.javabuilder</name> <arguments> </arguments> </buildCommand> <buildCommand> <name>org.maven.ide.eclipse.maven2Builder</name> <arguments> </arguments> </buildCommand> </buildSpec> <natures> <nature>org.eclipse.jdt.core.javanature</nature> <nature>org.maven.ide.eclipse.maven2Nature</nature> </natures> </projectDescription> --- NEW FILE: pom.xml --- <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>is.landsbokasafn</groupId> <artifactId>deduplicator</artifactId> <name>DeDuplicator3 (Heritrix 3 add-on module)</name> <version>3.0.0-SNAPSHOT</version> <description> An add-on module for the web crawler Heritrix 3 that offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls. </description> <url>http://deduplicator.sourceforge.net/</url> <issueManagement> <system>SourceForge Trackers</system> <url>http://sourceforge.net/tracker/?group_id=181565</url> </issueManagement> <mailingLists> <mailingList> <name> Crawler Discussion List (General Heritrix Discussion) </name> <subscribe> mailto:arc...@ya... </subscribe> <unsubscribe> mailto:arc...@ya... </unsubscribe> <post>mailto:arc...@ya...</post> <archive> http://groups.yahoo.com/group/archive-crawler/ </archive> </mailingList> <mailingList> <name>DeDuplicator CVS Commits</name> <subscribe> http://lists.sourceforge.net/lists/listinfo/deduplicator-cvs </subscribe> <unsubscribe> http://lists.sourceforge.net/lists/listinfo/deduplicator-cvs </unsubscribe> <archive> http://sourceforge.net/mailarchive/forum.php?forum=deduplicator-cvs </archive> </mailingList> </mailingLists> <developers> <developer> <id>Kristinn</id> <name>Kristinn Sigurðsson</name> <email>kristsi at bok.hi.is</email> <organization> National and University Library of Iceland </organization> <roles> <role>Developer</role> </roles> <timezone>+0</timezone> </developer> </developers> <contributors> <contributor> <name>Lars Clausen</name> <email>lc at statsbiblioteket.dk</email> <organization>Netarkivet.dk</organization> <timezone>+1</timezone> </contributor> <contributor> <name>Maximilian Schoefmann</name> <email>schoefma at cip.ifi.lmu.de</email> </contributor> <contributor> <name>Kare Fiedler Christiansen</name> <email>kfc at statsbiblioteket.dk</email> <organization>Netarkivet.dk</organization> <timezone>+1</timezone> </contributor> </contributors> <scm> <connection> scm:cvs:pserver:anonymous:@deduplicator.cvs.sourceforge.net:/cvsroot/deduplicator:deduplicator3 </connection> <developerConnection> scm:cvs:ext:dev...@de...:/cvsroot/deduplicator:deduplicator3 </developerConnection> <url> http://deduplicator.cvs.sourceforge.net/deduplicator/deduplicator3/ </url> </scm> <organization> <name>National and University Library of Iceland</name> <url>http://www.landsbokasafn.is</url> </organization> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <build> <finalName>${project.artifactId}-${project.version}-${buildNumber}</finalName> <plugins> <!-- this is a java 1.6 project --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> <encoding>UTF-8</encoding> </configuration> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>buildnumber-maven-plugin</artifactId> <version>1.0-beta-1</version> <executions> <execution> <phase>validate</phase> <goals> <goal>create</goal> </goals> </execution> </executions> <configuration> <format>{0,date,yyyyMMdd}</format> <items> <item>timestamp</item> </items> <doCheck>false</doCheck> <doUpdate>false</doUpdate> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptors> <descriptor> src/main/assembly/dist.xml </descriptor> <descriptor> src/main/assembly/src.xml </descriptor> </descriptors> <finalName>${project.artifactId}-${project.version}-${buildNumber}</finalName> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <artifactId>maven-site-plugin</artifactId> <configuration> <locales>en</locales> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.archive.heritrix</groupId> <artifactId>heritrix-commons</artifactId> <version>3.0.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.archive.heritrix</groupId> <artifactId>heritrix-modules</artifactId> <version>3.0.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.archive.heritrix</groupId> <artifactId>heritrix-engine</artifactId> <version>3.0.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>3.0.2</version> </dependency> </dependencies> <reporting> <plugins> <plugin> <artifactId>maven-javadoc-plugin</artifactId> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>jxr-maven-plugin</artifactId> <configuration> <overview> ${basedir}/src/main/java/is/landsbokasafn/deduplicator/overview.html </overview> <version>true</version> </configuration> </plugin> </plugins> </reporting> <distributionManagement> <site> <id>website</id> <url> scp://deduplicator.sourceforge.net/home/groups/d/de/deduplicator/htdocs/ </url> </site> </distributionManagement> </project> --- NEW FILE: .classpath --- <?xml version="1.0" encoding="UTF-8"?> <classpath> <classpathentry kind="src" output="target/classes" path="src/main/java"/> <classpathentry kind="src" output="target/test-classes" path="src/test/java"/> <classpathentry kind="con" path="org.maven.ide.eclipse.MAVEN2_CLASSPATH_CONTAINER"/> <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/> <classpathentry kind="output" path="target/classes"/> </classpath> |