harvestlinks

David Fisher

The HarvestLinks application extracts all links (and link text) from a collection
of web pages. It can be used to gather anchor text and in-links for HTML and TREC Web data.
This in turn can be added to an index in the form of "inlink" fields for use for
direct retrieval or for [pagerank] calculations.

Notice, harvestlinks outputs only those hyperlinks that points to pages inside the collection,
and ignores those that points outside. It creates a closed Web graph, and have consequences
when used to do [pagerank] computation.

The two required parameters for the harvestlinks application are:
* corpus: The path to the directory holding the corpus files you're trying to index
* output: The path to a directory where the link harvesting output should go

For example, running this from the command line might look like:
$ ./harvestlinks -corpus=/path/to/corpus -output=/path/to/output

Once you have gathered your links, you must tell the indexer to index them along with your source data.
In your index parameter file, you should add the following to your <corpus> parameter set:
<inlink>/path/to/output/sorted</inlink>

(where the "sorted" directory is the directory named "sorted" under the output directory for harvestlinks). And also, so that the indexer knows about the inlink fields:
<field><name>inlink</name></field>

This will allow you to perform retrieval tasks on the anchor text.

  • corpus: (required) The path to the directory holding the corpus files you're trying to index
  • output: (required) The path to a directory where the link harvesting output should go
  • class: (optional) The file class of the corpus. One of trecweb (the default) or warc.
  • redirect: (optional) specifies a redirect file that maps from source to target URLs to create aliases for links. The redirect file is a text file with one entry per line in the form of:

    [SOURCE_URL] [TARGET_URL]

Where the source URL is the original URL to be found and the target URL will be what is searched for instead of the original source URL.

  • mergethreads: (optional) specified the number of threads to use for the file sort and merge operations (default 4, recommended less than 8 max.)
  • delete: (optional) set to false to not delete any existing directories in the output directory (default true: do delete)
  • harvest: (optional) perform the harvesting step (default true, set to false to skip)
  • sort: (optional) perform the sorting/merge step (default true, set to false to skip)
  • clean: (optional) perform cleaning of temporary files after sort (default true, set to false to skip)
  • combine: (optional) perfom final combination of links (default true, set to false to skip)

Related

Wiki: Home
Wiki: Lemur Toolkit Utilities
Wiki: Quick Start
Wiki: pagerank