The HarvestLinks application extracts all links (and link text) from a collection
of web pages. It can be used to gather anchor text and in-links for HTML and TREC Web data.
This in turn can be added to an index in the form of "inlink" fields for use for
direct retrieval or for [pagerank] calculations.
Notice, harvestlinks outputs only those hyperlinks that points to pages inside the collection,
and ignores those that points outside. It creates a closed Web graph, and have consequences
when used to do [pagerank] computation.
The two required parameters for the harvestlinks application are:
* corpus: The path to the directory holding the corpus files you're trying to index
* output: The path to a directory where the link harvesting output should go
For example, running this from the command line might look like:
$ ./harvestlinks -corpus=/path/to/corpus -output=/path/to/output
Once you have gathered your links, you must tell the indexer to index them along with your source data.
In your index parameter file, you should add the following to your <corpus> parameter set:
(where the "sorted" directory is the directory named "sorted" under the output directory for harvestlinks). And also, so that the indexer knows about the inlink fields:
This will allow you to perform retrieval tasks on the anchor text.
Where the source URL is the original URL to be found and the target URL will be what is searched for instead of the original source URL.