Menu

CrawlSpec

John Arcoman

Crawl Specifications

The crawler cockpit is the user interface to the crawler system that has
been developed within the ARCOMEM system. It provides the means to set up
crawl campaigns, to see their progress and peruse intermediate results.
The means for transferring the campaign configuration from the crawler cockpit
to the ARCOMEM analysis system is via crawl specifications. These data
structures store information that determines how the system will guide the
crawl.

Crawl specifications are immutable – that is, once they are written they
cannot be changed. New crawl specifications are generated when changes are
made to a specification. This is a pragmatic decision that allows individual
analysis results to be linked directly to specific crawl specifications (if
they were mutable it would not be possible). Each new crawl specification is
given a new identifier and contains “sub-specifications” for the configuration
of the different parts of the ARCOMEM system.

Crawl Specification

The crawl specification design has been implemented as Java classes in the
ARCOMEM project\'s crawl-specification sub-module. These allow the use of Java
to generate crawl specifications and convert them into and out of JSON or RDF.
There\'s a full example of the object use in the jUnit test
CrawlSpecificationTest in the crawl-specification sub-module of the ARCOMEM
codebase.

    // Instantiate the container
    this.crawlSpecification = new CrawlSpecification();
    // Main crawl specification
    this.crawlSpecification.campaignIdentifier =
            URI.create( RDF.ARCO+"campaign1" );
    this.crawlSpecification.crawlSpecificationIdentifier =
            URI.create( RDF.ARCO+"crawlSpecification1" );
    this.crawlSpecification.crawlStartDate = "2012-12-01";
    this.crawlSpecification.crawlEndDate = "2012-12-31";
    // API Crawler configuration
    this.crawlSpecification.apiCrawlSpec =
            new APICrawlSpecification();
    this.crawlSpecification.apiCrawlSpec.crawlPeriod = 6;
    this.crawlSpecification.apiCrawlSpec.keywords =
            new ArrayList<String>( Arrays.asList( new String[]{
                "barack", "obama","usa" } ) );
    . . .

Storing and accessing these crawl specification objects has been made easier
through the
CrawlSpecificationFactory
class in the ARCOMEM core framework. The framework will deal with getting an
appropriate crawl specification serializer and retrieving a crawl specification
when the loadCrawlSpecification()
method is called.

The API Crawler is not developed in Java. It is based on a Python codebase and
so these classes are not available. To make the crawl specification API more
language agnostic, the crawl specification getters and setters have been
delegated to an HTTP service that uses JSON for external communication. The
Java CrawlSpecification
class has a method that allows conversion directly to and from the JSON required
for this service. The Python implementation must produce JSON that conforms to
the JSON format required for the service. The block below shows the example
from the CrawlSpecificationTest jUnit test that’s in the crawl-specification
sub-module:

{"campaignIdentifier":"http://crawlspec.arcomem.eu/campaign1","crawlSpecificationIdentifier":"http://crawlspec.arcomem.eu/crawlSpecification1","crawlStartDate":"2012-12-01","crawlEndDate":"2012-12-31","apiCrawlSpec":{"crawlPeriod":6.0,"keywords":["barack","obama","usa"]},"htmlCrawlSpec":{"seedURLs":["http://news.bbc.co.uk/","http://telegraph.co.uk/"],"languagePreferences":[{"languageIdentifier":"en","weight":1.0},{"languageIdentifier":"*","weight":0.2}]},"onlineAnalysisConfig":{"urlPatterns":[{"weight":1.0,"pattern":"*news*"}],"applicationScopes":[{"applicationTag":"blog","weight":0.4},{"applicationTag":"news","weight":1.0}],"entityPreferences":[{"entityURI":"http://dbpedia.org/resource/Barack_Obama","entityName":"Barack Obama","weight":1.0}],"topicPreferences":[{"topicIdentifier":"healthcare","weight":1.0}]},"offlineAnalysiConfig":{}}

Table 2 - JSON elements required to write a crawl specification






































































































Name Type Description
Crawl Specification
campaignIdentifier String (URI) The URI of this campaign
crawlSpecificationIdentifier String (URI) The URI of this crawl specification
crawlStartDate String (Date) The start date of the crawl
crawlEndDate String (Date) The end date of the crawl
apiCrawlSpec Object The API crawler configuration
htmlCrawlSpec Object The HTML crawler specification
onlineAnalysisConfig Object The configuration for the online analysis engine
offlineAnalysisConfig Object The configuration for the offline analysis engine
API Crawler Configuration
crawlPeriod Double The number of hours between each API crawl run
keywords Array[String] Keywords used to search the API being crawled
HTML Crawler Configuration
seedURLs Array[String] The URLs used to seed the crawl
languagePreferences Array[Object] A set of priority values for all preferred languages.
Each object needs to have languageIdentifier (as an HTTP language
identifier) and weight – a double between 0 and 1.
Online Analysis Configuration
urlPatterns Array[Object] A list of URL patterns which will be considered relevant documents.
Each object needs to have pattern (a regex to match against) and weight
- a double which weights the scores of the documents that match the pattern.
applicationScopes Array[Object] Each object defines weights for particular applications. The object
needs applicationTag defining the application (blog, news, etc.), and
weight – a double which weights the scores of the documents from that
application.
entityPreferences Array[Object] Each object defines weights for particular entity URIs found
within the document. The object needs entityURI, and entityName – the
URI and string representing the entity – and weight – a double which
weights the scores of documents containing those entities.
topicPreferences Array[Object] Each object defines weights for particular topics found within the
document. The object needs topicIdentifier – an identifier that
represents a topic – and weight – a double which weights the scores of
documents matching those topics.

Crawl Specification Access

Access to the crawl specification is provided by the framework.
A CrawlSpecification
interface is defined which defined functional methods for accessing the crawl spec.
Implementations of the crawl specification interface provide access to crawl specifications
stored in different ways. In particular, there are implementations for accessing crawl specs
stored in files, HDFS files and triple-stores. The type of crawl spec access is defined
in the framework configuration file with the key "crawlspec.class"
(see Configuration Options below).

To get an instance of the crawl specification for your framework setup, use the
CrawlSpecificationFactory
to instantiate it. If you have a map-reduce Context you can
also pass that to the factory in case the crawl specification loader needs to be
configured.

CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification();
List<URL> urls = cs.getSeedURLs();

or if you have a map-reduce context:

CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification( context );

Related

Wiki: Architecture
Wiki: TryIt

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.