ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

CrawlSpec

Crawl Specifications

The crawler cockpit is the user interface to the crawler system that has
been developed within the ARCOMEM system. It provides the means to set up
crawl campaigns, to see their progress and peruse intermediate results.
The means for transferring the campaign configuration from the crawler cockpit
to the ARCOMEM analysis system is via crawl specifications. These data
structures store information that determines how the system will guide the
crawl.

Crawl specifications are immutable – that is, once they are written they
cannot be changed. New crawl specifications are generated when changes are
made to a specification. This is a pragmatic decision that allows individual
analysis results to be linked directly to specific crawl specifications (if
they were mutable it would not be possible). Each new crawl specification is
given a new identifier and contains “sub-specifications” for the configuration
of the different parts of the ARCOMEM system.

Crawl Specification

The crawl specification design has been implemented as Java classes in the
ARCOMEM project\'s crawl-specification sub-module. These allow the use of Java
to generate crawl specifications and convert them into and out of JSON or RDF.
There\'s a full example of the object use in the jUnit test
CrawlSpecificationTest in the crawl-specification sub-module of the ARCOMEM
codebase.

    // Instantiate the container
    this.crawlSpecification = new CrawlSpecification();
    // Main crawl specification
    this.crawlSpecification.campaignIdentifier =
            URI.create( RDF.ARCO+"campaign1" );
    this.crawlSpecification.crawlSpecificationIdentifier =
            URI.create( RDF.ARCO+"crawlSpecification1" );
    this.crawlSpecification.crawlStartDate = "2012-12-01";
    this.crawlSpecification.crawlEndDate = "2012-12-31";
    // API Crawler configuration
    this.crawlSpecification.apiCrawlSpec =
            new APICrawlSpecification();
    this.crawlSpecification.apiCrawlSpec.crawlPeriod = 6;
    this.crawlSpecification.apiCrawlSpec.keywords =
            new ArrayList<String>( Arrays.asList( new String[]{
                "barack", "obama","usa" } ) );
    . . .

Storing and accessing these crawl specification objects has been made easier
through the
CrawlSpecificationFactory
class in the ARCOMEM core framework. The framework will deal with getting an
appropriate crawl specification serializer and retrieving a crawl specification
when the loadCrawlSpecification()
method is called.

The API Crawler is not developed in Java. It is based on a Python codebase and
so these classes are not available. To make the crawl specification API more
language agnostic, the crawl specification getters and setters have been
delegated to an HTTP service that uses JSON for external communication. The
Java CrawlSpecification
class has a method that allows conversion directly to and from the JSON required
for this service. The Python implementation must produce JSON that conforms to
the JSON format required for the service. The block below shows the example
from the CrawlSpecificationTest jUnit test that’s in the crawl-specification
sub-module:

{"campaignIdentifier":"http://crawlspec.arcomem.eu/campaign1","crawlSpecificationIdentifier":"http://crawlspec.arcomem.eu/crawlSpecification1","crawlStartDate":"2012-12-01","crawlEndDate":"2012-12-31","apiCrawlSpec":{"crawlPeriod":6.0,"keywords":["barack","obama","usa"]},"htmlCrawlSpec":{"seedURLs":["http://news.bbc.co.uk/","http://telegraph.co.uk/"],"languagePreferences":[{"languageIdentifier":"en","weight":1.0},{"languageIdentifier":"*","weight":0.2}]},"onlineAnalysisConfig":{"urlPatterns":[{"weight":1.0,"pattern":"*news*"}],"applicationScopes":[{"applicationTag":"blog","weight":0.4},{"applicationTag":"news","weight":1.0}],"entityPreferences":[{"entityURI":"http://dbpedia.org/resource/Barack_Obama","entityName":"Barack Obama","weight":1.0}],"topicPreferences":[{"topicIdentifier":"healthcare","weight":1.0}]},"offlineAnalysiConfig":{}}

Table 2 - JSON elements required to write a crawl specification

Name	Type	Description
Crawl Specification
campaignIdentifier	String (URI)	The URI of this campaign
crawlSpecificationIdentifier	String (URI)	The URI of this crawl specification
crawlStartDate	String (Date)	The start date of the crawl
crawlEndDate	String (Date)	The end date of the crawl
apiCrawlSpec	Object	The API crawler configuration
htmlCrawlSpec	Object	The HTML crawler specification
onlineAnalysisConfig	Object	The configuration for the online analysis engine
offlineAnalysisConfig	Object	The configuration for the offline analysis engine
API Crawler Configuration
crawlPeriod	Double	The number of hours between each API crawl run
keywords	Array[String]	Keywords used to search the API being crawled
HTML Crawler Configuration
seedURLs	Array[String]	The URLs used to seed the crawl
languagePreferences	Array[Object]	A set of priority values for all preferred languages. Each object needs to have languageIdentifier (as an HTTP language identifier) and weight – a double between 0 and 1.
Online Analysis Configuration
urlPatterns	Array[Object]	A list of URL patterns which will be considered relevant documents. Each object needs to have pattern (a regex to match against) and weight - a double which weights the scores of the documents that match the pattern.
applicationScopes	Array[Object]	Each object defines weights for particular applications. The object needs applicationTag defining the application (blog, news, etc.), and weight – a double which weights the scores of the documents from that application.
entityPreferences	Array[Object]	Each object defines weights for particular entity URIs found within the document. The object needs entityURI, and entityName – the URI and string representing the entity – and weight – a double which weights the scores of documents containing those entities.
topicPreferences	Array[Object]	Each object defines weights for particular topics found within the document. The object needs topicIdentifier – an identifier that represents a topic – and weight – a double which weights the scores of documents matching those topics.

Crawl Specification Access

Access to the crawl specification is provided by the framework.
A CrawlSpecification
interface is defined which defined functional methods for accessing the crawl spec.
Implementations of the crawl specification interface provide access to crawl specifications
stored in different ways. In particular, there are implementations for accessing crawl specs
stored in files, HDFS files and triple-stores. The type of crawl spec access is defined
in the framework configuration file with the key "crawlspec.class"
(see Configuration Options below).

To get an instance of the crawl specification for your framework setup, use the
CrawlSpecificationFactory
to instantiate it. If you have a map-reduce Context you can
also pass that to the factory in case the crawl specification loader needs to be
configured.

CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification();
List<URL> urls = cs.getSeedURLs();

or if you have a map-reduce context:

CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification( context );

Wiki: Architecture
Wiki: TryIt