The crawler cockpit is the user interface to the crawler system that has
been developed within the ARCOMEM system. It provides the means to set up
crawl campaigns, to see their progress and peruse intermediate results.
The means for transferring the campaign configuration from the crawler cockpit
to the ARCOMEM analysis system is via crawl specifications. These data
structures store information that determines how the system will guide the
crawl.
Crawl specifications are immutable – that is, once they are written they
cannot be changed. New crawl specifications are generated when changes are
made to a specification. This is a pragmatic decision that allows individual
analysis results to be linked directly to specific crawl specifications (if
they were mutable it would not be possible). Each new crawl specification is
given a new identifier and contains “sub-specifications” for the configuration
of the different parts of the ARCOMEM system.
The crawl specification design has been implemented as Java classes in the
ARCOMEM project\'s crawl-specification
sub-module. These allow the use of Java
to generate crawl specifications and convert them into and out of JSON or RDF.
There\'s a full example of the object use in the jUnit test
CrawlSpecificationTest
in the crawl-specification sub-module of the ARCOMEM
codebase.
// Instantiate the container this.crawlSpecification = new CrawlSpecification(); // Main crawl specification this.crawlSpecification.campaignIdentifier = URI.create( RDF.ARCO+"campaign1" ); this.crawlSpecification.crawlSpecificationIdentifier = URI.create( RDF.ARCO+"crawlSpecification1" ); this.crawlSpecification.crawlStartDate = "2012-12-01"; this.crawlSpecification.crawlEndDate = "2012-12-31"; // API Crawler configuration this.crawlSpecification.apiCrawlSpec = new APICrawlSpecification(); this.crawlSpecification.apiCrawlSpec.crawlPeriod = 6; this.crawlSpecification.apiCrawlSpec.keywords = new ArrayList<String>( Arrays.asList( new String[]{ "barack", "obama","usa" } ) ); . . .
Storing and accessing these crawl specification objects has been made easier
through the
CrawlSpecificationFactory
class in the ARCOMEM core framework. The framework will deal with getting an
appropriate crawl specification serializer and retrieving a crawl specification
when the loadCrawlSpecification()
method is called.
The API Crawler is not developed in Java. It is based on a Python codebase and
so these classes are not available. To make the crawl specification API more
language agnostic, the crawl specification getters and setters have been
delegated to an HTTP service that uses JSON for external communication. The
Java CrawlSpecification
class has a method that allows conversion directly to and from the JSON required
for this service. The Python implementation must produce JSON that conforms to
the JSON format required for the service. The block below shows the example
from the CrawlSpecificationTest
jUnit test that’s in the crawl-specification
sub-module:
{"campaignIdentifier":"http://crawlspec.arcomem.eu/campaign1","crawlSpecificationIdentifier":"http://crawlspec.arcomem.eu/crawlSpecification1","crawlStartDate":"2012-12-01","crawlEndDate":"2012-12-31","apiCrawlSpec":{"crawlPeriod":6.0,"keywords":["barack","obama","usa"]},"htmlCrawlSpec":{"seedURLs":["http://news.bbc.co.uk/","http://telegraph.co.uk/"],"languagePreferences":[{"languageIdentifier":"en","weight":1.0},{"languageIdentifier":"*","weight":0.2}]},"onlineAnalysisConfig":{"urlPatterns":[{"weight":1.0,"pattern":"*news*"}],"applicationScopes":[{"applicationTag":"blog","weight":0.4},{"applicationTag":"news","weight":1.0}],"entityPreferences":[{"entityURI":"http://dbpedia.org/resource/Barack_Obama","entityName":"Barack Obama","weight":1.0}],"topicPreferences":[{"topicIdentifier":"healthcare","weight":1.0}]},"offlineAnalysiConfig":{}}
Table 2 - JSON elements required to write a crawl specification
Name | Type | Description |
---|---|---|
Crawl Specification | ||
campaignIdentifier | String (URI) | The URI of this campaign |
crawlSpecificationIdentifier | String (URI) | The URI of this crawl specification |
crawlStartDate | String (Date) | The start date of the crawl |
crawlEndDate | String (Date) | The end date of the crawl |
apiCrawlSpec | Object | The API crawler configuration |
htmlCrawlSpec | Object | The HTML crawler specification |
onlineAnalysisConfig | Object | The configuration for the online analysis engine |
offlineAnalysisConfig | Object | The configuration for the offline analysis engine |
API Crawler Configuration | ||
crawlPeriod | Double | The number of hours between each API crawl run |
keywords | Array[String] | Keywords used to search the API being crawled |
HTML Crawler Configuration | ||
seedURLs | Array[String] | The URLs used to seed the crawl |
languagePreferences | Array[Object] | A set of priority values for all preferred languages. Each object needs to have languageIdentifier (as an HTTP language identifier) and weight – a double between 0 and 1. |
Online Analysis Configuration | ||
urlPatterns | Array[Object] | A list of URL patterns which will be considered relevant documents. Each object needs to have pattern (a regex to match against) and weight - a double which weights the scores of the documents that match the pattern. |
applicationScopes | Array[Object] | Each object defines weights for particular applications. The object needs applicationTag defining the application (blog, news, etc.), and weight – a double which weights the scores of the documents from that application. |
entityPreferences | Array[Object] | Each object defines weights for particular entity URIs found within the document. The object needs entityURI, and entityName – the URI and string representing the entity – and weight – a double which weights the scores of documents containing those entities. |
topicPreferences | Array[Object] | Each object defines weights for particular topics found within the document. The object needs topicIdentifier – an identifier that represents a topic – and weight – a double which weights the scores of documents matching those topics. |
Access to the crawl specification is provided by the framework.
A CrawlSpecification
interface is defined which defined functional methods for accessing the crawl spec.
Implementations of the crawl specification interface provide access to crawl specifications
stored in different ways. In particular, there are implementations for accessing crawl specs
stored in files, HDFS files and triple-stores. The type of crawl spec access is defined
in the framework configuration file with the key "crawlspec.class"
(see Configuration Options below).
To get an instance of the crawl specification for your framework setup, use the
CrawlSpecificationFactory
to instantiate it. If you have a map-reduce Context
you can
also pass that to the factory in case the crawl specification loader needs to be
configured.
CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification(); List<URL> urls = cs.getSeedURLs();
or if you have a map-reduce context:
CrawlSpecification cs = CrawlSpecificationFactory.loadCrawlSpecification( context );