ARCOMEM Wiki

Semantic and social web crawling

KB

Knowledge base

The knowledge base is implemented on top of ARCOMEM's scalable HBase triple
store: H2RDF. The high level functions to
query it are in the rdfstore project. It is implemented in Java, with a
Python binding, and released under GPLv3.

The Knowledge Base provides storing, indexing and retrieving mechanisms for all
the semantic data produced and utilised by the rest of the architectural
components. More specifically, it appropriately indexes and stores RDF triples
that derive from the annotation of Web Objects, as performed by the online and
the offline processing modules and offers SPARQL querying capabilities while
maintaining scalability and high-performance characteristics.

It can be accessed using its Java or python API.

Set up

H2RDF relies on HBase and Zookeeper. It was tested with
Zookeeper 3.4.3.

To execute SPARQL queries, go to your master server and start a zookeeper
Quorum by running:

tar xzf zookeeper-3.4.3_light.tar.gz
zookeeper-3.4.3/bin/zkServer.sh start

In all worker nodes run (replacing $master with the master host):

hdfs dfs -mkdir /user/hbase/bulkAPITriples
hdfs dfs -chmod 777 /user/hbase/bulkAPITriples
tar xzf ApiCalls.tgz
hadoop jar H2RDF.jar concurrent.SyncPrimitive qTest $master c

where masterDNS is the DNS name of your zookeeper master node.

To import RDF data, upload your ntriples file to HDFS and run

hadoop jar H2RDF.jar sampler.SamplerEx input_path HBaseTable

For more information, see
the public repository.

Knowledge base Interface

Three logical layers are used:

the client (crawler cockpit, API crawler, etc.)
the middle layer (Zookeeper queue library and RPC layer on top of it, and a
separate process that pops requests from the queue and calls the appropriate
function)
H2RDF

This allows to leave the triples manipulations to the middle layer and let the
clients only deal with high level operations. Moreover, the communication with
the middle layer uses a protocol with implementations in Java and Python.

The middle layer relies on Zookeeper queues (implemented on top of the ZK file
system) for communication with the clients: each client uses a pair of queues,
one to send requests, the other one to receive responses. On the other
side, an independent Java process code pops each request, decodes it, calls the
corresponding triple store queries, encodes the reply and enqueues it.

JSON-RPC encoding could be used for the RPC parameters and responses.
Currently, a custom JSON format is implemented.

See rdfstore/README_API_call.txt for more details on how to implement new
RPCs.

See the the triple store connector for more
information on the RDF and JSON used for populating the triple store.

Getting and Putting Data to the H2RDF Triple Store

H2RDF's Architecture

The figure above presents an overview of H2RDF's architecture. The system
imports RDF triples into 3 different HBase indices, using either the HBase
API or highly efficient, Map-Reduce, bulk import jobs.

Users are able to execute multiple SPARQL queries over the imported data which are parsed by
Jena's SPARQL parser to ensure syntax correctness and create the query graph.
Our Join Planner iterates over the query graph and greedily chooses the join
that needs to be executed, considering the selectivity and cost of all possible
joins. Joins are executed by the Join Executor module that decides which
algorithm will be used for each join, out of a selection of Map-Reduce-based
and centralised join algorithms. While centralised joins are executed on a
single cluster node, distributed joins launch Map-Reduce jobs. After all joins
are executed, query results are stored in HDFS files and can be accessed by
iterators implemented in the client code.

H2RDF is also designed to execute concurrent SPARQL queries and achieve
great query throughput by utilizing all cluster resources, especially in
the case of centralised joins.

H2RDF utilises a 'zookeeper quorum' (a replicated key-value store) in order
to offer a multi-language client API and schedule client requests to the
available server resources. Apache Zookeeper provides a good solution because, primarily,
it offers a way to handle synchronisation in distributed environments and also
because it offers
client bindings for several programming languages. We use
Apache Zookeeper to implement a distributed request queue. Clients post
requests to the queue while server instances constantly check the queue and
grab requests for execution. Requests are abstracted in order to provide a high
level API that can be easily extended to provide new functionality. Each
request object posted to the zookeeper queue contains a byte array of data
that consists of the request type as well as its input data, serialised in
JSON format. The implemented requests can be seen in the following table.

Name	Type	Input	Output	Description
executeQuery	0	Database table name, query string	Query results	Execute a regular SPARQL query
getICS	3	Campaign Id	Serialised crawl specification	Returns the last crawl specification for the specified campaign Id
putICS	4	Serialised crawl specification	True/false	Upload a new crawl specification
bulkPutTriples	5	Database table name, NTriples file	True/false	Append the input triples to a HDFS file that will be used as input to the MapReduce bulk import job
bulkLoadTriples	6	Database table name	True/false	Launch a bulk import MapReduce job to load the gathered triples to the database

The client code is implemented both in Java and Python in order to facilitate
the communication of the different ARCOMEM modules with the triple store.
Furthermore, the server side request handlers are reloaded before the
execution of each request offering the capability to add or remove API
functionality without having to restart the H2RDF cluster.

The Java client code offers more functionality and ease for accessing the
database. The main class used to connect to H2RDF is gr.ntua.h2rdf.client.Store.
The following code creates a store object that is connected to H2RDF.

String address = "myserver.com"
String table = "MyDatabase"
String user = "UserName"
H2RDFConf conf = new H2RDFConf( address, table, user );
H2RDFFactory h2fact = new H2RDFFactory();
Store store = h2fact.connectStore( conf );

To connect to the database users need to provide three configuration parameters:

Address: the dns name of the master node of the cluster.
Table: the name of the database they want to connect with.
User: a username that have the required HDFS read/write permissions.

After connection the store object can be used to invoke all the implemented
API methods. The functions provided by the Store object are:

Name	Input	Output	Description
add	Triple(jena object that represents an RDF triple)	void	Add a RDF triple to the store. It uses the specified loader class. There are 3 types of loaders: 1) HBASE_SEQUENTIAL: adds the triples sequentially using the HBase API; 2) HBASE_BULK: does a client side buffering of the triples in order to use HBase bulk API operations; 3) BULK: gathers the triples to HDFS files and launches a MapReduce job to import them to HBase
exec	SPARQL query string	ResultSet	Executes the SPARQL query and returns a ResultSet object that can be used to iterate over the result
execOpenRDF	SPARQL query string	QueryResult<BindingSet>	Executes the SPARQL query and returns a QueryResult object that can be used to iterate over the result. This method is implemented in order to provide the same querying interface as openRDF and limit the effort of integrating modules designed to work using openRDF.
setLoader	String(type of loader)	void	Set the loader used from the store object. (HBASE_SEQUENTIAL, HBASE_BULK, BULK)
putICS	Crawl specification object	void	Uploads a new crawl specification.
getICS	URI of the campaign Id	Crawl specification object	Returns the last crawl specification for the specified campaign Id

Python interface

rdfstore also has a python binding to most high level operations under
src/main/python. The library modules come with an example of how to create an
ICS (ics_example.py) or call statistics functions (api_examples.py).

It requires the python-zookeeper package, not available for Squeeze. You can
force the installation of the Wheezy packages, after getting them from the
Debian site or in our quick_start repository:

dpkg -i --force-depends cdh4-repository_1.0_all.deb libzookeeper-mt2_3.3.5+dfsg1-2_amd64.deb python-zookeeper_3.3.5+dfsg1-2_amd64.deb

For instance, to create an ICS, edit the ICS in ics_example.py, update the
Zookeeper server and port if needed, and run it:

cd src/main/python
vi ics_example.py
python ics_example.py

This will write the ICS to the triple store, retrieve it from the triple store
and print the answer.

Wiki: HadoopHBase
Wiki: TripleStoreConnector
Wiki: TryIt