The knowledge base is implemented on top of ARCOMEM's scalable HBase triple
store: H2RDF. The high level functions to
query it are in the rdfstore
project. It is implemented in Java, with a
Python binding, and released under GPLv3.
The Knowledge Base provides storing, indexing and retrieving mechanisms for all
the semantic data produced and utilised by the rest of the architectural
components. More specifically, it appropriately indexes and stores RDF triples
that derive from the annotation of Web Objects, as performed by the online and
the offline processing modules and offers SPARQL querying capabilities while
maintaining scalability and high-performance characteristics.
It can be accessed using its Java or python API.
H2RDF relies on HBase and Zookeeper. It was tested with
Zookeeper 3.4.3.
To execute SPARQL queries, go to your master server and start a zookeeper
Quorum by running:
tar xzf zookeeper-3.4.3_light.tar.gz zookeeper-3.4.3/bin/zkServer.sh start
In all worker nodes run (replacing $master with the master host):
hdfs dfs -mkdir /user/hbase/bulkAPITriples hdfs dfs -chmod 777 /user/hbase/bulkAPITriples tar xzf ApiCalls.tgz hadoop jar H2RDF.jar concurrent.SyncPrimitive qTest $master c
where masterDNS is the DNS name of your zookeeper master node.
To import RDF data, upload your ntriples file to HDFS and run
hadoop jar H2RDF.jar sampler.SamplerEx input_path HBaseTable
For more information, see
the public repository.
Three logical layers are used:
This allows to leave the triples manipulations to the middle layer and let the
clients only deal with high level operations. Moreover, the communication with
the middle layer uses a protocol with implementations in Java and Python.
The middle layer relies on Zookeeper queues (implemented on top of the ZK file
system) for communication with the clients: each client uses a pair of queues,
one to send requests, the other one to receive responses. On the other
side, an independent Java process code pops each request, decodes it, calls the
corresponding triple store queries, encodes the reply and enqueues it.
JSON-RPC encoding could be used for the RPC parameters and responses.
Currently, a custom JSON format is implemented.
See rdfstore/README_API_call.txt
for more details on how to implement new
RPCs.
See the the triple store connector for more
information on the RDF and JSON used for populating the triple store.
The figure above presents an overview of H2RDF's architecture. The system
imports RDF triples into 3 different HBase indices, using either the HBase
API or highly efficient, Map-Reduce, bulk import jobs.
Users are able to execute multiple SPARQL queries over the imported data which are parsed by
Jena's SPARQL parser to ensure syntax correctness and create the query graph.
Our Join Planner iterates over the query graph and greedily chooses the join
that needs to be executed, considering the selectivity and cost of all possible
joins. Joins are executed by the Join Executor module that decides which
algorithm will be used for each join, out of a selection of Map-Reduce-based
and centralised join algorithms. While centralised joins are executed on a
single cluster node, distributed joins launch Map-Reduce jobs. After all joins
are executed, query results are stored in HDFS files and can be accessed by
iterators implemented in the client code.
H2RDF is also designed to execute concurrent SPARQL queries and achieve
great query throughput by utilizing all cluster resources, especially in
the case of centralised joins.
H2RDF utilises a 'zookeeper quorum' (a replicated key-value store) in order
to offer a multi-language client API and schedule client requests to the
available server resources. Apache Zookeeper provides a good solution because, primarily,
it offers a way to handle synchronisation in distributed environments and also
because it offers
client bindings for several programming languages. We use
Apache Zookeeper to implement a distributed request queue. Clients post
requests to the queue while server instances constantly check the queue and
grab requests for execution. Requests are abstracted in order to provide a high
level API that can be easily extended to provide new functionality. Each
request object posted to the zookeeper queue contains a byte array of data
that consists of the request type as well as its input data, serialised in
JSON format. The implemented requests can be seen in the following table.
Name | Type | Input | Output | Description |
---|---|---|---|---|
executeQuery | 0 | Database table name, query string | Query results | Execute a regular SPARQL query |
getICS | 3 | Campaign Id | Serialised crawl specification | Returns the last crawl specification for the specified campaign Id |
putICS | 4 | Serialised crawl specification | True/false | Upload a new crawl specification |
bulkPutTriples | 5 | Database table name, NTriples file | True/false | Append the input triples to a HDFS file that will be used as input to the MapReduce bulk import job |
bulkLoadTriples | 6 | Database table name | True/false | Launch a bulk import MapReduce job to load the gathered triples to the database |
The client code is implemented both in Java and Python in order to facilitate
the communication of the different ARCOMEM modules with the triple store.
Furthermore, the server side request handlers are reloaded before the
execution of each request offering the capability to add or remove API
functionality without having to restart the H2RDF cluster.
The Java client code offers more functionality and ease for accessing the
database. The main class used to connect to H2RDF is gr.ntua.h2rdf.client.Store
.
The following code creates a store object that is connected to H2RDF.
String address = "myserver.com" String table = "MyDatabase" String user = "UserName" H2RDFConf conf = new H2RDFConf( address, table, user ); H2RDFFactory h2fact = new H2RDFFactory(); Store store = h2fact.connectStore( conf );
To connect to the database users need to provide three configuration parameters:
After connection the store object can be used to invoke all the implemented
API methods. The functions provided by the Store object are:
Name | Input | Output | Description |
---|---|---|---|
add | Triple(jena object that represents an RDF triple) | void | Add a RDF triple to the store. It uses the specified loader class. There are 3 types of loaders: 1) HBASE_SEQUENTIAL: adds the triples sequentially using the HBase API; 2) HBASE_BULK: does a client side buffering of the triples in order to use HBase bulk API operations; 3) BULK: gathers the triples to HDFS files and launches a MapReduce job to import them to HBase |
exec | SPARQL query string | ResultSet | Executes the SPARQL query and returns a ResultSet object that can be used to iterate over the result |
execOpenRDF | SPARQL query string | QueryResult<BindingSet> | Executes the SPARQL query and returns a QueryResult object that can be used to iterate over the result. This method is implemented in order to provide the same querying interface as openRDF and limit the effort of integrating modules designed to work using openRDF. |
setLoader | String(type of loader) | void | Set the loader used from the store object. (HBASE_SEQUENTIAL, HBASE_BULK, BULK) |
putICS | Crawl specification object | void | Uploads a new crawl specification. |
getICS | URI of the campaign Id | Crawl specification object | Returns the last crawl specification for the specified campaign Id |
rdfstore
also has a python binding to most high level operations under
src/main/python
. The library modules come with an example of how to create an
ICS (ics_example.py
) or call statistics functions (api_examples.py
).
It requires the python-zookeeper
package, not available for Squeeze. You can
force the installation of the Wheezy packages, after getting them from the
Debian site or in our quick_start
repository:
dpkg -i --force-depends cdh4-repository_1.0_all.deb libzookeeper-mt2_3.3.5+dfsg1-2_amd64.deb python-zookeeper_3.3.5+dfsg1-2_amd64.deb
For instance, to create an ICS, edit the ICS in ics_example.py
, update the
Zookeeper server and port if needed, and run it:
cd src/main/python vi ics_example.py python ics_example.py
This will write the ICS to the triple store, retrieve it from the triple store
and print the answer.