ARCOMEM Wiki

Semantic and social web crawling

Brought to you by: arcomem

DataModel

The Data Model

The ARCOMEM data model is a set of concepts that are links by specific relationships.
They have been modelled with Protégé and represented as an OWL and RDF ontology
that is available through the ARCOMEM website.

The ARCOMEM Data Model

The ontology stipulates which relationships are valid for which concepts and
whether there are any constraints on those relationships. Instances of the
relationships can be represented as triples. For example, to state that
a web resource is an image, you would link an instance of a web resource (the subject)
to an instance of an image (the object) using the hasWebObject relationship (the predicate):

    arco:webresource1 arco:hasWebObject arco:image1

Here, the arco: prefix is shorthand for the base URI of the ontology; that is, the
full relationship (predicate) URI is actually

    http://www.gate.ac.uk/ontologies/arcomem/data-model/hasWebObject.

...but that\'s much too long to type!

Triples like this can be written directly to a knowledge base using a
TripleStoreConnector
(see the Triple Store Connectors section for information
on creating and using a triple store connector).

The Data Model Java Classes

The data model has been converted (automatically) to Java classes. These Java
classes provide a means for interacting with instances of the data model.
They automatically provide the validation, where applicable, and are able to
be serialized (using the OpenIMAJ RDFSerializer)
into triples compatible with the ontology.

With these classes it is possible instantiate objects from the data model
as if they were Java objects and set and get the properties using regular
getters and setters. These getters and setters may provide validation.
For example,

    WebResourceImpl webResource = new WebResourceImpl();
    webResource.setURI( key.toString() );

    ImageImpl image = new ImageImpl();
    image.setURI( generateRandomURI() );
    webResource.getContainsWebObject().add( image );

Here we are just creating and setting properties for the top-level image
web-resources as they arrive at a mapper. We use the getContainsWebObject()
method to add the image web object to a list of web objects associated with the
resource because the data-model allows an unrestricted one-to-many
relation between WebResources and WebObjects, so this is translated into Java
through the use of Lists.

To write this information to the triple-store we can utilise the OpenIMAJ
RDFSerializer class (developed for ARCOMEM) which takes a set of Java classes
and serialises them to RDF. The data model Java classes have been generated in
such a way that the serialiser can generate RDF that conforms to the ontology.

    // Write the n-triples to a String (could write to File)
    final StringWriter sw = new StringWriter();

    // Write to n-triples (could use other writers here)
    // See OpenRDF for the available writers
    final NTriplesWriter tw = new NTriplesWriter( sw );

    // Override the addTriple() method in a new RDFSerializer
    // so that it sends the triples to the OpenRDF triple writer
    final RDFSerializer rs = new RDFSerializer()
    {
        @Override
        public void addTriple( final Statement s )
        {
            try
            {
                tw.handleStatement( s );
            }
            catch( final RDFHandlerException e )
            {
                e.printStackTrace();
            }
        }
    };

    // We won’t output the Java class names into the RDF
    rs.setOutputClassNames( false );

    // Start the RDF writing
    tw.startRDF();

    // Start the serializing
    rs.serialize( webResource, webResource.getURI() );

    // End the RDF writing
    tw.endRDF();

    // The RDF is now in the sw StringWriter

The code above sets up an NTriplesWriter (part of OpenRDF) which will take
Statements (triples) and write valid n-triples. The RDFSerializer class
calls the addTriple(Statement) method every time it generates a triple when
serializing the object. This could be sent to a
TripleStoreConnector#writeTriple()
method but in this example we send it to the NTriplesWriter
(which itself is writing to a StringWriter).

A Note about the URI Scheme for Resources

When stored into the HBase, every resource has a key. For web-pages, this key is the URL of the
web-page, although it may also be other schemes that represent parts of twitter feeds. However,
this key does not uniquely define a single resource. It actually defines a set of resources. This
set contists of the content of that URL as it has been crawled at different moments in time.
http://www.bbc.co.uk/ looks very different now than it did 10 years ago. So, when resources
are referred to in the knowledge base, it\'s important to know which resource we are talking about.

So, the ARCOMEM technical team have agreed on a scheme which allows reference to a specific
versioned resource within a specific crawl database. It is this URI that should be used
within the triple store for referencing web-resources.

The scheme is as follows:

hbase://<table>/<timestamp>/<key>

The scheme (hbase://) refers to the fact that the data is stored within an hbase. The
table is the table name of the crawled data; The timestamp is the timestamp
that HBase uses to refer to a specific version; The key will be the HBase key that identifies the row.

For example:

hbase://crawl_data/20121112151302/http://www.bbc.co.uk/

Until there is a way of globally identifying an HBase instance, there cannot be
a server portion to the URI, hence this is always empty.

Wiki: Architecture
Wiki: OfflineOutputs
Wiki: SampleImageProcess
Wiki: TripleStoreConnector