1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

GettingStarted

From bigdata

Jump to: navigation, search

This page will help you get started with bigdata and is focused on its embedded use as an RDF database.

See the NanoSparqlServer for easy steps to deploy a bigdata SPARQL end point either using an embedded jetty server or as a WAR.

See Using_Bigdata_with_the_OpenRDF_Sesame_HTTP_Server for the significantly more complicated procedure required to deploy inside of the Sesame WAR.

See the CommonProblems page for a FAQ on common problems and how to fix them.

Contents

Where do I start?

I can answer this question best with another question - do you know how to use the Sesame 2 API? We have implemented the Sesame 2 API over bigdata. Sesame is an open source framework for storage, inferencing and querying of RDF data, much like Jena. The best place to start would be to head to openrdf.org[1], download Sesame 2.2, read their most excellent User Guide[2] (specifically Chapter 8 - “The Repository API”), and maybe try writing some code using their pre-packaged memory or disk based triple stores. If you have a handle on this you are 90% of the way to being able to use the bigdata RDF store.

[1] http://www.openrdf.org
[2] http://www.openrdf.org/doc/sesame2/users/

Where do I get the code?

Download

You can download the WAR from the bigdata sourceforge project page.

SVN

You can checkout bigdata from SVN. You need to find the branch that you want to be using and specify that here where it says "BRANCH".

svn co https://bigdata.svn.sourceforge.net/svnroot/bigdata/branches/BRANCH bigdata

Ok, I understand how to use Sesame 2. What now?

If you understand Sesame 2 then you are no doubt familiar with the concept of a SAIL (Storage and Inference Layer). Well, we have implemented a SAIL over bigdata. So all you have to do is take the code you’ve written for the Sesame 2 API and instantiate a different SAIL class, specifically:

com.bigdata.rdf.sail.BigdataSail

You can get this Sesame 2 implementation by either downloading the source tree from SVN (see above), or just download the binary and/or source release from Sourceforge[1].

I would highly recommend checking out the bigdata trunk from SVN directly into Eclipse as its own project, because you will get a .classpath and .project that will automatically build everything for you.

There are several project modules at this time: bigdata (indices, journals, services, etc), bigdata-jini (jini integration providing for distributed services), bigdata-rdf (the RDFS++ database), and bigdata-sails (the Sesame 2.0 integration for the RDFS++ database). Each module bundles all necessary dependencies in its lib subdirectory.

If you are concerned about the size of the distribution, note the following dependencies are required only for the scale-out architecture:

- jini
- zookeeper

If you are doing a scale-up installation, then you do not need any of the jars in the bigdata-jini/lib directory.

In addition, ICU is required only if you want to take advantage of compressed Unicode sort keys. This is a great feature if you are using Unicode and you care about this sort of thing and is available for both scale-up and scale-out deployments. ICU will be used by default if the ICU dependenies are on the classpath. See the com.bigdata.btree.keys package for further notes on ICU and Unicode options. For the brave, ICU also has an optional JNI library.

Removing jini and zookeeper can save you 10M. Removing ICU can save you 30M.

The fastutils dependency is also quite large. We plan to prune it subsequent releases to only the class files bigdata actually needs.

[1] http://sourceforge.net/project/showfiles.php?group_id=191861

Is it really that easy?

No of course not, life is never that easy. Bigdata currently has 70 configurable options, which makes it extremely flexible, yet somewhat bewildering. (This is why we encourage you to keep us in the loop as you evaluate bigdata, so that we can make sure you’re getting the most out of the database. Or better yet, buy a support contract.) Luckily, we’ve created some configuration files that represent various common “modes” with which you might want to run bigdata:

- Full Feature Mode. This turns on all of bigdata’s goodies - statement identifiers, free-text index, incremental inference and truth maintenance. This is how you would use bigdata in a system that requires statement-level provenance, free-text search, and incremental load and retraction.
- RDF-Only Mode. This turns off all inference and truth maintenance, for when you just need to store triples.
- Fast Load Mode. This is how we run bigdata when we are evaluating load and query performance, for example with the LUBM harness. This turns off some features that are unnecessary for this type of evaluation (statement identifiers and the free text index), which increases throughput. This mode still does inference, but it is database-at-once instead of incremental. It also turns off the recording of justification chains, meaning it is an extremely inefficient mode if you need to retract statements (all inferences would have to be wiped and re-computed). This is a highly specialized mode for highly specialized problem sets.

You can find these and other modes in the form of properties files in the bigdata source tree, in the “bigdata-sails” module, at:

bigdata-sails/src/samples/com/bigdata/samples[1]

Or let us help you devise the mode that is right for your particular problem. Of course we will always answer questions, but also please consider buying a support contract!

[1] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/

So how do I put the database in triple store versus quad store mode?

We've set up three modes for bigdata that configure the store properly for triples, triples with provenance, and quads. Look for the TRIPLES_MODE, TRIPLES_MODE_WITH_PROVENANCE, and QUADS_MODE on AbstractTripleStore.Options and BigdataSail.Options.

Currently bigdata does not support inference or provenance for quads, so those features are automatically turned off in QUADS_MODE.

Ok, I’ve picked the bigdata configuration setting I want to work with. Help me write some code.

It’s easy. For the most part it’s the same as any Sesame 2 repository. This code is taken from bigdata-sails/src/samples/com/bigdata/samples/SampleCode.java

// use one of our pre-configured option-sets or "modes"
Properties properties =
    sampleCode.loadProperties("fullfeature.properties");

// create a backing file for the database
File journal = File.createTempFile("bigdata", ".jnl");
properties.setProperty(
    BigdataSail.Options.FILE,
    journal.getAbsolutePath()
    );

// instantiate a sail and a Sesame repository
BigdataSail sail = new BigdataSail(properties);
Repository repo = new BigdataSailRepository(sail);
repo.initialize();

We now have a Sesame repository that is ready to use. Anytime we want to “do” anything (load data, query, delete, etc), we need to obtain a connection to the repository. This is how I usually use the Sesame API:

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {

    ... // do something interesting

    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

Make sure to always use autoCommit=false! Otherwise the SAIL automatically does a commit after every single operation! This causes severe performance degradation and also causes the bigdata journal to grow very large.

Inside that “do something interesting” section you might want to add a statement:

Resource s = new URIImpl("http://www.bigdata.com/rdf#Mike");
URI p = new URIImpl("http://www.bigdata.com/rdf#loves");
Value o = new URIImpl("http://www.bigdata.com/rdf#RDF");
Statement stmt = new StatementImpl(s, p, o);
cxn.add(stmt);

Or maybe you’d like to load an entire RDF document:

String baseURL = ... // the base URL for the document
InputStream is = ... // input stream to the document
Reader reader = new InputStreamReader(new BufferedInputStream(is));
cxn.add(reader, baseURL, RDFFormat.RDFXML);

Once you have data loaded you might want to read some data from your database. Note that by casting the statement to a “BigdataStatement”, you can get at additional information like the statement type (Explicit, Axiom, or Inferred):

URI uri = ... // a Resource that you’d like to know more about
RepositoryResult<Statement> stmts =
    cxn.getStatements(uri, null, null, true /* includeInferred */);
while (stmts.hasNext()) {
    Statement stmt = stmts.next();
    Resource s = stmt.getSubject();
    URI p = stmt.getPredicate();
    Value o = stmt.getObject();
    // do something with the statement

    // cast to BigdataStatement to get at additional information
    BigdataStatement bdStmt = (BigdataStatement) stmt;
    if (bdStmt.isExplicit()) {
        // do one thing
    } else if (bdStmt.isInferred()) {
        // do another thing
    } else { // bdStmt.isAxiom()
        // do something else
    }
}

Of course one of the most interesting things you can do is run high-level queries against the database. Sesame 2 repositories support the open-standard query language SPARQL[1] and a native Sesame query language SERQL[2]. Formulating high-level queries is outside the scope of this document, but assuming you have formulated your query you can execute it as follows:

final QueryLanguage ql = ... // the query language
final String query = ... // a “select” query
TupleQuery tupleQuery = cxn.prepareTupleQuery(ql, query);
tupleQuery.setIncludeInferred(true /* includeInferred */);
TupleQueryResult result = tupleQuery.evaluate();
// do something with the results

Personally I find “construct” queries to be more useful, they allow you to grab a real subgraph from your database:

// silly construct queries, can't guarantee distinct results
final Set<Statement> results = new LinkedHashSet<Statement>();
final GraphQuery graphQuery = cxn.prepareGraphQuery(ql, query);
graphQuery.setIncludeInferred(true /* includeInferred */);
graphQuery.evaluate(new StatementCollector(results));
// do something with the results
for (Statement stmt : results) {
    ...
}

While we’re at it, using the bigdata free text index is as simple as writing a high-level query. Bigdata uses a magic predicate to indicate that the free-text index should be used to find bindings for a particular variable in a high-level query. The free-text index is a Lucene style indexing that will match whole words or prefixes.

RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {
    cxn.add(new URIImpl("http://www.bigdata.com/A"), RDFS.LABEL,
            new LiteralImpl("Yellow Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/B"), RDFS.LABEL,
            new LiteralImpl("Red Rose"));
    cxn.add(new URIImpl("http://www.bigdata.com/C"), RDFS.LABEL,
            new LiteralImpl("Old Yellow House"));
    cxn.add(new URIImpl("http://www.bigdata.com/D"), RDFS.LABEL,
            new LiteralImpl("Loud Yell"));
    cxn.commit();
} catch (Exception ex) {
    cxn.rollback();
    throw ex;
} finally {
    // close the repository connection
    cxn.close();
}

String query = "select ?x where { ?x <"+BNS.SEARCH+"> \"Yell\" . }";
executeSelectQuery(repo, query, QueryLanguage.SPARQL);
// will match A, C, and D

You can find all of this code and more in the source tree at bigdata-sails/src/samples/com/bigdata/samples.[3]

[1] http://www.w3.org/TR/rdf-sparql-query/
[2] http://www.openrdf.org/doc/sesame/users/ch06.html
[3] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/

You claim that you've "solved" the provenance problem for RDF with statement identifiers. Can you show me how that works?

Sure. The concept here is that RDF is very bad for making statements about statements. Well at least it used to be. With the introduction of the concept of named graphs, we can now exploit the context position in a clever way to allow statements about statements without cumbersome reification. All that was required was a custom extension to RDF/XML to model quads. This is best illustrated through an example. Let's start with some RDF/XML:

<rdf:Description rdf:about="#Mike" >
    <rdfs:label bigdata:sid="_S1">Mike</rdfs:label>
    <bigdata:loves bigdata:sid="_S2" rdf:resource="#RDF" />
</rdf:Description>

<rdf:Description rdf:nodeID="_S1" >
    <bigdata:source>www.systap.com</bigdata:source>
</rdf:Description>

<rdf:Description rdf:nodeID="_S2" >
    <bigdata:source>www.systap.com</bigdata:source>
</rdf:Description>

You can see that we first assert two statements, assigning each a "sid" or statement identifier in the form of a bnode. Then we can use that bnode ID to make statements about the statements. In this case, we simply assert the source. We could assert all sorts of other things as well, including access control information, author, date, etc. Bigdata then maps these bnode IDs into internal statement identifiers. Each explicit statement in the database gets a unique statement identifier. You can then write a SPARQL query using the named graph feature to get at this information. So if I wanted to write a query to get at all the provenance information for the statement { Mike, loves, RDF }, it would look as follows:

String NS = "http://www.bigdata.com/rdf#";
String MIKE = NS + "Mike";
String LOVES = NS + "loves";
String RDF = NS + "RDF";
String query =
    "construct { ?sid ?p ?o } " +
    "where { " +
    "  ?sid ?p ?o ." +
    "  graph ?sid { <"+MIKE+"> <"+LOVES+"> <"+RDF+"> } " +
    "}";
executeConstructQuery(repo, query, QueryLanguage.SPARQL);

This example is codified with the rest of the sample code in bigdata-sails/src/samples/com/bigdata/samples[1].

[1] http://bigdata.svn.sourceforge.net/viewvc/bigdata/trunk/bigdata-sails/src/samples/com/bigdata/samples/

Do you support SPARQL query hints?

You can embed query hints into SPARQL queries that will be passed into the query engine as a set of key-value pairs. These query hints can then be used by the query engine to parameterize the execution of particular queries. This is currently only useful if you plan to extend the query engine to handle your particular key-value pair, there are no pre-packaged sets of query hints at this time. See com.bigdata.rdf.store.BD.QUERY_HINTS_NAMESPACE for more details.

Anything else I need to know?

Make sure you are running with the -server JVM option and if possible, expand the heap size using the -Xmx option as well. You should see extremely good load and query performance. If you are not, please contact us and let us help you get the most out of our product.

Personal tools