Menu

Using BigData in the Sesame Http Server

Help
2011-08-26
2014-02-19
  • Michael Szalay

    Michael Szalay - 2011-08-26

    Hi all

    the following change in the OpenRDF Sesame Http Server made my application much much faster:

     if (request.getMethod().equalsIgnoreCase("GET")) {
                                        if ((repository instanceof SailRepository)
                                                && (((SailRepository) repository).getSail() instanceof BigdataSail)) {
                                            logger.info("About to create readonly connection.");
                                            SailRepository s = (SailRepository) repository;
                                            BigdataSail sail = (BigdataSail) s.getSail();
                                            BigdataSailConnection readOnlyConnection = sail.getReadOnlyConnection();
                                            BigdataSailRepository rd = new BigdataSailRepository(sail);
                                            repositoryCon = new BigdataSailRepositoryConnection(rd, readOnlyConnection);
                                            logger.info("Created bigdata readonly connection.");
                                        }
                                        else {
                                            logger.info("About to create readwrite connection.");
                                            repositoryCon = repository.getConnection();
                                            logger.info("Created readwrite connection.");
                                        }
                                    }
                                    else {
                                        logger.info("Create write connection because of post.");
                                        repositoryCon = repository.getConnection();
                                    }
    

    (in the File RepositoryInterceptor.java, Project sesame-http-server-spring).

    It opens a readonly connection when the repository is a bigdata repository and a GET request.
    Therefore, concurrent reads are much more efficient.

    What is your opinion of doing that? Are there any negative side-effects?

    Regards Michael

     
  • Bryan Thompson

    Bryan Thompson - 2011-08-26

    This is a good idea.  One of the benefits of the NanoSparqlServer is that it is a native bigdata application and does such optimizations automatically.  Bigdata is fully concurrent for read-only connections and will deliver much better concurrent query performance with this modification.

    Unfortunately, there does not appear to be any way to get this "fix" into the openrdf code.  Maybe an extended transaction API?

     
  • Michael Szalay

    Michael Szalay - 2011-08-26

    I would like to use the NanoSparqlServer, but our app is coded against the OpenRDF Repository API… we use HttpRepository.
    I think we cannot access Nano with that repository, right?
    Is there any library to do that?

     
  • Bryan Thompson

    Bryan Thompson - 2011-08-26

    The NanoSparqlServer provides a general purpose REST-ful interface for SPARQL query and update, but it does not expose the sorts of openrdf internal APIs that you are using.  We try to avoid writing applications to the OpenRDF repository API, preferring to operate on RDF/SPARQL instead.

    Thanks,
    Bryan

     
  • Michael Szalay

    Michael Szalay - 2011-09-01

    Another question: the read access with my patch is very fast now. However, writes are blocking the repository.
    Is there a configuration option in bigdata to prevent a single write connection to block the entire database?

     
  • Bryan Thompson

    Bryan Thompson - 2011-09-01

    It depends on what your write workload is like.  Using the BigdataSail without full read-write transactions provides higher throughput for updates, especially if the updates are large.  However, there is only one "unisolated" connection and concurrent writers will block until they acquire it.

    If you have a lot of small updates and would like them to proceed concurrently, you can configure the sail for full read/write transaction support (there is a difference in the index structures so you need to re-load the data into the transactional sail).  The MVCC architecture allows us to reconcile add/add conflicts across transactions, but add/remove updates can not be reconciled and one of the transactions will fail.

    The other way to achieve higher throughput is to merge your updates.  Each unisolated transaction will perform a full database commit when it completes.  That commit is often the main source of latency for small updates.  If you combine your updates so that they are bulkier, then you will spend more time actually writing data on the database and less time waiting for the disk to sync. This gives you a net higher throughput.  You can think of this as "application side" group commit.

    All of the above applies only to the single machine version of bigdata.  A cluster is shard-wise ACID for updates and supports commit groups natively.

    If you want to provide a little more information about your application workload and performance targets I might be able to offer some more advice.

    Thanks,
    Bryan

     
  • Michael Szalay

    Michael Szalay - 2011-09-04

    Thanks Bryan.

    We have an architecture with several jobs writing data and several jobs and end-user apps reading data.
    Most data is very "live", I mean it comes in and disappears, this is about 80% of the data. So the kind of "optimistic lock" you are describing will cause a lot of transaction failures, I fear.

    Regards Michael

     
  • Bryan Thompson

    Bryan Thompson - 2011-09-04

    If you have a heavy mixture of adds and removes then both MVCC (which is what we use) will cause a lot of transaction failures.  2PL does not work well with graphs as they lack a suitable hierarchical structure (e.g., nothing corresponds really to row, page, table; you jump from statement to graph).  The MVCC reconciliation is at the level of the individual statement, not the B+Tree page, so conflicts only occur for specific statements.

    Overall, it sounds like the architecture is basically being used as a workflow queue.  We have done this sort of thing with persistent queues in JBoss by passing around "thick" graphs serialized on the queues and then materializing the data after a series of transforms in a target knowledge base.  This was for an entity extraction / co-resolution application.  We were able to query against the long term KB on the basis of the information in the graphs as they moved through the workflow, build on the information in those "thick" graphs, and then finally add their data to the KB.  However, your application sounds like it has a heavier mixture of deletes, unless those are just removing data which is moving through workflow states. 

    Feel free to contact me directly if you would like to talk more .

    Thanks,
    Bryan

    http://www.systap.com/contact.htm

     

Log in to post a comment.