Does Bigdata 1.1 have a known issue with handling queries during a lengthy import/update? We're using the NanoSparqlServer with bigdata 1.1, and are seeing the journal file drastically expand when doing a simple file update/import with an N3-formatted file of about 13M triples. The file size goes from 11GB to over 200GB, when the disk runs out of space.
The file is streamed using the url http://bigdata.somewhere.org/sparql?context-uri=http://some.org/context/1 using POST and with Content-Type header set correctly.
This import/update is happening in a nightly job, and there are some status pings hitting Bigdata during this time - hence my suspicion that those pings (causing queries) may be what's behind the dramatic journal file expansion that we're seeing. Is that possible?
I should add that when compacting the blown-up journal file (showing as 191GB on disk), Bigdata realizes that it's not really nearly that big:
bytes used: before=15788307610, after=11709998572, reducedBy=26%
Ola, there are no known bugs related to RWStore recycling, long running imports, etc. You need to make sure that there are no query connections that are being held open across long running loads. Query connections that are left open will pin history and prevent the RWStore from recycling storage. You also should be using the recycler mode (minReleaseTime=1) rather than session protection (the default). This is all covered on the wiki. Please see the links from the home page on the RWStore and transaction support. Every time we have looked at this, it has always come down to an application problem. You can use the com.bigdata.txLog to help you analyze this. minReleaseTime and the txLog are described on the wiki.
The #of index pages that are touched during an update has to do with the distribution of the keys in those indices, A pathological case would have randomly distributed keys (UUIDs…).
Thanks, Bryan. Just to clarify, "Query connections that are left open" - would they show up on the NanoSparqlServer status page (http://bigdata.somewhere.org:8080/triplemap_store/status) in the "Running query count" section?
the com.bigdata.txLog is at a lower level. It reports directly on what the RWStore is observing. The running query count reports on the entries in a concurrent hash map for the query engine.
Are you asking about bigdata 1.1.0? Or about the SPARQL 1.1 support in bigdata? If you are talking about bigdata 1.1.0, please upgrade.
After upgrading to bidata 1.2.2 and implementing the config change you suggested (com.bigdata.service.AbstractTransactionService.minReleaseAge=1), we're not seeing an impact on the rapid growth of the journal file during import.
We extracted a small portion of the data that is being loaded and have run some tests with it. The sample data is a 1GB file with 9,237,437 triples (and the longest triple in the file is 564 chars). After importing about a quarter of this file (we monitor the bytes sent to the NanoSparqlServer), the journal file size goes from 12GB (initial size) to 25GB in a few minutes. At that point we killed tomcat so as not to run out of disk space. At that growth rate, the journal file would have reached 60GB if we had let the file fully import.
The file we're importing contains simple labels, timestamps (e.g. "2012-09-17T03:00:00.000-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>), and triples that connect existing subjects in the triplestore (i.e. <existingUriA> <relatedTo> <exsitingUriB>).
Is it normal for this type of file to grow the size of the journal file so dramatically?
I am confused by your question. It says that you "are not" seeing a problem in the first paragraph and then goes on to describe some problem in the 2nd. Maybe you had a typo in the first paragraph?
The journal is very space efficient when used correctly. I am not sure what is mis-configured in your setup. There is extensive documentation both on the wiki and in the javadoc for the platform that will help you to diagnose your problem, including the txLog that was discussed above.
Sorry, that first paragraph was poorly worded. What I meant is that the problem remained the same after we implemented the changes you suggested: the journal file kept growing at an equally rapid pace.
I should have looked closer at the wiki earlier, my apologies. This section pretty much sums up why Bigdata will not work for our use case: https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=CommonProblems#Problem:_I_am_using_the_Journal_and_the_file_size_grows_very_quickly. We have a very dynamic data set and will be updating/adding data quite frequently - as Bigdata cannot handle this, we will have to find a different triplestore to meet our needs.
I'll update that section. It is old and refers to the WORM storage model. The RWStore backend does NOT have this problem. It is space efficient under any update pattern, but it can not recycle storage until you release open connections.
Hmm, so perhaps the import/update itself constitutes an open connection…? Perhaps if we break up the import in smaller chunks, Bigdata will be able to reclaim the space after importing each chunk. I'll give that a try.
A writer will recycle allocations made during the write operation, but it can not recycle allocations that are currently committed until the next commit point.
Here is how to think about this. You have some data. It is committed. The database has given you a guarantee that you can read that committed data. Therefore, the committed data MUST remain available and storage associated with the committed data MUST NOT be recycled until (A) there is no longer any operation reading on that commit point; and (B) there is a new commit point.
Storage usage will go up if you are loading data onto the store. It needs to have both the old data (still visible through the last commit point) and the new data (currently being written). The storage for the old data will be recycled, but not until after the new data has been written, and not if there is a 2nd open connection.
Thank you, Bryan - I will keep this in mind as we try to refactor how we're importing updates.
It will wind up using an appropriate amount of storage for the workload. You just have to bear in mind that an initial bulk load will be more dense of the disk. If you go back and write more data, it will fill up unused allocators and then extend the store and start writing on that extension. The /status page can show you the RWStore allocator summary. So can DumpJournal. In the development branch, you can also run DumpJournal from the NSS on the live store.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.