Journal file size, can it be compacted?

Fast, scalable, robust graph database platform

Brought to you by: beebs, hyandell, mrpersonick, thompsonbry

This project can now be found here.

Journal file size, can it be compacted?

Forum: Help

Creator: Ola Bildtsen

Created: 2012-09-12

Updated: 2014-02-19

Ola Bildtsen - 2012-09-12

We have a journal file that for some reason has grown unreasonably large (it's around 96GB at the moment). We suspect that at some point a process added a whole bunch of triples and that those were subsequently removed from the bigdata store. But the journal file size remains large. Is it possible to compact/shrink/optimize a journal file? We're using the NanoSparqlServer version of Bigdata.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bryan Thompson - 2012-09-12

Ola,

The RWStore reserves space when it needs to grow the file. The allocations are spread throughout each new extent. The addresses of those allocations are in the B+Tree nodes. It is not possible to "compact" the file in place.

However, there is a CompactJournalUtility that can be used to generate a new Journal from an existing Journal. It will only copy the most recent committed state into the new journal. It will do this for all indices on the source journal.

There is also an "Export/Import" utility described on the wiki (https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration#Export)

Neither of those utilities can be run while the NanoSparqlServer is running.

Thanks,
Bryan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ola Bildtsen - 2012-09-12

Thanks, Bryan. I tried the export/import method - the export worked great, and I have a data.xml.gz file in TRIX format. Presumably then I should be able to import that into a NanoSparqlServer instance with the following:

curl -X POST -H "Content-Type: application/trix" -d @data.xml http://localhost:8282/bigdata/sparql

This looks like it's working, in that it takes a long time on a 36G input file. When the curl command finally completes, I get no response back and no errors in the logs. But there is nothing in the triplestore - simple query returns no results and an ESTCARD query returns 0 for rangeCount.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bryan Thompson - 2012-09-13

Ola,

Can you replicate your problem and file a ticket including a sample data file and the command (as above) that you are using?

Are you sure that the @data.xml file was appropriately formatted? Per the curl man page, it defaults to ASCII and wants the file to be url encoded. Maybe you should be using -data-binary? I am not sure. I don't use curl that much.

Thanks,
Bryan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bryan Thompson - 2012-09-13

We've added a listener to support incremental feedback on SPARQL UPDATE operations. However, we have not yet integrated this into the REST response for HTTP POSTS for SPARQL UPDATE. Currently, the HTTP status code indicates whether or not the operation was successful. However, if we start writing on the response in order to provide incremental feedback on the UPDATE progress, then the response can be committed (at the servlet level, that means that it has flushed the first buffer to the client). After the response is committed, we can not change the status code.

I think that we will probably deal with this by offering a URL query parameter that can be used to request incremental updates on the progress of SPARQL UPDATE operations. When specified, you will see a 200 (Ok) status code and it will write an XHTML response document that details the progress of the UPDATE request. That way we will not break the expectations of REST-ful client that expect the status code to indicate the outcome of the request (success or failure).

Bryan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ola Bildtsen - 2012-09-13

Unfortunately, I cannot post the sample data I was using as it contains proprietary information. I'm assuming the file is properly formatted as I didn't modify it after doing the export (ExportKB). So unless you think bigdata export may have a problem, I'd say it's safe to say the data is ok.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bryan Thompson - 2012-09-13

Ola,

I do not need your data. But, please post some data that demonstrates the problem. Also, note that the curl documentation says it is expecting the file contends to be URL encoded. Export definitely does not do that. I suggest that you try this on a small file and try a binary encoding.

Bryan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ola Bildtsen - 2012-09-13

Thanks, Bryan - I will try that before posting an issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.