We have a journal file that for some reason has grown unreasonably large (it's around 96GB at the moment). We suspect that at some point a process added a whole bunch of triples and that those were subsequently removed from the bigdata store. But the journal file size remains large. Is it possible to compact/shrink/optimize a journal file? We're using the NanoSparqlServer version of Bigdata.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The RWStore reserves space when it needs to grow the file. The allocations are spread throughout each new extent. The addresses of those allocations are in the B+Tree nodes. It is not possible to "compact" the file in place.
However, there is a CompactJournalUtility that can be used to generate a new Journal from an existing Journal. It will only copy the most recent committed state into the new journal. It will do this for all indices on the source journal.
Thanks, Bryan. I tried the export/import method - the export worked great, and I have a data.xml.gz file in TRIX format. Presumably then I should be able to import that into a NanoSparqlServer instance with the following:
curl -X POST -H "Content-Type: application/trix" -d @data.xml http://localhost:8282/bigdata/sparql
This looks like it's working, in that it takes a long time on a 36G input file. When the curl command finally completes, I get no response back and no errors in the logs. But there is nothing in the triplestore - simple query returns no results and an ESTCARD query returns 0 for rangeCount.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you replicate your problem and file a ticket including a sample data file and the command (as above) that you are using?
Are you sure that the @data.xml file was appropriately formatted? Per the curl man page, it defaults to ASCII and wants the file to be url encoded. Maybe you should be using -data-binary? I am not sure. I don't use curl that much.
Thanks,
Bryan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We've added a listener to support incremental feedback on SPARQL UPDATE operations. However, we have not yet integrated this into the REST response for HTTP POSTS for SPARQL UPDATE. Currently, the HTTP status code indicates whether or not the operation was successful. However, if we start writing on the response in order to provide incremental feedback on the UPDATE progress, then the response can be committed (at the servlet level, that means that it has flushed the first buffer to the client). After the response is committed, we can not change the status code.
I think that we will probably deal with this by offering a URL query parameter that can be used to request incremental updates on the progress of SPARQL UPDATE operations. When specified, you will see a 200 (Ok) status code and it will write an XHTML response document that details the progress of the UPDATE request. That way we will not break the expectations of REST-ful client that expect the status code to indicate the outcome of the request (success or failure).
Bryan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately, I cannot post the sample data I was using as it contains proprietary information. I'm assuming the file is properly formatted as I didn't modify it after doing the export (ExportKB). So unless you think bigdata export may have a problem, I'd say it's safe to say the data is ok.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I do not need your data. But, please post some data that demonstrates the problem. Also, note that the curl documentation says it is expecting the file contends to be URL encoded. Export definitely does not do that. I suggest that you try this on a small file and try a binary encoding.
Bryan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We have a journal file that for some reason has grown unreasonably large (it's around 96GB at the moment). We suspect that at some point a process added a whole bunch of triples and that those were subsequently removed from the bigdata store. But the journal file size remains large. Is it possible to compact/shrink/optimize a journal file? We're using the NanoSparqlServer version of Bigdata.
Ola,
The RWStore reserves space when it needs to grow the file. The allocations are spread throughout each new extent. The addresses of those allocations are in the B+Tree nodes. It is not possible to "compact" the file in place.
However, there is a CompactJournalUtility that can be used to generate a new Journal from an existing Journal. It will only copy the most recent committed state into the new journal. It will do this for all indices on the source journal.
There is also an "Export/Import" utility described on the wiki (https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=DataMigration#Export)
Neither of those utilities can be run while the NanoSparqlServer is running.
Thanks,
Bryan
Thanks, Bryan. I tried the export/import method - the export worked great, and I have a data.xml.gz file in TRIX format. Presumably then I should be able to import that into a NanoSparqlServer instance with the following:
This looks like it's working, in that it takes a long time on a 36G input file. When the curl command finally completes, I get no response back and no errors in the logs. But there is nothing in the triplestore - simple query returns no results and an ESTCARD query returns 0 for rangeCount.
Ola,
Can you replicate your problem and file a ticket including a sample data file and the command (as above) that you are using?
Are you sure that the @data.xml file was appropriately formatted? Per the curl man page, it defaults to ASCII and wants the file to be url encoded. Maybe you should be using -data-binary? I am not sure. I don't use curl that much.
Thanks,
Bryan
We've added a listener to support incremental feedback on SPARQL UPDATE operations. However, we have not yet integrated this into the REST response for HTTP POSTS for SPARQL UPDATE. Currently, the HTTP status code indicates whether or not the operation was successful. However, if we start writing on the response in order to provide incremental feedback on the UPDATE progress, then the response can be committed (at the servlet level, that means that it has flushed the first buffer to the client). After the response is committed, we can not change the status code.
I think that we will probably deal with this by offering a URL query parameter that can be used to request incremental updates on the progress of SPARQL UPDATE operations. When specified, you will see a 200 (Ok) status code and it will write an XHTML response document that details the progress of the UPDATE request. That way we will not break the expectations of REST-ful client that expect the status code to indicate the outcome of the request (success or failure).
Bryan
Unfortunately, I cannot post the sample data I was using as it contains proprietary information. I'm assuming the file is properly formatted as I didn't modify it after doing the export (ExportKB). So unless you think bigdata export may have a problem, I'd say it's safe to say the data is ok.
Ola,
I do not need your data. But, please post some data that demonstrates the problem. Also, note that the curl documentation says it is expecting the file contends to be URL encoded. Export definitely does not do that. I suggest that you try this on a small file and try a binary encoding.
Bryan
Thanks, Bryan - I will try that before posting an issue.