From: Grant I. <gsi...@ap...> - 2007-07-21 03:12:40
|
Are there a set of best practices or guidelines for using Aperture? Especially in regards to implementing the CrawlerHandler. Thanks, Grant |
From: Antoni M. <ant...@df...> - 2007-07-24 08:55:41
|
Grant Ingersoll pisze: > Are there a set of best practices or guidelines for using Aperture? > Especially in regards to implementing the CrawlerHandler. > > Thanks, > Grant > There is no document that describes these issues. Crawler handler is the interface between a crawler and your application. You receive the data objects from aperture and can do anything you please with them. But since you ask I'd say people usually have problems with the underylying RDF2Go framework. Each RDF Model needs to be closed, and all iterators returned by queries need to be closed too. This implies following pieces of advice 1. Acquaint yourself with opening and closing models. Learn the difference between an RDFContainer working with a shared model and one having exclusive access to the model. Always close all models you use. 2. The crawler handler methods that accept a DataObject should always dispose the object after processing. Note that a DataObject wraps a metadata RDFContainer which wraps a Model, which wraps an underylying rdf store implementation (e.g. a sesame repository which can have it's own layered structure). All of these elements may be closed independently. Be aware of that fact. 3. Surround all queries to the model (e.g. findStatement, sparqlSelect, sparqlConstruct etc.) with a try/catch block. Close the ClosableIterators in a finally clause. Failing to close an interator can lead to deadlocks. 4. Crawling large directories can yield quite a lot of information. Be aware that the simple RDF2Go.getModelFactory().createModel() method creates an in-memory model. If you have 2GB of PDF's, and you extract the fulltext of them all in an RDF Model, then the it will likely consume all available memory. Use persistent storage for these tasks (e.g. by wraping a RepositoryModel around a sesame NativeStore). Antoni Mylka ant...@df... |
From: Grant I. <gsi...@ap...> - 2007-09-23 18:31:32
|
On Jul 24, 2007, at 4:55 AM, Antoni Mylka wrote: > Grant Ingersoll pisze: >> Are there a set of best practices or guidelines for using >> Aperture? Especially in regards to implementing the CrawlerHandler. >> Thanks, >> Grant > > There is no document that describes these issues. Crawler handler > is the interface between a crawler and your application. You > receive the data objects from aperture and can do anything you > please with them. > > But since you ask I'd say people usually have problems with the > underylying RDF2Go framework. Each RDF Model needs to be closed, > and all iterators returned by queries need to be closed too. This > implies following pieces of advice > > 1. Acquaint yourself with opening and closing models. Learn the > difference between an RDFContainer working with a shared model and > one having exclusive access to the model. Always close all models > you use. > > 2. The crawler handler methods that accept a DataObject should > always dispose the object after processing. Note that a DataObject > wraps a metadata RDFContainer which wraps a Model, which wraps an > underylying rdf store implementation (e.g. a sesame repository > which can have it's own layered structure). All of these elements > may be closed independently. Be aware of that fact. > > 3. Surround all queries to the model (e.g. findStatement, > sparqlSelect, sparqlConstruct etc.) with a try/catch block. Close > the ClosableIterators in a finally clause. Failing to close an > interator can lead to deadlocks. > > 4. Crawling large directories can yield quite a lot of information. > Be aware that the simple RDF2Go.getModelFactory().createModel() > method creates an in-memory model. If you have 2GB of PDF's, and > you extract the fulltext of them all in an RDF Model, then the it > will likely consume all available memory. Use persistent storage > for these tasks (e.g. by wraping a RepositoryModel around a sesame > NativeStore). Is there an example of wrapping persistent storage anywhere? > > Antoni Mylka > ant...@df... |
From: <ant...@po...> - 2007-09-24 11:39:19
|
Grant Ingersoll pisze: > > Is there an example of wrapping persistent storage anywhere? > You can't do it with RDF2Go API only. You need to instantiate a persistent store by yourself using the API of the underlying RDF storage framework and then wrap it within the appropriate implementation of the Model/ModelSet interface. In case of Sesame it would look like this: private ModelSet getPersistentModelSet(File directory) throws RepositoryException { NativeStore nativeStore = new NativeStore(directory); Repository repository = new SailRepository(nativeStore); repository.initialize(); ModelSet modelSet = new RepositoryModelSet(repository); return modelSet; } Antoni Mylka ant...@gm... |
From: Christiaan F. <chr...@ad...> - 2007-09-25 08:16:58
|
Antoni Myłka wrote: > Grant Ingersoll pisze: >> Is there an example of wrapping persistent storage anywhere? > > You can't do it with RDF2Go API only. You need to instantiate a > persistent store by yourself using the API of the underlying RDF storage > framework and then wrap it within the appropriate implementation of the > Model/ModelSet interface. In case of Sesame it would look like this: > > private ModelSet getPersistentModelSet(File directory) > throws RepositoryException { > NativeStore nativeStore = new NativeStore(directory); > Repository repository = new SailRepository(nativeStore); > repository.initialize(); > ModelSet modelSet = new RepositoryModelSet(repository); > return modelSet; > } I guess a Wiki page with some more info on this would certainly be helpful (who's volunteering? ;) ). For example, the above code is ok but it would still recommend people to use the in-memory Models during crawling (e.g. wrapped in a RDFContainer returned by CrawlerHandler.getRDFContainer) and adding them to the ModelSet once their contents is complete, rather than using ModelSet.getModel(URI), as it means safer and faster code (no transaction on the persistent store for every single statement). Regards, Chris -- |