avoid data duplication


  • Anonymous

    hi I'm using the ExampleImapCrawler and implemented virtuoso as database, but I'm getting some duplicated data on the database when I run the crawler more than one time.

    Reading the crawler wiki I found this.

    "f you'd like to perform incremental crawling of really big data sources containing thousands data objects, you might need an AccessData implementation that doesn't store everything it knows in main memory. For this end we created ModelAccessData. An AccessData implementation that stores everything in an RDF2Go Model. With it you can create a persitent Model backed by a Sesame NativeStore (as described in ) and pass it to the ModelAccessData constructor. This will give you a scalable solution for really big data sets."

    and I think that will solve my problem, so have anyone a code example of incremental crawling.


  • Antoni Mylka
    Antoni Mylka

    Try running the ExampleImapCrawler with the -accessDataStore parameter. Give it a folder name. The folder will be used to store a database of emails which have already been seen on previous crawls. This should solve your problem.

    In general, the information necessary for proper incremental crawling is stored in an AccessData object. Aperture provides two implementations of the AccessData interface - AccessDataImpl (for in-memory storage, simple yet not scalable) and ModelAccessData (can be backed by a Model which in turn is backed by a persistent Sesame repository).

    Use the PersistentStoreCrawlingExample.createPersistentModelSet method to create a persistent ModelSet. Then:

    AccessData ad = new ModelAccessData(modelSet.getDefaultModel());

    afterwards, the crawler should behave correctly on multiple crawls.