Updating Autofocus Crawler

P Foomer
2010-10-19
2013-05-13
  • P Foomer
    P Foomer
    2010-10-19

    Hi
    I am trying to migrate the Autofocus crawler libraries from Aperture 1.1 to 1.5, so I can add my own crawler.

    It appears not to be crawling (ie not identifying any new data for the repository), are the changes between the two version documented anywhere?

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-20

    :)

    It's been quite some time since Aperture 1.1. Autofocus isn't available for download anymore.

    The best (only) documentation of changes is the CHANGELOG file in the distribution. I forwarded your question to the Autofocus authors. I'll get back to you when I know more.

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-20

    It seems that AutoFocus is no longer supported.

    AFAIR there have been no significant API changes between 1.1 and 1.5, it "should" work. That being said, without the source code there is little I can do to help.

     
  • P Foomer
    P Foomer
    2010-10-20

    Hi

    Actually from 1.3 it does not work. I have written a crawler for Mediawiki, which uses new libraries, aperture included.

    The Autofocus crawler used a RepositoryAccessData class, which after much searching, appears to have been superseded by different classes in the later  aperture versions, extra methods are required.

    I ran the persistent repository test program and this stores things differently (based on the files in the repository directories), to the original aperture used in autofocus.

    Really I was looking for design documentation for aperture to try and determine how to modify autofocus to work.

    Currently the modified autofocus CrawlingRepository (which appears to be the interface between autofocus and aperture) runs without error but does not crawl correctly, ie no increase in found terms, and no results when searching for the terms.

    Running the Extraction test program supplied by aperture, results in the terms being seen, so that appears to work.

     
  • P Foomer
    P Foomer
    2010-10-22

    Hi

    It seems I either have to persevere to get AF working with the new aperture, the ideal as AF is just what I require, or look for an alternative to AF, is there one?

     
  • P Foomer
    P Foomer
    2010-10-23

    Well replacing in the AutoFocus CrawlingRepository class

    // setup an AccessData instance
                    //RepositoryAccessData accessData = new RepositoryAccessData(
                    //        connection,
                    //        CrawlSchema.ACCESS_DATA_CONTEXT);

                    // this works but the delete problem remains
                   

                    // First we need a model that will store the data source configuration
                    Model model = RDF2Go.getModelFactory().createModel();
                    // Don't forget to open it before it can be used
                    model.open();

                    ModelAccessData accessData = new ModelAccessData(model);

    works,however, removing a file from the source,does not result in the term being removed on a refresh of the source.

    The only way to remove a term is to do a full rescan.

    This worked ok with aperture 1.0.1 (as supplied with AF), using the commented out bits above, with the new approach using the Model class and aperture 1.5.0 the problem is apparent.

    Any idea why this should be?

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-25

    But you first index a source with the NEW aperture and the NEW access data.
    Then you delete a file and refresh the source.

    Or is it that you index a source with the OLD aperture/accessdata, then change the source code and try to refresh.
    ??

     
  • P Foomer
    P Foomer
    2010-10-25

    Hi

    no the first case, and its the same when I tried the http crawler also.

    To confirm the only way to make the index see the change is do a complete rescan (delete, rescan) as opposed to a refresh.

    The only change between the original AF class is the code I posted earlier using the Model class.

    I also diffed the latest tarball from the svn and applied the changes, now I am running into problems with the FileDataStore class (and other Datastore classes), which is not on the svn trunk as it appears to be generated.

    The last time this happened I downloaded the svn trunk and ran mvn which appeared to create the files (hence my other posts re ttl files and svn not working), so now I am at an impasse!!

    My wikipedia crawler is based in part on the http crawler and extra bits to read the wiki without reading all the other stuff you get if you just crawl it with the http crawler.

    I am using Netbeans as I do not want to spend time learning another IDE (too old!!)

     
  • P Foomer
    P Foomer
    2010-10-26

    retrieved latest trunk fron svn using ubuntu (see other thread).

    diffed trunk with my copy of the code under netbeans (after generating with mvn to get DataSource classes)

    Compiled,
    cleaned indexes,
    added test file ,
    found search item,
    removed test file,
    refreshed index,
    term still in index,
    rebuilt index resulting in term removed.

    Same as previous comments, cannot remove terms on a refresh of the index, only on a clean & rebuild of the index.

    Help!!

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-28

    You seem to have AutoFocus source code, downloaded in the days when it was open source and available for download.

    I don't, and it's difficult for me to help you.

    When we introduced SubCrawlers - the AccessData interface and all its implementations in the aperture codebase have undergone an overhaul. The RepositoryAccessData implementation in AutoFocus was outside Aperture at that time and did not get the same attention. (I did the aperture-core overhaul and had nothing to do with autofocus).

    You could either fix RepositoryAccessData to make it pass the AccessDataTest, just create a subclass of AccessDataTest, similar to TestAccessDataImpl and tweak RepositoryAccessData until all the tests pass.

    or.

    remove all references to RepositoryAccessData from the AutoFocusCodebase and replace it with ModelAccessData from Aperture

    then see what happens when a crawler finds a modified file, e.g. set a breakpoint in the objectModified method of the CrawlerHandler implementation in AutoFocus and see why the object modification is not registered correctly. If objectNotModified is not called at all - then the bug is in the crawler or the accessdata implementation.

    I don't have the source code to AutoFocus 5.0, it's not open source anymore and Aduna doesn't support it. If you find a bug in Aperture, file an issue on the tracker. As for getting old version of AutoFocus (or any Aperture application for that matter) to work with new Aperture, I'm at a loss.