From: Grant I. <gsi...@ap...> - 2008-08-20 21:04:06
|
I have followed the tutorial on using a persistent store at http://aperture.wiki.sourceforge.net/PersistentCrawlingTutorial and it makes sense as to what I need to do to enable a persistent store. However, I am wondering if there is a way to only store the URI information. My use case is I am indexing the extracted content, so I have no need for the triples, etc. after the content is extracted. However, I want the persistent store to keep track of the URI so that I can get notifications about new/changed/removed objects. Any suggestions on how to do this? Thanks, Grant |
From: Antoni M. <ant...@gm...> - 2008-08-20 22:09:17
|
Grant Ingersoll pisze: > I have followed the tutorial on using a persistent store at http://aperture.wiki.sourceforge.net/PersistentCrawlingTutorial > and it makes sense as to what I need to do to enable a persistent > store. > > However, I am wondering if there is a way to only store the URI > information. My use case is I am indexing the extracted content, so I > have no need for the triples, etc. after the content is extracted. > However, I want the persistent store to keep track of the URI so that > I can get notifications about new/changed/removed objects. Any > suggestions on how to do this? > > Thanks, > Grant > The information required for the crawler to perform incremental crawling (new/changed/unchanged/deleted objects) is encapsulated behind the AccessData interface. There is an in-memory implementation (AccessDataImpl), an in-memory with the possibility to serialize stuff to a file (gzipped xml) - (FileAccessData). You can choose between the FileAccessData and the ModelAccessData backed by a disk-based repository (the NativeStore). The first solution is faster and simpler, but your accessdata will grow linearly with the number of objects. If you want a more scalable approach - settle for the NativeStore. File folder = ... // the folder where the info will be stored Repository repo = new SailRepository(new NativeStore(folder)); repo.initialize(); Model model = new RepositoryModel(repo); model.open(); ModelAccessData accessData = new ModelAccessData(model); accessData.initialize() then when you have a crawler, simply before calling crawl(), call crawler.setAccessData(accessData) on the first crawl, the crawler will populate the access data with appropriate info. If you use the same access data instance on the second crawl, the crawler will report new/modified/unmodified/deleted. Note that one access data instance cannot be shared between multiple data sources. You can still reuse the same folder though, by storing information from separate data sources in separate contexts in the native store, just pass a context uri on the initialization of the repository model. File folder = ... // the folder where the info will be stored Repository repo = new SailRepository(new NativeStore(folder)); repo.initialize(); Model model = new RepositoryModel(contextUri,repo); model.open(); ModelAccessData accessData = new ModelAccessData(model); accessData.initialize() Hope this helps. Antoni Mylka ant...@gm... |
From: Leo S. <leo...@df...> - 2008-08-21 10:03:36
|
Hi Grant, Antoni is right, if you just need the URIs, you can use NativeStore and Sesame. you can also *re-implement* your own AccessData class on top of MySQL or some other relational database if that is what you dig - then we would be happy to add this to the available AccessData implementations best Leo It was Antoni Myłka who said at the right time 21.08.2008 00:09 the following words: > Grant Ingersoll pisze: > >> I have followed the tutorial on using a persistent store at http://aperture.wiki.sourceforge.net/PersistentCrawlingTutorial >> and it makes sense as to what I need to do to enable a persistent >> store. >> >> However, I am wondering if there is a way to only store the URI >> information. My use case is I am indexing the extracted content, so I >> have no need for the triples, etc. after the content is extracted. >> However, I want the persistent store to keep track of the URI so that >> I can get notifications about new/changed/removed objects. Any >> suggestions on how to do this? >> >> Thanks, >> Grant >> >> > > The information required for the crawler to perform incremental crawling > (new/changed/unchanged/deleted objects) is encapsulated behind the > AccessData interface. There is an in-memory implementation > (AccessDataImpl), an in-memory with the possibility to serialize stuff > to a file (gzipped xml) - (FileAccessData). > > You can choose between the FileAccessData and the ModelAccessData backed > by a disk-based repository (the NativeStore). The first solution is > faster and simpler, but your accessdata will grow linearly with the > number of objects. If you want a more scalable approach - settle for the > NativeStore. > > File folder = ... // the folder where the info will be stored > Repository repo = new SailRepository(new NativeStore(folder)); > repo.initialize(); > Model model = new RepositoryModel(repo); > model.open(); > ModelAccessData accessData = new ModelAccessData(model); > accessData.initialize() > > then when you have a crawler, simply before calling crawl(), call > crawler.setAccessData(accessData) > on the first crawl, the crawler will populate the access data with > appropriate info. If you use the same access data instance on the second > crawl, the crawler will report new/modified/unmodified/deleted. > > Note that one access data instance cannot be shared between multiple > data sources. You can still reuse the same folder though, by storing > information from separate data sources in separate contexts in the > native store, just pass a context uri on the initialization of the > repository model. > > File folder = ... // the folder where the info will be stored > Repository repo = new SailRepository(new NativeStore(folder)); > repo.initialize(); > Model model = new RepositoryModel(contextUri,repo); > model.open(); > ModelAccessData accessData = new ModelAccessData(model); > accessData.initialize() > > Hope this helps. > > Antoni Mylka > ant...@gm... > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |
From: Grant I. <gsi...@ap...> - 2008-08-21 14:30:18
|
On Aug 20, 2008, at 6:09 PM, Antoni Myłka wrote: > > > Note that one access data instance cannot be shared between multiple > data sources. You can still reuse the same folder though, by storing > information from separate data sources in separate contexts in the > native store, just pass a context uri on the initialization of the > repository model. > > File folder = ... // the folder where the info will be stored > Repository repo = new SailRepository(new NativeStore(folder)); > repo.initialize(); > Model model = new RepositoryModel(contextUri,repo); > model.open(); > ModelAccessData accessData = new ModelAccessData(model); > accessData.initialize() Thanks for the help, this mostly makes sense. So, can the contextURI be my DataSource ID? As in DataSource.getID()? I guess I will try it. Also, does it mean that I can share the SailRepository between data sources? Is that thread-safe? |
From: Grant I. <gsi...@ap...> - 2008-08-21 19:38:54
|
More fun w/ AccessData: I did something like: //Initialize the Crawler w/ the ModelAccessData crawler.crawl() int num = crawler.getAccessData().getSize(); And I get: SEVERE: java.lang.IllegalStateException: AccessData not initialized, call initialize() first at org .semanticdesktop .aperture .accessor .base.ModelAccessData.checkInitialization(ModelAccessData.java:97) at org .semanticdesktop .aperture.accessor.base.ModelAccessData.getSize(ModelAccessData.java: 224) It seems that the call by the crawler to the store() method causes the time stamp used for initialization to be set to -1, thus causing the checkInitialization to fail. The following workaround works, but just seems a bit weird to have to initialize the AccessData immediately after crawling, since I initialized it immediately before crawling when I set the AccessData on the Crawler : crawler.crawl(); crawler.getAccessData().initialize(); int num = crawler.getAccessData().getSize(); A possible note to add to the javadoc on Crawl: /** * Starts crawling the domain defined in the DataSource of this Crawler. If this is a subsequent run of * this method, it will only report the differences with the previous run, unless the previous scan * results have been cleared. Any CrawlerListeners registered on this Crawler will get notified about the * crawling progress. * NOTE: If using an AccessData object, the AccessData.store() method will be called by the crawler, which means it will need to be initialized again for next use. */ Another question: how costly is initialize()? On Aug 21, 2008, at 10:29 AM, Grant Ingersoll wrote: > > On Aug 20, 2008, at 6:09 PM, Antoni Myłka wrote: > >> >> >> Note that one access data instance cannot be shared between multiple >> data sources. You can still reuse the same folder though, by storing >> information from separate data sources in separate contexts in the >> native store, just pass a context uri on the initialization of the >> repository model. >> >> File folder = ... // the folder where the info will be stored >> Repository repo = new SailRepository(new NativeStore(folder)); >> repo.initialize(); >> Model model = new RepositoryModel(contextUri,repo); >> model.open(); >> ModelAccessData accessData = new ModelAccessData(model); >> accessData.initialize() > > Thanks for the help, this mostly makes sense. > > So, can the contextURI be my DataSource ID? As in > DataSource.getID()? I guess I will try it. > > Also, does it mean that I can share the SailRepository between data > sources? Is that thread-safe? > > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > Build the coolest Linux based applications with Moblin SDK & win > great prizes > Grand prize is a trip for two to an Open Source event anywhere in > the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel |
From: Antoni M. <ant...@gm...> - 2008-08-21 21:57:54
|
Grant Ingersoll pisze: > More fun w/ AccessData: > > I did something like: > //Initialize the Crawler w/ the ModelAccessData > crawler.crawl() > int num = crawler.getAccessData().getSize(); > > And I get: > SEVERE: java.lang.IllegalStateException: AccessData not initialized, > call initialize() first > at > org > .semanticdesktop > .aperture > .accessor > .base.ModelAccessData.checkInitialization(ModelAccessData.java:97) > at > org > .semanticdesktop > .aperture.accessor.base.ModelAccessData.getSize(ModelAccessData.java: > 224) > > It seems that the call by the crawler to the store() method causes the > time stamp used for initialization to be set to -1, thus causing the > checkInitialization to fail. > > The following workaround works, but just seems a bit weird to have to > initialize the AccessData immediately after crawling, since I > initialized it immediately before crawling when I set the AccessData > on the Crawler : > crawler.crawl(); > crawler.getAccessData().initialize(); > int num = crawler.getAccessData().getSize(); AccessData is not intended to be used outside the crawler. If you want to get the number of crawled objects, use crawler.getCrawlReport().getNewCount() crawler.getCrawlReport().getUnchangedCount() etc. See the javadocs for the CrawlReport interface for more info. > A possible note to add to the javadoc on Crawl: > /** > * Starts crawling the domain defined in the DataSource of this > Crawler. If this is a subsequent run of > * this method, it will only report the differences with the > previous run, unless the previous scan > * results have been cleared. Any CrawlerListeners registered on > this Crawler will get notified about the > * crawling progress. > > * NOTE: If using an AccessData object, the AccessData.store() method > will be called by the crawler, which means it will > need to be initialized again for next use. > */ That may be a good idea. I'll include it. > Another question: how costly is initialize()? > The way we do it now i.e. in AccessDataImpl and ModelAccessData initialize() only involves setting one variable. In FileAccessData initialization involves reading and parsing the file into the memory, so it can potentially be costly if the file is large. Antoni Mylka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-08-21 22:50:09
|
On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: > > AccessData is not intended to be used outside the crawler. If you want > to get the number of crawled objects, use > > crawler.getCrawlReport().getNewCount() > crawler.getCrawlReport().getUnchangedCount() > > etc. See the javadocs for the CrawlReport interface for more info. Ah, very cool. Had missed that and was implementing my own counters. Thanks! -Grant |
From: Grant I. <gsi...@ap...> - 2008-08-25 15:17:02
|
On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: > > AccessData is not intended to be used outside the crawler. If you want > to get the number of crawled objects, use > > crawler.getCrawlReport().getNewCount() > crawler.getCrawlReport().getUnchangedCount() > > etc. See the javadocs for the CrawlReport interface for more info. My CallbackHandler doesn't necessarily process all objects it receives, what do people think of a change to the interface that reported back whether the object was "accepted" or not and then the crawl stats are incremented accordingly? If so, I can work up a patch. -Grant |
From: Antoni M. <ant...@gm...> - 2008-08-26 06:39:55
|
2008/8/25 Grant Ingersoll <gsi...@ap...>: > > On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: >> >> AccessData is not intended to be used outside the crawler. If you want >> to get the number of crawled objects, use >> >> crawler.getCrawlReport().getNewCount() >> crawler.getCrawlReport().getUnchangedCount() >> >> etc. See the javadocs for the CrawlReport interface for more info. > > My CallbackHandler doesn't necessarily process all objects it > receives, what do people think of a change to the interface that > reported back whether the object was "accepted" or not and then the > crawl stats are incremented accordingly? If so, I can work up a patch. > You mean you'd like the getNewCount() to report the count of new objects that have been 'accepted' by the CrawlerHandler? Could you elaborate a bit more on the use case for this? -- Antoni Myłka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-08-26 11:09:24
|
On Aug 26, 2008, at 2:40 AM, Antoni Mylka wrote: > 2008/8/25 Grant Ingersoll <gsi...@ap...>: >> >> On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: >>> >>> AccessData is not intended to be used outside the crawler. If you >>> want >>> to get the number of crawled objects, use >>> >>> crawler.getCrawlReport().getNewCount() >>> crawler.getCrawlReport().getUnchangedCount() >>> >>> etc. See the javadocs for the CrawlReport interface for more info. >> >> My CallbackHandler doesn't necessarily process all objects it >> receives, what do people think of a change to the interface that >> reported back whether the object was "accepted" or not and then the >> crawl stats are incremented accordingly? If so, I can work up a >> patch. >> > > You mean you'd like the getNewCount() to report the count of new > objects that have been 'accepted' by the CrawlerHandler? Could you > elaborate a bit more on the use case for this? Yeah, the accepted items. It is separate from whether an item is "touched" by the crawler, I suppose, so I guess we wouldn't want to change the semantics of the other counts. I can just keep the counter in my callback handler, too. I guess I was thinking that if, for instance, one doesn't do any operations on the Folder objects, that one might not want them "counted". |
From: Antoni M. <ant...@gm...> - 2008-08-27 19:47:27
|
Grant Ingersoll pisze: > On Aug 26, 2008, at 2:40 AM, Antoni Mylka wrote: > >> 2008/8/25 Grant Ingersoll <gsi...@ap...>: >>> On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: >>>> AccessData is not intended to be used outside the crawler. If you >>>> want >>>> to get the number of crawled objects, use >>>> >>>> crawler.getCrawlReport().getNewCount() >>>> crawler.getCrawlReport().getUnchangedCount() >>>> >>>> etc. See the javadocs for the CrawlReport interface for more info. >>> My CallbackHandler doesn't necessarily process all objects it >>> receives, what do people think of a change to the interface that >>> reported back whether the object was "accepted" or not and then the >>> crawl stats are incremented accordingly? If so, I can work up a >>> patch. >>> >> You mean you'd like the getNewCount() to report the count of new >> objects that have been 'accepted' by the CrawlerHandler? Could you >> elaborate a bit more on the use case for this? > > Yeah, the accepted items. It is separate from whether an item is > "touched" by the crawler, I suppose, so I guess we wouldn't want to > change the semantics of the other counts. I can just keep the counter > in my callback handler, too. I guess I was thinking that if, for > instance, one doesn't do any operations on the Folder objects, that > one might not want them "counted". :) I don't want to sound unhelpful, but whoever 'might' not want them "counted" 'might' also count the 'accepted' objects by him/herself :) The use case is IMHO weak, the additional confusion this would introduce is high and changing the semantics of the CrawlReport in this way is backwards-incompatible, even if we say the the objects are 'accepted' by default and only an additional call to some 'reject' method would trigger this behavior. I'd rather not do this in aperture. Antoni Mylka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-08-27 20:04:34
|
On Aug 27, 2008, at 3:47 PM, Antoni Myłka wrote: > > I don't want to sound unhelpful, but whoever 'might' not want them > "counted" 'might' also count the 'accepted' objects by him/herself :) > The use case is IMHO weak, the additional confusion this would > introduce > is high and changing the semantics of the CrawlReport in this way is > backwards-incompatible, even if we say the the objects are > 'accepted' by > default and only an additional call to some 'reject' method would > trigger this behavior. > > I'd rather not do this in aperture. No worries, I will implement in my handler. After sending it, I was thinking more along the lines of having two counts, one for "touched" and one for "accepted" in the CrawlReport, but like I said, I can get it via my handler, too. |
From: Grant I. <gsi...@ap...> - 2008-08-27 15:44:48
|
On Aug 25, 2008, at 11:16 AM, Grant Ingersoll wrote: > > On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: >> >> AccessData is not intended to be used outside the crawler. If you >> want >> to get the number of crawled objects, use So, what's the preferred way of clearing out the AccessData? I have two cases, one where I want to delete all contexts, and one where I want to delete a specific context. These are separate events from a crawl, but need to be safe in respect to the repository model (but not saying that necessarily means Aperture needs to handle the safety). Essentially, what I want to do is: Stop Crawl (got that) Delete the thing the crawl was creating (got that) Delete everything in the persistent store (don't have that) I see Crawler.clear, Model.removeAll() as possibilities. There is also the option of just deleting the underlying directory and re- instantiating the Sail Repository, but that seems pretty harsh. Thanks, Grant |
From: Antoni M. <ant...@gm...> - 2008-08-27 20:02:04
|
Grant Ingersoll pisze: > On Aug 25, 2008, at 11:16 AM, Grant Ingersoll wrote: > >> On Aug 21, 2008, at 5:57 PM, Antoni Myłka wrote: >>> AccessData is not intended to be used outside the crawler. If you >>> want >>> to get the number of crawled objects, use > > So, what's the preferred way of clearing out the AccessData? I have > two cases, one where I want to delete all contexts, and one where I > want to delete a specific context. These are separate events from a > crawl, but need to be safe in respect to the repository model (but not > saying that necessarily means Aperture needs to handle the safety). > > Essentially, what I want to do is: > Stop Crawl (got that) > Delete the thing the crawl was creating (got that) > Delete everything in the persistent store (don't have that) > > I see Crawler.clear, Model.removeAll() as possibilities. There is > also the option of just deleting the underlying directory and re- > instantiating the Sail Repository, but that seems pretty harsh. If you want to delete the content of the AccessData there is more than one way. 1. accessData.clear() - the preferred way, independent of the AccessData implementation, note that it doesn't actually delete the files used for persistence. E.g. for a FileAccessData doing clear() and store() would store an empty file. 2. model.removeAll() - works for a ModelAccessData. A model wraps a single context, so if you have more ModelAccessData instances backed by separate contexts in a single repository (a single ModelSet), calling model.removeAll() will only delete a single context and not affect the other ones. 3. modelSet.removeAll() - a ModelSet wraps all contexts in a repository. Calling this method will remove everything (though it will not remove the actual folder where the NativeStore files are located, it will only remove their content 4. removing the actual files and reinitializing the access data - will also work, but as you said it's a bit harsh, though if your AccessData is really big, you might want to consider it for performance reasons. 5. crawler.clear() - If you do this, the crawler handler will be notified about clearStarted(), clearingObject() etc. The result is exactly the same as accessData.clear(). I personally don't like it though, as it places the responsibility for clearing the data outside the entity that is actually capable of doing it. It's featured on the http://aperture.wiki.sourceforge.net/ApertureArchitectureCleanup as a candidate for a face-lifting in Aperture 2.0 All kinds of comments welcome Antoni Mylka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-08-27 21:41:11
|
On Aug 27, 2008, at 4:02 PM, Antoni Myłka wrote: > 3. modelSet.removeAll() - a ModelSet wraps all contexts in a > repository. > Calling this method will remove everything (though it will not remove > the actual folder where the NativeStore files are located, it will > only > remove their content Here's what I tried: ModelSet modelSet = new RepositoryModelSet(repo); modelSet.open(); modelSet.removeAll(); modelSet.commit(); modelSet.close(); But, when I crawl again, I don't get that the files in question are new as I would expect. Do I have to some how shutdown the underlying Repository or something like that? It seems like I still have active connections open on the Repository, is it possible that is the cause? |
From: Antoni M. <ant...@gm...> - 2008-08-27 22:02:48
|
Grant Ingersoll pisze: > On Aug 27, 2008, at 4:02 PM, Antoni Myłka wrote: >> 3. modelSet.removeAll() - a ModelSet wraps all contexts in a >> repository. >> Calling this method will remove everything (though it will not remove >> the actual folder where the NativeStore files are located, it will >> only >> remove their content > > Here's what I tried: > ModelSet modelSet = new RepositoryModelSet(repo); > modelSet.open(); > modelSet.removeAll(); > modelSet.commit(); > modelSet.close(); > > But, when I crawl again, I don't get that the files in question are > new as I would expect. Do I have to some how shutdown the underlying > Repository or something like that? It seems like I still have active > connections open on the Repository, is it possible that is the cause? The RepositoryModelSet always works in autocommit mode. The commit() method has no effect(). Please check if the statements are actually removed. Like System.out.println(modelSet.size()) modelSet.removeAll() System.out.println(modelSet.size()) or modelSet.dump() modelSet.removeAll() modelSet.dump() Then you could make sure that the model that backs your ModelAccessData is actually empty before you begin your next crawl. Even if there are any active connections, they shouldn't play any role in here. Antoni Mylka ant...@gm... |
From: Christiaan F. <chr...@ad...> - 2008-08-28 08:47:56
|
Antoni Myłka wrote: > If you want to delete the content of the AccessData there is more than > one way. > > 1. accessData.clear() - the preferred way, independent of the AccessData > implementation, note that it doesn't actually delete the files used for > persistence. E.g. for a FileAccessData doing clear() and store() would > store an empty file. > 2. model.removeAll() - works for a ModelAccessData. A model wraps a > single context, so if you have more ModelAccessData instances backed by > separate contexts in a single repository (a single ModelSet), calling > model.removeAll() will only delete a single context and not affect the > other ones. > 3. modelSet.removeAll() - a ModelSet wraps all contexts in a repository. > Calling this method will remove everything (though it will not remove > the actual folder where the NativeStore files are located, it will only > remove their content > 4. removing the actual files and reinitializing the access data - will > also work, but as you said it's a bit harsh, though if your AccessData > is really big, you might want to consider it for performance reasons. > 5. crawler.clear() - If you do this, the crawler handler will be > notified about clearStarted(), clearingObject() etc. The result is > exactly the same as accessData.clear(). I personally don't like it > though, as it places the responsibility for clearing the data outside > the entity that is actually capable of doing it. It's featured on the > http://aperture.wiki.sourceforge.net/ApertureArchitectureCleanup > as a candidate for a face-lifting in Aperture 2.0 1 and 5 are preferred IMO, as they do the job at the right level of abstraction. 2, 3 and 4 introduce the risk that someday someone changes a ModelXYZ implementation in a way that breaks these approaches. Concerning 5: I originally introduced the clear method in Crawler for two reasons: - The Crawler is the one that puts the data in the AccessData, so I thought it should also be the one that takes it out. I may at that time still have been thinking about sharing AccessData instances between Crawlers. I.e., *thought* about it, I said nothing about properly implementing it ;) In that case, this would have been the only option. - Like you said, the CrawlerHandler gets notified about the deletions, which is useful if you want to synchronize these deletions with other deletions. Clearing the AccessData instance directly means that you have to know exactly what else needs to be cleared. In practice this if often also a simple .clear on some other information store though. In that case, approach 1 gives you much better performance (no iteration on things that you are about to delete). Regards, Chris -- |
From: Grant I. <gsi...@ap...> - 2008-08-28 10:55:42
|
On Aug 28, 2008, at 4:48 AM, Christiaan Fluit wrote: > Antoni Myłka wrote: >> If you want to delete the content of the AccessData there is more >> than >> one way. >> >> 1. accessData.clear() - the preferred way, independent of the >> AccessData >> implementation, note that it doesn't actually delete the files used >> for >> persistence. E.g. for a FileAccessData doing clear() and store() >> would >> store an empty file. >> 2. model.removeAll() - works for a ModelAccessData. A model wraps a >> single context, so if you have more ModelAccessData instances >> backed by >> separate contexts in a single repository (a single ModelSet), calling >> model.removeAll() will only delete a single context and not affect >> the >> other ones. >> 3. modelSet.removeAll() - a ModelSet wraps all contexts in a >> repository. >> Calling this method will remove everything (though it will not remove >> the actual folder where the NativeStore files are located, it will >> only >> remove their content >> 4. removing the actual files and reinitializing the access data - >> will >> also work, but as you said it's a bit harsh, though if your >> AccessData >> is really big, you might want to consider it for performance reasons. >> 5. crawler.clear() - If you do this, the crawler handler will be >> notified about clearStarted(), clearingObject() etc. The result is >> exactly the same as accessData.clear(). I personally don't like it >> though, as it places the responsibility for clearing the data outside >> the entity that is actually capable of doing it. It's featured on the >> http://aperture.wiki.sourceforge.net/ApertureArchitectureCleanup >> as a candidate for a face-lifting in Aperture 2.0 > > 1 and 5 are preferred IMO, as they do the job at the right level of > abstraction. 2, 3 and 4 introduce the risk that someday someone > changes > a ModelXYZ implementation in a way that breaks these approaches. #1 seems to contradict what Antoni said earlier: "AccessData is not intended to be used outside the crawler." -Grant |
From: Grant I. <gsi...@ap...> - 2008-08-28 16:22:20
|
Thanks so much for all of your help on this one! I am still a bit confused on what is going on. Here is some code that I wrote that I think illustrates my confusion with all of this (built in the Aperture trunk src/test/org/semanticdesktop/aperture directory): Could someone try it out and tell me what I am doing wrong? That is, why is model2 empty the second time the method second() is called (after doing crawler.clear() (substituting in model.removeAll() has the same results) but if I comment out that single line, model2 stays intact and everything works as expected (i.e. the second DataSource finds no new objects). I was under the impression, from Antoni's first email on this thread (way back when), that I could create separate ModelAccessData's in the same native store as long as I have separate contexts. I'm guessing that I am doing this wrong, but I am not sure how else to do it. Thanks, Grant package org.semanticdesktop.aperture; import org.semanticdesktop.aperture.crawler.filesystem.FileSystemCrawler; import org.semanticdesktop.aperture.crawler.CrawlReport; import org.semanticdesktop.aperture.crawler.Crawler; import org.semanticdesktop.aperture.crawler.base.CrawlerHandlerBase; import org.semanticdesktop.aperture.datasource.filesystem.FileSystemDataSource; import org.semanticdesktop.aperture.rdf.RDFContainer; import org.semanticdesktop.aperture.rdf.RDFContainerFactory; import org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl; import org.semanticdesktop.aperture.accessor.impl.DefaultDataAccessorRegistry; import org.semanticdesktop.aperture.accessor.base.ModelAccessData; import org.semanticdesktop.aperture.accessor.DataObject; import org.openrdf.repository.sail.SailRepository; import org.openrdf.repository.Repository; import org.openrdf.repository.RepositoryException; import org.openrdf.sail.nativerdf.NativeStore; import org.openrdf.rdf2go.RepositoryModel; import org.ontoware.rdf2go.model.Model; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.util.Date; /** * * **/ public class PersistentStoreTest { public static void main(String[] args) throws IOException, RepositoryException { File tmpDir = new File(System.getProperty("java.io.tmpdir")); File file = new File(tmpDir, "store"); file.mkdirs(); Repository repo = new SailRepository(new NativeStore(file)); repo.initialize(); RDFContainerFactory factory = new RDFContainerFactoryImpl(); File tmpLoc = new File(tmpDir, "pst_" + new Date().getTime()); tmpLoc.mkdirs(); File fuzzy = new File(tmpLoc, "fuzzy.txt"); writeStringToFile(fuzzy, "Fuzzy-wuzzy was a bear. Fuzzy-wuzzy had no hair"); File wuzzy = new File(tmpLoc, "wuzzy.txt"); writeStringToFile(wuzzy, "Wuzzy-Fuzzy was a bear. Wuzzy-fuzzy had no hair"); FileSystemDataSource fsds = new FileSystemDataSource(); RDFContainer container = factory.newInstance("file://" + tmpLoc); fsds.setConfiguration(container); fsds.setRootFolder(tmpLoc.getAbsolutePath()); FileSystemCrawler crawler = new FileSystemCrawler(); crawler.setDataSource(fsds); crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); Model model = new RepositoryModel(fsds.getID(), repo); model.open(); dumpModel(model, "model"); ModelAccessData mad = new ModelAccessData(model); mad.initialize(); crawler.setAccessData(mad); MyCrawlerHandler crawlerHandler = new MyCrawlerHandler(); crawler.setCrawlerHandler(crawlerHandler); crawler.crawl(); CrawlReport cr = crawler.getCrawlReport(); System.out.println("New: " + cr.getNewCount() + " changed: " + cr.getChangedCount() + " deleted: " + cr.getRemovedCount()); dumpModel(model, "model"); File tmpLoc2 = new File(tmpDir, "pst_2_" + new Date().getTime()); tmpLoc2.mkdirs(); File goldilocks = new File(tmpLoc2, "goldilocks.txt"); writeStringToFile(goldilocks, "Goldilocks was a picky eater."); File humpty = new File(tmpLoc2, "humpty.txt"); writeStringToFile(humpty, "Humpty Dumpty was pushed."); second(repo, factory, tmpLoc2); //Now, delete the first Model, then crawl the second again System.out.println(""); System.out.println("Clearing the first location"); dumpModel(model, "model -- before"); crawler.clear(); dumpModel(model, "model -- after"); System.out.println("Done clearing"); System.out.println(""); second(repo, factory, tmpLoc2); } private static void dumpModel(Model model, String modelName) { System.out.println("Dump: " + modelName); model.dump(); System.out.println("Done dumping: " + modelName); } private static void second(Repository repo, RDFContainerFactory factory, File tmpLoc2) throws IOException { FileSystemCrawler crawler; MyCrawlerHandler crawlerHandler; CrawlReport cr; System.out.println(""); System.out.println("Second Location"); FileSystemDataSource fsds2 = new FileSystemDataSource(); RDFContainer container2 = factory.newInstance("file://" + tmpLoc2); fsds2.setConfiguration(container2); fsds2.setRootFolder(tmpLoc2.getAbsolutePath()); crawler = new FileSystemCrawler(); crawler.setDataSource(fsds2); crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); Model model2 = new RepositoryModel(fsds2.getID(), repo); model2.open(); dumpModel(model2, "model2"); ModelAccessData mad2 = new ModelAccessData(model2); mad2.initialize(); crawler.setAccessData(mad2); crawlerHandler = new MyCrawlerHandler(); crawler.setCrawlerHandler(crawlerHandler); crawler.crawl(); cr = crawler.getCrawlReport(); System.out.println("New: " + cr.getNewCount() + " changed: " + cr.getChangedCount() + " deleted: " + cr.getRemovedCount()); dumpModel(model2, "model2"); model2.close(); } private static void writeStringToFile(File file, String text) throws IOException { FileWriter writer = new FileWriter(file); writer.write(text); writer.close(); } private static class MyCrawlerHandler extends CrawlerHandlerBase { @Override public void objectChanged(Crawler crawler, DataObject object) { System.out.println("Object Changed: " + object.getID()); object.dispose(); } @Override public void objectNew(Crawler crawler, DataObject object) { System.out.println("Object New: " + object.getID()); object.dispose(); } @Override public void objectRemoved(Crawler crawler, String url) { System.out.println("Object Removed: " + url); } } } |
From: Grant I. <gsi...@ap...> - 2008-09-04 14:32:33
|
Is this a bug in RDF2Go or am I doing something wrong? On Aug 28, 2008, at 12:21 PM, Grant Ingersoll wrote: > Thanks so much for all of your help on this one! I am still a bit > confused on what is going on. Here is some code that I wrote that I > think illustrates my confusion with all of this (built in the Aperture > trunk src/test/org/semanticdesktop/aperture directory): > > Could someone try it out and tell me what I am doing wrong? That is, > why is model2 empty the second time the method second() is called > (after doing crawler.clear() (substituting in model.removeAll() has > the same results) but if I comment out that single line, model2 stays > intact and everything works as expected (i.e. the second DataSource > finds no new objects). I was under the impression, from Antoni's > first email on this thread (way back when), that I could create > separate ModelAccessData's in the same native store as long as I have > separate contexts. I'm guessing that I am doing this wrong, but I am > not sure how else to do it. > > Thanks, > Grant > > > package org.semanticdesktop.aperture; > > import > org.semanticdesktop.aperture.crawler.filesystem.FileSystemCrawler; > import org.semanticdesktop.aperture.crawler.CrawlReport; > import org.semanticdesktop.aperture.crawler.Crawler; > import org.semanticdesktop.aperture.crawler.base.CrawlerHandlerBase; > import > org > .semanticdesktop.aperture.datasource.filesystem.FileSystemDataSource; > import org.semanticdesktop.aperture.rdf.RDFContainer; > import org.semanticdesktop.aperture.rdf.RDFContainerFactory; > import org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl; > import > org > .semanticdesktop.aperture.accessor.impl.DefaultDataAccessorRegistry; > import org.semanticdesktop.aperture.accessor.base.ModelAccessData; > import org.semanticdesktop.aperture.accessor.DataObject; > import org.openrdf.repository.sail.SailRepository; > import org.openrdf.repository.Repository; > import org.openrdf.repository.RepositoryException; > import org.openrdf.sail.nativerdf.NativeStore; > import org.openrdf.rdf2go.RepositoryModel; > import org.ontoware.rdf2go.model.Model; > > import java.io.File; > import java.io.FileWriter; > import java.io.IOException; > import java.util.Date; > > > /** > * > * > **/ > public class PersistentStoreTest { > > > public static void main(String[] args) throws IOException, > RepositoryException { > > File tmpDir = new File(System.getProperty("java.io.tmpdir")); > File file = new File(tmpDir, "store"); > file.mkdirs(); > Repository repo = new SailRepository(new NativeStore(file)); > repo.initialize(); > RDFContainerFactory factory = new RDFContainerFactoryImpl(); > > File tmpLoc = new File(tmpDir, "pst_" + new Date().getTime()); > tmpLoc.mkdirs(); > File fuzzy = new File(tmpLoc, "fuzzy.txt"); > writeStringToFile(fuzzy, "Fuzzy-wuzzy was a bear. Fuzzy-wuzzy > had no hair"); > File wuzzy = new File(tmpLoc, "wuzzy.txt"); > writeStringToFile(wuzzy, "Wuzzy-Fuzzy was a bear. Wuzzy-fuzzy > had no hair"); > > FileSystemDataSource fsds = new FileSystemDataSource(); > RDFContainer container = factory.newInstance("file://" + tmpLoc); > fsds.setConfiguration(container); > fsds.setRootFolder(tmpLoc.getAbsolutePath()); > FileSystemCrawler crawler = new FileSystemCrawler(); > crawler.setDataSource(fsds); > crawler.setDataAccessorRegistry(new > DefaultDataAccessorRegistry()); > Model model = new RepositoryModel(fsds.getID(), repo); > model.open(); > dumpModel(model, "model"); > ModelAccessData mad = new ModelAccessData(model); > mad.initialize(); > crawler.setAccessData(mad); > MyCrawlerHandler crawlerHandler = new MyCrawlerHandler(); > crawler.setCrawlerHandler(crawlerHandler); > crawler.crawl(); > CrawlReport cr = crawler.getCrawlReport(); > System.out.println("New: " + cr.getNewCount() + " changed: " + > cr.getChangedCount() + " deleted: " + cr.getRemovedCount()); > dumpModel(model, "model"); > > File tmpLoc2 = new File(tmpDir, "pst_2_" + new Date().getTime()); > tmpLoc2.mkdirs(); > File goldilocks = new File(tmpLoc2, "goldilocks.txt"); > writeStringToFile(goldilocks, "Goldilocks was a picky eater."); > File humpty = new File(tmpLoc2, "humpty.txt"); > writeStringToFile(humpty, "Humpty Dumpty was pushed."); > > second(repo, factory, tmpLoc2); > > //Now, delete the first Model, then crawl the second again > System.out.println(""); > System.out.println("Clearing the first location"); > dumpModel(model, "model -- before"); > crawler.clear(); > dumpModel(model, "model -- after"); > System.out.println("Done clearing"); > System.out.println(""); > second(repo, factory, tmpLoc2); > > > } > > private static void dumpModel(Model model, String modelName) { > System.out.println("Dump: " + modelName); > model.dump(); > System.out.println("Done dumping: " + modelName); > } > > private static void second(Repository repo, RDFContainerFactory > factory, File tmpLoc2) throws IOException { > > FileSystemCrawler crawler; > MyCrawlerHandler crawlerHandler; > CrawlReport cr; > System.out.println(""); > System.out.println("Second Location"); > > > FileSystemDataSource fsds2 = new FileSystemDataSource(); > RDFContainer container2 = factory.newInstance("file://" + > tmpLoc2); > fsds2.setConfiguration(container2); > fsds2.setRootFolder(tmpLoc2.getAbsolutePath()); > crawler = new FileSystemCrawler(); > crawler.setDataSource(fsds2); > crawler.setDataAccessorRegistry(new > DefaultDataAccessorRegistry()); > Model model2 = new RepositoryModel(fsds2.getID(), repo); > model2.open(); > dumpModel(model2, "model2"); > ModelAccessData mad2 = new ModelAccessData(model2); > mad2.initialize(); > crawler.setAccessData(mad2); > crawlerHandler = new MyCrawlerHandler(); > crawler.setCrawlerHandler(crawlerHandler); > crawler.crawl(); > cr = crawler.getCrawlReport(); > System.out.println("New: " + cr.getNewCount() + " changed: " + > cr.getChangedCount() + " deleted: " + cr.getRemovedCount()); > dumpModel(model2, "model2"); > model2.close(); > } > > private static void writeStringToFile(File file, String text) > throws IOException { > FileWriter writer = new FileWriter(file); > writer.write(text); > writer.close(); > } > > private static class MyCrawlerHandler extends CrawlerHandlerBase { > @Override > public void objectChanged(Crawler crawler, DataObject object) { > System.out.println("Object Changed: " + object.getID()); > object.dispose(); > > } > > @Override > public void objectNew(Crawler crawler, DataObject object) { > System.out.println("Object New: " + object.getID()); > object.dispose(); > } > > @Override > public void objectRemoved(Crawler crawler, String url) { > System.out.println("Object Removed: " + url); > } > } > > } |
From: Antoni M. <ant...@gm...> - 2008-09-04 19:23:08
|
Grant Ingersoll pisze: > Is this a bug in RDF2Go or am I doing something wrong? > > On Aug 28, 2008, at 12:21 PM, Grant Ingersoll wrote: > >> Thanks so much for all of your help on this one! I am still a bit >> confused on what is going on. Here is some code that I wrote that I >> think illustrates my confusion with all of this (built in the Aperture >> trunk src/test/org/semanticdesktop/aperture directory): >> >> Could someone try it out and tell me what I am doing wrong? That is, >> why is model2 empty the second time the method second() is called >> (after doing crawler.clear() (substituting in model.removeAll() has >> the same results) but if I comment out that single line, model2 stays >> intact and everything works as expected (i.e. the second DataSource >> finds no new objects). I was under the impression, from Antoni's >> first email on this thread (way back when), that I could create >> separate ModelAccessData's in the same native store as long as I have >> separate contexts. I'm guessing that I am doing this wrong, but I am >> not sure how else to do it. >> Sorry for a late reply. I couldn't figure out what's the problem a week ago and moved to a new flat in the meantime :). Now I returned to it and found the culprit, Your code is correct. It is a bug in the rdf2go adapter implementation. the removeAll() method from the RepositoryModel class deletes everything from all contexts, which is wrong. A model should only work on a SINGLE context. I'll try to pull appropriate strings to get this fixed asap. Thanks for the report, and once again sorry for the delay. Antoni Mylka ant...@gm... |
From: Antoni M. <ant...@gm...> - 2008-09-04 20:21:18
|
Antoni Myłka pisze: > Grant Ingersoll pisze: >> Is this a bug in RDF2Go or am I doing something wrong? >> >> On Aug 28, 2008, at 12:21 PM, Grant Ingersoll wrote: >> >>> Thanks so much for all of your help on this one! I am still a bit >>> confused on what is going on. Here is some code that I wrote that I >>> think illustrates my confusion with all of this (built in the Aperture >>> trunk src/test/org/semanticdesktop/aperture directory): >>> >>> Could someone try it out and tell me what I am doing wrong? That is, >>> why is model2 empty the second time the method second() is called >>> (after doing crawler.clear() (substituting in model.removeAll() has >>> the same results) but if I comment out that single line, model2 stays >>> intact and everything works as expected (i.e. the second DataSource >>> finds no new objects). I was under the impression, from Antoni's >>> first email on this thread (way back when), that I could create >>> separate ModelAccessData's in the same native store as long as I have >>> separate contexts. I'm guessing that I am doing this wrong, but I am >>> not sure how else to do it. >>> > > Sorry for a late reply. I couldn't figure out what's the problem a week > ago and moved to a new flat in the meantime :). Now I returned to it and > found the culprit, > > Your code is correct. It is a bug in the rdf2go adapter implementation. > > the removeAll() method from the RepositoryModel class deletes everything > from all contexts, which is wrong. A model should only work on a SINGLE > context. I'll try to pull appropriate strings to get this fixed asap. > > Thanks for the report, and once again sorry for the delay. > Created an issue on the RDF2Go JIRA. http://octopus13.fzi.de:8080/browse/RTGO-56 Antoni Mylka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-09-06 20:01:43
|
Thanks, Antoni. I'm glad I'm not crazy, at least not on this one. Would the suggested workaround be to keep separate Repositories for each Data Source? On Sep 4, 2008, at 4:21 PM, Antoni Myłka wrote: > Antoni Myłka pisze: >> Grant Ingersoll pisze: >>> Is this a bug in RDF2Go or am I doing something wrong? >>> >>> On Aug 28, 2008, at 12:21 PM, Grant Ingersoll wrote: >>> >>>> Thanks so much for all of your help on this one! I am still a bit >>>> confused on what is going on. Here is some code that I wrote >>>> that I >>>> think illustrates my confusion with all of this (built in the >>>> Aperture >>>> trunk src/test/org/semanticdesktop/aperture directory): >>>> >>>> Could someone try it out and tell me what I am doing wrong? That >>>> is, >>>> why is model2 empty the second time the method second() is called >>>> (after doing crawler.clear() (substituting in model.removeAll() has >>>> the same results) but if I comment out that single line, model2 >>>> stays >>>> intact and everything works as expected (i.e. the second DataSource >>>> finds no new objects). I was under the impression, from Antoni's >>>> first email on this thread (way back when), that I could create >>>> separate ModelAccessData's in the same native store as long as I >>>> have >>>> separate contexts. I'm guessing that I am doing this wrong, but >>>> I am >>>> not sure how else to do it. >>>> >> >> Sorry for a late reply. I couldn't figure out what's the problem a >> week >> ago and moved to a new flat in the meantime :). Now I returned to >> it and >> found the culprit, >> >> Your code is correct. It is a bug in the rdf2go adapter >> implementation. >> >> the removeAll() method from the RepositoryModel class deletes >> everything >> from all contexts, which is wrong. A model should only work on a >> SINGLE >> context. I'll try to pull appropriate strings to get this fixed asap. >> >> Thanks for the report, and once again sorry for the delay. >> > > Created an issue on the RDF2Go JIRA. > > http://octopus13.fzi.de:8080/browse/RTGO-56 > > Antoni Mylka > ant...@gm... > |
From: Antoni M. <ant...@gm...> - 2008-09-08 17:42:45
|
Grant Ingersoll pisze: > Thanks, Antoni. I'm glad I'm not crazy, at least not on this one. > Would the suggested workaround be to keep separate Repositories for > each Data Source? > It depends on your deadlines. The workaround suggested "by me" is to wait for a next rdf2go release. My patch has been applied. I can update the relevant jar from he trunk if that fixes your problem. A proper release is a matter of days. Antoni Mylka ant...@gm... |
From: Grant I. <gsi...@ap...> - 2008-09-17 18:49:24
|
On Sep 8, 2008, at 1:37 PM, Antoni Myłka wrote: > Grant Ingersoll pisze: >> Thanks, Antoni. I'm glad I'm not crazy, at least not on this one. >> Would the suggested workaround be to keep separate Repositories for >> each Data Source? >> > > It depends on your deadlines. The workaround suggested "by me" is to > wait for a next rdf2go release. My patch has been applied. I can > update > the relevant jar from he trunk if that fixes your problem. A proper > release is a matter of days. Any status on this? I just updated, but didn't see a jar. Where can I grab the JAR? |