From: Christiaan F. <chr...@ad...> - 2005-12-12 10:38:52
|
While working on the IMAPCrawler I'm running into some serious problems with the RDFContainer and in particular its Map-based methods. The main problem is the fact that all generated DataObjects may share the same RDFContainer instance whose described URI is reset every time a RDFContainer is requested from the RDFContainerFactory. This is problematic as I am creating several DataObjects at once and sometimes I add or remove a metadata statement later on, when I have already created other DataObjects with their metadata. When I get the RDFContainer from a previously created DataObject in order to add or remove something, I have no guarantee as to whether its described URI is still the same or not. This makes the behaviour of the put and get methods in this case undefined; they may now operate on another Resource. As the RDFContainer contains hardly any real triple-oriented methods (which would let me specify the subject), there is no pure RDFContainer workaround. For example, there are no get methods taking a subject as argument anymore. Just finishing all processing of one DataObject before moving on to the next is also hardly an option in my code, as some metadata statements of a DataObject can only be known when all child DataObjects are also fully created. For now I will work around this problem in the IMAP code by retrieving the underlying Repository and working on that object directly, so I have full control over my triples. Clearly, this makes the code Sesame-dependent, which we will need to solve somehow in the future. Another issue with the RDFContainer API is the fact that I am starting to have different feelings about the implementation strategy of the put methods. Right now, these methods overwrite the existing values if there is a single value for that subject-predicate pair, to mimic java.util.Map's behaviour. This somehow starts to feel wrong to me now, it's quite contrary to how SW apps typically work, where anyone can make any kind of statement about anything. It's not the purpose of, say, an HTMLExtractor to decide that it has a better title for a specific document and that the existing recorded title should be overwritten. If you want to keep these things separate, you can always use context and decide later on which one you favor. As an added bonus, addition of information will also be faster (remember that the replaceInternal was by far the most expensive operation in the demo file crawler) as half of its operation (looking up existing statements) will no longer be necessary. With these two issues combined, I would be in favor of using a more rdf2go-like, triple-oriented API in the future, but with the support for the data types that we have introduced in the RDFContainer API, i.e. specialized methods for handling Dates, ints, etc., as these really ease development and provide centralized control over how these data types are modeled in RDF. Any opinions on these issues? Chris -- |