[Aperture-devel] RDFContainer issues

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

While working on the IMAPCrawler I'm running into some serious problems 
with the RDFContainer and in particular its Map-based methods.

The main problem is the fact that all generated DataObjects may share 
the same RDFContainer instance whose described URI is reset every time a 
RDFContainer is requested from the RDFContainerFactory.

This is problematic as I am creating several DataObjects at once and 
sometimes I add or remove a metadata statement later on, when I have 
already created other DataObjects with their metadata. When I get the 
RDFContainer from a previously created DataObject in order to add or 
remove something, I have no guarantee as to whether its described URI is 
still the same or not. This makes the behaviour of the put and get 
methods in this case undefined; they may now operate on another Resource.

As the RDFContainer contains hardly any real triple-oriented methods 
(which would let me specify the subject), there is no pure RDFContainer 
workaround. For example, there are no get methods taking a subject as 
argument anymore. Just finishing all processing of one DataObject before 
moving on to the next is also hardly an option in my code, as some 
metadata statements of a DataObject can only be known when all child 
DataObjects are also fully created.

For now I will work around this problem in the IMAP code by retrieving 
the underlying Repository and working on that object directly, so I have 
full control over my triples. Clearly, this makes the code 
Sesame-dependent, which we will need to solve somehow in the future.

Another issue with the RDFContainer API is the fact that I am starting 
to have different feelings about the implementation strategy of the put 
methods. Right now, these methods overwrite the existing values if there 
is a single value for that subject-predicate pair, to mimic 
java.util.Map's behaviour. This somehow starts to feel wrong to me now, 
it's quite contrary to how SW apps typically work, where anyone can make 
any kind of statement about anything. It's not the purpose of, say, an 
HTMLExtractor to decide that it has a better title for a specific 
document and that the existing recorded title should be overwritten. If 
you want to keep these things separate, you can always use context and 
decide later on which one you favor. As an added bonus, addition of 
information will also be faster (remember that the replaceInternal was 
by far the most expensive operation in the demo file crawler) as half of 
its operation (looking up existing statements) will no longer be necessary.

With these two issues combined, I would be in favor of using a more 
rdf2go-like, triple-oriented API in the future, but with the support for 
the data types that we have introduced in the RDFContainer API, i.e. 
specialized methods for handling Dates, ints, etc., as these really ease 
development and provide centralized control over how these data types 
are modeled in RDF.

Any opinions on these issues?

Chris
--