Re: [Aperture-devel] Proposal for extending the DataSource interface

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm not comfortable with this approach, but perhaps I don't understand 
the entire problem too well. Some observations:

- When DataObjects are produced during crawling, their metadata include 
statements relating the DataObject to the DataSource. Therefore, you are 
not required to ask a DataSource if they "contain" a DataObject, 
provided that you have stored this fact and are able to look it up once 
you decide to open the original binary resource. In my view, this is 
always necessary as URIs are merely identifiers, they do not contain all 
information necessary to obtain the original resource (e.g., user 
credentials and other account information is usually kept out). Does 
this approach not work out for you? Do you have use cases where you only 
have a URI and no DataSource linked to it?

- Right now, DataSources are declarative by nature (or that's how *I* 
think about them, perhaps others have a different opinion). It is the 
Crawler implementor who decides which DataObject URIs to produce. Adding 
code for containment checking to DataSource requires a common agreement 
on how the DataObject URIs for a given DataSource type look like. I'm 
still wondering whether this is actually a good idea or whether a 
Crawler implementor should be given complete freedom in designing URI 
formats.

- I can imagine several Crawler implementations per DataSource type and 
several DataAccessor implementations per URI scheme, so putting get 
methods for these in DataSource does not sound like a good approach. 
Also, this method would probably not be needed if you have stored the 
DataObject-->DataSource link.

Now these are just objections to your plan, not a constructive solution 
for what you want to achieve. Given these remarks, has your view on the 
design you have in mind changed?

Regards,

Chris
--

Antoni Mylka wrote:
> An Aperture facade in Gnowsis had an accessResource method that accepted
> an URI. This method returned a DataObject regardless of where this URI
> was (file, website etc...). We would like to have a similar method in
> Nepomuk.
> 
> The algorithm was:
> 
> 1. Find an accessor for the scheme of the URI.
> 2. Find the datasource that contains this URI.
> 3. Use the accessor and return the DataObject (backed by an appropriate
> source).
> 
> The implementation:
> 1. It used the registry. If there was no accessor for the URI Scheme it
> did the step 2, and then checked if the data source is an instance of
> OutlookDataSource. Then it used the OutlookAccessor. This worked only
> for http, and file uris, and those uris who happen to belong to an
> OutlookDataSource
> 
> 2. This was done with a containsUri method. It took a configuration of a
> DataSource, extracted the DomainBoundaries and checked if an uri falls
> within the DomainBoundaries. This worked only for data sources with
> domain boundaries specified.
> 
> In order to have a better solution we suggest the following:
> 
> 1. Add a containsUri(URI uri) method to the DataSource interface. It
> could use simple rules (e.g. begins with file:// and ends with an
> alphanumeric character - to distinguish files from parts of files) or
> use the configuration (e.g. outlook uris must begin with the prefix
> specified as the rootUrl configuration property, file uris might be
> further restricted with DomainBoundaries, ical uris begin with the path
> to the ical calendar file).
> 
> 2. Add getDataAccessor method to the DataSource interface. Some data 
> sources aren't directly assigned to an URI scheme. (e.g. an outlook 
> accessor) Therefore they  can't be accessed through the 
> DataAccessorRegistry and require hacks like
> 
>   if (source instanceof OutlookDataSource)
>      accessor = new OutlookDataAccessor();
> 
> Such an accessor could be more tightly coupled with a particular source.
> 
> 3. Add a getDataOpener method to the DataSource interface. For the same 
> reason as in 2. The most pressing example is the OutlookOpener.
> 
> If nobody objects, I would implement it in near future.