From: Christiaan F. <chr...@ad...> - 2007-06-14 09:45:55
|
I'm not comfortable with this approach, but perhaps I don't understand the entire problem too well. Some observations: - When DataObjects are produced during crawling, their metadata include statements relating the DataObject to the DataSource. Therefore, you are not required to ask a DataSource if they "contain" a DataObject, provided that you have stored this fact and are able to look it up once you decide to open the original binary resource. In my view, this is always necessary as URIs are merely identifiers, they do not contain all information necessary to obtain the original resource (e.g., user credentials and other account information is usually kept out). Does this approach not work out for you? Do you have use cases where you only have a URI and no DataSource linked to it? - Right now, DataSources are declarative by nature (or that's how *I* think about them, perhaps others have a different opinion). It is the Crawler implementor who decides which DataObject URIs to produce. Adding code for containment checking to DataSource requires a common agreement on how the DataObject URIs for a given DataSource type look like. I'm still wondering whether this is actually a good idea or whether a Crawler implementor should be given complete freedom in designing URI formats. - I can imagine several Crawler implementations per DataSource type and several DataAccessor implementations per URI scheme, so putting get methods for these in DataSource does not sound like a good approach. Also, this method would probably not be needed if you have stored the DataObject-->DataSource link. Now these are just objections to your plan, not a constructive solution for what you want to achieve. Given these remarks, has your view on the design you have in mind changed? Regards, Chris -- Antoni Mylka wrote: > An Aperture facade in Gnowsis had an accessResource method that accepted > an URI. This method returned a DataObject regardless of where this URI > was (file, website etc...). We would like to have a similar method in > Nepomuk. > > The algorithm was: > > 1. Find an accessor for the scheme of the URI. > 2. Find the datasource that contains this URI. > 3. Use the accessor and return the DataObject (backed by an appropriate > source). > > The implementation: > 1. It used the registry. If there was no accessor for the URI Scheme it > did the step 2, and then checked if the data source is an instance of > OutlookDataSource. Then it used the OutlookAccessor. This worked only > for http, and file uris, and those uris who happen to belong to an > OutlookDataSource > > 2. This was done with a containsUri method. It took a configuration of a > DataSource, extracted the DomainBoundaries and checked if an uri falls > within the DomainBoundaries. This worked only for data sources with > domain boundaries specified. > > In order to have a better solution we suggest the following: > > 1. Add a containsUri(URI uri) method to the DataSource interface. It > could use simple rules (e.g. begins with file:// and ends with an > alphanumeric character - to distinguish files from parts of files) or > use the configuration (e.g. outlook uris must begin with the prefix > specified as the rootUrl configuration property, file uris might be > further restricted with DomainBoundaries, ical uris begin with the path > to the ical calendar file). > > 2. Add getDataAccessor method to the DataSource interface. Some data > sources aren't directly assigned to an URI scheme. (e.g. an outlook > accessor) Therefore they can't be accessed through the > DataAccessorRegistry and require hacks like > > if (source instanceof OutlookDataSource) > accessor = new OutlookDataAccessor(); > > Such an accessor could be more tightly coupled with a particular source. > > 3. Add a getDataOpener method to the DataSource interface. For the same > reason as in 2. The most pressing example is the OutlookOpener. > > If nobody objects, I would implement it in near future. |