Crawl

Access

Open

Map a source to an URI

Filesystem

ok, there is a crawler for the FilesystemDataSource

OK, there is a scheme-based FileAccessor

OK, there is a FileOpener, it uses the operating system, file type registry

Possible, the URI must be below the rootFolder, and fall within DomainBoundaries




should be implemented in the datasource instance though...

Web (http)

ok, there is a crawler for the WebDataSource

ok, the crawler can use fileAccessors and HttpAccessors, both work

OK, there is a HttpOpener, which brings the browser, when FileOpener is invoked on an HTML file, the browser is shown too

Possible, the URI must fall within the domain boundaries OR be below the rootUrl (if there are no domain boundaries




should be implemented in the datasource instance though

Imap

there is an ImapCrawler

The crawler plays the role of a DataAccessor. This enables the crawler and the accessor to share the same connection which is nice. It doesn't work with a registry though. There is no ImapAccessorFactory. An accessor instance is tied to a crawler instance which is tied to a datasource instance. There is currently no way to use it independently. We'd need a getDataAccessor method in the datasource. Or implement the crawlerFactory and the accessorfactory as a single class, that lets all crawlers and accessors share a single connection pool (which wouldn't be too difficult....)

There is no ImapOpener. If an email account is managed with a mail program (outlook, thunderbird) it should be accessed with an (Outlook|Thunderbird)DataSource and an appropriate application-specific opener. In order to be able to open an imap:// uri with a ThunderbirdImapOpener we need to assume that this particular account has been set in thunderbird. The thunderbird could take care about the credentials information. The opener doesn't get the datasource parameter and can't do it by itself. It is possible though, because imap uris have a fixed scheme. The Outlook/Thunderbird opener would be application specific though. We could always try to use the windows mail api. That would work with any mail application, but would be limited to windows.

Possible, the URI must begin with IMAP://, have an appropriate hostname and username, (and fall within the domain boundaries) the password can then be extracted from the datasource configuration.




should be implemented in the datasource instance though

AppleAddressbook

There is a crawler

It is impossible to create an accessor because there is no scheme to associate it with

It is impossible to create an opener, due to the lack of a fixed scheme

Is possible, every uri beginning with urn:semdesk:appleaddressbook can be thought to come from an apple addressbook, because there is ONE apple addressbook on each desktop


modify the crawler to use urn:semdesk:appleaddressbook, let the accessor registry work with prefixes, not schemes

modify the crawler to use urn:semdesk:appleaddressbook, let the opener registry work with prefixes, not schemes

should be implemented in the datasource instance though

Thunderbird Addressbook

There is a crawler

It is impossible because the crawler doesn't use a fixed scheme.

It is impossible to create an opener, due to the lack of a fixed scheme

Is possible, the addresbook crawler works with files, so the path to file should be a part of the URI. The source could check if the path is correct.


When implementing the accessor some care must be taken. The crawler wouldn't be able to use this accessor, because that would make crawling a single file quadratic in complexity. It is quite possible though.

modify the crawler to use urn:semdesk:thunderbirdaddresbook, let the opener registry work with prefixes, not schemes. It would only work when Thunderbird is installed, and would have to accomodate for various OS's.

should be implemented in the datasource instance though

Ical File

There is a crawler

It is impossible because the crawler doesn't use a fixed scheme.

It is impossible to create an opener, due to the lack of a fixed scheme

Is possible. The path to the ical file would be a part of the URI. It could be checked if it corresponds to the one specified in the configuration of a data source.


When implementing the accessor some care must be taken. The crawler wouldn't be able to use this accessor, because that would make crawling a single file quadratic in complexity. It is quite possible though to do it without duplicating code.

When the scheme is fixed, an opener would be possible. It would have to be application specific (e.g. KontactIcalOpener, or OutlookIcalOpener). There is no generic way to do it.

should be implemented in the datasource instance though

Outlook, mail, addressbook and calendar

There is a crawler

There is an accessor but there is no accessor factory, (due to the fact that there is no scheme).

There is an Opener but there is no OpenerFactory due to the lack of a fixed scheme

There is only one outlook in the system so every URI that begin with urn:semdesk:outlook belongs to the datasource

modify the crawler to use urn:semdesk:outlook, let the accessor registry work with prefixes, not schemes

When the scheme is fixed, creating a factory will be trivial

should be implemented in the datasource instance though

Thunderbird mail

(MBOX crawler)

Not implemented yet at all

We need a crawler. It would either try to read mbox files, or communicate with thunderbird directly. It would need some intelligence, because each folder is usually stored in a different file but together they make up the entier mailbox. Also thunderbird can have multiple accounts, Do we want each account to be a different data source or throw everything into a single one (like in Outlook)

we would also need an accesor

and an opener

and a way to map an email to a thunderbird data source