From: Leo S. <leo...@gn...> - 2010-09-27 17:30:56
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> Hi Aperture,<br> <br> Veronica came back to me in a private mail and evaluated it - the proposed change is too late to be done in her thesis....<br> <br> hey, now since we have the changes documented, would someone else want to jump on this topic?<br> ... it would be nice to have a configurable outlook crawler.... plz... someone?<br> <br> best<br> Leo<br> <br> It was Leo Sauermann who said at the right time 22.09.2010 14:38 the following words: <blockquote cite="mid:4C9...@gn..." type="cite"> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> Hi Veronica,<br> <br> what is missing in outlook is the folder-patterns that are implemented in the file crawler and web crawler using domainBoundaries.<br> With outlook, you need something different, because the uri scheme does not reflect the structure of outlook (more see below)<br> <br> <br> To be compatible with the aperture framework as such, I propose that the bevahiour must be <br> "specify special folders to crawl. crawl specified folders only, crawl the items inside the specified folders and the subfolders. Include also all parent-folders between the datasource rootfolder and the specified folders, to know where the specified folders are in hierarchy".<br> <br> to achieve this, the crawler must go into many subfolders to find the actual folder to be crawled, but that is ok. it just needs to NOT crawl the items in parent-folders.<br> <br> there is two points to be done, to understand my example code, read up on Domain Boundaries class and related RDF config, ( and if you fail to understand them, ask again on the list what domain boundaries are, but I think you should be able to understand the logic of Domain Boundaries).<br> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/datasource/config/DomainBoundaries.java">http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/datasource/config/DomainBoundaries.java</a><br> <br> There is a drawback though:<br> DomainBoundaries are NOT EXACTLY what will happen in Outlook.<br> <br> So actually, I propose below to use the concept of "include a folder path name match" instead of domain boundaries, but it is good to understand domain boundaries, so read up on them.<br> the problem why I can't use domain boundaries is that in our OutlookCrawler, the items are identified not by their folder path but only by their class and EntryID from outlook, <br> so a mail stored in the folder root/My Data/Inbox/project/aperture/old will be identified by something like outlook://email/12123123123.<br> <br> hence "include configured folder names". <br> that is what users typically want - its very common to select outlook folders in mobile-phone synching tools ..."select folders to synch to your nokia"... etc... I did use such sync tools a lot. also for google calendar sync, addressbook sync, etc... all these tools show you a folder tree.<br> <br> <br> code changes you need to do are in one class:<br> <br> *** crawling special folders ***<br> <br> in the crawler,<br> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/outlook/OutlookCrawler.java">http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/outlook/OutlookCrawler.java</a><br> <br> The method to change is <br> <span class="kd"></span><span class="o"><br> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/outlook/OutlookCrawler.java#L274">http://sourceforge.net/apps/trac/aperture/browser/aperture/trunk/core/src/main/java/org/semanticdesktop/aperture/outlook/OutlookCrawler.java#L274</a><br> <br> change the method to something like this (parentheses only to separate the code from the rest of the mail):<br> <br> </span><span class="kd">private</span> <span class="kt">boolean</span> <span class="nf">crawlContainer</span><span class="o">(</span>OutlookResource<span class="o">.</span><span class="na">Folder</span> folder<span class="o">,</span> OutlookResource parent<span class="o">)</span> <span class="o"></span><br> <span class="o">{<br> </span>logger<span class="o">.</span><span class="na">info</span><span class="o">(</span><span class="s">"crawling folder: "</span><span class="o">+</span>folder<span class="o">.</span><span class="na">getUri</span><span class="o">());</span><br> <span class="o">boolean crawlThis = checkFolderName(folder); // crawl the items in this folder?<br> boolean crawlSub = (crawlThis || checkIsParentOf</span><span class="o">FolderName</span><span class="o">(folder)); // go into subfolders either if this folder is crawled, or if this folder is potentionally the parent of a folder that will eventually be crawled<br> <br> if (!(crawlThis || crawlSub)) {<br> // uh, this is a dead end, neither this folder nor subfolders are to be crawled, stop<br> logger.finer("not in domain boundaries, skipping this folder");<br> return true; // the return is not because of stopRequested, but because of domain boundaries.<br> }<br> // this is unchanged - you must leave the parents hierarchy even if the subfolders are not to be crawled<br> // data of folder<br> crawlSingleResource(folder, parent);<br> <br> // items inside folder<br> if (</span><span class="o">crawlThis) {</span><br> <span class="o"> boolean result = crawlSubItems(folder);<br> if (result == false)<br> return false;<br> }<br> <br> // subfolders<br> result = crawlSubFolders(folder);<br> return result;<br> }<br> <br> <br> <br> of course, the two methods </span><span class="o">checkFolderName </span>and <span class="o">checkIsParentOf</span><span class="o">FolderName</span><span class="o"> </span>have to be implemented, checkout how the web crawler and the file crawler do this with domain boundaries. basically it will be:<br> <br> <br> <span class="o"></span><span class="o">checkFolderName(folder)<br> {<br> String foldername = .... get the full name including parents of that folder, something like "root/My Items/Contacts"<br> if (configuredFolders.empty())<br> return true; // if there is nothing configured, crawl all.<br> for (String s in configuredFolders)<br> if (s.equals(foldername))<br> return true;<br> return false;<br> }<br> <br> </span><span class="o">checkIsParentOf</span><span class="o">FolderName</span><span class="o"></span><span class="o">(folder)<br> {<br> String foldername = .... get the full name including parents of that folder, something like "root/My Items/Contacts"<br> if (configuredFolders.empty())<br> return true; // if there is nothing configured, crawl all.<br> for (String s in configuredFolders)<br> if (s.startsWith(foldername)) // HERE IS THE DIFFERENCE<br> return true;<br> return false;<br> }</span><span class="o"></span><br> <span class="o"><br> </span>Note that <span class="o">checkIsParentOf</span><span class="o">FolderName</span><span class="o"></span><span class="o"> </span><span class="o">means: is the current folder path (something like "root/My Items/Contacts"</span>) a substring of the <span class="o">configuredFolders</span>?<br> Note that using Outlook folder labels SUCKS and can only be a hack, Outlook really does everything using EntryIDs. But for you, using the label/name of a folder hierarchy may just work fine, and its easier to debug and confiugre.<br> Note that you have to somehow do <span class="o">configuredFolders. Preferably using OutlookDataSource!!! no hacks here, it should go into RDF and into </span><span class="o">OutlookDataSource</span><br> <span class="o"><br> <br> *** configuration ***<br> <br> you need to configure the crawler for above setup.<br> you may need a graphical gui for the user to select the folders to crawl.<br> that is tricky.<br> <br> I know that all these sync-your-mobile-with-outlook applications are using a quite generic treeview control of outlook, but I never found this activex control within the outlook activex list. There has to be some generic control showing the overall tree of all things in outlook, including nice icons. I have seen it many times, and I guess it must be a standard component. <br> if you don't find this, you must probably write the domain boundaries config by hand, something like this in n3/turtle (please check the proper RDF using the ontology or by looking into the filecrawler / webcrawler code):<br> "... an OutlookDatasource;<br> includeFolder "</span><span class="o">root/My Items/Contacts", "</span><span class="o">root/My Items/Inbox";</span><span class="o"><br> ...<br> </span>"<br> <br> that means you must find an ontology for includeFolder, there is one connected to OutlookDataSource (look for that class, I don't knwo where it is, must be somewhere)<br> <br> overall, if you do this, you MUST ensure that the thing still CRAWLS EVERYTHING if the domain boundaries are NOT CONFIGURED at all.<br> If you promise to write your code like this, I will help you further :-)<br> otherwise the code won't be of much use to us.<br> <br> please read up on domain boundaries before asking any questions, you need to understand that principle before you understand this mail.<br> <br> best<br> Leo<br> <br> It was Antoni Mylka who said at the right time 17.09.2010 01:10 the following words: <blockquote cite="mid:4C9...@gm..." type="cite"> <pre wrap="">W dniu 2010-08-26 16:33, Verónica Rivera Pelayo pisze: </pre> <blockquote type="cite"> <pre wrap="">Dear Developers, I'm using the OutlookCrawler from Aperture 1.5. and I have a problem when the email corpora is big (emails from years time). The crawling takes more than one day to perform... My first idea to solve it was not crawling all the OutlookResources (Folders, Calender, Appointments...). Is it possible to filter which OutlookResources are crawled? I had a look at the code, but didn't find anything useful. Is there maybe another solution to my problem? </pre> </blockquote> <pre wrap="">AFAIK no, this is a limitation of the current Outlook Crawler, either all or nothing. It is possible though to implement this, you'd need to have a look at the Outlook object model on MSDN, and the way it is accessed in our java-based OutlookCrawler via the Jacob library. Imagine writing it in VBA and just translate the VBA to Jacob invocations in Java. If you're willing to try to implement this, we'd be more than happy to help. Antoni Mylka <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:ant...@gm...">ant...@gm...</a> ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://p.sf.net/sfu/novell-sfdev2dev">http://p.sf.net/sfu/novell-sfdev2dev</a> _______________________________________________ Aperture-devel mailing list <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Ape...@li...">Ape...@li...</a> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/aperture-devel">https://lists.sourceforge.net/lists/listinfo/aperture-devel</a> </pre> </blockquote> <br> <br> <pre class="moz-signature" cols="72">-- Leo Sauermann, Dr. CEO and Founder mail: <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:leo...@gn...">leo...@gn...</a> mobile: +43 6991 gnowsis <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://www.gnowsis.com">http://www.gnowsis.com</a> helping people remember, so join our newsletter <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://www.gnowsis.com/about/content/newsletter">http://www.gnowsis.com/about/content/newsletter</a> ____________________________________________________ </pre> <pre wrap=""> <fieldset class="mimeAttachmentHeader"></fieldset> ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. <a class="moz-txt-link-freetext" href="http://p.sf.net/sfu/novell-sfdev2dev">http://p.sf.net/sfu/novell-sfdev2dev</a></pre> <pre wrap=""> <fieldset class="mimeAttachmentHeader"></fieldset> _______________________________________________ Aperture-devel mailing list <a class="moz-txt-link-abbreviated" href="mailto:Ape...@li...">Ape...@li...</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/aperture-devel">https://lists.sourceforge.net/lists/listinfo/aperture-devel</a> </pre> </blockquote> <br> <br> <pre class="moz-signature" cols="72">-- Leo Sauermann, Dr. CEO and Founder mail: <a class="moz-txt-link-abbreviated" href="mailto:leo...@gn...">leo...@gn...</a> mobile: +43 6991 gnowsis <a class="moz-txt-link-freetext" href="http://www.gnowsis.com">http://www.gnowsis.com</a> helping people remember, so join our newsletter <a class="moz-txt-link-freetext" href="http://www.gnowsis.com/about/content/newsletter">http://www.gnowsis.com/about/content/newsletter</a> ____________________________________________________ </pre> </body> </html> |