Download Latest Version recoll-lotusnotes-filter-1.2.0.zip (35.3 kB)
Email in envelope

Get an email when there's a new version of recoll-lotusnotes-filter

Home
Name Modified Size InfoDownloads / Week
README 2013-08-06 13.5 kB
recoll-lotusnotes-filter-1.2.0.zip 2013-08-06 35.3 kB
recoll-lotusnotes-filter-1.0.0.zip 2012-11-11 34.1 kB
Totals: 3 Items   82.9 kB 0
               A Recoll Indexing Filter for Lotus Notes Databases

This filter application provides a mechanism for extracting and indexing all of
the documents contained in a Lotus Notes database with the Recoll text search 
tool on a Linux system.

Requirements:
=============

You must have the Linux version of the IBM Lotus Notes desktop client installed
on the computer where you install this filter. The filter uses the Java API that
is installed as part of the client in order to access the database files.

You must also, of course, have Jean-Francois Dockes' Recoll text search tool 
installed.

Installation:
=============

    If you are updating from a previous release, only steps 1 and 2 are required.

    1)  The filter is packaged as a ZIP file. Download and unpack this ZIP file 
        to a temporary directory on your computer:
    
            https://sourceforge.net/projects/rcollnotesfiltr/files/latest/download
    
    2)  Copy these files to the Recoll filters directory, usually this is the
        /usr/share/recoll/filters directory:

            rcllnotes.jar
            rcllnotes
            rclOpenLotusNotesClient
         
    3)  Add the following line to the mimemap file in your Recoll configuration
        directory, usually this is the ~/.recoll/mimemap file:

            .nsf = application/x-extension-nsf 

        See the examples/mimemap file contained in the ZIP file for an example.
        
    4)  Add the following line to the [index] section of the mimeconf file in
        your Recoll configuration directory, usually this is the 
        ~/.recoll/mimeconf file:

            application/x-extension-nsf = execm rcllnotes 

        See the examples/mimeconf file contained in the ZIP file for an example.
        
    5)  Add the following line to the [stored] section of the fields file in
        your Recoll configuration directory, usually this is the 
        ~/.recoll/fields file:

            notesurl= 

        See the examples/fields file contained in the ZIP file for an example.
        
    6)  Add the following line to the top of the mimeview file in your Recoll 
        configuration directory, usually this is the ~/.recoll/mimeview file.
        This would go above the [view] section in the file, if there is one:
        
            xallexcepts = application/pdf application/postscript application/x-dvi tex\
            t/html|gnuinfo text/html|chm text/html|epub text/html|notesd\
            oc 

        If your mimeview file already contains this line, please just add the 
        text/html|notesdoc string to the end.

        Add the following line to the [view] section of the mimeview file:

            /text/html|notesdoc = /usr/share/recoll/filters/rclOpenNotesClient %f 

        If you do not have Recoll installed in the default location, please 
        adjust the path to the rclOpenNotesClient script above accordingly.

        See the examples/mimeview file contained in the ZIP file for examples.
        
    6)  Copy the examples/.rcllnotes configuration file to your home directory: 
        ~/.rcllnotes. Edit that file and put your Lotus Notes password into it 
        in the location specified in the file. You may also enable the other 
        settings in this file if you wish, however they are all optional and are
        not required for the filter to function properly in most cases. Almost
        all of those optional settings will cause the indexing process to run 
        more slowly than it normally would.

The installation is now complete. When recollindex runs it should index all of 
the Lotus Notes databases on your computer. Since these databases can be very 
large and may contain thousands of documents, do not be surprised if recollindex
runs for much longer than you have previously been used to.

Remember that you can control which files are indexed through the preference 
settings in the Recoll GUI or by directly editing your Recoll config file.

Also, if you run recollindex interactively and watch the console output, do not 
be surprised if you see long pauses in recollindex processing while it is 
working on a Lotus Notes database. The rcllnotes filter uses a Java application 
to extract all of the documents from the database as part of the process and you 
will not see any console output from recollindex while that is happening. For 
(very) large databases this pause can be (very) lengthy.

The ~/.rcllnotes configuration file contains a debug log setting that you can 
enable to instruct the Java application to output detailed information about its 
activity to a log file. You can monitor that log file during processing to see 
what the Java application is doing if you are concerned. This log can also be 
useful in troubleshooting problems.

Features, Quirks, and FAQs:
===========================

    * What is indexed? - Notes documents and their attachments are converted 
      into HTML format by this filter, and with the help of the rest of the 
      Recoll filters the resulting text is indexed by recollindex. This filter
      ignores graphics that are embedded in Notes documents as there doesn’t 
      seem to be much meaningful metadata associated with them. This filter does
      process graphic images that are attachments however, since they might 
      contain interesting metadata tags.
      
    * The Preview and Open links in search results lists - The Open link works. 
      The Preview link does not, most of the time. Indexed attachments are 
      opened in their respective applications. Notes documents are opened in the
      Notes client.

      The Preview link will probably never work reliably with Notes documents or 
      attachments. This is due to the fact that the Recoll application compares
      the time the file was indexed with the current time of the Notes database 
      file they came out of and refuses to open the preview if the database file 
      has a later time. Given the dynamic nature of Notes database files and the
      many actions that can update them, including just reading from them, it’s 
      unlikely that these two different timestamps will match. And therefore 
      unlikely that the Preview link will work.
      
    * The indexing process takes a long time - Yes it does. The process of 
      extracting documents from a Notes database is slow work. There can be 
      thousands of documents and attachments in a database, even tens of 
      thousands. The recollindex application, and therefore this filter, runs
      with the lowest possible priority on the system in order to have as little
      impact as possible on the other work you are doing. All of these factors
      contribute to the length of time it can take to index a Lotus Notes
      database. I have a mail archive database which contains just under 10,000
      documents and attachment files. It takes 15 minutes to extract and index
      this file and that's on a system with an 8-way i7 Intel CPU and 32GB of
      RAM. Have patience.

      Consider setting up a separate index and indexing run for Notes databases.
      See "Configurations, multiple indexes" in the Recoll manual for guidance 
      on that.
      
      http://www.lesbonscomptes.com/recoll/usermanual/rcl.indexing.html#RCL.INDEXING.INTRODUCTION.CONFIG 
      
      I have one index for my normal files, a second index for my active Notes
      databases, and a third index for my inactive archive databases. Each index
      is updated by recollindex on a different schedule specified in my crontab
      file. My Recoll search GUI is configured to search all three of these
      indexes as if they were one.

      The only potential “gotcha” with this configuration is when two instances
      of recollindex running at the same time both try to open the same Notes
      database at the same time. When that happens there will likely be a 
      deadlock with both indexing processes hanging indefinitely. I've seen it
      happen. The trick is to be sure that the topdirs and skippedPaths settings
      in your respective recoll.conf files prevent any overlaps in the files
      that will be processed by each of the instances you are setting up.

    * Performance impacts of multi-threading in Recoll 1.9 - In version 1.9
      Recoll became multi-threaded.  This means that Recoll will invoke multiple
      instances of the rcllnotes filter simultaneously.  Those multiple
      instances can chew up a lot of CPU and memory and can slow down your 
      system.  If this happens to you, you will want to read about the 
      configuration options that allow you to control the number of "file
      conversion and data extraction" threads that are spawned by Recoll.  See
      the details here:
      
      http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.INSTALL.CONFIG.RECOLLCONF.IDXTHREADS
      
      The value that needs to be adjusted is the first number in the"thrQSizes"
      and "thrTCounts" strings.  You will need to experiment to find the right
      values for your particular system, there is no single "correct" answer.
      
Release History:
================

    * 1.0.0 - Initial release - 2012-12-11
    * 1.1.0 - Updated to accept additional title field types - 2012-30-11
            - Fixed minor typo in Installation section of README
    * 1.2.0 - Updated for Notes 9 - 2013-08-01
            - Updated the RecollFilter Java application that extracts documents
              from databases to use a new thread model documented with Notes 9.
              This is supposed to be backward compatible with Notes 8.5 but I
              am not able to test that.
            - Added "-Xmx1g" JVM option in rcllnotes filter script to increase 
              maximum heap size to 1GB in order to eliminate exceptions when 
              processing larger Notes DBs.
              
How The Filter Works:
=====================

    1)  Assuming that the various ~/.recoll/mime* files are set up properly, 
        (see the Installation section above) when recollindex encounters a 
        Lotus Notes database (*.nsf) file it invokes the rcllnotes filter to 
        open the file.
    
    2)  The filter invokes a Java application that opens the database file and
        reads every document in the database in turn, extracting it in XML 
        format. During the extraction process the document's XML is transformed
        into HTML using an XSLT stylesheet. The Java application then walks 
        through this HTML, extracting any attachment files from the HTML. It 
        then pipes these attachment files back to the rcllnotes filter as 
        individual files in base64 text format. After handling all of the 
        attachments, the Java application pipes the remaining Notes document 
        HTML back to the rcllnotes filter. It repeats this process until all of
        the database file’s documents have been extracted.
    
    3)  The rcllnotes filter takes the stream of data that is piped to it by the
        Java application and parses it into individual files. The filter pipes 
        each of these files back to recollindex for indexing. The attachment 
        files that are part of the stream are converted from their base64 format
        into their original binary format before they are piped to recollindex. 
        Recollindex may submit these attachment files to other filters for 
        further processing, depending on their mime type. Recollindex ultimately 
        inserts the contents of the Notes documents and the attachments into its 
        index where it is available for searching.
    
    4)  The rcllnotes filter is also invoked by the recoll GUI application when
        you elect to open an attachment or Notes document that has been 
        presented to you as part of a search result. In this case the filter is
        given both the *.nsf file’s name and the Lotus Notes document UNID and
        asked to retrieve the document. It does so, fetching that individual 
        document and subjecting it to the same XML/XSLT conversion and attachment
        processing described above before returning the results to the recoll 
        application. If the file you selected is an attachment, recoll will open
        it using it’s normal process for that type of file. If it is a Notes 
        document then the recoll application passes the HTML representation of 
        the document that the filter produced on to the command specified for
        Notes documents in the mimeview file, the rclOpenNotesClient script. 
        That Python script extracts the NotesURL for the document from inside 
        the HTML file and invokes the Notes client, passing the NotesURL to it 
        on the command line. The client will then start up, if it isn't already 
        started, and open that document. If Notes is already started, the 
        requested document will simply be opened in the existing client window.

Support:
========

There is no formal support for this code. I will provide "best-effort" support 
in as far as my real life will allow. I use the discussion forum and the bug 
tracking system on SourceForge for these purposes. Please make use of them
if you have problems or questions: http://rcollnotesfiltr.sourceforge.net/

Licensing:
==========

This code is licensed under the terms of the GNU GPL v3 license. 
Source: README, updated 2013-08-06