A Recoll Indexing Filter for Lotus Notes Databases
This filter application provides a mechanism for extracting and indexing all of
the documents contained in a Lotus Notes database with the Recoll text search
tool on a Linux system.
Requirements:
=============
You must have the Linux version of the IBM Lotus Notes desktop client installed
on the computer where you install this filter. The filter uses the Java API that
is installed as part of the client in order to access the database files.
You must also, of course, have Jean-Francois Dockes' Recoll text search tool
installed.
Installation:
=============
If you are updating from a previous release, only steps 1 and 2 are required.
1) The filter is packaged as a ZIP file. Download and unpack this ZIP file
to a temporary directory on your computer:
https://sourceforge.net/projects/rcollnotesfiltr/files/latest/download
2) Copy these files to the Recoll filters directory, usually this is the
/usr/share/recoll/filters directory:
rcllnotes.jar
rcllnotes
rclOpenLotusNotesClient
3) Add the following line to the mimemap file in your Recoll configuration
directory, usually this is the ~/.recoll/mimemap file:
.nsf = application/x-extension-nsf
See the examples/mimemap file contained in the ZIP file for an example.
4) Add the following line to the [index] section of the mimeconf file in
your Recoll configuration directory, usually this is the
~/.recoll/mimeconf file:
application/x-extension-nsf = execm rcllnotes
See the examples/mimeconf file contained in the ZIP file for an example.
5) Add the following line to the [stored] section of the fields file in
your Recoll configuration directory, usually this is the
~/.recoll/fields file:
notesurl=
See the examples/fields file contained in the ZIP file for an example.
6) Add the following line to the top of the mimeview file in your Recoll
configuration directory, usually this is the ~/.recoll/mimeview file.
This would go above the [view] section in the file, if there is one:
xallexcepts = application/pdf application/postscript application/x-dvi tex\
t/html|gnuinfo text/html|chm text/html|epub text/html|notesd\
oc
If your mimeview file already contains this line, please just add the
text/html|notesdoc string to the end.
Add the following line to the [view] section of the mimeview file:
/text/html|notesdoc = /usr/share/recoll/filters/rclOpenNotesClient %f
If you do not have Recoll installed in the default location, please
adjust the path to the rclOpenNotesClient script above accordingly.
See the examples/mimeview file contained in the ZIP file for examples.
6) Copy the examples/.rcllnotes configuration file to your home directory:
~/.rcllnotes. Edit that file and put your Lotus Notes password into it
in the location specified in the file. You may also enable the other
settings in this file if you wish, however they are all optional and are
not required for the filter to function properly in most cases. Almost
all of those optional settings will cause the indexing process to run
more slowly than it normally would.
The installation is now complete. When recollindex runs it should index all of
the Lotus Notes databases on your computer. Since these databases can be very
large and may contain thousands of documents, do not be surprised if recollindex
runs for much longer than you have previously been used to.
Remember that you can control which files are indexed through the preference
settings in the Recoll GUI or by directly editing your Recoll config file.
Also, if you run recollindex interactively and watch the console output, do not
be surprised if you see long pauses in recollindex processing while it is
working on a Lotus Notes database. The rcllnotes filter uses a Java application
to extract all of the documents from the database as part of the process and you
will not see any console output from recollindex while that is happening. For
(very) large databases this pause can be (very) lengthy.
The ~/.rcllnotes configuration file contains a debug log setting that you can
enable to instruct the Java application to output detailed information about its
activity to a log file. You can monitor that log file during processing to see
what the Java application is doing if you are concerned. This log can also be
useful in troubleshooting problems.
Features, Quirks, and FAQs:
===========================
* What is indexed? - Notes documents and their attachments are converted
into HTML format by this filter, and with the help of the rest of the
Recoll filters the resulting text is indexed by recollindex. This filter
ignores graphics that are embedded in Notes documents as there doesn’t
seem to be much meaningful metadata associated with them. This filter does
process graphic images that are attachments however, since they might
contain interesting metadata tags.
* The Preview and Open links in search results lists - The Open link works.
The Preview link does not, most of the time. Indexed attachments are
opened in their respective applications. Notes documents are opened in the
Notes client.
The Preview link will probably never work reliably with Notes documents or
attachments. This is due to the fact that the Recoll application compares
the time the file was indexed with the current time of the Notes database
file they came out of and refuses to open the preview if the database file
has a later time. Given the dynamic nature of Notes database files and the
many actions that can update them, including just reading from them, it’s
unlikely that these two different timestamps will match. And therefore
unlikely that the Preview link will work.
* The indexing process takes a long time - Yes it does. The process of
extracting documents from a Notes database is slow work. There can be
thousands of documents and attachments in a database, even tens of
thousands. The recollindex application, and therefore this filter, runs
with the lowest possible priority on the system in order to have as little
impact as possible on the other work you are doing. All of these factors
contribute to the length of time it can take to index a Lotus Notes
database. I have a mail archive database which contains just under 10,000
documents and attachment files. It takes 15 minutes to extract and index
this file and that's on a system with an 8-way i7 Intel CPU and 32GB of
RAM. Have patience.
Consider setting up a separate index and indexing run for Notes databases.
See "Configurations, multiple indexes" in the Recoll manual for guidance
on that.
http://www.lesbonscomptes.com/recoll/usermanual/rcl.indexing.html#RCL.INDEXING.INTRODUCTION.CONFIG
I have one index for my normal files, a second index for my active Notes
databases, and a third index for my inactive archive databases. Each index
is updated by recollindex on a different schedule specified in my crontab
file. My Recoll search GUI is configured to search all three of these
indexes as if they were one.
The only potential “gotcha” with this configuration is when two instances
of recollindex running at the same time both try to open the same Notes
database at the same time. When that happens there will likely be a
deadlock with both indexing processes hanging indefinitely. I've seen it
happen. The trick is to be sure that the topdirs and skippedPaths settings
in your respective recoll.conf files prevent any overlaps in the files
that will be processed by each of the instances you are setting up.
* Performance impacts of multi-threading in Recoll 1.9 - In version 1.9
Recoll became multi-threaded. This means that Recoll will invoke multiple
instances of the rcllnotes filter simultaneously. Those multiple
instances can chew up a lot of CPU and memory and can slow down your
system. If this happens to you, you will want to read about the
configuration options that allow you to control the number of "file
conversion and data extraction" threads that are spawned by Recoll. See
the details here:
http://www.lesbonscomptes.com/recoll/usermanual/usermanual.html#RCL.INSTALL.CONFIG.RECOLLCONF.IDXTHREADS
The value that needs to be adjusted is the first number in the"thrQSizes"
and "thrTCounts" strings. You will need to experiment to find the right
values for your particular system, there is no single "correct" answer.
Release History:
================
* 1.0.0 - Initial release - 2012-12-11
* 1.1.0 - Updated to accept additional title field types - 2012-30-11
- Fixed minor typo in Installation section of README
* 1.2.0 - Updated for Notes 9 - 2013-08-01
- Updated the RecollFilter Java application that extracts documents
from databases to use a new thread model documented with Notes 9.
This is supposed to be backward compatible with Notes 8.5 but I
am not able to test that.
- Added "-Xmx1g" JVM option in rcllnotes filter script to increase
maximum heap size to 1GB in order to eliminate exceptions when
processing larger Notes DBs.
How The Filter Works:
=====================
1) Assuming that the various ~/.recoll/mime* files are set up properly,
(see the Installation section above) when recollindex encounters a
Lotus Notes database (*.nsf) file it invokes the rcllnotes filter to
open the file.
2) The filter invokes a Java application that opens the database file and
reads every document in the database in turn, extracting it in XML
format. During the extraction process the document's XML is transformed
into HTML using an XSLT stylesheet. The Java application then walks
through this HTML, extracting any attachment files from the HTML. It
then pipes these attachment files back to the rcllnotes filter as
individual files in base64 text format. After handling all of the
attachments, the Java application pipes the remaining Notes document
HTML back to the rcllnotes filter. It repeats this process until all of
the database file’s documents have been extracted.
3) The rcllnotes filter takes the stream of data that is piped to it by the
Java application and parses it into individual files. The filter pipes
each of these files back to recollindex for indexing. The attachment
files that are part of the stream are converted from their base64 format
into their original binary format before they are piped to recollindex.
Recollindex may submit these attachment files to other filters for
further processing, depending on their mime type. Recollindex ultimately
inserts the contents of the Notes documents and the attachments into its
index where it is available for searching.
4) The rcllnotes filter is also invoked by the recoll GUI application when
you elect to open an attachment or Notes document that has been
presented to you as part of a search result. In this case the filter is
given both the *.nsf file’s name and the Lotus Notes document UNID and
asked to retrieve the document. It does so, fetching that individual
document and subjecting it to the same XML/XSLT conversion and attachment
processing described above before returning the results to the recoll
application. If the file you selected is an attachment, recoll will open
it using it’s normal process for that type of file. If it is a Notes
document then the recoll application passes the HTML representation of
the document that the filter produced on to the command specified for
Notes documents in the mimeview file, the rclOpenNotesClient script.
That Python script extracts the NotesURL for the document from inside
the HTML file and invokes the Notes client, passing the NotesURL to it
on the command line. The client will then start up, if it isn't already
started, and open that document. If Notes is already started, the
requested document will simply be opened in the existing client window.
Support:
========
There is no formal support for this code. I will provide "best-effort" support
in as far as my real life will allow. I use the discussion forum and the bug
tracking system on SourceForge for these purposes. Please make use of them
if you have problems or questions: http://rcollnotesfiltr.sourceforge.net/
Licensing:
==========
This code is licensed under the terms of the GNU GPL v3 license.