scott gerard - 2006-12-25

How do I index additional fields in HTML?  For example, I would like to include the <title> in the index.  I have not been able to get the the JTidyHtmlIndexer to do this at all. I was able to get NekoHtmlIndexer to do this with the following:

<fields>
   <luceneField name="fullText" xpathSelect="//*" type="Text" ocurSep="|" />
   <luceneField name="title" xpathSelect="//TITLE" type="Text" ocurSep="|" />
</fields>

But this only works if the <title> tag is not namespace qualified in the source document.  Some of my HTML files explicitly declare a namespace.

   <html xmlns="http://www.w3.org/1999/xhtml">

In this, case the xpathSelect="//TITLE" is incorrect because it does not specify a namespace.  I tried

   <luceneField name="title" xpathSelect="//xhtml:TITLE" type="Text" ocurSep="|" />

This does not work because there is no declaration for the xhtml namespace prefix.  It debugging through the code, I found only namespaces declared in the INDEXED document are passed to Jaxen.  But xhtml namespace needs to be defined in the LiusConfig.xml document for the xpathSelect statement to always work. 

I also tried

   <luceneField name="title" xpathSelect="//*[local-name(.)='TITLE'" type="Text" ocurSep="|" />

which should work regardless of the namespace of the indexed document.  However, I get errors that "ERROR [main] (XmlFileIndexer.java:148) - Function :local-name". 

Your help is much appreciated.

Scott