Just a quick update – I have just built upon my XSLT work from last week by integrating the Java Aperture library with VuFind. This makes it possible to harvest documents like PDFs or Word files and extract their text contents directly into the Solr index. It was easier to get it working than I expected, though I did run into one apparent bug in Aperture’s shell scripts under Linux! See notes here:
It may be useful to do something similar for SolrMarc-based imports – see http://vufind.org/jira/browse/VUFIND-274 for details.
Let me know if you have questions about this – I’m sure if anyone starts using this in earnest, we’ll need to make some further adjustments for improved stability… but as a proof of concept, it seems to work quite nicely!