[Aperture-devel] Lucene support in Aperture

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Aperturians

I've been looking at the issue tracker and noticed the oldest open
issue: http://tinyurl.com/sf1362628. I think that this seems to be a
common use-case - to feed the fulltext into a lucene index. I've
written two class stubs, LuceneUtil and LuceneUtilTest. They compile
under aperture with lucene-2.4.0-core added into the classpath (and
appropriate imports). IMHO this approach is better, because it's
potentially usable everywhere, in all CrawlerHandlers, a complete
CrawlerHandler implementation would be harder to integrate in an app
that already has an elaborate CrawlerHandler.

I'm sure that some of you have code lying around that does these
tricks. All contributions welcome, warm smiles for copy-pasted
snippets torn out of context, devoid of unit tests - eternal glory for
a complete LuceneUtil class with heavy tests, with the use of
InferenceUtil, flattenning the property hierarchies, and proper
support for add/replace/remove.

My own resources are limited and I'm sure you have much more
experience on bridging rdf with lucene than I do.

What's the best (i.e. most practical) approach towards mapping RDF
properties to lucene field names?

This request has been there for three years now. It's time to get it done.

All kinds of comments welcome.

-- 
Antoni Myłka
ant...@gm...

------ BEGIN THE CLASS STUB ----
public class LuceneIndexUtil {
    private IndexWriter indexWriter;
    public LuceneIndexUtil(IndexWriter indexWriter) {
        this.indexWriter = indexWriter;
    }
    public void addObject(DataObject object) throws IOException {
        Document d = new Document();
        d.add(new Field("uri",object.getID().toString(),
            Field.Store.YES, Field.Index.NOT_ANALYZED));
        String content = object.getMetadata().getString(
            NIE.plainTextContent);
        if (content != null) {
            d.add(new Field("content",content,
                Field.Store.YES,Field.Index.ANALYZED));
        }
        indexWriter.addDocument(d);
    }
    public void replaceObject(DataObject object) throws IOException { }
    public void removeObject(String url) throws IOException {  }
}
------ END THE CLASS STUB ----

------ BEGIN THE TEST STUB ----
public class LuceneIndexUtilTest extends ApertureTestBase {
	public void testAddBasicFullTextObject() throws Exception {
	    URI uri = new URIImpl("file:///home/antheque/file.txt");
	    DataObject object = new DataObjectBase(uri,null,
	        createRDFContainer(uri));
	    object.getMetadata().add(NIE.plainTextContent,
	        "This is the plain text content");

	    util.addObject(object);
	    writer.commit();
	    writer.close();
	    reader = IndexReader.open(dir);
	    assertEquals(1,writer.numDocs());
	    Query query = parser.parse("plain text");
	    searcher = new IndexSearcher(reader);
	    TopDocs docs = searcher.search(query,null,1);
	    assertEquals(1, docs.totalHits);
	    Document d = reader.document(docs.scoreDocs[0].doc);
	    assertEquals("file:///home/antheque/file.txt",d.get("uri"));
	    assertEquals("This is the plain text content",d.get("content"));
	    object.dispose();
	}
	public void setUp() throws Exception {
		dir = new RAMDirectory();
		analyzer = new StandardAnalyzer();
		writer = new IndexWriter(dir,analyzer,true,
		    IndexWriter.MaxFieldLength.LIMITED);
		util = new LuceneIndexUtil(writer);
		parser = new QueryParser("content",analyzer);
	}
	public void tearDown() throws Exception {
		if (searcher != null) {searcher.close();}
		searcher = null;
		if (reader != null) {reader.close();}
		reader = null;
		writer.close();
		writer = null;
		dir.close();
		dir = null;
	}
	private Directory dir;
	private IndexReader reader;
	private IndexWriter writer;
	private IndexSearcher searcher;
	private Analyzer analyzer;
	private QueryParser parser;
	private LuceneIndexUtil util;
}
--------- END THE TEST STUB -------