From: Antoni M. <ant...@gm...> - 2008-11-13 17:28:32
|
Hello Aperturians I've been looking at the issue tracker and noticed the oldest open issue: http://tinyurl.com/sf1362628. I think that this seems to be a common use-case - to feed the fulltext into a lucene index. I've written two class stubs, LuceneUtil and LuceneUtilTest. They compile under aperture with lucene-2.4.0-core added into the classpath (and appropriate imports). IMHO this approach is better, because it's potentially usable everywhere, in all CrawlerHandlers, a complete CrawlerHandler implementation would be harder to integrate in an app that already has an elaborate CrawlerHandler. I'm sure that some of you have code lying around that does these tricks. All contributions welcome, warm smiles for copy-pasted snippets torn out of context, devoid of unit tests - eternal glory for a complete LuceneUtil class with heavy tests, with the use of InferenceUtil, flattenning the property hierarchies, and proper support for add/replace/remove. My own resources are limited and I'm sure you have much more experience on bridging rdf with lucene than I do. What's the best (i.e. most practical) approach towards mapping RDF properties to lucene field names? This request has been there for three years now. It's time to get it done. All kinds of comments welcome. -- Antoni Myłka ant...@gm... ------ BEGIN THE CLASS STUB ---- public class LuceneIndexUtil { private IndexWriter indexWriter; public LuceneIndexUtil(IndexWriter indexWriter) { this.indexWriter = indexWriter; } public void addObject(DataObject object) throws IOException { Document d = new Document(); d.add(new Field("uri",object.getID().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); String content = object.getMetadata().getString( NIE.plainTextContent); if (content != null) { d.add(new Field("content",content, Field.Store.YES,Field.Index.ANALYZED)); } indexWriter.addDocument(d); } public void replaceObject(DataObject object) throws IOException { } public void removeObject(String url) throws IOException { } } ------ END THE CLASS STUB ---- ------ BEGIN THE TEST STUB ---- public class LuceneIndexUtilTest extends ApertureTestBase { public void testAddBasicFullTextObject() throws Exception { URI uri = new URIImpl("file:///home/antheque/file.txt"); DataObject object = new DataObjectBase(uri,null, createRDFContainer(uri)); object.getMetadata().add(NIE.plainTextContent, "This is the plain text content"); util.addObject(object); writer.commit(); writer.close(); reader = IndexReader.open(dir); assertEquals(1,writer.numDocs()); Query query = parser.parse("plain text"); searcher = new IndexSearcher(reader); TopDocs docs = searcher.search(query,null,1); assertEquals(1, docs.totalHits); Document d = reader.document(docs.scoreDocs[0].doc); assertEquals("file:///home/antheque/file.txt",d.get("uri")); assertEquals("This is the plain text content",d.get("content")); object.dispose(); } public void setUp() throws Exception { dir = new RAMDirectory(); analyzer = new StandardAnalyzer(); writer = new IndexWriter(dir,analyzer,true, IndexWriter.MaxFieldLength.LIMITED); util = new LuceneIndexUtil(writer); parser = new QueryParser("content",analyzer); } public void tearDown() throws Exception { if (searcher != null) {searcher.close();} searcher = null; if (reader != null) {reader.close();} reader = null; writer.close(); writer = null; dir.close(); dir = null; } private Directory dir; private IndexReader reader; private IndexWriter writer; private IndexSearcher searcher; private Analyzer analyzer; private QueryParser parser; private LuceneIndexUtil util; } --------- END THE TEST STUB ------- |