I need a way to get more detailed information from the search results, and it would also be great if search could output to html.
In an effort to meet this goal, I wrote a short java front-end program to call lucene's search (via multivalent), and then use multivalent's pdfinfo to extract the document's author and title tags and extracttext to extract the first 10 lines of text. I then have the program output all of this information to an html file that somewhat resembled google's search results page.
So far, this output is exponentially more useful than a single filename however, my program has several drawbacks.
1) It is incredibly slow. This is probably because I have to call multivalent once to perform the actual search, then two more times per result (to extract the pdf info and the text). Even when I cut the pdfinfo step out of the loop, my computer still takes a substantial amount of time simply returning the results from extracttext (maybe between 5 and 15 seconds per pdf file).
2) I haven't yet found a clean way to get filenames that were not returned from the previous search because the number of results was maxed out. I can probably fix this by simply saving all of the pdf file names first, and then allowing the user to extract info from say, 10 files at a time, but it would add additional steps and user intervention to what I hoped would be one-command-line operation.
It seems to me that it might be possible to specify all of this (report more information, output to html, and show results x through y) when the initial search is conducted. I think this would at least begin to address the first problem of speed. However, I don't completely understand the relationships between lucene, multivalent, and multivalent tools so it might not be feasible to do it programmatically.
Would it be possible to add this functionality to multivalent? Or, would it be possible for me to write a 'behavior' to do this? Thank you for responding.
I would be happy to share my code if someone else thinks it might be useful.
The tools are targeted toward command-line use. Calling main() is possible but not recommended. Programmatic interfaces have been requested and are in the pipeline.
For search everybody wants a different custom display of results. You can drive Lucene yourself. Lucene can store arbitrary fields in addition to indexed text, so you could add source text, filename, title, and other metadata. Then you could instantly generate HTML with whatever data you'd like.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.