Re: [Hebmorph-thinktank] hebmorph searching the bible

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

On 10/06/2011 15:55, Efraim Feinstein wrote:

> I'll dig up some unfiltered examples over the weekend. As I said, the 
> interface I'm using is eXist, so I'm not sure exactly where the 
> extraneous results are coming from.
>
> What would be useful to help debug it?
Sorry but I'm not familiar with eXist. Try perhaps using the Score - 
that can give you an idea of how far off was the scoring, or Lucene's 
Explanation objects. Usually common sense and understanding of the lower 
works is the best debugger...

Common faults are the tf/idf algorithm which takes into account the 
document length, so short documents score higher (not what you want for 
Tanach), and lemma density (similar words, different meanings, and 
ambiguity).

There are some configurations that fine tune HebMorph searches, and I'll 
be blogging about those next week.

> I was actually pleasantly surprised at how *well* it works with 
> Biblical Hebrew, considering that it is based on modern Hebrew 
> spelling and grammar. It certainly does much better than any other 
> analyzers. What would it take to add to the dictionary? Although I 
> don't have lots of time to work on this, we do have a reasonably 
> complete public domain biblical dictionary (that is, word list + parts 
> of speech). It wouldn't help with the unique biblical grammatical 
> forms, non-Academia spelling, or Aramaic, but it could get us a bit 
> farther along.
Actually I recently added a simple external dictionary support (to the 
.NET version), and it proved to be very useful for one of our users. It 
didn't use POS though. I'll be blogging about that as well, and I 
suggest we'll continue this discussion after I have that published.

Itamar.