So, to review:
The main problem with Extended Dismax is that it doesn’t properly apply the default operator when a NOT or - clause is used.
So if you enter:
apples oranges -bananas
apples oranges NOT bananas
you would expect your search results to be the same as those for:
apples AND oranges NOT bananas
but instead you get:
apples OR oranges NOT bananas
On today’s dev call, we discussed the possibility of detecting the - or NOT operators, then failing back to the old Lucene code to get around this limitation. Alas, the plot thickens and it gets more complicated.
First of all, using the fallback code was a mistake. It currently does not handle NOT properly. Not surprising, because it’s not real DisMax. It creates a whole bunch of queries and OR’s them together – so you will frequently get results back that include the term you are attempting to exclude. There is no easy fix for this, aside from writing our own DisMax query generator in PHP, which would be an exercise in madness.
Another interesting discovery is that the basic DisMax handler does process the - operator appropriately… so while a current instance of VuFind will break with “apples oranges NOT bananas” it will yield correct results for “apples oranges -bananas". So this is definitely a regression if we move to eDisMax. Maybe not a significant one, since library users are much more likely to use the broken NOT syntax than the working - syntax.
This all leaves me even more uncertain about the best road forward – switching to eDismax breaks something that is already broken, just in a different way. If the Solr team fixes the underlying problem that is causing this behavior, then we’ll be in great shape. In the meantime, it seems we have these options:
1.) Stick with the status quo, but add the option to turn on eDismax if desired
2.) Switch to eDismax, on the assumption that the benefits outweigh the drawbacks
3.) Write some sort of crude query parser to insert AND operators into queries containing NOT or -. We can probably make the most common cases work fairly easily, but doing it correctly would require a lot of effort, and that may be a waste of time given that this is a workaround for a bug and not something that we need in an ideal world.
4.) Write code to use the regular Dismax handler instead of eDismax for queries containing the - operator and no other operators. This will lead to optimal functionality of a small number of edge cases – not worthwhile in my opinion, but maybe worth mentioning.
I’d really like to get this wrapped up, but the best option is not obvious.