From: Shad L. <sh...@sh...> - 2013-06-30 06:47:40
|
Hey everyone, In the process of beefing-up G3.1's rest module, I added a search REST resource. In the process, I started looking more closely at the search module, and had some thoughts. Notably, I'd like to change how Search::add_query_terms() works. Summary: In 3.0.x, we use Search::add_query_terms() to add wildcarded terms to the search. Example: a search of "foo bars" becomes "foo bars foo* bar*" after Search::add_query_terms(). I believe the philosophy here is this: "search for a more generic term, but put exact matches first." In general, I like this philosophy, but I'm not entirely sure we've implemented it the best way. So, I have a new proposal that I think is both simpler and more predictable. -- First thought: search was written with a bunch of hand-coded MySQL queries, which seems less than ideal. In particular, this recently led to a bug because we didn't count whitespace carefully enough. To be fair, it's in part because it's kinda hard when all the segments are hand-coded. Since Gallery sits atop Kohana's stack of OO tools to do the dirty work for us, I refactored it to use them. Voilà - no more counting whitespace :-). This also gave me a chance to understand *why* we chose to build the "fulltext" query as we do. Maybe it's just my naïveté with MySQL, but it took me quite awhile to understand why we used "IN BOOLEAN MODE" for one part of the query but not the other. Finally, I got it: - "boolean" mode allows special operators (+, -, *,...), so it's best for *finding* which items match the search. But, it doesn't give us a useful score. - "natural langauge" mode gives us a useful score, so it's best for *ordering* the found items. This approach makes good sense to me, and I took the liberty of adding comments to the Search::_build_query_base() so the next newbie doesn't have to pour through Oracle docs like I did to figure this out :-). So, onto Search::add_query_terms(). Here's how 3.0.x does it: - user search box - foo bars - natural language - foo bars foo* bar* - boolean mode - foo bars foo* bar* Notes: - it makes no sense to send wildcards to a natural language query as they're ignored. The result is a query of "foo bars foo bar", which oddly doubles foo. - it makes no sense to send "foo foo*" to a boolean query as they're the same. Another example for 3.0.x, which illustrates how plurals are handled: - user search box - entry alumnus - natural language - entry alumnus entry* alumnu* - boolean mode - entry alumnus entry* alumnu* A more nuanced example for 3.0.x, which illustrates how special operators aren't considered (note: Search::add_query_terms() imposes a 5 term limit) - user search box - +(required terms) foo bars - natural language - +(required terms) foo bars required* - boolean mode - +(required terms) foo bars required* Another example for 3.0.x, which illustrates how quotes make adding extra terms a pain: - user search box - "exact match only" - natural language - "exact match only" match* only"* - boolean mode - "exact match only" match* only"* -- Here's my proposal: - natural language - exact same as user search box. MySQL will automatically ignore special operators. - boolean mode - add wildcards to existing terms, add no *new* terms. Also, use Inflector::singular() and Inflector::plural() to figure out wildcard placement. Example: - user search box - foo bars entry alumnus +required "exact phrase" - natural language - foo bars entry alumnus +required "exact phrase" [MySQL sees "foo bars entry alumnus required exact phrase"] - boolean mode - foo* bar* entr* alumn* +required* "exact phrase" The result keeps the same philosophy: search for a more generic term (boolean mode), but put exact matches first (natural language mode). Notes: - Simpler logic removes the need to figure out how to extend/limit extra terms, be smart with parentheses and quotes while extending, etc. - Use of Inflector class lets us be more clever with irregular plurals. - This still isn't i18n savvy, but I suspect that trying to be i18n-savvy comprehensively is a deep rabbit hole. - Maybe to at least be clear about our lack of i18n, we should add an admin option for wildcarding modes. (Prefix mode: none, add wildcard; suffix mode: none, add wildcard, add smart wildcard (English only)). - If we go to the trouble of adding an admin screen, maybe it makes sense to fold in my short_search_fix module (enables search terms of <4 characters on shared hosting installations) Thoughts? Take care, Shad |