Search Improvements

By September 29th, 2010

We’ve made some changes to the new SourceForge Enrichment Center Query Response System over the last few days.

First and foremost, we restored mailing list search, which we rogue ninjas broke a few weeks ago. We had to re-build the entire Lucene index, and it took a while for full text analysis on a decade worth of open source projects’ mailing list messages. Code monkeys are very chatty while their code compiles! 🙂 Then we re-synchronized our query analyzers to the new index, which should improve term-matching.

Speaking of term-matching, we added a more intelligent min-should-match parameter to our project search. So, searching with multiple query terms should have a stronger affect on the results – we will see fewer results, and they will be more relevant to all the query terms.

We’ve also refined the project ratings scoring. We changed the ratings field from a simple average to a lower bound of Wilson score confidence interval for a Bernoulli parameter. It’s occasionally off by a millionth or two – what with free will and all – but we won’t see 100% single-rating projects scoring above 99% hundred-rating projects anymore. In fact, we won’t see any percentages – just the count of thumbs-up and thumbs-down ratings.

Finally, we adjusted the project relevance algorithm to address issues we heard from users like, “… lots of abandoned projects now appearing near the top of the default search. If a project has nothing to download and has had no activity for years could it be made less ‘relevant’?” Yes, yes it can. We scaled down the term boost and refined the file and downloads boost functions so files and downloads will lend more to projects’ scores.

We like to hear your feedback – here on this post, by email, or by IRC. Thank you for helping us help you help us all.

3 comments