mdn-backend Mailing List for Media Distribution Network

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Here is one idea for doing search. It's a very simple, keyword-based
algorithm that ranks first by number of distinct entries found, then by
contiguous hits in search order, then by hits. This does plaintext searching
only, no HTML fanciness.

I've also attached a sample dataset. See what you guys think.

Search.php is here:
http://students.washington.edu/areusch/search.txt(rename it to .php
due to webservers that actually run php)
example-dataset.sql is here:
http://students.washington.edu/areusch/example-dataset.sql

The algorithm makes use of string sorting functions in PHP because PHP
doesn't have the idea of a Comparable interface (so rather than make it I
just made a more inefficient algorithm for the time being). Anyway, hence
its inefficiency. On my (1.6GHz) processor with 1GB ram, the add function
took about 5 minutes to add the dataset I'm sending, which is about 10000
entries. time php search.php search red hot chili peppers gives 217 results
and the following time output:
real    0m0.136s
user    0m0.060s
sys     0m0.008s

This is awfully slow given our application, but I believe that there is huge
optimization we can reap by writing a system that doesn't do string sorts.
Those lines are marked in the code.

A basic description of the algorithm:

Add assigns the input file a new document id, tokenizes it (here we could
apply a stemming lib) and, for each token, appends a reference string to the
reverse index for that token like <docID>:<tokenOrdinal>SPACE where
tokenOrdinal is the number of tokens from the start of the file. So for the
following input with ID #1030:
A quick dog barked at the other dog.

The reverse index looks like:
A 1030:0
quick 1030:1
dog 1030:2 1030:7
barked 1030:3
at 1030:4
the 1030:5
other 1030:6

Search retrieves the reverse index entry for each search term and appends
the string :<query token ordinal> to each reference string. It then sorts
them numerically (looking at the full numbers contiguously, which is the
ultra-slow part), so that if you searched for words that are right next to
each other, their reference strings appear right next to each other. Then
the algorithm enumerates the sorted list and does some stats, which are then
stored in the closest thing to a priority queue that exists in PHP (another
array which I later sort by the same method).

Comments? We can certainly speed the sorting using some type of bucketing
system with Comparable classes.

To use the sample code, get php command line/mysql, create a sql database
and a user for it (or just use root as i did) then set up the top of
search.php accordingly. Run php search.php setup to create the tables and
you're good to go. Or, instead of doing serach.php setup you could
alternatively import the dataset, which you can feed to mysql (it's just sql
commands). The dataset is a set of 10000 songs by title/artist/album.

Hopefully the list doesn't scrub attachments....oops it did, see above for
links.

Andrew

mdn-backend Mailing List for Media Distribution Network

mdn-backend — Backend Devs Unite!