So it would be nice to add a "search similar" feature
to look for docs
similar to a particular result.
How do you do this?
I've seen two approaches in the literature. Both are
based on matrix algebra!
1) Take the list of word frequencies in the document.
Lots of them will be
zeros for words that aren't in the doc.
Take this vector and compare it to other document
vectors. To cut down on
the search, it's probably best to pick a few words that
in the document and use those to pick out potential
matches from the DB.
The comparison is pretty simple: take the dot product
of the two vectors.
So the answer is a number that's very high if they
share every word...
2) Take the list of pages that link to this document.
Compare this vector to other document vectors. Since we
can assemble these
vectors quickly, we probably don't need to "weed" as
Again, take the dot product of the two vectors. Pages
that are linked from
the same places will score highly.
The first seems to make more sense than the second.
After all, documents on
the same subjects should share common words. But the
second may be more
accurate, and is easier to compute (we can easily keep
a backlink list)
After all, if pages are linked from the same places,
pretty related. Of course this will require some side-