[Google-hack-developers] Prath Raveendranathan Thesis Defense

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

COMPUTER SCIENCE COLLOQUIUM

Identifying Sets of Related Words on the World Wide Web

PRATHEEPAN RAVEENDRANATHAN
Computer Science Graduate Student

Thursday, June 9, 2005
1:00 p.m.

HELLER HALL 306

ABSTRACT

As the Internet keeps growing, the number of Web pages indexed by
commercial search engines such as Google increases rapidly. Currently,
Google reports that they index over 8 billion Web pages. The type of
information available through the Web is very diverse, from publications
to electronic encyclopedias to information about products. In short, the
Web is vast and huge. Until recently, the Web has not been used to acquire
information about words in order to better understand Natural Language.
However, we believe that there is a need to develop methods that take
advantage of the huge amount of information on the Web. Hence, this thesis
focuses on finding sets of related words by using the World Wide Web.

This thesis presents three new methods for using Web search results to
find sets of related words. We rely on the Google API to obtain search
engine results, but in principle these methods can be used with any search
engine. They rely on pattern matching techniques in addition to various
measures or relatedness that we have developed.

In addition to finding sets of related words, we also explore the problem
of Sentiment Classification. This was motivated by a desire to find a
practical application for the sets of related words we discover. As such
we extend the Pointwise Mutual Information - Information Retrieval
(PMI-IR) measure described in (Turney, 2002) to be used with Google in
order to discover sets of related words. These sets are then used as seeds
in our Sentiment Classification algorithm.