From: ted p. <tpederse@d.umn.edu> - 2005-06-06 13:02:36
|
COMPUTER SCIENCE COLLOQUIUM Identifying Sets of Related Words on the World Wide Web PRATHEEPAN RAVEENDRANATHAN Computer Science Graduate Student Thursday, June 9, 2005 1:00 p.m. HELLER HALL 306 ABSTRACT As the Internet keeps growing, the number of Web pages indexed by commercial search engines such as Google increases rapidly. Currently, Google reports that they index over 8 billion Web pages. The type of information available through the Web is very diverse, from publications to electronic encyclopedias to information about products. In short, the Web is vast and huge. Until recently, the Web has not been used to acquire information about words in order to better understand Natural Language. However, we believe that there is a need to develop methods that take advantage of the huge amount of information on the Web. Hence, this thesis focuses on finding sets of related words by using the World Wide Web. This thesis presents three new methods for using Web search results to find sets of related words. We rely on the Google API to obtain search engine results, but in principle these methods can be used with any search engine. They rely on pattern matching techniques in addition to various measures or relatedness that we have developed. In addition to finding sets of related words, we also explore the problem of Sentiment Classification. This was motivated by a desire to find a practical application for the sets of related words we discover. As such we extend the Pointwise Mutual Information - Information Retrieval (PMI-IR) measure described in (Turney, 2002) to be used with Google in order to discover sets of related words. These sets are then used as seeds in our Sentiment Classification algorithm. |