[Htmlparser-user] Web Crawler Thesis Project Using HTML Parser To collect links
Brought to you by:
derrickoswald
From: Neftali P. <pap...@ya...> - 2009-08-21 17:42:44
|
Hi everyone. I am Neftali Papelleras, an Engineering student from University of San Carlos, Cebu City, Philippines. I am currently having my thesis project which involves web crawling. The title of my project is A Web Extraction Tool to Monitor Websites and is implemented in Java. I am still on the first month of this one-year thesis project, and still on the information gathering stage. The first question I need to answer is how to create a Java-based web crawler. And next is how to retrieve the the web contents on every web page. And lastly, how to retrieve links from a given web source. First thing came to my mind was to use Java RegEx to retrieve the links given a web source. But now I understand it's not the right way to do it. And that's why I came to HTML Parser, because I knew this is the right way. I know Java but not on advanced level, I just know the concept. Though I have created several programs already, last was a chat system, I am still not confident with my skills on Java. But I am very much eager to learn and I am starting now, again. I have already downloaded the 1.6 version of HTML Parser and have browsed on different folders and files. I attempted to create a very simple parser program using the HTML Parser API, but unfortunately I was confused where to and how to start. I am hoping that this organization can provide a simple program that illustrates how to retrieve a link given a web page source/html text. I can follow through the program and eventually lead me to the understanding of using this API. Looking forward for a good response from this organization. Respectfully, neftali Design your own exclusive Pingbox today! It's easy to create your personal chat space on your blogs. http://ph.messenger.yahoo.com/pingbox |