[Htmlparser-user] Web Crawler Thesis Project Using HTML Parser To collect links

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi everyone.

I am Neftali Papelleras, an Engineering student from University of San Carlos, Cebu City, Philippines. I am currently having my thesis project which involves web crawling. The title of my project is A Web Extraction Tool to Monitor Websites and is implemented in Java. I am still on the first month of this one-year thesis project, and still on the information gathering stage.

The first question I need to answer is how to create a Java-based web crawler. And next is how to retrieve the the web contents on every web page. And lastly, how to retrieve links from a given web source. First thing came to my mind was to use Java RegEx to retrieve the links given a web source. But now I understand it's not the right way to do it. And that's why I came to HTML Parser, because I knew this is the right way.

I know Java but not on advanced level, I just know the concept. Though I have created several programs already, last was a chat system, I am still not confident with my skills on Java. But I am very much eager to learn and I am starting now, again.

I have already downloaded the 1.6 version of HTML Parser and have browsed on different folders and files. I attempted to create a very simple parser program using the HTML Parser API, but unfortunately I was confused where to and how to start. I am hoping that this organization can provide a simple program that illustrates how to retrieve a link given a  web page source/html text. I can follow through the program and eventually lead me to the understanding of using this API.

Looking forward for a good response from this organization.

Respectfully,
neftali

      Design your own exclusive Pingbox today! It's easy to create your personal chat space on your blogs. http://ph.messenger.yahoo.com/pingbox