Re: [Htmlparser-user] Htmlparser-user Digest, Vol 35, Issue 4

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Good Day!

I just woke up,8:30 in the morning. I'm very glad got a reply from this organization already with very helpful information. I will look at this later this morning as I will have a seminar to attend to at university.

Thank you very much! i really appreciated this help :)

I will check on here from time to time if I get hung up on a problem regarding the topic.

Respectfully,
neftali

________________________________
From: "htm...@li..." <htm...@li...>
To: htm...@li...
Sent: Saturday, August 22, 2009 4:56:24 AM
Subject: Htmlparser-user Digest, Vol 35, Issue 4

Send Htmlparser-user mailing list submissions to
    htm...@li...

To subscribe or unsubscribe via the World Wide Web, visit
    https://lists.sourceforge.net/lists/listinfo/htmlparser-user
or, via email, send a message with subject or body 'help' to
    htm...@li...urceforge..net

You can reach the person managing the list at
    htm...@li...

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Htmlparser-user digest...."

Today's Topics:

   1. Need Suggestions to get Started in HTML parsing (tamizh vendan)
   2. Re: Need Suggestions to get Started in HTML    parsing
      (Derrick Oswald)
   3. Web Crawler Thesis Project Using HTML Parser To    collect links
      (Neftali Papelleras)
   4.. Web Crawler Thesis Project Using HTML Parser To    collect links
      (Neftali Papelleras)
   5. Re: Web Crawler Thesis Project Using HTML Parser    To collect
      links (Derrick Oswald)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 Aug 2009 20:42:04 +0530
From: tamizh vendan <tam...@gm...>
Subject: [Htmlparser-user] Need Suggestions to get Started in HTML
    parsing
To: htm...@li...
Message-ID:
    <b98...@ma...>
Content-Type: text/plain; charset="iso-8859-1"

I am newbie to HTML parsing. I knew both Java and HTML well. I would like to
construct a DOM tree from the HTML coding of a Webpage. It would be helpful
for me if someone specify how to get started and kindly provide some
tutorial or article links. Provide Sample programs if possible.. Thanks in
advance..
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 2
Date: Wed, 19 Aug 2009 19:18:39 +0200
From: Derrick Oswald <der...@gm...>
Subject: Re: [Htmlparser-user] Need Suggestions to get Started in HTML
    parsing
To: htmlparser user list <htm...@li...>
Message-ID:
    <16a...@ma...>
Content-Type: text/plain; charset="iso-8859-1"

Have a look at the mainline in Parser.java:
http://htmlparser.svn.sourceforge.net/viewvc/htmlparser/trunk/parser/src/main/java/org/htmlparser/Parser.java?revision=8&view=markup

That program prints it out, but the results of parser.Parse (filter) is a
NodeList which is your (nested) dom tree.

Also have a look for other main methods in the code.

On Wed, Aug 19, 2009 at 5:12 PM, tamizh vendan <tam...@gm...> wrote:

>
> I am newbie to HTML parsing.. I knew both Java and HTML well. I would like
> to construct a DOM tree from the HTML coding of a Webpage. It would be
> helpful for me if someone specify how to get started and kindly provide some
> tutorial or article links. Provide Sample programs if possible.. Thanks in
> advance..
>
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus
> on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 3
Date: Fri, 21 Aug 2009 10:40:19 -0700 (PDT)
From: Neftali Papelleras <pap...@ya...>
Subject: [Htmlparser-user] Web Crawler Thesis Project Using HTML
    Parser To    collect links
To: htm...@li...
Cc: pap...@ya...
Message-ID: <661...@we...>
Content-Type: text/plain; charset="utf-8"

Hi everyone.

I am Neftali Papelleras, an Engineering student from University of San Carlos, Cebu City, Philippines. I am currently having my thesis project which involves web crawling. The title of my project is A Web Extraction Tool to Monitor Websites and is implemented in Java. I am still on the first month of this one-year thesis project, and still on the information gathering stage.

The first question I need to answer is how to create a Java-based web crawler. And next is how to retrieve the the web contents on every web page. And lastly, how to retrieve links from a given web source. First thing came to my mind was to use Java RegEx to retrieve the links given a web source. But now I understand it's not the right way to do it. And that's why I came to HTML Parser, because I knew this is the right way.

I know Java but not on advanced level, I just know the concept. Though I have created several programs already, last was a chat system, I am still not confident with my skills on Java. But I am very much eager to learn and I am starting now, again.

I have already downloaded the 1.6 version of HTML Parser and have browsed on different folders and files. I attempted to create a very simple parser program using the HTML Parser API, but unfortunately I was confused where to and how to start. I am hoping that this organization can provide a simple program that illustrates how to retrieve a link given a  web page source/html text. I can follow through the program and eventually lead me to the understanding of using this API.

Looking forward for a good response from this organization.

Respectfully,
neftali

      Surf faster. Internet Explorer 8 optmized for Yahoo! auto launches 2 of your favorite pages everytime you open your browser. Get IE8 here! http://downloads.yahoo.com/sg/internetexplorer/
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

Message: 4
Date: Fri, 21 Aug 2009 10:42:32 -0700 (PDT)
From: Neftali Papelleras <pap...@ya...>
Subject: [Htmlparser-user] Web Crawler Thesis Project Using HTML
    Parser To    collect links
To: htm...@li...
Cc: pap...@ya...
Message-ID: <269...@we...>
Content-Type: text/plain; charset="utf-8"