Thread: RE: [Htmlparser-user] Efficient parsing - help needed

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ash,

For your requirement of reading the entire HTML and storing it on disk
in an identical format, I suggest that you not use the HTMLParser. I
suggest that u do it onyour own using readers and writers for the
present. The changes suggested by you are quite good. However as far as
the toHTML() method is concerned it does not exactly throw replicate the
input HTML. So if you are using it to do that you are better off with
the approach given above.

However for parsing HTML, this parser is great not only because it works
beautifully, but because it is so easy to use as described by Somik
below but also because you can switch off and switch on the parsers as
required.

Regards,

Dhaval Udani
Senior Analyst
M-Line, QPEG
OrbiTech Solutions Ltd.
+91-22-28290019 Extn. 1457

-----Original Message-----
From: jtrek4 [mailto:jt...@ya...]
Sent: Monday, January 06, 2003 5:07 PM
To: htmlparser-user
Cc: jtrek4
Subject: Re: [Htmlparser-user] Efficient parsing - help needed

Hi Somik,

Thanks for the help. 

> You can use toHTML() to do this..
> HTMLNode node;
> for (HTMLEnumeration e =
> parser.elements();e.hasMoreNodes();) {
>    node = e.nextHTMLNode();
>    writeToDisk(node.toHTML());
> }

I tried this, but toHTML() modifies the contents,
wrongly in some cases. I have posted a bug regarding
this :
http://sourceforge.net/tracker/index.php?func=detail&aid=663038&group_id
=24399&atid=381399

I have one suggestion to make : overloaded
constructors in HTMLParser of the foll. signature/s :
public HTMLParser(java.lang.String resourceLocn,
HTMLParserFeedback feedback, Writer writer)

public HTMLParser(java.lang.String resourceLocn,
Writer writer)

with corresponding overloaded constructors in
HTMLReader:
public HTMLReader(java.io.Reader in, int len, Writer
writer)

public HTMLReader(java.io.Reader in, java.lang.String
url, Writer writer)

This will give the users a way to save the response to
disk as it is received. Of course, there is another
option of taking a String file name argument, but the
user may want to specify the file encoding as well (as
is the case with me). So the java.io.Writer is a
better option.

This should not take much time to implement, as you
just need to check if the writer has been supplied and
once you read a line using the readLine() method in
HTMLReader, write this string to the writer using the
println method and call flush(). This gives the added
advantage to the user of preserving line breaks at the
original points.

What do you think?

Also, when can we expect the next release?

Warm Regards,
Ash

________________________________________________________________________
Missed your favourite TV serial last night? Try the new, Yahoo! TV.
       visit http://in.tv.yahoo.com

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user