htmlparser-developer Mailing List for HTML Parser (Page 21)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 .. 19 20 21 22 23 .. 33 > >> (Page 21 of 33)

[Htmlparser-developer] Charset tests failing

From: Somik R. <so...@ya...> - 2002-12-27 07:17:56

Hi Derrick,
    I was working on the latest parser, and just found that the charset =
tests are failing (I didnt modify anything in HTMLParser).
    Could you check what might have gone wrong ? Also, I was wondering =
if it might not be better to have the charset pages on our domain =
(http://htmlparser.sourceforge.net) - we could put up encoded pages and =
they'd always be there. =20

Regards
Somik

RE: [Htmlparser-developer] toPlainTextString() feedback requested

From: <dha...@or...> - 2002-12-26 09:32:50

Attachments: BDY.RTF

Hi,

I agree with Sam when he says that "don't fix it when its not broken".
But at the same time I see the need to make code better, more readable
and simpler. However as suggested below it seems that the Visitor
pattern is going to make things difficult to understand. And I too like
Sam had a problem with HTMLParser initially. It may be with the
documentation but this whole tag-scanner things was extremely confusing
in the beginning(though subsequently I have loved them so much that I
have even written a few myself without any problem). Hence I think that
even if the visitor pattern is used, the user must have some simple
easy-to-use methods to get his work done rather than try to understand
the Visitor pattern.

Just like Same, this was my two cents.

Cheers,
Dhaval

-----Original Message-----
From: gaijin [mailto:ga...@yh...]
Sent: Tuesday, December 24, 2002 12:21 PM
To: htmlparser-developer
Cc: gaijin; htmlparser-user
Subject: Re: [Htmlparser-developer] toPlainTextString() feedback
requested

Hi Somik and Joshua

Joshua Kerievsky wrote:

>>Could you explain why you want to refactor these methods?  Remember
the
>>danger of premature refactoring ... you lose flexibility that then has
>>to be re-added later on, making more work in the long run.
>>    
>>
>There's a good deal of duplicate code in way the two toHTML methods and
the
>toPlainTextString method do their work.  The central theme is
information
>accumulation/alteration.  That involves outputing tag and node results
and
>recusing through tags.  The refactoring to Visitor allows us to
>
>* remove many lines of duplicate code, spread across many classes
>* remove hard-coded accumulation/alteration logic, thereby making it
easier
>for clients to get the data they need
>
>Visitor takes some getting used to.  I rarely use the pattern. In this
case,
>IMO, it was a good fit.
>
The Visitor pattern sounds interesting, and I look forward to hearing 
more about it.  However, duplicated code itself is not IMO necessarily 
an evil. It all depends on whether one thinks that the duplicated 
components are going to diverge in functionality in the future.  If you 
are sure they are not, then fine, refactor away.

I guess my surprise at your (or perhaps Somik's) focus on refactoring 
comes from the fact that while the htmlparser is a great piece of 
software, the javadocs and other documentation could use some attention.

 For example, I can't find any explanation in the javadocs or otherwise 
of how the filters are supposed to work with the different scanners, or 
what values they are allowed to take.

I generally work to "if it's not broken don't fix it", but I often add 
"before you start fixing it, make sure your documentation is up to
date".

Using the Visitor pattern may make it easier for clients to get the data

they need, but given that the htmlparser is "working" (well it works for

me), I would say that the more urgent issue here is making sure all the 
documentation is up to date.   I have a lot of positive things to say 
about htmlparser, so don't take it the wrong way, when I say that the 
biggest problem I've had in using it in the last few weeks is inadequate

javadocs.

>>Is there some efficiency reason why you want to refactor these methods
>>or is it just for neatness?
>>    
>>
>
>Duplication removal is reason #1. 
>
As I mention above.  One should be careful of duplication removal for 
the sake of it.

> Removal of hard-coded logic is reason #2.
>
This is a good reason.  However I get the feeling that introduction of 
these Visitor classes will make the system conceptually more difficult 
to use rather than easier.  I would feel better if the current set up 
was more fully documented before more complexity was added.

And even if the Visitor pattern is used, I would recommend leaving 
methods like toPlainTextString() etc in place, but just making them 
short cut implementations to certain kinds of visitor-using methods. 
 This will allow people who have yet to grasp the Visitor  pattern 
something to work with.

If you are keen to see lots of people using htmlparser, I think that you

don't want people to have to come to terms with too many new concepts at

once.  You say yourself that the Visitor pattern takes some getting used

to.  I think the whole scanner concept takes some getting used to ....

>Simplicity is reason #3: there is little reason to fatten the
interfaces of
>tag and node classes with various data accumulation/alteration methods
when
>one method and a variety of concrete Visitors can do the job with much
less
>code.
>
well I would agree if you could guarantee that there will be no 
divergence whatsoever in how the different methods will be used.  If you

can create a flexible enough implementation of the Visitor pattern then 
I guess that will support any possible divergence in the separate 
methods. However, I think there is a reason to have a fatter interface, 
in that convenience methods lower the barrier to entry for new users.  

Perhaps ideally one has a well implemented Visitor pattern that supports

a raw method access, and a number of convenience methods?

A well implemented Visitor pattern will, I assume, support all sorts of 
different operations, but I would feel much happier if the htmlparser 
had a complete javadoc and documentation review before any refactoring 
took place.  People are trying to use the existing system and having 
trouble not because of the lack of refactoring, but a lack of well 
described methods.  Well I say people, I mean me, I don't know if anyone

else feels the same.  Maybe it's just me :-)

Just my two cents.

CHEERS> SAM

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 21:49:42

Hi Kaarle,

> You seem to be very active today!
> 
> So am I. Merry Christmas to you all!
> 

Merry Christmas to everyone and to you! 

Cheers,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Kaarle K. <kaa...@ik...> - 2002-12-24 21:02:14

At 15:01 24.12.2002 -0500, you wrote:
>Somik Raha wrote:

You seem to be very active today!

So am I. Merry Christmas to you all!

Kaarle

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Derrick O. <Der...@ro...> - 2002-12-24 20:00:21

Somik Raha wrote:

>Hi Derrick,
>    You've raised some interesting points..
>
>    Looking at it from an end-user perspective, prior to using the visitor,
>the code for collecting text from a page would be :
>
>StringBuffer textCollectionBuffer = new StringBuffer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString());
>}
>String result = textCollectionBuffer.toString();
>
>After putting in the visitor, the code looks like :
>
>Renderer plainTextRenderer = new PlainTextRenderer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    e.nextHTMLNode().renderWith(plainTextRenderer);
>}
>String result = plainTextRenderer.getResult();
>
>IMO, from a user perspective, there's almost no extra complexity. You'd
>anyway collect results into a string buffer, while the renderer now does
>that for you.
>
>If it was just for the plain text renderer, we would not be justified in
>this refactoring. However, the code for getting html out would look like :
>Renderer htmlRenderer = new HTMLRenderer();
>for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
>    e.nextHTMLNode().renderWith(htmlRenderer);
>}
>String result = htmlRenderer.getResult();
>
>And now, when ripping webpages, we would like to modify the links and images
>to a base location in the local disk. We could have modifying html renderer,
>like this:
>
>Renderer  modifyingLinksAndImagesRenderer = new
>ModifyingLinksAndImagesRenderer();
>...
>
>Any modification in the mechanism of rendering is now easily handled, and we
>don't have to modify the parser per se, for every new user requirement.
>
The first two examples gain you nothing from using a visitor pattern. 
 The plainTextRenderer and HTMLRenderer call the toPlainTextString() and 
toHTML() methods because only the tag knows how to render itself into 
text or HTML (see your comment below).

Your last example indicates the visitor class knows which parameters 
have links and which parameters have images for each tag and I would 
suggest this knowledge should be in the tag class itself.  If the base 
class (HTMLTag I presume) had an abstract method signature 'toHTML(URL 
baseURL)', then code within the class would adjust relative link and 
image references based on it's value, and the current behaviour of 
'toHRML()' would just call toHTML(null).  The code is not duplicated, 
only refactored to handle a base URL.

>
>  
>
>>It breaks encapsulation.  It presumes that an exterior class (the
>>visitor) knows how to do something to a class better than the class
>>itself does.
>>    
>>
>
>In light of the above, this is not a bad thing. At the same time, putting in
>the visitor has caused/will cause refactoring of the tag classes so that
>most of the logic is still encapsulated in the tags, and the visitor only
>makes calls to the tag's concrete methods.
>e.g. The visitor still does NOT know how to render a tag's html. Such logic
>continues to be encapsulated within the tag. The visitor only deals with a
>high-level abstraction.
>
So why bother with the extra overhead of the visitor class?  Logically, 
either the visitor has code that differs from the tag code or it 
doesn't.  If it doesn't, then cut the Gordian knot and use the simplest 
mechanism and forget the visitor.  If it does differ, the code within 
the visitor either uses the methods of the visitee only or it has some 
new methods with logic and data that are external to the visitee.  In 
the former case, that code that only uses methods on the visitee could 
be a method on the vistee's superclass.  If it has new methods, either 
the visitors extra logic and data are specific to each vistee or they 
are of a generic nature.  If they are specific to the visitee, then they 
should be within the visitee class and not in a catch-all visitor class, 
otherwise if they are of a general nature they might as well be moved to 
the superclass (or a utility class if they don't belong there).  There 
is really no need for a visitor pattern.

>
>  
>
>>It makes the system brittle.  The visitor class has a method defined for
>>every subclass of visited object.  Adding, removing or refactoring any
>>visited class breaks the visitors.
>>    
>>
>
>A valid point indeed. To counter, in our imlpementation, there is a lot of
>duplication, and quite a few classes really don't need their own toHTML()
>method. The parent class - HTMLTag, should take care of single tags. Those
>tags with children always have duplicated code to recurse through their
>children. We can refactor such functionality out, so that, tags with
>children conform to a recursion interface. Using such an interface, you wont
>have the recursion loops in each and every node class, but you could handle
>it uniformly outside. So, the visitor does not need to take care of each and
>every subclass. It will only take care of some special ones only, while
>visiting the rest as a simple HTMLTag.
>  
>
This The duplication should be refactored to superclasses.  You may need 
is subclass of HTMLTag that has a 'container' interface, but why not put 
that interface on HTMLTag itself and have a default implementation that 
is an empty container.  Then FORM and FRAMESET and other container tags 
derive from an abstract class (HTMLContainerTag ?) that derives from 
HTMLTag but overrides this behaviour and provides real recursion support.

>  
>
>>It's obscure.  Looking at the visit loop tells you nothing of what is
>>being perfomed.  The only indication of the task might be the name of
>>the visitor class.
>>    
>>
>
>That is again not a bad thing - for well-intentioned naming reduces the need
>for excessive documentation. You should be able to look at a class name and
>have a fairly good idea of what it does. A look at the API of the class
>would confirm your guess.
>
>At the same time, looking at the visit loop above, you'd see that it looks
>intuitive enough - the loops is not in the visitor, it is still in the
>client code.
>  
>
But the StringBuffer is now hidden and it's not obvious that a 
getResult() method call returns it's contents, so you have an 
indirection to look up the visitor class.

>  
>
>>It's redundant.  In every example I've seen (I've never seen it used in
>>real applications), the result could have been coded in a simpler
>>fashion with correct class design.
>>    
>>
>
>This is the point we're really thinking about. I've personally been thinking
>on this for some months - for we did have realistic requirements of being
>able to modify links and images in a ripper, in a uniform manner. In fact,
>the current version of the parser already has a visitor. The Sample Programs
>makes not mention of the word "visitor" to avoid controversy and confusion
>:), but if you look at
>http://htmlparser.sourceforge.net/samples/ripper.html - this is what's
>happening. I must add that although this visitor does provide a solution, I
>wasn't very happy with it. We've been able to refactor this out, and provide
>a uniform mechanism now.
>  
>
That's not really a visitor example, that's a callback, which is 
obviated if the toHTML(URL) signature is used.

>  
>
>>I would only use the visitor pattern if either the set of public methods
>>on the visited nodes completely specify their possible behavior and
>>allowed the client to fully manipulate them at the highest level of
>>abstraction (i.e. everything you can do to the visited classes is
>>already coded and you don't want them to change)
>>    
>>
>
>I think we're talking the same language here - this is also our vision. The
>code you see on the samples page is an example of that (although not a very
>good one). We're working on those lines, and the first step was to introduce
>such levels of abstraction.
>
>Regards,
>Somik
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>  
>

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 18:48:57

Hi Derrick,
    You've raised some interesting points..

    Looking at it from an end-user perspective, prior to using the visitor,
the code for collecting text from a page would be :

StringBuffer textCollectionBuffer = new StringBuffer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    textCollectionBuffer.append(e.nextHTMLNode().toPlainTextString());
}
String result = textCollectionBuffer.toString();

After putting in the visitor, the code looks like :

Renderer plainTextRenderer = new PlainTextRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(plainTextRenderer);
}
String result = plainTextRenderer.getResult();

IMO, from a user perspective, there's almost no extra complexity. You'd
anyway collect results into a string buffer, while the renderer now does
that for you.

If it was just for the plain text renderer, we would not be justified in
this refactoring. However, the code for getting html out would look like :
Renderer htmlRenderer = new HTMLRenderer();
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
    e.nextHTMLNode().renderWith(htmlRenderer);
}
String result = htmlRenderer.getResult();

And now, when ripping webpages, we would like to modify the links and images
to a base location in the local disk. We could have modifying html renderer,
like this:

Renderer  modifyingLinksAndImagesRenderer = new
ModifyingLinksAndImagesRenderer();
...

Any modification in the mechanism of rendering is now easily handled, and we
don't have to modify the parser per se, for every new user requirement.

> It breaks encapsulation.  It presumes that an exterior class (the
> visitor) knows how to do something to a class better than the class
> itself does.

In light of the above, this is not a bad thing. At the same time, putting in
the visitor has caused/will cause refactoring of the tag classes so that
most of the logic is still encapsulated in the tags, and the visitor only
makes calls to the tag's concrete methods.
e.g. The visitor still does NOT know how to render a tag's html. Such logic
continues to be encapsulated within the tag. The visitor only deals with a
high-level abstraction.

> It makes the system brittle.  The visitor class has a method defined for
> every subclass of visited object.  Adding, removing or refactoring any
> visited class breaks the visitors.

A valid point indeed. To counter, in our imlpementation, there is a lot of
duplication, and quite a few classes really don't need their own toHTML()
method. The parent class - HTMLTag, should take care of single tags. Those
tags with children always have duplicated code to recurse through their
children. We can refactor such functionality out, so that, tags with
children conform to a recursion interface. Using such an interface, you wont
have the recursion loops in each and every node class, but you could handle
it uniformly outside. So, the visitor does not need to take care of each and
every subclass. It will only take care of some special ones only, while
visiting the rest as a simple HTMLTag.

> It's obscure.  Looking at the visit loop tells you nothing of what is
> being perfomed.  The only indication of the task might be the name of
> the visitor class.

That is again not a bad thing - for well-intentioned naming reduces the need
for excessive documentation. You should be able to look at a class name and
have a fairly good idea of what it does. A look at the API of the class
would confirm your guess.

At the same time, looking at the visit loop above, you'd see that it looks
intuitive enough - the loops is not in the visitor, it is still in the
client code.

> It's redundant.  In every example I've seen (I've never seen it used in
> real applications), the result could have been coded in a simpler
> fashion with correct class design.

This is the point we're really thinking about. I've personally been thinking
on this for some months - for we did have realistic requirements of being
able to modify links and images in a ripper, in a uniform manner. In fact,
the current version of the parser already has a visitor. The Sample Programs
makes not mention of the word "visitor" to avoid controversy and confusion
:), but if you look at
http://htmlparser.sourceforge.net/samples/ripper.html - this is what's
happening. I must add that although this visitor does provide a solution, I
wasn't very happy with it. We've been able to refactor this out, and provide
a uniform mechanism now.

> I would only use the visitor pattern if either the set of public methods
> on the visited nodes completely specify their possible behavior and
> allowed the client to fully manipulate them at the highest level of
> abstraction (i.e. everything you can do to the visited classes is
> already coded and you don't want them to change)

I think we're talking the same language here - this is also our vision. The
code you see on the samples page is an example of that (although not a very
good one). We're working on those lines, and the first step was to introduce
such levels of abstraction.

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Derrick O. <Der...@ro...> - 2002-12-24 17:47:30

IMHO the visitor pattern is not a good one. Here are the reasons.

It breaks encapsulation.  It presumes that an exterior class (the 
visitor) knows how to do something to a class better than the class 
itself does.

It makes the system brittle.  The visitor class has a method defined for 
every subclass of visited object.  Adding, removing or refactoring any 
visited class breaks the visitors.

It's obscure.  Looking at the visit loop tells you nothing of what is 
being perfomed.  The only indication of the task might be the name of 
the visitor class.

It's redundant.  In every example I've seen (I've never seen it used in 
real applications), the result could have been coded in a simpler 
fashion with correct class design.

I would only use the visitor pattern if either the set of public methods 
on the visited nodes completely specify their possible behavior and 
allowed the client to fully manipulate them at the highest level of 
abstraction (i.e. everything you can do to the visited classes is 
already coded and you don't want them to change), or the visitor pattern 
had already been coded into the visited nodes and for some reason they 
could not be modified further (i.e. the code was available in binary 
form only).  I don't think either of these applies here.

my $0.02

Derrick

Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

From: Sam J. <ga...@yh...> - 2002-12-24 17:05:00

Hi Somik

Somik Raha wrote:

>Hi Sam,
>  
>
>>Can you give me an example of the hard coded rules you are using now,
>>and a couple of examples of dirty html pages that cause them to be
>>sub-optimal.
>>    
>>
>
>Here are some tags :
>[1] From neurogrid.com (debugging last year :)
><a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a>
>
>[2] From freshmeat.net
><a>revision</a>
>
>[3] From fedpage.com
><a href="registration.asp?EventID=1272"><img border="0"
>src="\images\register.gif"</a>
>
>[4] From yahoo.com
><a href=s/8741><img
>src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16
>width=16 border=0></img></td><td nowrap> &nbsp;
><a href=s/7509><b>Yahoo! Movies</b></a>
>
>As you can see, dirty html hardly looks predictable. Especially when links
>are not closed correctly, the scanner has to guess when it should close the
>tag.
>
>And this is only for the link tag. For normal tags,
>[1] <sometag key1=value key2="value2 key3 = value3>
>[2] <sometag key1="<sometag>" key2="<!-- skdlskld -->">
>
>The above two tags demonstrate a classic dilemma. If we ignore inverted
>commas, we cannot handle case 2, where the contents within inverted commas
>is valid text and not tags. All these examples are accepted by IE.
>
>All of these problems are currently handled by the parser, and I was looking
>to simplify the brain of the parser.
>
Could you explain how all  of these examples are handled by the current 
parser.  Are you using some kind of specific rule to handle each case? 
 Perhaps you can cut and paste a bit of code to the list to illustrate.

The more precisely you can describe the operation of the existing parser 
when handling these kinds of cases, the more likely I can come up with a 
learner that will meet your needs.

CHEERS> SAM

Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

From: Somik R. <so...@ya...> - 2002-12-24 07:28:40

Hi Sam,
> Can you give me an example of the hard coded rules you are using now,
> and a couple of examples of dirty html pages that cause them to be
> sub-optimal.

Here are some tags :
[1] From neurogrid.com (debugging last year :)
<a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a>

[2] From freshmeat.net
<a>revision</a>

[3] From fedpage.com
<a href="registration.asp?EventID=1272"><img border="0"
src="\images\register.gif"</a>

[4] From yahoo.com
<a href=s/8741><img
src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16
width=16 border=0></img></td><td nowrap> &nbsp;
<a href=s/7509><b>Yahoo! Movies</b></a>

As you can see, dirty html hardly looks predictable. Especially when links
are not closed correctly, the scanner has to guess when it should close the
tag.

And this is only for the link tag. For normal tags,
[1] <sometag key1=value key2="value2 key3 = value3>
[2] <sometag key1="<sometag>" key2="<!-- skdlskld -->">

The above two tags demonstrate a classic dilemma. If we ignore inverted
commas, we cannot handle case 2, where the contents within inverted commas
is valid text and not tags. All these examples are accepted by IE.

All of these problems are currently handled by the parser, and I was looking
to simplify the brain of the parser.

> Using learning in a system to increase efficiency is usually very
> difficult to do well.  Learning systems basically have more flexibility
> than other systems, but as a consequence you have moer free parameters.
>  It is easy to add a learning framework but then spend all your time
> just trying to adjust the system parameters, and then to discover that
> exploring the space of possible parameters for your learner is just too
> expensive.
>
> Nontheless I am always fascinated by the problem of adding learning to a
> system, precisely because it is so difficult to do well.  If you can
> give me some concrete examples, I will do my best to help you select an
> appropriate learning mechanism.

Thanks!

> Interesting.  The ant scripts for neurogrid were originally made by Rick
> Knowles, and I'm still only just getting a really good feel for ant.  I
> remember trying to set up something in my ng scripts that would add the
> date to the jar file name, like you sometimes do, and failing.  My own
> fault really; I rarely read the manual and always try to learn by
> modifying the operation of an existing system (kind of an evolutionary
> approach ....)

Thats what I did too - but I started with the examples in the ant website (I
think they believe in the approach too)

Cheers,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 07:01:48

Hi Sam,

> I guess my surprise at your (or perhaps Somik's) focus on refactoring
> comes from the fact that while the htmlparser is a great piece of
> software, the javadocs and other documentation could use some attention.
>  For example, I can't find any explanation in the javadocs or otherwise
> of how the filters are supposed to work with the different scanners, or
> what values they are allowed to take.

The last few weeks, I've been only adding docs...
The Sample Programs are new - you will find an example of using filters here
:
http://htmlparser.sourceforge.net/samples/linksEmbedded.html

From the javadoc :
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/HTMLNode.html
(check collectInto)

I guess more javadoc should be written about this, but I didn't bother to
write it bcos the filters were mostly used for demonstration from the
command line. When you type java -jar htmlparser.jar, you would see a help
menu of each filter.

Its only with the introduction of Collection Parameter that we've been using
filter strings for actual collection of data, and that has been documented.
If its not important, its not documented :).

> I generally work to "if it's not broken don't fix it", but I often add
> "before you start fixing it, make sure your documentation is up to date".
> Using the Visitor pattern may make it easier for clients to get the data
> they need, but given that the htmlparser is "working" (well it works for
> me), I would say that the more urgent issue here is making sure all the
> documentation is up to date.   I have a lot of positive things to say
> about htmlparser, so don't take it the wrong way, when I say that the
> biggest problem I've had in using it in the last few weeks is inadequate
> javadocs.

Sam - feel free to post as often as you like on the list. I'd be glad to
help you out. The lack of adequate documentation is of course my
responsibility and in light of my explanation above, if you can suggest
other areas of inadequate docs, I'll look into it.

> >Duplication removal is reason #1.
> >
> As I mention above.  One should be careful of duplication removal for
> the sake of it.
>
> > Removal of hard-coded logic is reason #2.
> >
> This is a good reason.  However I get the feeling that introduction of
> these Visitor classes will make the system conceptually more difficult
> to use rather than easier.  I would feel better if the current set up
> was more fully documented before more complexity was added.
>
> And even if the Visitor pattern is used, I would recommend leaving
> methods like toPlainTextString() etc in place, but just making them
> short cut implementations to certain kinds of visitor-using methods.
>  This will allow people who have yet to grasp the Visitor  pattern
> something to work with.

Don't worry, we'll leave toPlainTextString() alone - in fact, we've already
begun doing the short cut implementations. :)
The purpose of my initial mail was not to alarm you, but to know more about
realistic "customer" stories.

> If you are keen to see lots of people using htmlparser, I think that you
> don't want people to have to come to terms with too many new concepts at
> once.  You say yourself that the Visitor pattern takes some getting used
> to.  I think the whole scanner concept takes some getting used to ....
>

Bytway, have you gone thru the Sample Programs - I should've thought that
this new addition will make life very simple. If not, we'd probably need
more docs..

Regards,
Somik

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-24 06:35:40

Hi Somik and Joshua

Joshua Kerievsky wrote:

>>Could you explain why you want to refactor these methods?  Remember the
>>danger of premature refactoring ... you lose flexibility that then has
>>to be re-added later on, making more work in the long run.
>>    
>>
>There's a good deal of duplicate code in way the two toHTML methods and the
>toPlainTextString method do their work.  The central theme is information
>accumulation/alteration.  That involves outputing tag and node results and
>recusing through tags.  The refactoring to Visitor allows us to
>
>* remove many lines of duplicate code, spread across many classes
>* remove hard-coded accumulation/alteration logic, thereby making it easier
>for clients to get the data they need
>
>Visitor takes some getting used to.  I rarely use the pattern. In this case,
>IMO, it was a good fit.
>
The Visitor pattern sounds interesting, and I look forward to hearing 
more about it.  However, duplicated code itself is not IMO necessarily 
an evil. It all depends on whether one thinks that the duplicated 
components are going to diverge in functionality in the future.  If you 
are sure they are not, then fine, refactor away.

I guess my surprise at your (or perhaps Somik's) focus on refactoring 
comes from the fact that while the htmlparser is a great piece of 
software, the javadocs and other documentation could use some attention. 
 For example, I can't find any explanation in the javadocs or otherwise 
of how the filters are supposed to work with the different scanners, or 
what values they are allowed to take.

I generally work to "if it's not broken don't fix it", but I often add 
"before you start fixing it, make sure your documentation is up to date".

Using the Visitor pattern may make it easier for clients to get the data 
they need, but given that the htmlparser is "working" (well it works for 
me), I would say that the more urgent issue here is making sure all the 
documentation is up to date.   I have a lot of positive things to say 
about htmlparser, so don't take it the wrong way, when I say that the 
biggest problem I've had in using it in the last few weeks is inadequate 
javadocs.

>>Is there some efficiency reason why you want to refactor these methods
>>or is it just for neatness?
>>    
>>
>
>Duplication removal is reason #1. 
>
As I mention above.  One should be careful of duplication removal for 
the sake of it.

> Removal of hard-coded logic is reason #2.
>
This is a good reason.  However I get the feeling that introduction of 
these Visitor classes will make the system conceptually more difficult 
to use rather than easier.  I would feel better if the current set up 
was more fully documented before more complexity was added.

And even if the Visitor pattern is used, I would recommend leaving 
methods like toPlainTextString() etc in place, but just making them 
short cut implementations to certain kinds of visitor-using methods. 
 This will allow people who have yet to grasp the Visitor  pattern 
something to work with.

If you are keen to see lots of people using htmlparser, I think that you 
don't want people to have to come to terms with too many new concepts at 
once.  You say yourself that the Visitor pattern takes some getting used 
to.  I think the whole scanner concept takes some getting used to ....

>Simplicity is reason #3: there is little reason to fatten the interfaces of
>tag and node classes with various data accumulation/alteration methods when
>one method and a variety of concrete Visitors can do the job with much less
>code.
>
well I would agree if you could guarantee that there will be no 
divergence whatsoever in how the different methods will be used.  If you 
can create a flexible enough implementation of the Visitor pattern then 
I guess that will support any possible divergence in the separate 
methods. However, I think there is a reason to have a fatter interface, 
in that convenience methods lower the barrier to entry for new users.  

Perhaps ideally one has a well implemented Visitor pattern that supports 
a raw method access, and a number of convenience methods?

A well implemented Visitor pattern will, I assume, support all sorts of 
different operations, but I would feel much happier if the htmlparser 
had a complete javadoc and documentation review before any refactoring 
took place.  People are trying to use the existing system and having 
trouble not because of the lack of refactoring, but a lack of well 
described methods.  Well I say people, I mean me, I don't know if anyone 
else feels the same.  Maybe it's just me :-)

Just my two cents.

CHEERS> SAM

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Joshua K. <jo...@in...> - 2002-12-24 05:40:22

> Could you explain why you want to refactor these methods?  Remember the
> danger of premature refactoring ... you lose flexibility that then has
> to be re-added later on, making more work in the long run.

Hi Sam,

There's a good deal of duplicate code in way the two toHTML methods and the
toPlainTextString method do their work.  The central theme is information
accumulation/alteration.  That involves outputing tag and node results and
recusing through tags.  The refactoring to Visitor allows us to

* remove many lines of duplicate code, spread across many classes
* remove hard-coded accumulation/alteration logic, thereby making it easier
for clients to get the data they need

Visitor takes some getting used to.  I rarely use the pattern. In this case,
IMO, it was a good fit.

> Is there some efficiency reason why you want to refactor these methods
> or is it just for neatness?

Duplication removal is reason #1.  Removal of hard-coded logic is reason #2.
Simplicity is reason #3: there is little reason to fatten the interfaces of
tag and node classes with various data accumulation/alteration methods when
one method and a variety of concrete Visitors can do the job with much less
code.

best regards
jk

Re: [Htmlparser-developer] New Developer Introduction - Joshua Kerievsky

From: Joshua K. <jo...@in...> - 2002-12-24 04:29:33

Thanks for the intro Somik.  I'd hardly consider myself a legend of OO -
merely a practitioner.  I'm excited to help out here.  Thanks for having
e.  --jk

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 03:37:22

Hi Sam,

> Could you explain why you want to refactor these
> methods?  Remember the 
> danger of premature refactoring ... you lose
> flexibility that then has 
> to be re-added later on, making more work in the
> long run.

There seems to be so much duplication in the parser,
that it seems like Merciless_Refactoring can alone
help clean the code.
 
> Is there some efficiency reason why you want to
> refactor these methods 
> or is it just for neatness?

Readability and efficiency are important factors - far
more important is to have a simple way of getting data
from the parser. The idea is to pass in your own
visitors into the parser, which could either extract
text, or html. If you need to modify them, then you
can modify these visitors to customize behaviour. The
weight of the parser would reduce, code duplication
would not be there, and it would be easier to
customize.

> Either way, I would certainly recommend a
> deprecation step.

Will keep that in mind.

Cheers,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: [Htmlparser-developer] toPlainTextString() feedback requested

From: Sam J. <ga...@yh...> - 2002-12-24 02:39:57

Hi Somik,

I tend to use toPlainTextString() in all three ways described below.  I 
use the first to generate a plain test overview, the second for 
debugging statements, and the final one for when I want to add my own 
additional formatting to the final text.

Could you explain why you want to refactor these methods?  Remember the 
danger of premature refactoring ... you lose flexibility that then has 
to be re-added later on, making more work in the long run.

Is there some efficiency reason why you want to refactor these methods 
or is it just for neatness?

Either way, I would certainly recommend a deprecation step.

CHEERS> SAM

Somik Raha wrote:

>Hi Folks,
>  We are performing a series of refactorings, which
>will involve merging of the functionality of
>toPlainTextString() and toHTML(HTMLRenderer), and
>perhaps, toHTML() itself..
>
>  We'd like to know user stories of how you've been
>using toPlainTextString().
>
>Do you enumerate through the nodes and :
>[1] collect all the text data using
>toPlainTextString() into a buffer? 
>
>[2] make calls to toPlainTextString() but don't
>collect the data ? 
>
>[3] collect all text data using toPlainTextString()
>into a buffer, but only after having processed the
>string in some way ? 
>
>Another question - if we remove toPlainTextString()
>[and provide you an alternative], will it hurt, or
>would you prefer a deprecation step (in 1.3) before
>removal (in 1.4) ?
>
>It would be good for us to know more details about
>your applications, so that we can perform our design
>modifications correctly. Feel free to tell us your
>stories in detail. (Remember to click Reply All so the
>replies go to both lists).
>
>Regards,
>Somik
>
>__________________________________________________
>Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>  
>

[Htmlparser-developer] New Developer Introduction - Joshua Kerievsky

From: Somik R. <so...@ya...> - 2002-12-24 02:26:39

Hi Folks,
  It gives me great pleasure to introduce Joshua
Kerievsky, founder of Industrial Logic, Inc. 
Josh is a legendary OO programmer, and a great XP
coach. 

  Josh is actually writing a book on refactoring to
patterns
(http://www.industriallogic.com/xp/refactoring/), and
he is considering using the parser as an example of a
framework where refactoring to a visitor could be
applied. 

  We in fact spent much of today improving the design
and merging the functionality of toPlainTextString()
and toHTML(HTMLRenderer) into a single
acceptVisitor(HTMLVisitor) method.

  We are also looking at merging functionality of
toHTML() itself..

  Pairing with Josh has given me insight into so many
possible refactorings on the current codebase.

   Josh -> Welcome to the developer team of
HTMLParser.

Regards,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

[Htmlparser-developer] toPlainTextString() feedback requested

From: Somik R. <so...@ya...> - 2002-12-24 02:20:37

Hi Folks,
  We are performing a series of refactorings, which
will involve merging of the functionality of
toPlainTextString() and toHTML(HTMLRenderer), and
perhaps, toHTML() itself..

  We'd like to know user stories of how you've been
using toPlainTextString().

Do you enumerate through the nodes and :
[1] collect all the text data using
toPlainTextString() into a buffer? 

[2] make calls to toPlainTextString() but don't
collect the data ? 

[3] collect all text data using toPlainTextString()
into a buffer, but only after having processed the
string in some way ? 

Another question - if we remove toPlainTextString()
[and provide you an alternative], will it hurt, or
would you prefer a deprecation step (in 1.3) before
removal (in 1.4) ?

It would be good for us to know more details about
your applications, so that we can perform our design
modifications correctly. Feel free to tell us your
stories in detail. (Remember to click Reply All so the
replies go to both lists).

Regards,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: [Htmlparser-developer] charset

From: Somik R. <so...@ya...> - 2002-12-24 02:05:00

Hi Derrick,
  Great going!
  
> 1) Why aren't the standard scanners registered by
> the HTMLParser 
> constructor? It would seem that proper parsing of
> HTML would require 
> these always be registered. Unless there's a set for
> HTML 3.2 and one 
> for HTML 4.1 or something. Some of the tests rely on
> this 'empty scanner 
> list' behaviour to return the expected number of
> nodes.

That is bcos we allow folks to custom build their own
parser. e.g. if you are only interested in text
content, you wouldnt wish to register any of the
scanners at all. If you are interested only in links,
you would register only the link scanner. Control that
passes to a scanner - implies extra processing time
(sometimes twice that of the no-scanner scenario).
 
> 2) Mark and reset are still supported somewhat. If
> you mark before 
> calling elements() and reset() after exhausting the
> input stream, all is 
> well. But mark() in the midst of parsing is useless
> and dangerous 
> because of the BufferedReader. Perhaps the
> documentation should reflect 
> this, or a way to do it properly worked out.
> 
Hmm.. I thought I'd added this doc in. If its not
enough, feel free to modify.

> Aside notes:
> 
> Nobody's checking that the contents of the document
> are "text/html", 
> perhaps this check should be added.
> 
> These changes still leave the 'Reader' constructors
> as a separate group 
> which is used by a majority of the test cases. The
> 'new testing 
> framework' should try to use the same code path as
> most of the  
> real-world use cases.  Perhaps, test cases could be
> loaded from 
> htmlparser.sourceforge.net like
> HTMLParserTest.testURLWithSpaces() does.
> 
That would cause a serious performance hit while
running 284 tests - I'd simply stop testing if it took
too long to run the tests..

Regards,
Somik

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: [Htmlparser-developer] charset

From: Somik R. <so...@ya...> - 2002-12-24 02:00:26

Hi Derrick,
--- Derrick Oswald <Der...@ro...> wrote:
> g'day,
> 
> I've dropped code to handle charset parameters, but
> this raised a couple 
> of issues. First an explanation of the changes.
> 
> HTMLParser constructors were added that take a
> URLConnection, so that 
> the input stream could be reacquired as needed.
> For consistency the 'from file' handling is
> identical to a URL now.
> 
> The HTTP header is now examined for the charset
> parameter and the input 
> stream is converted to a reader with that encoding.
> However, a lot of sites lie. They don't specify the
> charset in the HTTP 
> header, even though the HTML is encoded with a
> non-standard encoding.
> So, we have to pre-read the header portion (meta
> tags) and restart if 
> the charset in the meta tags is different than in
> the HTTP header.
> 
> Ya, ya, pre-reading is fraught with peril, and
> that's what I'm on about 
> here...
> 
> 1) Why aren't the standard scanners registered by
> the HTMLParser 
> constructor? It would seem that proper parsing of
> HTML would require 
> these always be registered. Unless there's a set for
> HTML 3.2 and one 
> for HTML 4.1 or something. Some of the tests rely on
> this 'empty scanner 
> list' behaviour to return the expected number of
> nodes.
> 
> 2) Mark and reset are still supported somewhat. If
> you mark before 
> calling elements() and reset() after exhausting the
> input stream, all is 
> well. But mark() in the midst of parsing is useless
> and dangerous 
> because of the BufferedReader. Perhaps the
> documentation should reflect 
> this, or a way to do it properly worked out.
> 
> Aside notes:
> 
> Nobody's checking that the contents of the document
> are "text/html", 
> perhaps this check should be added.
> 
> These changes still leave the 'Reader' constructors
> as a separate group 
> which is used by a majority of the test cases. The
> 'new testing 
> framework' should try to use the same code path as
> most of the  
> real-world use cases.  Perhaps, test cases could be
> loaded from 
> htmlparser.sourceforge.net like
> HTMLParserTest.testURLWithSpaces() does.
> 
> Derrick
> 
> 
> 
>
-------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

[Htmlparser-developer] charset

From: Derrick O. <Der...@ro...> - 2002-12-23 20:47:40

g'day,

I've dropped code to handle charset parameters, but this raised a couple 
of issues. First an explanation of the changes.

HTMLParser constructors were added that take a URLConnection, so that 
the input stream could be reacquired as needed.
For consistency the 'from file' handling is identical to a URL now.

The HTTP header is now examined for the charset parameter and the input 
stream is converted to a reader with that encoding.
However, a lot of sites lie. They don't specify the charset in the HTTP 
header, even though the HTML is encoded with a non-standard encoding.
So, we have to pre-read the header portion (meta tags) and restart if 
the charset in the meta tags is different than in the HTTP header.

Ya, ya, pre-reading is fraught with peril, and that's what I'm on about 
here...

1) Why aren't the standard scanners registered by the HTMLParser 
constructor? It would seem that proper parsing of HTML would require 
these always be registered. Unless there's a set for HTML 3.2 and one 
for HTML 4.1 or something. Some of the tests rely on this 'empty scanner 
list' behaviour to return the expected number of nodes.

2) Mark and reset are still supported somewhat. If you mark before 
calling elements() and reset() after exhausting the input stream, all is 
well. But mark() in the midst of parsing is useless and dangerous 
because of the BufferedReader. Perhaps the documentation should reflect 
this, or a way to do it properly worked out.

Aside notes:

Nobody's checking that the contents of the document are "text/html", 
perhaps this check should be added.

These changes still leave the 'Reader' constructors as a separate group 
which is used by a majority of the test cases. The 'new testing 
framework' should try to use the same code path as most of the  
real-world use cases.  Perhaps, test cases could be loaded from 
htmlparser.sourceforge.net like HTMLParserTest.testURLWithSpaces() does.

Derrick

Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

From: Sam J. <ga...@yh...> - 2002-12-23 06:07:54

Hi Somik,

Somik Raha wrote:

>Bytway... I'd written earlier about this - what is your opinion on using
>Bayesian networks to have a rule-based learning system, that gets better
>over time ? i.e. right now the tag identification mechanism is linear-
>there is only so far that can go. But with the sort of dirty html we get,
>the system has to be self-learning. I am thinking of an approach where we'd
>try to eliminate a lot of the hard-coded rules, with a learning network. Of
>course, we'd have our tests to verify that we haven't broken anything, and
>from there, it should only get better. It would be great to have your
>insight on this.
>
I think I don't quite understand enough abuot htmlparser to see how 
Bayesian networks would be applicable.  I have only recently worked out 
how your scanners work, or rather, that you have scanners for different 
types of tags and can then avoid processing those tags that you are not 
interested in.  You say above that your tag identification mechanism is 
linear, but linear with respect to what?

Can you give me an example of the hard coded rules you are using now, 
and a couple of examples of dirty html pages that cause them to be 
sub-optimal.

Using learning in a system to increase efficiency is usually very 
difficult to do well.  Learning systems basically have more flexibility 
than other systems, but as a consequence you have moer free parameters. 
 It is easy to add a learning framework but then spend all your time 
just trying to adjust the system parameters, and then to discover that 
exploring the space of possible parameters for your learner is just too 
expensive.

Nontheless I am always fascinated by the problem of adding learning to a 
system, precisely because it is so difficult to do well.  If you can 
give me some concrete examples, I will do my best to help you select an 
appropriate learning mechanism.

>>p.s. I'm impressed by the frequency with which you are releasing
>>htmlParser, and your process of having multiple candidates etc.  I
>>struggle to release often as the release process itself still seems a
>>little cumbersome (sourceforge has got better) ....  have you any tips
>>for streamlining it ....?  I guess what I really need is an ant methods
>>    
>>
>like
>  
>
>>ant release-bug-fix version
>>ant create-new-version-release
>>ant create-new-candiate-release
>>
>>which handle all the necessary communication with sourceforge,
>>uploading, packaging and handling of release numbers ....
>>    
>>
>
>Ha ha! I am not sure if you'll believe this, but I was inspired to structure
>the htmlparser project based on the neurogrid project-  you had ant scripts
>long before we did. 
>
Interesting.  The ant scripts for neurogrid were originally made by Rick 
Knowles, and I'm still only just getting a really good feel for ant.  I 
remember trying to set up something in my ng scripts that would add the 
date to the jar file name, like you sometimes do, and failing.  My own 
fault really; I rarely read the manual and always try to learn by 
modifying the operation of an existing system (kind of an evolutionary 
approach ....)

>Of course, ant scripts are so important to do the job
>automatically - but I like keeping things simple -in the sense, there is no
>seperate bug-fix version, but the next integration release (Candidate).
>
>I am not yet a fan of branches - they're ok if they dont live more than two
>weeks (I've been thinking real hard about it for a while). Im planning to
>get the production release out this week - so we can all move on to 1.3
>(instead of having two versions - we'll live with 1.3 integration releases).
>I'd hate to make the same bug fixes twice.
>
ok, but how much do you use the ant tasks to collect together all the 
thigns required for a release, or do you access souceforge yourself each 
time? Given that the code is ready to go, how long does it take you to 
do a release?  5 minutes, 30?  I'm imagining an ant task that would 
require one command and then you'd leave it to run ....  Really I will 
have to sit down and spend some time looking at your ant scripts again 
to work that out, but I thought it might be interesting to hear from you 
about whether the release process feels cumbersome or not...

CHEERS> SAM

[Htmlparser-developer] HTMLParser 1.2 (Production Release) is out

From: Somik R. <so...@ya...> - 2002-12-22 03:24:26

Hi Folks,
    Finally, after 8 months of hard work, we have the next production
release of the parser. 1.2 has tons of bug fixes and features.

    The change log difference b/w 1.2 and 1.1 is too big to be listed in
this mail - check the change log when you are downloading (its also in the
download package). Documentation has been considerably improved (the Sample
programs would be the place to start). There's a section on the patterns in
action as well. You can modify the rendering process for links and images,
as well as provide collecting parameters to pick up nodes that you wish
(currently images and links supported).

    Below is the change log (as compared to last week's integration release)
:
Production Release 1.2
----------------------
[1] Rewrote HTMLLinkProcessor.extract() so URL class does all the heavy
lifting
[2] Partially fixed bug 654746 - HTMLLinkScanner error, code review needed
[3] Rendering bug fixed - allowing uniform rendering for links and images
[4] Fixed bug 655917, made HTMLParameterParser.parseParameters() thread-safe
[5] Refactored HTMLFormTag (introduced POST and GET static members)
[6] Bug fixed in HTMLFormTag.getInputTag() (NullPointerException when input
tag has no name)
[7] Added ability to get textarea tag from HTMLFormTag.
[8] Added search capability in HTMLFormTag
[9] Fixed bug 655627 - JSP tags with < sign (for loops) were not being
parsed correctly
[10] Fixed bug 655603 - JSP tags within src of script not recognized
correctly when using
single apostrophes
[11] Fixed bug 655580 - JSP tags within title tags not recognized correctly
[12] Fixed bug 655599 - Erroneous end-of-line characters were being added in
string nodes
[13] Fixed bug 656870 - HTMLFormScanner goes into infinite loop if a
previous link has not been closed

    Thanks to Derrick Oswald and Dhaval Udani for their work on the last few
releases. Thanks to Joe Robins for pointing out an important bug in
HTMLFormScanner. A special mention for Dhaval - all his bug reports come
with testcases making it really easy for us to reproduce the bug and fix
them.

Regards,
Somik

[Htmlparser-developer] version 1.2

From: Derrick O. <Der...@ro...> - 2002-12-22 03:10:18

Wow, Somik's been busy.
He wiped the slate of bugs clean and has officially released version 1.2:

https://sourceforge.net/project/showfiles.php?group_id=24399&release_id=129477

Let's give him a round of applause.

Now roll up your sleeves and make a glorious 1.3!

Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

From: Somik R. <so...@ya...> - 2002-12-21 08:47:08

Hi Sam

> It seems that really we want something like:
>
> 1.  Specify use case, and data-in/data-out
> 2.  Automatically generate test code and shell implementation code
> 3.  Run test code to fail
> 4.  fill in implementaton details ...etc
>

I like your idea. You can check this week's release - you'll find searching
support for form tags which allow you to pick up input tags, textarea tags,
select tags, ...

Bytway... I'd written earlier about this - what is your opinion on using
Bayesian networks to have a rule-based learning system, that gets better
over time ? i.e. right now the tag identification mechanism is linear-
there is only so far that can go. But with the sort of dirty html we get,
the system has to be self-learning. I am thinking of an approach where we'd
try to eliminate a lot of the hard-coded rules, with a learning network. Of
course, we'd have our tests to verify that we haven't broken anything, and
from there, it should only get better. It would be great to have your
insight on this.

> p.s. I'm impressed by the frequency with which you are releasing
> htmlParser, and your process of having multiple candidates etc.  I
> struggle to release often as the release process itself still seems a
> little cumbersome (sourceforge has got better) ....  have you any tips
> for streamlining it ....?  I guess what I really need is an ant methods
like
>
> ant release-bug-fix version
> ant create-new-version-release
> ant create-new-candiate-release
>
> which handle all the necessary communication with sourceforge,
> uploading, packaging and handling of release numbers ....

Ha ha! I am not sure if you'll believe this, but I was inspired to structure
the htmlparser project based on the neurogrid project-  you had ant scripts
long before we did. Of course, ant scripts are so important to do the job
automatically - but I like keeping things simple -in the sense, there is no
seperate bug-fix version, but the next integration release (Candidate).

I am not yet a fan of branches - they're ok if they dont live more than two
weeks (I've been thinking real hard about it for a while). Im planning to
get the production release out this week - so we can all move on to 1.3
(instead of having two versions - we'll live with 1.3 integration releases).
I'd hate to make the same bug fixes twice.

Regards,
Somik

Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )

From: Sam J. <ga...@yh...> - 2002-12-21 06:39:59

Hi Somik

Somik Raha wrote:

>Hi Sam,
>  This is interesting - bcos I was doing much the same
>thing last evening. I've added a lot of searching
>methods into HTMLFormTag - as I was using this to do
>test-first development of an XSLT stylesheet. 
>  I was happy with the results - I was actually able
>to develop the stylesheet test-first.
>
Sounds cool.

>>I have started creating a TestCodeGenerator using
>>HtmlParser.  I am 
>>using an HTMLFormScanner to strip out all the form
>>details, and from 
>>this I hope to generate testing and handling java
>>classes ....
>>    
>>
>
>I will probably add some more utility methods into
>HTMLParserTestCase that will make life easier - but
>even in its current form, you might find it useful.
>
I think it will be.  At first I was thinking I wanted lots of additional 
methods, but actually I can predicate behaviour based on the TYPE of the 
input tags.  I'm getting output like this:

   [java] DEBUG [main] (TestCodeGenerator.java:296) - pTagName: FORM
   [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: ACTION, 
value: $FORM_ACTION
   [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: NAME, value: 
$EDIT_FORM
   [java] DEBUG [main] (TestCodeGenerator.java:302) - attr: METHOD, 
value: POST
   [java] DEBUG [main] (TestCodeGenerator.java:306) - x_inputs.size(): 7
   [java] DEBUG [main] (TestCodeGenerator.java:321) - pTagName: INPUT
   [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: VALUE, 
value: add_bookmark
   [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: NAME, value: 
$ACTION
   [java] DEBUG [main] (TestCodeGenerator.java:327) - attr: TYPE, value: 
hidden

and I think I can generate the necessary test and code shells from this.

>I've documented it here :
>http://htmlparser.sourceforge.net/design/tests.html
>
Interesting.  I'm becoming more and more commited to a test-first 
philosophy myself, i.e.

1.  write test
2. run it and watch it fail to confirm test will fail when 
implementation is not present
3. write the implementation
4. try and get implementation to pass test
5. repeat as appropriate

Although actually my post was more about automatic test creation rather 
than testing itself.  I recently found this:

http://www.junitdoclet.org/

which automatically creates test shells out of code.  And I found myself 
wanting the equivalent thing for httpUnit.  Either way round I think if 
people are going to do testing they need support to make it easier to 
create the tests.  And if they are working on the tests first they need 
support creating the implementation.

It seems that really we want something like:

1.  Specify use case, and data-in/data-out
2.  Automatically generate test code and shell implementation code
3.  Run test code to fail
4.  fill in implementaton details ...etc

Anyways.....

CHEERS> SAM

p.s. I'm impressed by the frequency with which you are releasing 
htmlParser, and your process of having multiple candidates etc.  I 
struggle to release often as the release process itself still seems a 
little cumbersome (sourceforge has got better) ....  have you any tips 
for streamlining it ....?  I guess what I really need is an ant methods like

ant release-bug-fix version
ant create-new-version-release
ant create-new-candiate-release

which handle all the necessary communication with sourceforge, 
uploading, packaging and handling of release numbers ....

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 19 20 21 22 23 .. 33 > >> (Page 21 of 33)