htmlparser-developer Mailing List for HTML Parser (Page 15)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Derrick,=20
    I've made a few code changes (removed nextHTMLNode) and updated the =
contributors page.
    Just to let you know that u need to update the docs directory.

Regards,
Somik
StringBean handles "<LI>" tag quite alright, I am
still studying the code cuz it is quite different from
the previous StringExtractor.

I found two pages it gives wrong result:
1. It doesn't output the form content of
http://www.cnnfn.com/best/bpretire/bpretire_form.html

2. When processing the following page it gives
java.lang.outofmemoryError.

http://www.cnnfn.com/commentary/

Thanks

Ling Ma

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

Done. Sorry I didnt do it earlier.

Regards,
Somik
----- Original Message -----
From: "Derrick Oswald" <Der...@ro...>
To: <htm...@li...>
Sent: Friday, April 25, 2003 3:56 PM
Subject: Re: [Htmlparser-developer] future directions

> Somik,
>
> For the release this weekend, which Johan Naudts is keen to get, you'll
> need to change my Project Role to Project Manager in the Members section
> on the Admin page, so I can access the File Release System.
>
> I'll try the WikiGrabber tonight.
>
> Derrick
>
> Somik Raha wrote:
>
> >Hi Derrick,
> >    Sorry for not replying earlier, I have been out of town, and the last
> >few days have been really busy.
> >
> >
> >
> >>I didn't mean to imply that the existing testcases be discarded. No,
> >>they are excellent.
> >>But do they have good code coverage? Do they represent a significant
> >>portion of valid HTML constructs? Can they be said to probe the invalid
> >>HTML space in a meaningful way? The only way to do this is via
> >>automation. Perhaps the fit framework can be used as the front end for
> >>it, but nobody is going to hand-craft thousands of tests.
> >>
> >>
> >
> >I agree with you. The FIT framework can reduce the coding burden if it
ends
> >up as a frontend for most of our acceptance tests.
> >
> >I need a break, yes, and I do think that you can lead the project far
beyond
> >my ideas. I am not attached to the work already done, and in fact, don't
> >want to leave you with some of the poor design that I did a long time
back.
> >So, I will focus on cleaning up the core areas, which will be useful in
the
> >long run.
> >
> >It will be great if you try making the releases from this week- so we can
> >make a smooth transition. Before running the script, I usually run the
> >WikiGrabber (first delete the docs directory) which brings in the latest
> >docs from our Wiki.
> >
> >Regards,
> >Somik
> >
> >
> >
> >
> >
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

I've uploaded an interesting (well, interesting to those who really need 
to get out more) report on test coverage:
    http://htmlparser.sourceforge.net/HTMLParser_Coverage.html

Although large (nearly 2MB), the report shows line by line test case 
code coverage when running org.htmlparser.tests.AllTests.
It indicates that the tests cover 85% of the executable lines in the 
source tree (excluding the org.htmlparser.tests packages).
I think this is very good, since a lot of the unexecuted code lies 
within catch blocks and convenience methods.

Derrick

Somik,

For the release this weekend, which Johan Naudts is keen to get, you'll 
need to change my Project Role to Project Manager in the Members section 
on the Admin page, so I can access the File Release System.

I'll try the WikiGrabber tonight.

Derrick

Somik Raha wrote:

>Hi Derrick,
>    Sorry for not replying earlier, I have been out of town, and the last
>few days have been really busy.
>
>  
>
>>I didn't mean to imply that the existing testcases be discarded. No,
>>they are excellent.
>>But do they have good code coverage? Do they represent a significant
>>portion of valid HTML constructs? Can they be said to probe the invalid
>>HTML space in a meaningful way? The only way to do this is via
>>automation. Perhaps the fit framework can be used as the front end for
>>it, but nobody is going to hand-craft thousands of tests.
>>    
>>
>
>I agree with you. The FIT framework can reduce the coding burden if it ends
>up as a frontend for most of our acceptance tests.
>
>I need a break, yes, and I do think that you can lead the project far beyond
>my ideas. I am not attached to the work already done, and in fact, don't
>want to leave you with some of the poor design that I did a long time back.
>So, I will focus on cleaning up the core areas, which will be useful in the
>long run.
>
>It will be great if you try making the releases from this week- so we can
>make a smooth transition. Before running the script, I usually run the
>WikiGrabber (first delete the docs directory) which brings in the latest
>docs from our Wiki.
>
>Regards,
>Somik
>
>
>
>  
>

Hi Derrick,
    Sorry for not replying earlier, I have been out of town, and the last
few days have been really busy.

> I didn't mean to imply that the existing testcases be discarded. No,
> they are excellent.
> But do they have good code coverage? Do they represent a significant
> portion of valid HTML constructs? Can they be said to probe the invalid
> HTML space in a meaningful way? The only way to do this is via
> automation. Perhaps the fit framework can be used as the front end for
> it, but nobody is going to hand-craft thousands of tests.

I agree with you. The FIT framework can reduce the coding burden if it ends
up as a frontend for most of our acceptance tests.

I need a break, yes, and I do think that you can lead the project far beyond
my ideas. I am not attached to the work already done, and in fact, don't
want to leave you with some of the poor design that I did a long time back.
So, I will focus on cleaning up the core areas, which will be useful in the
long run.

It will be great if you try making the releases from this week- so we can
make a smooth transition. Before running the script, I usually run the
WikiGrabber (first delete the docs directory) which brings in the latest
docs from our Wiki.

Regards,
Somik

Hi,

I want to parse a LABEL tag and replace the data between the start and end 
tags. I am able to obtain the data using the getLabel method. However to 
replace, firstly there is no synonymous setLabel() method. Hence I used the 
setText() method of Tag class. After that I printed my tag using the toHtml() 
method. However I received the previous text itself, not the one I replaced. 

On closer inspection of Tag class I realize that this newly set value is not 
being considered during toHtml() itself and there are no test cases for 
setText() in TagTest.

I have logged the same as a bug (#726913) and attached a test condition as well.

Also the following is required:
1. Obviously toHtml() must consider the changed text.
2. LabelTag should have setLabel() method synoymous to getLabel() which 
internally calls setText() of Tag class.

Dhaval

-----Original Message-----
From: Udani, Dhaval H. 
Sent: Thursday, April 17, 2003 4:15 PM
To: htmlparser-developer
Cc: Udani, Dhaval H.
Subject: [Htmlparser-developer] Label Tag

Hi,

Is there any synoymous for the setLabel method in the LabelTag to go with the 
getLabel method. I need functionality tochange the Label of a Label tag. 

How would I achieve this?

Dhaval

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Hi,

A few things to point out:

1. FormScanner contains an unused variable textAreaVector.
2. FormTag contains a NodeList instance variable for INPUT tags and TEXTAREA 
tags. However none for SELECT tags. It needs to be added.

Dhaval

-----Original Message-----
From: somik [mailto:so...@ya...]
Sent: Sunday, April 20, 2003 8:36 AM
To: htmlparser-announce; htmlparser-developer; htmlparser-user
Cc: somik
Subject: [Htmlparser-developer] Integration Release 1.3-20030420 is out

Hi Folks,
    This week's release is out. From the change log:

Integration Build 1.3 - 20030420
--------------------------------
[1] Fixed bug #722046 StringExtractor.extractStrings misses most of the
text,
change to use a StringBean to dig into tables.
[2] add checking in Translate to eliminate bug #722835
StringIndexOutOfBoundsException exception
[3] added line-break condition in assertXmlEquals
[4] added fit testing framework
[5] added parent association for each node
[6] added digupStringNode() and findPositionOf(Node) to CompositeTag
[7] Fixed bug 723835 in LinkExtractor

We have some powerful searching capability with this release.
From any node, you can find the parent composite tag, and navigate thru the
entire html structure. This is useful in scenarios like :

Search for data that lies close to a certain piece of text.

e.g. ... <table>
                <tr>
                    <td>
                        <b>Name:</b><i>John Doe</i>
                    </td>
                </tr>
            </table>

We can extract John Doe, by using our knowledge of its expected position.
If we assume that the contents are inside a table tag, here's what a program
could look like:

parser.registerScanners();
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Lets assume our data is in the second table
TableTag table = (TableTag)nodes[1];

// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");

// We assume that the first node that matched is the one we want. We
navigate to its parent
Node parentOfName = stringNodes[0].getParent();

// From the parent, we shall find out the position of "Name"
int posOfName = parentOfName.findPositionOf(stringNodes[0]);

// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = parentOfName.childAt(posOfName + 3);

This can be useful for writing tests for your pages or extracting position
based info - new possibilities open up for semantic searches.

Regards,
Somik

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

yeah documentation in the change log would have been nice. I would have atleast 
changed my code then before rerunning it. Somik i agree with u regarding its 
need and i am sorry about sounding irritated in the last mail but its kindda 
frustrating u'd agree.

-----Original Message-----
From: somik [mailto:so...@ya...]
Sent: Thursday, April 17, 2003 11:45 PM
To: htmlparser-developer
Cc: somik
Subject: Re: [Htmlparser-developer] Form Scanner

Hi Dhaval,
   There's a reason for this - the scanners that are
connected to sub-scanners should demonstrate the
coupling. So, when you do addScanner(new
FormScanner()) - the input scanner, select scanner,
etc.. are also added automatically. The same thing
happens in the TableScanner (for row and column
scanner).

   This addition cannot be done if the parser is not
passed in. 

   I agree with you though about backward
compatibility. We should have at least documented it
in the change log, or kept the dummy c'tor in there.

Regards,
Somik
--- dha...@or... wrote:
> Hi,
> 
> The latest version of the Parser is not backward
> compatible with the one 
> released previously. 
> 
> FormScanner does not have a no-args constructor any
> longer. This is causing my 
> code to break I ahve to once again go back to code
> and fix it. Can we atleast 
> maintain some kind of backward compatibility here.
> code not working between 
> versions within a week's difference is really
> nerve-racking.
> 
> Regards,
> 
> Dhaval Udani
> Senior Analyst
> M-Line, QPEG
> OrbiTech Solutions Ltd.
> +91-22-28290019 Extn. 1457
> 
> 
> 
> 
>
-------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Somik,

I guess I'm flattered that you think highly of my contributions, but I 
stand on the shoulders of giants.  You're probably feeling burnt out 
after two years and want to get your life back.  I suggest this.  If no 
one else comes forward, grant me the Project Manager role and forget 
about it for the summer (if you can).  I'll try to be the "Somik" by 
doing the builds, answering email, performing triage and fixing bugs, 
contributing to discussions, and so on, but it's unlikely that the 
quality will be as good. I'm not naive enough to think I'm the one with 
the vision to move it forward.  If it's a rest you want, I can only 
promise a short respite.  We'll both consider it again in September, and 
if you still feel burnt out, you can drop your Project Manager role 
altogether, in favour of me or somebody else.  Or, if I haven't 
completely misread you, you can pick it up again after your sabbatical.

p.s.
I didn't mean to imply that the existing testcases be discarded. No, 
they are excellent.
But do they have good code coverage? Do they represent a significant 
portion of valid HTML constructs? Can they be said to probe the invalid 
HTML space in a meaningful way? The only way to do this is via 
automation. Perhaps the fit framework can be used as the front end for 
it, but nobody is going to hand-craft thousands of tests.

Derrick

Somik Raha wrote:

><snip>
>
>
>The testcases in the parser are anything but ad-hoc.I have specifically
>avoided going down the path of the DTD. Lots of parsers exist which follow
>the DTD to the letter and fail miserably on real-world html. It is not
>difficult to construct a "correct" htmlparser using a grammar and javacc.
>What differentiates us is that we have a massive testcase database from
>real-world html which took almost 3 years to accumulate. It is this alone
>that gives us the confidence to totally change the design at any given time
>and still know that we're doing it right. This is what allowed me to
>redesign the CompositeTagScanner, and not break anything new. We are at a
>position where we can go ahead and redesign whatever we wish.
>
>With this release, I have tried to incorporate the Fit framework
>(http://fit.c2.com) which makes writing tests a breeze. Now, to write new
>tests, you will only use a html editor like Netscape Composer or M$ Word and
>you can run them! I have made some tests for the AttributeParser (and I
>already found a bug).
>
>I think we can think of making release 1.3 very soon - and probably decide
>to seal it from now. My recommended to-do list for 1.3 is :
>[1] Redesign StringParser
>[2] Redesign AttributeParser
>[3] Redesign NodeReader
>
>These three look particularly scary right now.
>
>On to project management issues, I am planning to step down from the role of
>the project lead. Your contribution has been substantial and has added so
>much value to the parser. You also have a solid vision to take this project
>to v1.4 - I really like your ideas about semantic capabilities and have been
>thinking on the same lines. I hope you will volunteer for this
>responsibility. There is not much extra work involved - the build scripts
>have been there for a while - all you have to do is make releases whenever
>you feel (one week is usually good). I will of course be there to help in
>whatever way I can.
>
>Regards,
>Somik
>
>
>  
>

At 08:13 20.4.2003 -0700, you wrote:
>Hi Kaarle,
>     Happy Easter!

I put this in an html page and checked with IE6.0 and it show a picture but 
the
balloon with the text shows

great
pic

i.e. with newline. It does not strip the new line.

Kaarle

> > As I run the JUnit I get only error in test
> > org.htmlparser.tests.parserHelperTests.AttributeParserTest
> >
> > The results dir contains AttributeParser.html
> >
> > with a source like this:
> > ============
> > img src="somepic.jpg" alt="great
> > pic" width="10" height=20
> > =============
> > With a new line in the middle of an attribute value. I don't think that's
> > allowed in html??
>
>I thought that the parser should be able to figure out when to strip the new
>line.. You could always test this in IE and see if it is picked up (I think
>it is).
>
> > I guess you have somewhere some instructions on fit that I should know to
> > look at this test?
>
>fit.c2.com.
>To make life easy, I've added a class that allows you to run fit tests
>inside junit - check the resources package - fit.jar, expand it to go to
>runner.FitRunner. And simply run that as a testcase. It will run the tests
>in the specs directory and put the results in the results directory. It can
>flag errors at the file level after which you have to open the results to
>get detailed info.
>
>Regards,
>Somik
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844 

Hi Kaarle,
    Happy Easter!

> As I run the JUnit I get only error in test
> org.htmlparser.tests.parserHelperTests.AttributeParserTest
>
> The results dir contains AttributeParser.html
>
> with a source like this:
> ============
> img src="somepic.jpg" alt="great
> pic" width="10" height=20
> =============
> With a new line in the middle of an attribute value. I don't think that's
> allowed in html??

I thought that the parser should be able to figure out when to strip the new
line.. You could always test this in IE and see if it is picked up (I think
it is).

> I guess you have somewhere some instructions on fit that I should know to
> look at this test?

fit.c2.com.
To make life easy, I've added a class that allows you to run fit tests
inside junit - check the resources package - fit.jar, expand it to go to
runner.FitRunner. And simply run that as a testcase. It will run the tests
in the specs directory and put the results in the results directory. It can
flag errors at the file level after which you have to open the results to
get detailed info.

Regards,
Somik

At 20:07 19.4.2003 -0700, you wrote:
>Hi Kaarle,
>     That is actually a different bug - take a look at the test again (look
>at results.html).

Happy eastern to you all!

As I run the JUnit I get only error in test
org.htmlparser.tests.parserHelperTests.AttributeParserTest

The results dir contains AttributeParser.html

with a source like this:
============
img src="somepic.jpg" alt="great
pic" width="10" height=20
=============
With a new line in the middle of an attribute value. I don't think that's 
allowed in html??

I guess you have somewhere some instructions on fit that I should know to 
look at this test?

regards
Kaarle

>Regards,
>Somik
>----- Original Message -----
>From: "Kaarle Kaila" <kaa...@ik...>
>To: <htm...@li...>
>Sent: Saturday, April 19, 2003 1:56 PM
>Subject: [Htmlparser-developer] bug [ 722917 ] AttributeParser not parsing
>correctly
>
>
> > I have been assigned for the [ 722917 ] AttributeParser not parsing
>correctly
> > error in htmlparser. I have answered my opinion already a few times that
> > why should
> > htmlparser need to be able to parse uncompiled asp or jsp pages.
> >
> > If not I would like to know on what kind of html-page can this text be
>found?
> >
> > <a href=\"<%=Application(\"sURL\")%>/literature/index.htm>
> >
> > Can you answer this to me before I begin to look into this?
> >
> > Kaarle
> >
> > ---------------------------------------------
> > Kaarle Kaila
> > http://www.iki.fi/kaila
> > mailto:kaa...@ik...
> > tel: +358 50 3725844
> >
> >
> >
> > -------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Htmlparser-developer mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>
>
>
>-------------------------------------------------------
>This sf.net email is sponsored by:ThinkGeek
>Welcome to geek heaven.
>http://thinkgeek.com/sf
>_______________________________________________
>Htmlparser-developer mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844 

Hi Derrick,
    I am glad you brought this up.

> The htmlparser project is quite successful, with many, many users (over
> 15,000 downloads) and now 17 developers.

We actually ended up in pos 18 when I last checked. All of the credit goes
to all the developers (and a significant part to you) and the user community
who have given us feedback..

    I agree with most of your ideas. I shall therefore only write about
where I don't agree.

<Testing>
>
> The existing ad hoc test cases and examples of dirty HTML are good for
> development, but perhaps a more thorough suite of tests needs to be
> created for real world use.
>
> Based on the strict HTML document type definition at
> http://www.w3.org/TR/html4/strict.dtd, a test case generator could be
> constructed that would read the dtd, create a tree of possible HTML
> syntax, and then based on that correct HTML and a few dozen permutation
> possibilities (i.e. insert string content, dropping a node, injecting a
> bogus node, injecting a comment, altering a node by a character,
> truncating input, starting midway through a construct etc.) throw tens
> of thousands of tests at the parser and record the test HTML and any
> unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to
> replace </bod with </body>.
>
 </Testing>

The testcases in the parser are anything but ad-hoc.I have specifically
avoided going down the path of the DTD. Lots of parsers exist which follow
the DTD to the letter and fail miserably on real-world html. It is not
difficult to construct a "correct" htmlparser using a grammar and javacc.
What differentiates us is that we have a massive testcase database from
real-world html which took almost 3 years to accumulate. It is this alone
that gives us the confidence to totally change the design at any given time
and still know that we're doing it right. This is what allowed me to
redesign the CompositeTagScanner, and not break anything new. We are at a
position where we can go ahead and redesign whatever we wish.

With this release, I have tried to incorporate the Fit framework
(http://fit.c2.com) which makes writing tests a breeze. Now, to write new
tests, you will only use a html editor like Netscape Composer or M$ Word and
you can run them! I have made some tests for the AttributeParser (and I
already found a bug).

I think we can think of making release 1.3 very soon - and probably decide
to seal it from now. My recommended to-do list for 1.3 is :
[1] Redesign StringParser
[2] Redesign AttributeParser
[3] Redesign NodeReader

These three look particularly scary right now.

On to project management issues, I am planning to step down from the role of
the project lead. Your contribution has been substantial and has added so
much value to the parser. You also have a solid vision to take this project
to v1.4 - I really like your ideas about semantic capabilities and have been
thinking on the same lines. I hope you will volunteer for this
responsibility. There is not much extra work involved - the build scripts
have been there for a while - all you have to do is make releases whenever
you feel (one week is usually good). I will of course be there to help in
whatever way I can.

Regards,
Somik

Hi Kaarle,
    That is actually a different bug - take a look at the test again (look
at results.html).

Regards,
Somik
----- Original Message -----
From: "Kaarle Kaila" <kaa...@ik...>
To: <htm...@li...>
Sent: Saturday, April 19, 2003 1:56 PM
Subject: [Htmlparser-developer] bug [ 722917 ] AttributeParser not parsing
correctly

> I have been assigned for the [ 722917 ] AttributeParser not parsing
correctly
> error in htmlparser. I have answered my opinion already a few times that
> why should
> htmlparser need to be able to parse uncompiled asp or jsp pages.
>
> If not I would like to know on what kind of html-page can this text be
found?
>
> <a href=\"<%=Application(\"sURL\")%>/literature/index.htm>
>
> Can you answer this to me before I begin to look into this?
>
> Kaarle
>
> ---------------------------------------------
> Kaarle Kaila
> http://www.iki.fi/kaila
> mailto:kaa...@ik...
> tel: +358 50 3725844
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

Hi Folks,
    This week's release is out. From the change log:

Integration Build 1.3 - 20030420
--------------------------------
[1] Fixed bug #722046 StringExtractor.extractStrings misses most of the
text,
change to use a StringBean to dig into tables.
[2] add checking in Translate to eliminate bug #722835
StringIndexOutOfBoundsException exception
[3] added line-break condition in assertXmlEquals
[4] added fit testing framework
[5] added parent association for each node
[6] added digupStringNode() and findPositionOf(Node) to CompositeTag
[7] Fixed bug 723835 in LinkExtractor

We have some powerful searching capability with this release.
From any node, you can find the parent composite tag, and navigate thru the
entire html structure. This is useful in scenarios like :

Search for data that lies close to a certain piece of text.

e.g. ... <table>
                <tr>
                    <td>
                        <b>Name:</b><i>John Doe</i>
                    </td>
                </tr>
            </table>

We can extract John Doe, by using our knowledge of its expected position.
If we assume that the contents are inside a table tag, here's what a program
could look like:

parser.registerScanners();
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Lets assume our data is in the second table
TableTag table = (TableTag)nodes[1];

// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");

// We assume that the first node that matched is the one we want. We
navigate to its parent
Node parentOfName = stringNodes[0].getParent();

// From the parent, we shall find out the position of "Name"
int posOfName = parentOfName.findPositionOf(stringNodes[0]);

// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = parentOfName.childAt(posOfName + 3);

This can be useful for writing tests for your pages or extracting position
based info - new possibilities open up for semantic searches.

Regards,
Somik

I have been assigned for the [ 722917 ] AttributeParser not parsing correctly
error in htmlparser. I have answered my opinion already a few times that 
why should
htmlparser need to be able to parse uncompiled asp or jsp pages.

If not I would like to know on what kind of html-page can this text be found?

<a href=\"<%=Application(\"sURL\")%>/literature/index.htm>

Can you answer this to me before I begin to look into this?

Kaarle

---------------------------------------------
Kaarle Kaila
http://www.iki.fi/kaila
mailto:kaa...@ik...
tel: +358 50 3725844 

The htmlparser project is quite successful, with many, many users (over 
15,000 downloads) and now 17 developers.
It's time to consider where it goes from here. Here are some thoughts.

<Restructuring>

To initiate discussion, I propose a restructuring. One model of language 
processing identifies three levels so I'll briefly define these in terms 
htmlparser people can relate to:

lexical level
- low level, identifies character encoding, whitespace, tokens, lines
- currently implemented in Parser, NodeReader and Tag

syntactic level
- mid level, identifies tags, nesting, attributes, text
- this is the raison d'être of the htmlparser, currently implemented in 
scanners, especially TagScanner, AttributeParser, 
CompositeTagScannerHelper, StringParser and TagParser

semantic level
- high level, identifies meaning
- currently not implemented, but this is where all the 'action' is, 
regarding page layout, script, JSP, tables and other actual content

So far, htmlparser has not been in the semantic business, but I would 
think that the best parsing can be done when higher level knowledge is 
brought to bear. This borders on AI and applies when missing end tags 
and malformed tags or attributes are encountered.  It seems logical to 
create three packages to contain the three levels and extract the 
corresponding functionality out of the many places it currently is and 
put it into the appropriate centralized location.

Here are some broad specifics to use as a basis for design documents:
move all things character and stream related to a new lexer package
  - would return a contiguous stream of 'Token' objects that contain 
only absolute character offsets
  - would answer questions about line number and be able to extract 
portions of the stream
  - would implement 'cursors' to save and restore machine state
  - operates at the character level
repackage the parserHelper package to be what it really is, the syntax 
package
   - would return a contiguous stream of Tag objects
   - would implement 'bookmarks' to save and restore machine state
   - operates at the string level
create a sematics package that contains the high level knowledge about HTML
   - this level would return a contiguous stream of (possibly nested) nodes
   - operates at the document type definition level

</Restructuring>

<Error Handling>

Dirty HTML and the nature of parsing means error handling is integral to 
operation, so a review of error handling is in order with an eye to a 
comprehensive policy, the tenets of which might include:

1) The level (class) that discovers the error performs it's best effort 
to fix it. This is already mostly in place. It just needs systematic 
thought and robust utility methods to handle pathological cases, i.e. 
end of file, no expected token, quotes in odd places, etc. Exceptions 
should be handled locally as much as possible. Anything not generated by 
parser code, i.e. not a ParserException, is handled gracefully by the 
code closest to the throw. This applies for IOExceptions and other 
explicit exceptions, but also the runtime exceptions such as 
ArrayBoundsExceptions,  OutOfMemoryExceptions, 
IllegalArgumentExceptions, etc.

2) Errors that can't be fixed assume that the wrong higher level 
(semantic or syntactic) assumptions have been made and backs up to a 
point where an alternate interpretation is possible. This means being 
able to restore the state of the machine to a prior 'known good state'.

3) Errors should be couched in absolute character location terms so 
intelligent choices can be made about syntactic trees. If a supervisor 
routine is used, it can explore tree depths and stream consumption. The 
rule might be "choose the interpretation that extracts the most nodes" 
or "choose the interpretation that advances furthest into the file 
before discovering an error".

</Error Handling>

<Testing>

The existing ad hoc test cases and examples of dirty HTML are good for 
development, but perhaps a more thorough suite of tests needs to be 
created for real world use.

Based on the strict HTML document type definition at 
http://www.w3.org/TR/html4/strict.dtd, a test case generator could be 
constructed that would read the dtd, create a tree of possible HTML 
syntax, and then based on that correct HTML and a few dozen permutation 
possibilities (i.e. insert string content, dropping a node, injecting a 
bogus node, injecting a comment, altering a node by a character, 
truncating input, starting midway through a construct etc.) throw tens 
of thousands of tests at the parser and record the test HTML and any 
unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to 
replace </bod with </body>.

</Testing>

Hi Dhaval,
   There's a reason for this - the scanners that are
connected to sub-scanners should demonstrate the
coupling. So, when you do addScanner(new
FormScanner()) - the input scanner, select scanner,
etc.. are also added automatically. The same thing
happens in the TableScanner (for row and column
scanner).

   This addition cannot be done if the parser is not
passed in. 

   I agree with you though about backward
compatibility. We should have at least documented it
in the change log, or kept the dummy c'tor in there.

Regards,
Somik
--- dha...@or... wrote:
> Hi,
> 
> The latest version of the Parser is not backward
> compatible with the one 
> released previously. 
> 
> FormScanner does not have a no-args constructor any
> longer. This is causing my 
> code to break I ahve to once again go back to code
> and fix it. Can we atleast 
> maintain some kind of backward compatibility here.
> code not working between 
> versions within a week's difference is really
> nerve-racking.
> 
> Regards,
> 
> Dhaval Udani
> Senior Analyst
> M-Line, QPEG
> OrbiTech Solutions Ltd.
> +91-22-28290019 Extn. 1457
> 
> 
> 
> 
>
-------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

__________________________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo
http://search.yahoo.com

Hi Somik,

I was checking out the code of form scanner and I saw that it contained
a list of all the INPUT tags and all the TEXTAREA tags. In addition we
need to add the list of SELECT tags also out here..

Thanks for catching this. Done.
[Udani, Dhaval H.] 
I yet don't see this in the latest code. 

Dhaval

Hi,

The latest version of the Parser is not backward compatible with the one 
released previously. 

FormScanner does not have a no-args constructor any longer. This is causing my 
code to break I ahve to once again go back to code and fix it. Can we atleast 
maintain some kind of backward compatibility here. code not working between 
versions within a week's difference is really nerve-racking.

Regards,

Dhaval Udani
Senior Analyst
M-Line, QPEG
OrbiTech Solutions Ltd.
+91-22-28290019 Extn. 1457

Hi,

Is there any synoymous for the setLabel method in the LabelTag to go with the 
getLabel method. I need functionality tochange the Label of a Label tag. 

How would I achieve this?

Dhaval

wow!
htmlparser is in the top 25 most active projects this week:
    https://sourceforge.net/top/mostactive.php?type=week
at position 24 with a 99.780% percentile
(which I think means it's more active than 99.78% of the projects on 
sourceforge).

Keep up the good work!
Kill them bugs!

Some=A0things that need to be addressed:
=A0
1. LabelScanner does not have a no-args constructor like the other
scanners. Needs to be added.
2. SelectTag should use a NodeList of OptionTags rather than a List to
keep it in sync with the rest of the code.
=A0
I'll do both these tommorrow and send it to Somik.
=A0
Somik,
I hope you can include both these details in the next release.
=A0

   -----Original Message-----
   From: somik [mailto:so...@ya...]
   Sent: Monday, April 14, 2003 5:38 AM
   To: htmlparser-announce; htmlparser-developer; htmlparser-user
   Cc: somik
   Subject: [Htmlparser-developer] Integration Release 1.3-20030413 is
   out
  =20
  =20

  =20
   Hi Folks,
   =A0=A0=A0 This week's release contains :
   =A0
   Integration Build 1.3 - 20030413
   --------------------------------
   [1] reimplement StringBean as NodeVisitor, testStringBeanListener now
   succeeds
   [2] Implemented feature request 702541 (Tags created by
   CompositeTagScanner
   =A0=A0=A0 now have startLine and endLine information in their TagData)
   [3] Modified ScriptScanner to allow for subclassing and fixed minor
   bug
   [4] Re-architected CompositeTagScanner
   [5] Fixed Tag scanning bugs, OOM exceptions connected to #4
  =20
   Thanks to Derrick Oswald for his work on the StringBean
   and=A0Marc=A0Novakowski for his work on adding line number support.
   In this release, the CompositeTagScanner has been totally redesigned
   - using principles of Evolutionary Design. The new code is
   dramatically simpler and easier to understand. I tried ED on
   assertXmlEquals last week and had the same results. The Out of Memory
   bugs have been fixed - it will be good to have some freedback.
   =A0
   For those who might be interested in ED, before taking each step, I
   have taken a snapshot and put it in CVS. Many thanks to Josh
   Kerievsky for teaching me how to do ED.
   =A0
   Cheers,
   Somik

  =20

Hi Folks,
    This week's release contains :

Integration Build 1.3 - 20030413
--------------------------------
[1] reimplement StringBean as NodeVisitor, testStringBeanListener now =
succeeds
[2] Implemented feature request 702541 (Tags created by =
CompositeTagScanner
    now have startLine and endLine information in their TagData)
[3] Modified ScriptScanner to allow for subclassing and fixed minor bug
[4] Re-architected CompositeTagScanner
[5] Fixed Tag scanning bugs, OOM exceptions connected to #4

Thanks to Derrick Oswald for his work on the StringBean and Marc =
Novakowski for his work on adding line number support.
In this release, the CompositeTagScanner has been totally redesigned - =
using principles of Evolutionary Design. The new code is dramatically =
simpler and easier to understand. I tried ED on assertXmlEquals last =
week and had the same results. The Out of Memory bugs have been fixed - =
it will be good to have some freedback.

For those who might be interested in ED, before taking each step, I have =
taken a snapshot and put it in CVS. Many thanks to Josh Kerievsky for =
teaching me how to do ED.

Cheers,
Somik

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

htmlparser-developer Mailing List for HTML Parser (Page 15)

htmlparser-developer — The developer mailing list of the htmlparser project