htmlparser-developer Mailing List for HTML Parser (Page 15)
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
| 2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
| 2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
| 2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
| 2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
| 2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
| 2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
| 2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
|
From: Somik R. <so...@ya...> - 2003-04-27 19:16:55
|
Hi Derrick,=20
I've made a few code changes (removed nextHTMLNode) and updated the =
contributors page.
Just to let you know that u need to update the docs directory.
Regards,
Somik |
|
From: Mr L. MA <law...@ya...> - 2003-04-27 03:50:00
|
StringBean handles "<LI>" tag quite alright, I am still studying the code cuz it is quite different from the previous StringExtractor. I found two pages it gives wrong result: 1. It doesn't output the form content of http://www.cnnfn.com/best/bpretire/bpretire_form.html 2. When processing the following page it gives java.lang.outofmemoryError. http://www.cnnfn.com/commentary/ Thanks Ling Ma __________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com |
|
From: Somik R. <so...@ya...> - 2003-04-26 05:25:53
|
Done. Sorry I didnt do it earlier. Regards, Somik ----- Original Message ----- From: "Derrick Oswald" <Der...@ro...> To: <htm...@li...> Sent: Friday, April 25, 2003 3:56 PM Subject: Re: [Htmlparser-developer] future directions > Somik, > > For the release this weekend, which Johan Naudts is keen to get, you'll > need to change my Project Role to Project Manager in the Members section > on the Admin page, so I can access the File Release System. > > I'll try the WikiGrabber tonight. > > Derrick > > Somik Raha wrote: > > >Hi Derrick, > > Sorry for not replying earlier, I have been out of town, and the last > >few days have been really busy. > > > > > > > >>I didn't mean to imply that the existing testcases be discarded. No, > >>they are excellent. > >>But do they have good code coverage? Do they represent a significant > >>portion of valid HTML constructs? Can they be said to probe the invalid > >>HTML space in a meaningful way? The only way to do this is via > >>automation. Perhaps the fit framework can be used as the front end for > >>it, but nobody is going to hand-craft thousands of tests. > >> > >> > > > >I agree with you. The FIT framework can reduce the coding burden if it ends > >up as a frontend for most of our acceptance tests. > > > >I need a break, yes, and I do think that you can lead the project far beyond > >my ideas. I am not attached to the work already done, and in fact, don't > >want to leave you with some of the poor design that I did a long time back. > >So, I will focus on cleaning up the core areas, which will be useful in the > >long run. > > > >It will be great if you try making the releases from this week- so we can > >make a smooth transition. Before running the script, I usually run the > >WikiGrabber (first delete the docs directory) which brings in the latest > >docs from our Wiki. > > > >Regards, > >Somik > > > > > > > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
|
From: Derrick O. <Der...@ro...> - 2003-04-26 00:16:26
|
I've uploaded an interesting (well, interesting to those who really need
to get out more) report on test coverage:
http://htmlparser.sourceforge.net/HTMLParser_Coverage.html
Although large (nearly 2MB), the report shows line by line test case
code coverage when running org.htmlparser.tests.AllTests.
It indicates that the tests cover 85% of the executable lines in the
source tree (excluding the org.htmlparser.tests packages).
I think this is very good, since a lot of the unexecuted code lies
within catch blocks and convenience methods.
Derrick
|
|
From: Derrick O. <Der...@ro...> - 2003-04-25 22:44:30
|
Somik, For the release this weekend, which Johan Naudts is keen to get, you'll need to change my Project Role to Project Manager in the Members section on the Admin page, so I can access the File Release System. I'll try the WikiGrabber tonight. Derrick Somik Raha wrote: >Hi Derrick, > Sorry for not replying earlier, I have been out of town, and the last >few days have been really busy. > > > >>I didn't mean to imply that the existing testcases be discarded. No, >>they are excellent. >>But do they have good code coverage? Do they represent a significant >>portion of valid HTML constructs? Can they be said to probe the invalid >>HTML space in a meaningful way? The only way to do this is via >>automation. Perhaps the fit framework can be used as the front end for >>it, but nobody is going to hand-craft thousands of tests. >> >> > >I agree with you. The FIT framework can reduce the coding burden if it ends >up as a frontend for most of our acceptance tests. > >I need a break, yes, and I do think that you can lead the project far beyond >my ideas. I am not attached to the work already done, and in fact, don't >want to leave you with some of the poor design that I did a long time back. >So, I will focus on cleaning up the core areas, which will be useful in the >long run. > >It will be great if you try making the releases from this week- so we can >make a smooth transition. Before running the script, I usually run the >WikiGrabber (first delete the docs directory) which brings in the latest >docs from our Wiki. > >Regards, >Somik > > > > > |
|
From: Somik R. <so...@ya...> - 2003-04-24 15:08:12
|
Hi Derrick,
Sorry for not replying earlier, I have been out of town, and the last
few days have been really busy.
> I didn't mean to imply that the existing testcases be discarded. No,
> they are excellent.
> But do they have good code coverage? Do they represent a significant
> portion of valid HTML constructs? Can they be said to probe the invalid
> HTML space in a meaningful way? The only way to do this is via
> automation. Perhaps the fit framework can be used as the front end for
> it, but nobody is going to hand-craft thousands of tests.
I agree with you. The FIT framework can reduce the coding burden if it ends
up as a frontend for most of our acceptance tests.
I need a break, yes, and I do think that you can lead the project far beyond
my ideas. I am not attached to the work already done, and in fact, don't
want to leave you with some of the poor design that I did a long time back.
So, I will focus on cleaning up the core areas, which will be useful in the
long run.
It will be great if you try making the releases from this week- so we can
make a smooth transition. Before running the script, I usually run the
WikiGrabber (first delete the docs directory) which brings in the latest
docs from our Wiki.
Regards,
Somik
|
|
From: <dha...@or...> - 2003-04-24 14:43:50
|
Hi, I want to parse a LABEL tag and replace the data between the start and end tags. I am able to obtain the data using the getLabel method. However to replace, firstly there is no synonymous setLabel() method. Hence I used the setText() method of Tag class. After that I printed my tag using the toHtml() method. However I received the previous text itself, not the one I replaced. On closer inspection of Tag class I realize that this newly set value is not being considered during toHtml() itself and there are no test cases for setText() in TagTest. I have logged the same as a bug (#726913) and attached a test condition as well. Also the following is required: 1. Obviously toHtml() must consider the changed text. 2. LabelTag should have setLabel() method synoymous to getLabel() which internally calls setText() of Tag class. Dhaval -----Original Message----- From: Udani, Dhaval H. Sent: Thursday, April 17, 2003 4:15 PM To: htmlparser-developer Cc: Udani, Dhaval H. Subject: [Htmlparser-developer] Label Tag Hi, Is there any synoymous for the setLabel method in the LabelTag to go with the getLabel method. I need functionality tochange the Label of a Label tag. How would I achieve this? Dhaval ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
|
From: <dha...@or...> - 2003-04-21 10:18:15
|
Hi,
A few things to point out:
1. FormScanner contains an unused variable textAreaVector.
2. FormTag contains a NodeList instance variable for INPUT tags and TEXTAREA
tags. However none for SELECT tags. It needs to be added.
Dhaval
-----Original Message-----
From: somik [mailto:so...@ya...]
Sent: Sunday, April 20, 2003 8:36 AM
To: htmlparser-announce; htmlparser-developer; htmlparser-user
Cc: somik
Subject: [Htmlparser-developer] Integration Release 1.3-20030420 is out
Hi Folks,
This week's release is out. From the change log:
Integration Build 1.3 - 20030420
--------------------------------
[1] Fixed bug #722046 StringExtractor.extractStrings misses most of the
text,
change to use a StringBean to dig into tables.
[2] add checking in Translate to eliminate bug #722835
StringIndexOutOfBoundsException exception
[3] added line-break condition in assertXmlEquals
[4] added fit testing framework
[5] added parent association for each node
[6] added digupStringNode() and findPositionOf(Node) to CompositeTag
[7] Fixed bug 723835 in LinkExtractor
We have some powerful searching capability with this release.
From any node, you can find the parent composite tag, and navigate thru the
entire html structure. This is useful in scenarios like :
Search for data that lies close to a certain piece of text.
e.g. ... <table>
<tr>
<td>
<b>Name:</b><i>John Doe</i>
</td>
</tr>
</table>
We can extract John Doe, by using our knowledge of its expected position.
If we assume that the contents are inside a table tag, here's what a program
could look like:
parser.registerScanners();
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Lets assume our data is in the second table
TableTag table = (TableTag)nodes[1];
// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");
// We assume that the first node that matched is the one we want. We
navigate to its parent
Node parentOfName = stringNodes[0].getParent();
// From the parent, we shall find out the position of "Name"
int posOfName = parentOfName.findPositionOf(stringNodes[0]);
// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = parentOfName.childAt(posOfName + 3);
This can be useful for writing tests for your pages or extracting position
based info - new possibilities open up for semantic searches.
Regards,
Somik
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
|
|
From: <dha...@or...> - 2003-04-21 07:32:50
|
yeah documentation in the change log would have been nice. I would have atleast changed my code then before rerunning it. Somik i agree with u regarding its need and i am sorry about sounding irritated in the last mail but its kindda frustrating u'd agree. -----Original Message----- From: somik [mailto:so...@ya...] Sent: Thursday, April 17, 2003 11:45 PM To: htmlparser-developer Cc: somik Subject: Re: [Htmlparser-developer] Form Scanner Hi Dhaval, There's a reason for this - the scanners that are connected to sub-scanners should demonstrate the coupling. So, when you do addScanner(new FormScanner()) - the input scanner, select scanner, etc.. are also added automatically. The same thing happens in the TableScanner (for row and column scanner). This addition cannot be done if the parser is not passed in. I agree with you though about backward compatibility. We should have at least documented it in the change log, or kept the dummy c'tor in there. Regards, Somik --- dha...@or... wrote: > Hi, > > The latest version of the Parser is not backward > compatible with the one > released previously. > > FormScanner does not have a no-args constructor any > longer. This is causing my > code to break I ahve to once again go back to code > and fix it. Can we atleast > maintain some kind of backward compatibility here. > code not working between > versions within a week's difference is really > nerve-racking. > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-28290019 Extn. 1457 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo http://search.yahoo.com ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
|
From: Derrick O. <Der...@ro...> - 2003-04-20 19:24:16
|
Somik, I guess I'm flattered that you think highly of my contributions, but I stand on the shoulders of giants. You're probably feeling burnt out after two years and want to get your life back. I suggest this. If no one else comes forward, grant me the Project Manager role and forget about it for the summer (if you can). I'll try to be the "Somik" by doing the builds, answering email, performing triage and fixing bugs, contributing to discussions, and so on, but it's unlikely that the quality will be as good. I'm not naive enough to think I'm the one with the vision to move it forward. If it's a rest you want, I can only promise a short respite. We'll both consider it again in September, and if you still feel burnt out, you can drop your Project Manager role altogether, in favour of me or somebody else. Or, if I haven't completely misread you, you can pick it up again after your sabbatical. p.s. I didn't mean to imply that the existing testcases be discarded. No, they are excellent. But do they have good code coverage? Do they represent a significant portion of valid HTML constructs? Can they be said to probe the invalid HTML space in a meaningful way? The only way to do this is via automation. Perhaps the fit framework can be used as the front end for it, but nobody is going to hand-craft thousands of tests. Derrick Somik Raha wrote: ><snip> > > >The testcases in the parser are anything but ad-hoc.I have specifically >avoided going down the path of the DTD. Lots of parsers exist which follow >the DTD to the letter and fail miserably on real-world html. It is not >difficult to construct a "correct" htmlparser using a grammar and javacc. >What differentiates us is that we have a massive testcase database from >real-world html which took almost 3 years to accumulate. It is this alone >that gives us the confidence to totally change the design at any given time >and still know that we're doing it right. This is what allowed me to >redesign the CompositeTagScanner, and not break anything new. We are at a >position where we can go ahead and redesign whatever we wish. > >With this release, I have tried to incorporate the Fit framework >(http://fit.c2.com) which makes writing tests a breeze. Now, to write new >tests, you will only use a html editor like Netscape Composer or M$ Word and >you can run them! I have made some tests for the AttributeParser (and I >already found a bug). > >I think we can think of making release 1.3 very soon - and probably decide >to seal it from now. My recommended to-do list for 1.3 is : >[1] Redesign StringParser >[2] Redesign AttributeParser >[3] Redesign NodeReader > >These three look particularly scary right now. > >On to project management issues, I am planning to step down from the role of >the project lead. Your contribution has been substantial and has added so >much value to the parser. You also have a solid vision to take this project >to v1.4 - I really like your ideas about semantic capabilities and have been >thinking on the same lines. I hope you will volunteer for this >responsibility. There is not much extra work involved - the build scripts >have been there for a while - all you have to do is make releases whenever >you feel (one week is usually good). I will of course be there to help in >whatever way I can. > >Regards, >Somik > > > > |
|
From: Kaarle K. <kaa...@ik...> - 2003-04-20 17:38:40
|
At 08:13 20.4.2003 -0700, you wrote: >Hi Kaarle, > Happy Easter! I put this in an html page and checked with IE6.0 and it show a picture but the balloon with the text shows great pic i.e. with newline. It does not strip the new line. Kaarle > > As I run the JUnit I get only error in test > > org.htmlparser.tests.parserHelperTests.AttributeParserTest > > > > The results dir contains AttributeParser.html > > > > with a source like this: > > ============ > > img src="somepic.jpg" alt="great > > pic" width="10" height=20 > > ============= > > With a new line in the middle of an attribute value. I don't think that's > > allowed in html?? > >I thought that the parser should be able to figure out when to strip the new >line.. You could always test this in IE and see if it is picked up (I think >it is). > > > I guess you have somewhere some instructions on fit that I should know to > > look at this test? > >fit.c2.com. >To make life easy, I've added a class that allows you to run fit tests >inside junit - check the resources package - fit.jar, expand it to go to >runner.FitRunner. And simply run that as a testcase. It will run the tests >in the specs directory and put the results in the results directory. It can >flag errors at the file level after which you have to open the results to >get detailed info. > >Regards, >Somik > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
|
From: Somik R. <so...@ya...> - 2003-04-20 15:12:16
|
Hi Kaarle,
Happy Easter!
> As I run the JUnit I get only error in test
> org.htmlparser.tests.parserHelperTests.AttributeParserTest
>
> The results dir contains AttributeParser.html
>
> with a source like this:
> ============
> img src="somepic.jpg" alt="great
> pic" width="10" height=20
> =============
> With a new line in the middle of an attribute value. I don't think that's
> allowed in html??
I thought that the parser should be able to figure out when to strip the new
line.. You could always test this in IE and see if it is picked up (I think
it is).
> I guess you have somewhere some instructions on fit that I should know to
> look at this test?
fit.c2.com.
To make life easy, I've added a class that allows you to run fit tests
inside junit - check the resources package - fit.jar, expand it to go to
runner.FitRunner. And simply run that as a testcase. It will run the tests
in the specs directory and put the results in the results directory. It can
flag errors at the file level after which you have to open the results to
get detailed info.
Regards,
Somik
|
|
From: Kaarle K. <kaa...@ik...> - 2003-04-20 06:58:23
|
At 20:07 19.4.2003 -0700, you wrote: >Hi Kaarle, > That is actually a different bug - take a look at the test again (look >at results.html). Happy eastern to you all! As I run the JUnit I get only error in test org.htmlparser.tests.parserHelperTests.AttributeParserTest The results dir contains AttributeParser.html with a source like this: ============ img src="somepic.jpg" alt="great pic" width="10" height=20 ============= With a new line in the middle of an attribute value. I don't think that's allowed in html?? I guess you have somewhere some instructions on fit that I should know to look at this test? regards Kaarle >Regards, >Somik >----- Original Message ----- >From: "Kaarle Kaila" <kaa...@ik...> >To: <htm...@li...> >Sent: Saturday, April 19, 2003 1:56 PM >Subject: [Htmlparser-developer] bug [ 722917 ] AttributeParser not parsing >correctly > > > > I have been assigned for the [ 722917 ] AttributeParser not parsing >correctly > > error in htmlparser. I have answered my opinion already a few times that > > why should > > htmlparser need to be able to parse uncompiled asp or jsp pages. > > > > If not I would like to know on what kind of html-page can this text be >found? > > > > <a href=\"<%=Application(\"sURL\")%>/literature/index.htm> > > > > Can you answer this to me before I begin to look into this? > > > > Kaarle > > > > --------------------------------------------- > > Kaarle Kaila > > http://www.iki.fi/kaila > > mailto:kaa...@ik... > > tel: +358 50 3725844 > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > >------------------------------------------------------- >This sf.net email is sponsored by:ThinkGeek >Welcome to geek heaven. >http://thinkgeek.com/sf >_______________________________________________ >Htmlparser-developer mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-developer --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
|
From: Somik R. <so...@ya...> - 2003-04-20 03:20:46
|
Hi Derrick,
I am glad you brought this up.
> The htmlparser project is quite successful, with many, many users (over
> 15,000 downloads) and now 17 developers.
We actually ended up in pos 18 when I last checked. All of the credit goes
to all the developers (and a significant part to you) and the user community
who have given us feedback..
I agree with most of your ideas. I shall therefore only write about
where I don't agree.
<Testing>
>
> The existing ad hoc test cases and examples of dirty HTML are good for
> development, but perhaps a more thorough suite of tests needs to be
> created for real world use.
>
> Based on the strict HTML document type definition at
> http://www.w3.org/TR/html4/strict.dtd, a test case generator could be
> constructed that would read the dtd, create a tree of possible HTML
> syntax, and then based on that correct HTML and a few dozen permutation
> possibilities (i.e. insert string content, dropping a node, injecting a
> bogus node, injecting a comment, altering a node by a character,
> truncating input, starting midway through a construct etc.) throw tens
> of thousands of tests at the parser and record the test HTML and any
> unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to
> replace </bod with </body>.
>
</Testing>
The testcases in the parser are anything but ad-hoc.I have specifically
avoided going down the path of the DTD. Lots of parsers exist which follow
the DTD to the letter and fail miserably on real-world html. It is not
difficult to construct a "correct" htmlparser using a grammar and javacc.
What differentiates us is that we have a massive testcase database from
real-world html which took almost 3 years to accumulate. It is this alone
that gives us the confidence to totally change the design at any given time
and still know that we're doing it right. This is what allowed me to
redesign the CompositeTagScanner, and not break anything new. We are at a
position where we can go ahead and redesign whatever we wish.
With this release, I have tried to incorporate the Fit framework
(http://fit.c2.com) which makes writing tests a breeze. Now, to write new
tests, you will only use a html editor like Netscape Composer or M$ Word and
you can run them! I have made some tests for the AttributeParser (and I
already found a bug).
I think we can think of making release 1.3 very soon - and probably decide
to seal it from now. My recommended to-do list for 1.3 is :
[1] Redesign StringParser
[2] Redesign AttributeParser
[3] Redesign NodeReader
These three look particularly scary right now.
On to project management issues, I am planning to step down from the role of
the project lead. Your contribution has been substantial and has added so
much value to the parser. You also have a solid vision to take this project
to v1.4 - I really like your ideas about semantic capabilities and have been
thinking on the same lines. I hope you will volunteer for this
responsibility. There is not much extra work involved - the build scripts
have been there for a while - all you have to do is make releases whenever
you feel (one week is usually good). I will of course be there to help in
whatever way I can.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2003-04-20 03:06:02
|
Hi Kaarle,
That is actually a different bug - take a look at the test again (look
at results.html).
Regards,
Somik
----- Original Message -----
From: "Kaarle Kaila" <kaa...@ik...>
To: <htm...@li...>
Sent: Saturday, April 19, 2003 1:56 PM
Subject: [Htmlparser-developer] bug [ 722917 ] AttributeParser not parsing
correctly
> I have been assigned for the [ 722917 ] AttributeParser not parsing
correctly
> error in htmlparser. I have answered my opinion already a few times that
> why should
> htmlparser need to be able to parse uncompiled asp or jsp pages.
>
> If not I would like to know on what kind of html-page can this text be
found?
>
> <a href=\"<%=Application(\"sURL\")%>/literature/index.htm>
>
> Can you answer this to me before I begin to look into this?
>
> Kaarle
>
> ---------------------------------------------
> Kaarle Kaila
> http://www.iki.fi/kaila
> mailto:kaa...@ik...
> tel: +358 50 3725844
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
|
|
From: Somik R. <so...@ya...> - 2003-04-20 03:04:37
|
Hi Folks,
This week's release is out. From the change log:
Integration Build 1.3 - 20030420
--------------------------------
[1] Fixed bug #722046 StringExtractor.extractStrings misses most of the
text,
change to use a StringBean to dig into tables.
[2] add checking in Translate to eliminate bug #722835
StringIndexOutOfBoundsException exception
[3] added line-break condition in assertXmlEquals
[4] added fit testing framework
[5] added parent association for each node
[6] added digupStringNode() and findPositionOf(Node) to CompositeTag
[7] Fixed bug 723835 in LinkExtractor
We have some powerful searching capability with this release.
From any node, you can find the parent composite tag, and navigate thru the
entire html structure. This is useful in scenarios like :
Search for data that lies close to a certain piece of text.
e.g. ... <table>
<tr>
<td>
<b>Name:</b><i>John Doe</i>
</td>
</tr>
</table>
We can extract John Doe, by using our knowledge of its expected position.
If we assume that the contents are inside a table tag, here's what a program
could look like:
parser.registerScanners();
Node nodes [] = parser.extractAllNodesThatAre(TableTag.class);
// Lets assume our data is in the second table
TableTag table = (TableTag)nodes[1];
// Find the position of Name.
StringNode [] stringNodes = table.digupStringNode("Name");
// We assume that the first node that matched is the one we want. We
navigate to its parent
Node parentOfName = stringNodes[0].getParent();
// From the parent, we shall find out the position of "Name"
int posOfName = parentOfName.findPositionOf(stringNodes[0]);
// Its easy now to navigate to John Doe, as we know it is 3 positions away
Node expectedName = parentOfName.childAt(posOfName + 3);
This can be useful for writing tests for your pages or extracting position
based info - new possibilities open up for semantic searches.
Regards,
Somik
|
|
From: Kaarle K. <kaa...@ik...> - 2003-04-19 20:59:21
|
I have been assigned for the [ 722917 ] AttributeParser not parsing correctly error in htmlparser. I have answered my opinion already a few times that why should htmlparser need to be able to parse uncompiled asp or jsp pages. If not I would like to know on what kind of html-page can this text be found? <a href=\"<%=Application(\"sURL\")%>/literature/index.htm> Can you answer this to me before I begin to look into this? Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
|
From: Derrick O. <Der...@ro...> - 2003-04-18 15:49:26
|
The htmlparser project is quite successful, with many, many users (over 15,000 downloads) and now 17 developers. It's time to consider where it goes from here. Here are some thoughts. <Restructuring> To initiate discussion, I propose a restructuring. One model of language processing identifies three levels so I'll briefly define these in terms htmlparser people can relate to: lexical level - low level, identifies character encoding, whitespace, tokens, lines - currently implemented in Parser, NodeReader and Tag syntactic level - mid level, identifies tags, nesting, attributes, text - this is the raison d'être of the htmlparser, currently implemented in scanners, especially TagScanner, AttributeParser, CompositeTagScannerHelper, StringParser and TagParser semantic level - high level, identifies meaning - currently not implemented, but this is where all the 'action' is, regarding page layout, script, JSP, tables and other actual content So far, htmlparser has not been in the semantic business, but I would think that the best parsing can be done when higher level knowledge is brought to bear. This borders on AI and applies when missing end tags and malformed tags or attributes are encountered. It seems logical to create three packages to contain the three levels and extract the corresponding functionality out of the many places it currently is and put it into the appropriate centralized location. Here are some broad specifics to use as a basis for design documents: move all things character and stream related to a new lexer package - would return a contiguous stream of 'Token' objects that contain only absolute character offsets - would answer questions about line number and be able to extract portions of the stream - would implement 'cursors' to save and restore machine state - operates at the character level repackage the parserHelper package to be what it really is, the syntax package - would return a contiguous stream of Tag objects - would implement 'bookmarks' to save and restore machine state - operates at the string level create a sematics package that contains the high level knowledge about HTML - this level would return a contiguous stream of (possibly nested) nodes - operates at the document type definition level </Restructuring> <Error Handling> Dirty HTML and the nature of parsing means error handling is integral to operation, so a review of error handling is in order with an eye to a comprehensive policy, the tenets of which might include: 1) The level (class) that discovers the error performs it's best effort to fix it. This is already mostly in place. It just needs systematic thought and robust utility methods to handle pathological cases, i.e. end of file, no expected token, quotes in odd places, etc. Exceptions should be handled locally as much as possible. Anything not generated by parser code, i.e. not a ParserException, is handled gracefully by the code closest to the throw. This applies for IOExceptions and other explicit exceptions, but also the runtime exceptions such as ArrayBoundsExceptions, OutOfMemoryExceptions, IllegalArgumentExceptions, etc. 2) Errors that can't be fixed assume that the wrong higher level (semantic or syntactic) assumptions have been made and backs up to a point where an alternate interpretation is possible. This means being able to restore the state of the machine to a prior 'known good state'. 3) Errors should be couched in absolute character location terms so intelligent choices can be made about syntactic trees. If a supervisor routine is used, it can explore tree depths and stream consumption. The rule might be "choose the interpretation that extracts the most nodes" or "choose the interpretation that advances furthest into the file before discovering an error". </Error Handling> <Testing> The existing ad hoc test cases and examples of dirty HTML are good for development, but perhaps a more thorough suite of tests needs to be created for real world use. Based on the strict HTML document type definition at http://www.w3.org/TR/html4/strict.dtd, a test case generator could be constructed that would read the dtd, create a tree of possible HTML syntax, and then based on that correct HTML and a few dozen permutation possibilities (i.e. insert string content, dropping a node, injecting a bogus node, injecting a comment, altering a node by a character, truncating input, starting midway through a construct etc.) throw tens of thousands of tests at the parser and record the test HTML and any unexpected failure mode, i.e. "<head><body>Hello world</bod" - failed to replace </bod with </body>. </Testing> |
|
From: Somik R. <so...@ya...> - 2003-04-17 18:15:15
|
Hi Dhaval, There's a reason for this - the scanners that are connected to sub-scanners should demonstrate the coupling. So, when you do addScanner(new FormScanner()) - the input scanner, select scanner, etc.. are also added automatically. The same thing happens in the TableScanner (for row and column scanner). This addition cannot be done if the parser is not passed in. I agree with you though about backward compatibility. We should have at least documented it in the change log, or kept the dummy c'tor in there. Regards, Somik --- dha...@or... wrote: > Hi, > > The latest version of the Parser is not backward > compatible with the one > released previously. > > FormScanner does not have a no-args constructor any > longer. This is causing my > code to break I ahve to once again go back to code > and fix it. Can we atleast > maintain some kind of backward compatibility here. > code not working between > versions within a week's difference is really > nerve-racking. > > Regards, > > Dhaval Udani > Senior Analyst > M-Line, QPEG > OrbiTech Solutions Ltd. > +91-22-28290019 Extn. 1457 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo http://search.yahoo.com |
|
From: <dha...@or...> - 2003-04-17 13:28:06
|
Hi Somik, I was checking out the code of form scanner and I saw that it contained a list of all the INPUT tags and all the TEXTAREA tags. In addition we need to add the list of SELECT tags also out here.. Thanks for catching this. Done. [Udani, Dhaval H.] I yet don't see this in the latest code. Dhaval |
|
From: <dha...@or...> - 2003-04-17 13:20:18
|
Hi, The latest version of the Parser is not backward compatible with the one released previously. FormScanner does not have a no-args constructor any longer. This is causing my code to break I ahve to once again go back to code and fix it. Can we atleast maintain some kind of backward compatibility here. code not working between versions within a week's difference is really nerve-racking. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
|
From: <dha...@or...> - 2003-04-17 10:46:04
|
Hi, Is there any synoymous for the setLabel method in the LabelTag to go with the getLabel method. I need functionality tochange the Label of a Label tag. How would I achieve this? Dhaval |
|
From: Derrick O. <Der...@ro...> - 2003-04-17 00:10:35
|
wow!
htmlparser is in the top 25 most active projects this week:
https://sourceforge.net/top/mostactive.php?type=week
at position 24 with a 99.780% percentile
(which I think means it's more active than 99.78% of the projects on
sourceforge).
Keep up the good work!
Kill them bugs!
|
|
From: <dha...@or...> - 2003-04-16 12:27:24
|
Some=A0things that need to be addressed: =A0 1. LabelScanner does not have a no-args constructor like the other scanners. Needs to be added. 2. SelectTag should use a NodeList of OptionTags rather than a List to keep it in sync with the rest of the code. =A0 I'll do both these tommorrow and send it to Somik. =A0 Somik, I hope you can include both these details in the next release. =A0 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Monday, April 14, 2003 5:38 AM To: htmlparser-announce; htmlparser-developer; htmlparser-user Cc: somik Subject: [Htmlparser-developer] Integration Release 1.3-20030413 is out =20 =20 =20 Hi Folks, =A0=A0=A0 This week's release contains : =A0 Integration Build 1.3 - 20030413 -------------------------------- [1] reimplement StringBean as NodeVisitor, testStringBeanListener now succeeds [2] Implemented feature request 702541 (Tags created by CompositeTagScanner =A0=A0=A0 now have startLine and endLine information in their TagData) [3] Modified ScriptScanner to allow for subclassing and fixed minor bug [4] Re-architected CompositeTagScanner [5] Fixed Tag scanning bugs, OOM exceptions connected to #4 =20 Thanks to Derrick Oswald for his work on the StringBean and=A0Marc=A0Novakowski for his work on adding line number support. In this release, the CompositeTagScanner has been totally redesigned - using principles of Evolutionary Design. The new code is dramatically simpler and easier to understand. I tried ED on assertXmlEquals last week and had the same results. The Out of Memory bugs have been fixed - it will be good to have some freedback. =A0 For those who might be interested in ED, before taking each step, I have taken a snapshot and put it in CVS. Many thanks to Josh Kerievsky for teaching me how to do ED. =A0 Cheers, Somik =20 |
|
From: Somik R. <so...@ya...> - 2003-04-14 00:06:47
|
Hi Folks,
This week's release contains :
Integration Build 1.3 - 20030413
--------------------------------
[1] reimplement StringBean as NodeVisitor, testStringBeanListener now =
succeeds
[2] Implemented feature request 702541 (Tags created by =
CompositeTagScanner
now have startLine and endLine information in their TagData)
[3] Modified ScriptScanner to allow for subclassing and fixed minor bug
[4] Re-architected CompositeTagScanner
[5] Fixed Tag scanning bugs, OOM exceptions connected to #4
Thanks to Derrick Oswald for his work on the StringBean and Marc =
Novakowski for his work on adding line number support.
In this release, the CompositeTagScanner has been totally redesigned - =
using principles of Evolutionary Design. The new code is dramatically =
simpler and easier to understand. I tried ED on assertXmlEquals last =
week and had the same results. The Out of Memory bugs have been fixed - =
it will be good to have some freedback.
For those who might be interested in ED, before taking each step, I have =
taken a snapshot and put it in CVS. Many thanks to Josh Kerievsky for =
teaching me how to do ED.
Cheers,
Somik |