htmlparser-developer Mailing List for HTML Parser (Page 4)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Axel <ax...@gm...> - 2005-11-02 22:07:16
|
On 11/1/05, Ian Macfarlane <ian...@gm...> wrote: > I was thinking it might be worthwhile adding a method to Text/TextNode > along the lines of: > > boolean isWhiteSpace() > > Which would return if the TextNode consisted of solely white space > characters (or was the empty String). > > Now this could simply be done using String.trim().equals(""), however > that wouldn't account for: > > - the non-breaking space character (#160) > - The HTML code (also   as Firefox/IE do) > - The HTML code   (also   as Firefox/IE do) > > So my question is, do you think should this method should treat those > as spaces and remove/ignore them also for purposes of determining if > the TextNode is white space? Or should it only trim normal whitespace > (space, tab, carriage returns, etc). I think, if every character (or entity converted to a unicode-character) in the TextNode is true for Character#isWhitespace() the boolean isWhiteSpace() should return true; IMO the TextNode shouldn't be trimmed automatically. Only a special function should allow this to do. -- Axel Kramer http://www.plog4u.org - Wikipedia Eclipse Plugin |
From: Derrick O. <Der...@Ro...> - 2005-11-02 12:25:36
|
There aren't any design documents as far as I'm aware. It has grown by accretion in general. What's in the JavaDocs is what there is. Somik Raha or Joshua Kerievsky may have other documents related to the design... yangang li wrote: > Dear Derrick: > I am a graduate student.I am very interested to join the HTML Parser > Project. > I have done some research about it. Now I prefer some materials about > it, especially the design. > Yours,Yangang ,Li > > > <http://cn.zs.yahoo.com> |
From: yangang li <lhe...@ya...> - 2005-11-02 07:24:01
|
Dear Derrick: I am a graduate student.I am very interested to join the HTML Parser Project. I have done some research about it.Now I prefer some materials about it,especially the design. Yours,Yangang ,Li --------------------------------- 雅虎免费G邮箱-中国第一绝无垃圾邮件骚扰超大邮箱 雅虎助手¨D搜索、杀毒、防骚扰 |
From: Ian M. <ian...@gm...> - 2005-11-01 16:04:49
|
There doesn't seem to be any way of traversing from one top level node (e.g. doctype, html, #text, etc) to any of its siblings, once you no longer have the original NodeList obtained from the Parser class. This means, for example, that you cannot do a get previous/next sibling method currently on these nodes. A few quick ideas about how to do this (none of which are particularly idea= l): - Create a node named something like DocumentNode (we have HTMLPage but that doesn't really seem to do what we want) that whose children are the NodeList obtained by the Parser class. Make the Parser class create one of these instead of the NodeList as it does now. - Have a method named something like getDocument for each tag, which simply returns a reference to the initial NodeList obtained by the Parser class. Any other ideas are welcome. Ian |
From: Ian M. <ian...@gm...> - 2005-11-01 10:56:44
|
I was thinking it might be worthwhile adding a method to Text/TextNode along the lines of: boolean isWhiteSpace() Which would return if the TextNode consisted of solely white space characters (or was the empty String). Now this could simply be done using String.trim().equals(""), however that wouldn't account for: - the non-breaking space character (#160) - The HTML code (also   as Firefox/IE do) - The HTML code   (also   as Firefox/IE do) So my question is, do you think should this method should treat those as spaces and remove/ignore them also for purposes of determining if the TextNode is white space? Or should it only trim normal whitespace (space, tab, carriage returns, etc). Thanks for your advice Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2005-11-01 10:15:56
|
This has been checked into CVS. On 10/27/05, Ian Macfarlane <ian...@gm...> wrote: > Forwarded again as SourceForge blocked the zip attachments on the previou= s mail. > > ---------- Forwarded message ---------- > From: Ian Macfarlane <ian...@gm...> > Date: Oct 27, 2005 1:35 PM > Subject: More patches for HTMLParser > To: Der...@ro... > Cc: htm...@li... > > > Dear Derrick, > > I've modified the custom definitions I had to write for p tags and > definition-list-related tags (dl,dd,dt). > > Notes: > > - ParagraphTag: There are a lot of tags listed as closing the P tag - > these all appear to be correct however - I have tested it using the > splitting_test.html file attached (sample testing div splitting p) and > Firefox's DOM inspector, and went through every tag in the HTML 4 > specification. All the listed tag names caused the P tag to close in > Firefox. > > - Definition list stuff: I did not simply add these as entries to the > Bullet and BulletList classes, as from testing (a few examples shown > in the attached list_test.html) showed that they in fact behaved > differently to normal lists (for example a dt closes a dd, and vice > versa, but a li does not close either of these, nor do they close an > li. > > A question about PrototypicalNodeFactory, is there a reason that the > div, span, body, head and html tag registration in method registerTags > () are at the end rather than in alphabetical order like the other > tags? > > Kind regards, > > Ian Macfarlane > > PS: I've tried the following command to add a file to CVS: > > cvs -d :ext:ian...@cv...:/cvsroot/htmlparser > add src/org/htmlparser/tags/ParagraphTag.java > > And I get this response: > > cvs add: in directory .: > cvs [add aborted]: there is no version here; do 'cvs checkout' first > > I'm new to using CVS, so I'm not entirely sure what I'm doing wrong > here. Do you have any suggestions? > > > |
From: Ian M. <ian...@gm...> - 2005-11-01 10:14:02
|
This has been checked into CVS. On 10/28/05, Ian Macfarlane <ian...@gm...> wrote: > It looks like headings should definitely also extend CompositeTag. I > think that's most of the essential ones now (although we still need > THEAD, TBODY and TFOOT at some point). > > I've attached the file, HeadingTag.java, and also all the other tag > changes to date, to make it easier to keep track of all the changes > made so far. > > PrototypicalNodeFactory has also been updated with the new tag > definitions (heading, paragraph and the definition list stuff). > > I promise I'll try and get CVS upload worked out soon ;) > > Best wishes > > Ian Macfarlane > > > |
From: Ian M. <ian...@gm...> - 2005-11-01 10:13:40
|
This has been checked into CVS. On 10/27/05, Ian Macfarlane <ian...@gm...> wrote: > Here are some updates to the various table-related tags, making them > close on finding open/close TBODY, THEAD or TFOOT. > > Sorry, I'm still trying to work out CVS write access. > > Ian > > > |
From: Ian M. <ian...@gm...> - 2005-10-28 16:33:55
|
It looks like headings should definitely also extend CompositeTag. I think that's most of the essential ones now (although we still need THEAD, TBODY and TFOOT at some point). I've attached the file, HeadingTag.java, and also all the other tag changes to date, to make it easier to keep track of all the changes made so far. PrototypicalNodeFactory has also been updated with the new tag definitions (heading, paragraph and the definition list stuff). I promise I'll try and get CVS upload worked out soon ;) Best wishes Ian Macfarlane |
From: Ian M. <ian...@gm...> - 2005-10-27 15:34:37
|
Here are some updates to the various table-related tags, making them close on finding open/close TBODY, THEAD or TFOOT. Sorry, I'm still trying to work out CVS write access. Ian |
From: Ian M. <ian...@gm...> - 2005-10-27 14:54:33
|
Feature request 1291620 http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1291620&group_= id=3D24399&atid=3D381402 has been fulfilled by patch 1338534: http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1338534&group_= id=3D24399&atid=3D381401 Please mark it as closed. Ian |
From: Ian M. <ian...@gm...> - 2005-10-27 14:02:08
|
Forwarded again as SourceForge blocked the zip attachments on the previous = mail. ---------- Forwarded message ---------- From: Ian Macfarlane <ian...@gm...> Date: Oct 27, 2005 1:35 PM Subject: More patches for HTMLParser To: Der...@ro... Cc: htm...@li... Dear Derrick, I've modified the custom definitions I had to write for p tags and definition-list-related tags (dl,dd,dt). Notes: - ParagraphTag: There are a lot of tags listed as closing the P tag - these all appear to be correct however - I have tested it using the splitting_test.html file attached (sample testing div splitting p) and Firefox's DOM inspector, and went through every tag in the HTML 4 specification. All the listed tag names caused the P tag to close in Firefox. - Definition list stuff: I did not simply add these as entries to the Bullet and BulletList classes, as from testing (a few examples shown in the attached list_test.html) showed that they in fact behaved differently to normal lists (for example a dt closes a dd, and vice versa, but a li does not close either of these, nor do they close an li. A question about PrototypicalNodeFactory, is there a reason that the div, span, body, head and html tag registration in method registerTags () are at the end rather than in alphabetical order like the other tags? Kind regards, Ian Macfarlane PS: I've tried the following command to add a file to CVS: cvs -d :ext:ian...@cv...:/cvsroot/htmlparser add src/org/htmlparser/tags/ParagraphTag.java And I get this response: cvs add: in directory .: cvs [add aborted]: there is no version here; do 'cvs checkout' first I'm new to using CVS, so I'm not entirely sure what I'm doing wrong here. Do you have any suggestions? |
From: Derrick O. <Der...@Ro...> - 2005-09-29 11:24:46
|
Sorry, I fired from the hip in a hurry and didn't even see the attachment. I'll give it a better look when I get some time. Matthew Buckett wrote: > Derrick Oswald wrote: > >> It's zero based, unlike the usual text editor counting. > > > Yeah, but I'm passing in the position: > > Page.getLine(int position) > Get the text line the position of the cursor lies on. > > So if I parse "line0\nline1\nline2\n". > then call page.getLine(8) I should get back "line1\n" but I get > "line2\n"; > > row(8) correctly gives back 1 (zero based line number). But > mIndex.elementAt(1) returns the end of row 1 (position 12) then the > line is incremented and mIndex.elementAt(2) returns the end of row 2 > (position 18). This is then passed to getText which returns the text > for the last row. > > Try the tests without the patch and they fail. Are you saying my tests > should fail? > >> Matthew Buckett wrote: >> >>> Page.getLine always seems to return the previous line. Attached are >>> some tests that show this. It seems that the documentation on >>> PageIndex says it should be the index the the first character of the >>> line but it is actually set as being the position of the newline. >>> >>> I've attached a fix to Page.getLine() that makes it work but I don't >>> know if the correct fix change PageIndex so that the index of the >>> start of the line is put in it instead. >>> >>> ------------------------------------------------------------------------ >>> >>> >>> Index: Page.java >>> =================================================================== >>> RCS file: >>> /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v >>> retrieving revision 1.51 >>> diff -u -r1.51 Page.java >>> --- Page.java 20 Jun 2005 01:56:32 -0000 1.51 >>> +++ Page.java 28 Sep 2005 16:16:14 -0000 >>> @@ -1106,12 +1106,12 @@ >>> size = mIndex.size (); >>> if (line < size) >>> { >>> - start = mIndex.elementAt (line); >>> - line++; >>> - if (line <= size) >>> - end = mIndex.elementAt (line); >>> + end = mIndex.elementAt (line); >>> + line--; >>> + if (line >= 0) >>> + start = mIndex.elementAt (line); >>> else >>> - end = mSource.offset (); >>> + start = 0; >>> } >>> else // current line >>> { >>> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> /* >>> ====================================================================== >>> The Bodington System Software License, Version 1.0 >> > > Sorry Eclipse was still configured for the wrong project... > > >>> package org.htmlparser.tests; >>> >>> import junit.framework.TestCase; >>> >>> import org.htmlparser.Node; >>> import org.htmlparser.Parser; >>> import org.htmlparser.filters.TagNameFilter; >>> import org.htmlparser.util.NodeList; >>> import org.htmlparser.util.ParserException; >>> >>> public class LineTests extends TestCase >>> { >>> public void testGetLine1() throws ParserException { >>> Parser parser = getParser(); >>> NodeList list = parser.parse(new TagNameFilter("h1")); >>> Node node = list.elementAt(0); >>> assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine( >>> node.getStartPosition())); >>> } >>> public void testGetLine2() throws ParserException { >>> Parser parser = getParser(); >>> NodeList list = parser.parse(new TagNameFilter("h2")); >>> Node node = list.elementAt(0); >>> assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine( >>> node.getStartPosition())); >>> } >>> public void testGetLine3() throws ParserException { >>> Parser parser = getParser(); >>> NodeList list = parser.parse(new TagNameFilter("h3")); >>> Node node = list.elementAt(0); >>> assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine( >>> node.getStartPosition())); >>> } >>> public Parser getParser() >>> { >>> Parser parser = new Parser(); >>> try >>> { >>> parser.setInputHTML( >>> "<h1>Line 1</h1>\n"+ >>> "<h2>Line 2</h2>\n"+ >>> "<h3>Line 3</h3>\n" >>> ); >>> } >>> catch (ParserException e) >>> { >>> fail("Failed to parse"); >>> } >>> return parser; >>> } >>> } >>> >>> >> >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by: >> Power Architecture Resource Center: Free content, downloads, >> discussions, >> and more. http://solutions.newsforge.com/ibmarch.tmpl >> _______________________________________________ >> Htmlparser-developer mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer >> >> > > |
From: Matthew B. <mat...@co...> - 2005-09-29 09:29:50
|
Derrick Oswald wrote: > It's zero based, unlike the usual text editor counting. Yeah, but I'm passing in the position: Page.getLine(int position) Get the text line the position of the cursor lies on. So if I parse "line0\nline1\nline2\n". then call page.getLine(8) I should get back "line1\n" but I get "line2\n"; row(8) correctly gives back 1 (zero based line number). But mIndex.elementAt(1) returns the end of row 1 (position 12) then the line is incremented and mIndex.elementAt(2) returns the end of row 2 (position 18). This is then passed to getText which returns the text for the last row. Try the tests without the patch and they fail. Are you saying my tests should fail? > Matthew Buckett wrote: > >> Page.getLine always seems to return the previous line. Attached are >> some tests that show this. It seems that the documentation on >> PageIndex says it should be the index the the first character of the >> line but it is actually set as being the position of the newline. >> >> I've attached a fix to Page.getLine() that makes it work but I don't >> know if the correct fix change PageIndex so that the index of the >> start of the line is put in it instead. >> >> ------------------------------------------------------------------------ >> >> Index: Page.java >> =================================================================== >> RCS file: >> /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v >> retrieving revision 1.51 >> diff -u -r1.51 Page.java >> --- Page.java 20 Jun 2005 01:56:32 -0000 1.51 >> +++ Page.java 28 Sep 2005 16:16:14 -0000 >> @@ -1106,12 +1106,12 @@ >> size = mIndex.size (); >> if (line < size) >> { >> - start = mIndex.elementAt (line); >> - line++; >> - if (line <= size) >> - end = mIndex.elementAt (line); >> + end = mIndex.elementAt (line); >> + line--; >> + if (line >= 0) >> + start = mIndex.elementAt (line); >> else >> - end = mSource.offset (); >> + start = 0; >> } >> else // current line >> { >> >> >> ------------------------------------------------------------------------ >> >> /* ====================================================================== >> The Bodington System Software License, Version 1.0 Sorry Eclipse was still configured for the wrong project... >> package org.htmlparser.tests; >> >> import junit.framework.TestCase; >> >> import org.htmlparser.Node; >> import org.htmlparser.Parser; >> import org.htmlparser.filters.TagNameFilter; >> import org.htmlparser.util.NodeList; >> import org.htmlparser.util.ParserException; >> >> public class LineTests extends TestCase >> { >> public void testGetLine1() throws ParserException { >> Parser parser = getParser(); >> NodeList list = parser.parse(new TagNameFilter("h1")); >> Node node = list.elementAt(0); >> assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine( >> node.getStartPosition())); >> } >> public void testGetLine2() throws ParserException { >> Parser parser = getParser(); >> NodeList list = parser.parse(new TagNameFilter("h2")); >> Node node = list.elementAt(0); >> assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine( >> node.getStartPosition())); >> } >> public void testGetLine3() throws ParserException { >> Parser parser = getParser(); >> NodeList list = parser.parse(new TagNameFilter("h3")); >> Node node = list.elementAt(0); >> assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine( >> node.getStartPosition())); >> } >> public Parser getParser() >> { >> Parser parser = new Parser(); >> try >> { >> parser.setInputHTML( >> "<h1>Line 1</h1>\n"+ >> "<h2>Line 2</h2>\n"+ >> "<h3>Line 3</h3>\n" >> ); >> } >> catch (ParserException e) >> { >> fail("Failed to parse"); >> } >> return parser; >> } >> } >> >> > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Power Architecture Resource Center: Free content, downloads, discussions, > and more. http://solutions.newsforge.com/ibmarch.tmpl > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > -- +--Matthew Buckett-----------------------------------------+ | VLE Developer, Learning Technologies Group | | Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/ | +------------Computing Services, University of Oxford------+ |
From: Derrick O. <Der...@Ro...> - 2005-09-28 22:26:24
|
It's zero based, unlike the usual text editor counting. Matthew Buckett wrote: > Page.getLine always seems to return the previous line. Attached are > some tests that show this. It seems that the documentation on > PageIndex says it should be the index the the first character of the > line but it is actually set as being the position of the newline. > > I've attached a fix to Page.getLine() that makes it work but I don't > know if the correct fix change PageIndex so that the index of the > start of the line is put in it instead. > >------------------------------------------------------------------------ > >Index: Page.java >=================================================================== >RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v >retrieving revision 1.51 >diff -u -r1.51 Page.java >--- Page.java 20 Jun 2005 01:56:32 -0000 1.51 >+++ Page.java 28 Sep 2005 16:16:14 -0000 >@@ -1106,12 +1106,12 @@ > size = mIndex.size (); > if (line < size) > { >- start = mIndex.elementAt (line); >- line++; >- if (line <= size) >- end = mIndex.elementAt (line); >+ end = mIndex.elementAt (line); >+ line--; >+ if (line >= 0) >+ start = mIndex.elementAt (line); > else >- end = mSource.offset (); >+ start = 0; > } > else // current line > { > > >------------------------------------------------------------------------ > >/* ====================================================================== >The Bodington System Software License, Version 1.0 > >Copyright (c) 2001 The University of Leeds. All rights reserved. > >Redistribution and use in source and binary forms, with or without >modification, are permitted provided that the following conditions are >met: > >1. Redistributions of source code must retain the above copyright notice, >this list of conditions and the following disclaimer. > >2. Redistributions in binary form must reproduce the above copyright >notice, this list of conditions and the following disclaimer in the >documentation and/or other materials provided with the distribution. > >3. The end-user documentation included with the redistribution, if any, >must include the following acknowledgement: "This product includes >software developed by the University of Leeds >(http://www.bodington.org/)." Alternately, this acknowledgement may >appear in the software itself, if and wherever such third-party >acknowledgements normally appear. > >4. The names "Bodington", "Nathan Bodington", "Bodington System", >"Bodington Open Source Project", and "The University of Leeds" must not be >used to endorse or promote products derived from this software without >prior written permission. For written permission, please contact >d.g...@le.... > >5. The name "Bodington" may not appear in the name of products derived >from this software without prior written permission of the University of >Leeds. > >THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED >WARRANTIES, INCLUDING, BUT NOT LIMITED TO, TITLE, THE IMPLIED WARRANTIES >OF QUALITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO >EVENT SHALL THE UNIVERSITY OF LEEDS OR ITS CONTRIBUTORS BE LIABLE FOR >ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL >DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE >GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) >HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, >STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN >ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE >POSSIBILITY OF SUCH DAMAGE. >========================================================= > >This software was originally created by the University of Leeds and may contain voluntary >contributions from others. For more information on the Bodington Open Source Project, please >see http://bodington.org/ > >====================================================================== */ > >package org.htmlparser.tests; > >import junit.framework.TestCase; > >import org.htmlparser.Node; >import org.htmlparser.Parser; >import org.htmlparser.filters.TagNameFilter; >import org.htmlparser.util.NodeList; >import org.htmlparser.util.ParserException; > >public class LineTests extends TestCase >{ > public void testGetLine1() throws ParserException { > Parser parser = getParser(); > NodeList list = parser.parse(new TagNameFilter("h1")); > Node node = list.elementAt(0); > assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine( > node.getStartPosition())); > } > > public void testGetLine2() throws ParserException { > Parser parser = getParser(); > NodeList list = parser.parse(new TagNameFilter("h2")); > Node node = list.elementAt(0); > assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine( > node.getStartPosition())); > } > > public void testGetLine3() throws ParserException { > Parser parser = getParser(); > NodeList list = parser.parse(new TagNameFilter("h3")); > Node node = list.elementAt(0); > assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine( > node.getStartPosition())); > } > > public Parser getParser() > { > Parser parser = new Parser(); > try > { > parser.setInputHTML( > "<h1>Line 1</h1>\n"+ > "<h2>Line 2</h2>\n"+ > "<h3>Line 3</h3>\n" > ); > } > catch (ParserException e) > { > fail("Failed to parse"); > } > return parser; > } >} > > |
From: Matthew B. <mat...@co...> - 2005-09-28 16:25:31
|
Page.getLine always seems to return the previous line. Attached are some tests that show this. It seems that the documentation on PageIndex says it should be the index the the first character of the line but it is actually set as being the position of the newline. I've attached a fix to Page.getLine() that makes it work but I don't know if the correct fix change PageIndex so that the index of the start of the line is put in it instead. -- +--Matthew Buckett-----------------------------------------+ | VLE Developer, Learning Technologies Group | | Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/ | +------------Computing Services, University of Oxford------+ |
From: Matthew B. <mat...@co...> - 2005-09-14 09:06:06
|
Derrick Oswald wrote: > > To answer your second question first, it's a legacy thing, trying to > keep the base classes compatible with Java 1.x and avoiding the new Java > Collections Framework. This can probably be revisited, since the goal of > backward compatiblity has less emphasis these days. Ok. Even if NodeList didn't implement java.util.List having similar methods makes the learning curve smaller for most Java programmers. On a slightly related note, is there a reason for using a array for the Nodes in NodeList but a Vector for the attributes in TagNode? Switching to a Vector in NodeList would make it easy to expose more flexible methods. Was it orginally decided to use an array for performance reasons? > Removing nodes from an underlying collection while an iterator is active > on it is fraught with peril. Nothing like living dangerously ;-) > It might work in some cases (and I'm a little surprised it worked for > you), but I think the better approach it to throw all the nodes to be > deleted in a 'garbage bin' and remove them all later. Ok. I'll probably change to this method. I was just going to use a filter but then I either end up running multiple filters over the same tree or do the same tests on the nodes the filter returns to work out what I should be doing to them, neither seemed very sensible. Using a visitor I at least only traverse the tree once and can perform the alterations. > Yes, the NodeList > could use a remove(Node) call. > You could add the patch to the Patches > tracker (http://sourceforge.net/tracker/?group_id=24399&atid=381401) or > Request For Enhancement tracker > (http://sourceforge.net/tracker/?group_id=24399&atid=381402), but it's > probably good enough in the mail list here. Ok. -- +--Matthew Buckett-----------------------------------------+ | VLE Developer, Learning Technologies Group | | Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/ | +------------Computing Services, University of Oxford------+ |
From: Derrick O. <Der...@Ro...> - 2005-09-13 22:54:34
|
To answer your second question first, it's a legacy thing, trying to keep the base classes compatible with Java 1.x and avoiding the new Java Collections Framework. This can probably be revisited, since the goal of backward compatiblity has less emphasis these days. Removing nodes from an underlying collection while an iterator is active on it is fraught with peril. It might work in some cases (and I'm a little surprised it worked for you), but I think the better approach it to throw all the nodes to be deleted in a 'garbage bin' and remove them all later. Yes, the NodeList could use a remove(Node) call. You could add the patch to the Patches tracker (http://sourceforge.net/tracker/?group_id=24399&atid=381401) or Request For Enhancement tracker (http://sourceforge.net/tracker/?group_id=24399&atid=381402), but it's probably good enough in the mail list here. Matthew Buckett wrote: > First can I thanks for htmlparser, it's really useful. > > I'm trying to remove a Tag that I am visiting (using a NodeVisitor) > from its parent: > > public void visitTag( Tag tag ) > { > .... > NodeList children = tag.getParent().getChildren(); > for (int child = 0; child < children.size(); child++) > { > if (tag.equals(children.elementAt(child))) > { > children.remove(child); > break; > } > } > > and would rather be able todo: > > public void visitTag( Tag tag ) > { > .... > NodeList children = tag.getParent().getChildren(); > children.remove(tag); > > Comments? > > I was also wondering if there was a reason why NodeList doesn't > implement java.util.List as most Java programmers are already familiar > with the semantics of it. > > I would have attached a patch but I don't seem to be able todo a a CVS > diff against sourceforge anonymous CVS at the moment (timeouts) :-( > > -- Added code-- > /** > * Check to see if the NodeList contains the supplied Node. > * @param node The node to look for. > * @return True is the Node is in this NodeList. > */ > public boolean contains(Node node) { > return indexOf(node) != -1; > } > > /** > * Finds the index of the supplied Node. > * @param node The node to look for. > * @return The index of the node in the list or -1 if it isn't found. > */ > public int indexOf(Node node) { > for (int i=0;i<size;i++) { > if (nodeData.equals(node)) > return i; > } > return -1; > } > > /** > * Remove the supplied Node from the list. > * @param node The node to remove. > * @return True if the node was found and removed from the list. > */ > public boolean remove(Node node) { > int index = indexOf(node); > if (index != -1) { > remove(index); > return true; > } > return false; > } > |
From: Matthew B. <mat...@co...> - 2005-09-13 13:30:34
|
Matthew Buckett wrote: Sorry, missed this. > /** > * Finds the index of the supplied Node. > * @param node The node to look for. > * @return The index of the node in the list or -1 if it isn't found. > */ > public int indexOf(Node node) { > for (int i=0;i<size;i++) { > if (nodeData.equals(node)) Should have been: if (nodeData[i].equals(node)) -- +--Matthew Buckett-----------------------------------------+ | VLE Developer, Learning Technologies Group | | Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/ | +------------Computing Services, University of Oxford------+ |
From: Matthew B. <mat...@ou...> - 2005-09-13 13:25:05
|
First can I thanks for htmlparser, it's really useful. I'm trying to remove a Tag that I am visiting (using a NodeVisitor) from its parent: public void visitTag( Tag tag ) { .... NodeList children = tag.getParent().getChildren(); for (int child = 0; child < children.size(); child++) { if (tag.equals(children.elementAt(child))) { children.remove(child); break; } } and would rather be able todo: public void visitTag( Tag tag ) { .... NodeList children = tag.getParent().getChildren(); children.remove(tag); Comments? I was also wondering if there was a reason why NodeList doesn't implement java.util.List as most Java programmers are already familiar with the semantics of it. I would have attached a patch but I don't seem to be able todo a a CVS diff against sourceforge anonymous CVS at the moment (timeouts) :-( -- Added code-- /** * Check to see if the NodeList contains the supplied Node. * @param node The node to look for. * @return True is the Node is in this NodeList. */ public boolean contains(Node node) { return indexOf(node) != -1; } /** * Finds the index of the supplied Node. * @param node The node to look for. * @return The index of the node in the list or -1 if it isn't found. */ public int indexOf(Node node) { for (int i=0;i<size;i++) { if (nodeData.equals(node)) return i; } return -1; } /** * Remove the supplied Node from the list. * @param node The node to remove. * @return True if the node was found and removed from the list. */ public boolean remove(Node node) { int index = indexOf(node); if (index != -1) { remove(index); return true; } return false; } -- +--Matthew Buckett-----------------------------------------+ | VLE Developer, Learning Technologies Group | | Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/ | +------------Computing Services, University of Oxford------+ |
From: Derrick O. <Der...@Ro...> - 2005-09-02 11:59:28
|
It was an oversight. There probably needs to be explicit set/get of a recursion flag on the bean. The reason for there being a recursion flag on NodeList is to gain some control of the process, otherwise just asking the nodes in the list to do their own filtering would automatically get recursive behaviour, and that may not be what is desired when processing the list. Martin Hudson wrote: > Having a great time with the tool. I ran into the following behaviour > and wanted insight into the design decision and/or and alternative to > my fix. > > I had previously been manually creating a parser, creating a filter, > copyinto a NodeList and then processing that NodeList with a new > filter. Basically a two step filter. The first pass utilized relative > context to grab chunks of html while the second did a quick filter on > the resulting elements to pull from that reduced list. > > I was refactoring code to use the FilterBean class as this seemed to > offer an opportunity to simplify the code and handle the two filters > in series automagically. The unexpected result (for me J ) was that > the behavior was not identical. It turns out that the filter bean > explicitly does not recurse on the subsequent filter applications. > > From FilterBean, Line 166: > > ret = ret.extractAllNodesThatMatch (getFilters ()[i], false); > > as a result the second filter can’t find <A>s that are within <SPAN>s, > for instance. My short term hack was to set the recursion flag to true. > > Finally my question! Why is non-recursion the intended behavior as > this behaves differently than manually applying subsequent filters? Is > my fix OK or will it break some intended behavior elsewhere? > > Martin N. Hudson > > devIS - Development InfoStructure > |
From: Martin H. <Mh...@de...> - 2005-09-01 14:22:34
|
Having a great time with the tool. I ran into the following behaviour and wanted insight into the design decision and/or and alternative to my fix. =20 I had previously been manually creating a parser, creating a filter, copyinto a NodeList and then processing that NodeList with a new filter. Basically a two step filter. The first pass utilized relative context to grab chunks of html while the second did a quick filter on the resulting elements to pull from that reduced list. =20 I was refactoring code to use the FilterBean class as this seemed to offer an opportunity to simplify the code and handle the two filters in series automagically. The unexpected result (for me :-) ) was that the behavior was not identical. It turns out that the filter bean explicitly does not recurse on the subsequent filter applications. =20 From FilterBean, Line 166: ret =3D ret.extractAllNodesThatMatch (getFilters ()[i], false); =20 as a result the second filter can't find <A>s that are within <SPAN>s, for instance. My short term hack was to set the recursion flag to true. =20 Finally my question! Why is non-recursion the intended behavior as this behaves differently than manually applying subsequent filters? Is my fix OK or will it break some intended behavior elsewhere? =20 Martin N. Hudson devIS - Development InfoStructure =20 |
From: Derrick O. <Der...@Ro...> - 2005-08-26 02:45:15
|
No, I think you've got the zen exactly right. The incomplete coverage is historical. It's sort of grown organically with the tags people wanted most coming first. The downside of too many tags is that it gets more rigid. The closing tags are expected and the penalty for their absence may be a bit harsh. You're welcome to contribute the classes back, but there may be a bit of push-back from users of the current set. Maybe not. There are probably ways to add the new tags without breaking existing applications. Martin Hudson wrote: > First, I am grateful for all the work that has been done to produce > this project. > > Second, I noticed that the tag classes defined are not a complete > representation of the HTML tags available. I may have missed a > fundamental usage approach but noticed, for example, that there is no > ‘H2’ specific tag. The filters appear to find the opening tag as “H2” > but it is not associated with the end tag in any way, nor is the text > that appears between it easily extractable without resorting to serial > processing of the list. So, am I missing something in the approach or > is the tag list incomplete? > > Third, due perhaps to my ignorance about the overall ‘zen’ of the > parser, I created a clone of the Span.java class, edited it to create > an H2Tag.java class, and then registered it as a compound tag. It > works beautifully. This worked for many other tags. So, if I am not > completely missing the point, should I contribute these back to > increase the tag coverage? > > Martin N. Hudson > > devIS - Development InfoStructure > |
From: Martin H. <Mh...@de...> - 2005-08-25 23:01:33
|
First, I am grateful for all the work that has been done to produce this project. =20 Second, I noticed that the tag classes defined are not a complete representation of the HTML tags available. I may have missed a fundamental usage approach but noticed, for example, that there is no 'H2' specific tag. The filters appear to find the opening tag as "H2" but it is not associated with the end tag in any way, nor is the text that appears between it easily extractable without resorting to serial processing of the list. So, am I missing something in the approach or is the tag list incomplete? =20 Third, due perhaps to my ignorance about the overall 'zen' of the parser, I created a clone of the Span.java class, edited it to create an H2Tag.java class, and then registered it as a compound tag. It works beautifully. This worked for many other tags. So, if I am not completely missing the point, should I contribute these back to increase the tag coverage? =20 Martin N. Hudson devIS - Development InfoStructure =20 |
From: Derrick O. <Der...@Ro...> - 2005-04-21 22:12:08
|
I think you want to keepAllNodesThatMatch recursively (i.e. use the boolean recursive argument form): nodelist.keepAllNodesThatMatch( new NotFilter(new NodeClassFilter (RemarkNode.class), true); That way, all the remarks that are children of the nodes in the node list (and their children etc.) are also filtered out. dualspacekimo wrote: >I am confused with that....Nodelist's method > >Especially the NotFilter ... > >If I want to filter the RemarkNode. >Is it possible that use the following code to achive >the goal? I have write the code...but not skip the >comment in the html. > >NodeList nodelist = >parser.extractAllNodesThatMatch(new HasAttributeFilter >("class","blogbody")); > >nodelist.keepAllNodesThatMatch( new NotFilter(new >NodeClassFilter (RemarkNode.class)); > >SimpleNodeIterator Noderator = nodelist.elements(); >while(Noderator.hasMoreNodes()){ >......//print the content >}//end while > > > |