htmlparser-developer Mailing List for HTML Parser (Page 4)

Brought to you by: derrickoswald

htmlparser-developer — The developer mailing list of the htmlparser project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

Flat | Threaded

<< < 1 2 3 4 5 6 .. 33 > >> (Page 4 of 33)

Re: [Htmlparser-developer] Method to check if TextNode is just whitespace

From: Axel <ax...@gm...> - 2005-11-02 22:07:16

On 11/1/05, Ian Macfarlane <ian...@gm...> wrote:
> I was thinking it might be worthwhile adding a method to Text/TextNode
> along the lines of:
>
> boolean isWhiteSpace()
>
> Which would return if the TextNode consisted of solely white space
> characters (or was the empty String).
>
> Now this could simply be done using String.trim().equals(""), however
> that wouldn't account for:
>
> - the non-breaking space character (#160)
> - The HTML code &nbsp; (also &nbsp as Firefox/IE do)
> - The HTML code &#160; (also &#160 as Firefox/IE do)
>
> So my question is, do you think should this method should treat those
> as spaces and remove/ignore them also for purposes of determining if
> the TextNode is white space? Or should it only trim normal whitespace
> (space, tab, carriage returns, etc).
I think, if every character (or entity converted to a
unicode-character) in the TextNode is true for
Character#isWhitespace() the boolean isWhiteSpace() should return
true;
IMO the TextNode shouldn't be trimmed automatically. Only a special
function should allow this to do.

--
Axel Kramer
http://www.plog4u.org - Wikipedia Eclipse Plugin

Re: [Htmlparser-developer] I want to some materials about htmlparser

From: Derrick O. <Der...@Ro...> - 2005-11-02 12:25:36

There aren't any design documents as far as I'm aware. It has grown by
accretion in general. What's in the JavaDocs is what there is.

Somik Raha or Joshua Kerievsky may have other documents related to the
design...

yangang li wrote:

> Dear Derrick:
> I am a graduate student.I am very interested to join the HTML Parser
> Project.
> I have done some research about it. Now I prefer some materials about
> it, especially the design.
> Yours,Yangang ,Li
>
>
> <http://cn.zs.yahoo.com>

[Htmlparser-developer] I want to some materials about htmlparser

From: yangang li <lhe...@ya...> - 2005-11-02 07:24:01

Dear Derrick:
   I am a  graduate student.I am very interested to join  the HTML Parser Project.
   I have done some research about it.Now I prefer some materials about it,especially the design.
                              
                                                                Yours,Yangang ,Li



		
---------------------------------
 雅虎免费G邮箱－中国第一绝无垃圾邮件骚扰超大邮箱
 雅虎助手¨D搜索、杀毒、防骚扰

[Htmlparser-developer] Obtaining siblings of root-level nodes

From: Ian M. <ian...@gm...> - 2005-11-01 16:04:49

There doesn't seem to be any way of traversing from one top level node
(e.g. doctype, html, #text, etc) to any of its siblings, once you no
longer have the original NodeList obtained from the Parser class. This
means, for example, that you cannot do a get previous/next sibling
method currently on these nodes.

A few quick ideas about how to do this (none of which are particularly idea=
l):

- Create a node named something like DocumentNode (we have HTMLPage
but that doesn't really seem to do what we want) that whose children
are the NodeList obtained by the Parser class. Make the Parser class
create one of these instead of the NodeList as it does now.

- Have a method named something like getDocument for each tag, which
simply returns a reference to the initial NodeList obtained by the
Parser class.

Any other ideas are welcome.

Ian

[Htmlparser-developer] Method to check if TextNode is just whitespace

From: Ian M. <ian...@gm...> - 2005-11-01 10:56:44

I was thinking it might be worthwhile adding a method to Text/TextNode
along the lines of:

boolean isWhiteSpace()

Which would return if the TextNode consisted of solely white space
characters (or was the empty String).

Now this could simply be done using String.trim().equals(""), however
that wouldn't account for:

- the non-breaking space character (#160)
- The HTML code &nbsp; (also &nbsp as Firefox/IE do)
- The HTML code &#160; (also &#160 as Firefox/IE do)

So my question is, do you think should this method should treat those
as spaces and remove/ignore them also for purposes of determining if
the TextNode is white space? Or should it only trim normal whitespace
(space, tab, carriage returns, etc).

Thanks for your advice

Ian Macfarlane

[Htmlparser-developer] Re: More patches for HTMLParser

From: Ian M. <ian...@gm...> - 2005-11-01 10:15:56

This has been checked into CVS.

On 10/27/05, Ian Macfarlane <ian...@gm...> wrote:
> Forwarded again as SourceForge blocked the zip attachments on the previou=
s mail.
>
> ---------- Forwarded message ----------
> From: Ian Macfarlane <ian...@gm...>
> Date: Oct 27, 2005 1:35 PM
> Subject: More patches for HTMLParser
> To: Der...@ro...
> Cc: htm...@li...
>
>
> Dear Derrick,
>
> I've modified the custom definitions I had to write for p tags and
> definition-list-related tags (dl,dd,dt).
>
> Notes:
>
> - ParagraphTag: There are a lot of tags listed as closing the P tag -
> these all appear to be correct however - I have tested it using the
> splitting_test.html file attached (sample testing div splitting p) and
> Firefox's DOM inspector, and went through every tag in the HTML 4
> specification. All the listed tag names caused the P tag to close in
> Firefox.
>
> - Definition list stuff: I did not simply add these as entries to the
> Bullet and BulletList classes, as from testing (a few examples shown
> in the attached list_test.html) showed that they in fact behaved
> differently to normal lists (for example a dt closes a dd, and vice
> versa, but a li does not close either of these, nor do they close an
> li.
>
> A question about PrototypicalNodeFactory, is there a reason that the
> div, span, body, head and html tag registration in method registerTags
> () are at the end rather than in alphabetical order like the other
> tags?
>
> Kind regards,
>
> Ian Macfarlane
>
> PS: I've tried the following command to add a file to CVS:
>
>  cvs -d :ext:ian...@cv...:/cvsroot/htmlparser
> add src/org/htmlparser/tags/ParagraphTag.java
>
> And I get this response:
>
> cvs add: in directory .:
> cvs [add aborted]: there is no version here; do 'cvs checkout' first
>
> I'm new to using CVS, so I'm not entirely sure what I'm doing wrong
> here. Do you have any suggestions?
>
>
>

[Htmlparser-developer] Re: HeadingTag

From: Ian M. <ian...@gm...> - 2005-11-01 10:14:02

This has been checked into CVS.

On 10/28/05, Ian Macfarlane <ian...@gm...> wrote:
> It looks like headings should definitely also extend CompositeTag. I
> think that's most of the essential ones now (although we still need
> THEAD, TBODY and TFOOT at some point).
>
> I've attached the file, HeadingTag.java, and also all the other tag
> changes to date, to make it easier to keep track of all the changes
> made so far.
>
> PrototypicalNodeFactory has also been updated with the new tag
> definitions (heading, paragraph and the definition list stuff).
>
> I promise I'll try and get CVS upload worked out soon ;)
>
> Best wishes
>
> Ian Macfarlane
>
>
>

[Htmlparser-developer] Re: HTMLParser: Updates to various table-related classes

From: Ian M. <ian...@gm...> - 2005-11-01 10:13:40

This has been checked into CVS.

On 10/27/05, Ian Macfarlane <ian...@gm...> wrote:
> Here are some updates to the various table-related tags, making them
> close on finding open/close TBODY, THEAD or TFOOT.
>
> Sorry, I'm still trying to work out CVS write access.
>
> Ian
>
>
>

[Htmlparser-developer] HeadingTag

From: Ian M. <ian...@gm...> - 2005-10-28 16:33:55

Attachments: DefinitionList.java DefinitionListBullet.java HeadingTag.java ParagraphTag.java PrototypicalNodeFactory.java TableColumn.java TableHeader.java TableRow.java

It looks like headings should definitely also extend CompositeTag. I
think that's most of the essential ones now (although we still need
THEAD, TBODY and TFOOT at some point).

I've attached the file, HeadingTag.java, and also all the other tag
changes to date, to make it easier to keep track of all the changes
made so far.

PrototypicalNodeFactory has also been updated with the new tag
definitions (heading, paragraph and the definition list stuff).

I promise I'll try and get CVS upload worked out soon ;)

Best wishes

Ian Macfarlane

[Htmlparser-developer] HTMLParser: Updates to various table-related classes

From: Ian M. <ian...@gm...> - 2005-10-27 15:34:37

Attachments: TableColumn.java TableHeader.java TableRow.java

Here are some updates to the various table-related tags, making them
close on finding open/close TBODY, THEAD or TFOOT.

Sorry, I'm still trying to work out CVS write access.

Ian

[Htmlparser-developer] Feature request 1291620 completed

From: Ian M. <ian...@gm...> - 2005-10-27 14:54:33

Feature request 1291620
http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1291620&group_=
id=3D24399&atid=3D381402

has been fulfilled by patch 1338534:
http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1338534&group_=
id=3D24399&atid=3D381401

Please mark it as closed.

Ian

[Htmlparser-developer] Fwd: More patches for HTMLParser

From: Ian M. <ian...@gm...> - 2005-10-27 14:02:08

Attachments: PrototypicalNodeFactory.java DefinitionList.java DefinitionListBullet.java list_test.html ParagraphTag.java splitting_test.html

Forwarded again as SourceForge blocked the zip attachments on the previous =
mail.

---------- Forwarded message ----------
From: Ian Macfarlane <ian...@gm...>
Date: Oct 27, 2005 1:35 PM
Subject: More patches for HTMLParser
To: Der...@ro...
Cc: htm...@li...

Dear Derrick,

I've modified the custom definitions I had to write for p tags and
definition-list-related tags (dl,dd,dt).

Notes:

- ParagraphTag: There are a lot of tags listed as closing the P tag -
these all appear to be correct however - I have tested it using the
splitting_test.html file attached (sample testing div splitting p) and
Firefox's DOM inspector, and went through every tag in the HTML 4
specification. All the listed tag names caused the P tag to close in
Firefox.

- Definition list stuff: I did not simply add these as entries to the
Bullet and BulletList classes, as from testing (a few examples shown
in the attached list_test.html) showed that they in fact behaved
differently to normal lists (for example a dt closes a dd, and vice
versa, but a li does not close either of these, nor do they close an
li.

A question about PrototypicalNodeFactory, is there a reason that the
div, span, body, head and html tag registration in method registerTags
() are at the end rather than in alphabetical order like the other
tags?

Kind regards,

Ian Macfarlane

PS: I've tried the following command to add a file to CVS:

 cvs -d :ext:ian...@cv...:/cvsroot/htmlparser
add src/org/htmlparser/tags/ParagraphTag.java

And I get this response:

cvs add: in directory .:
cvs [add aborted]: there is no version here; do 'cvs checkout' first

I'm new to using CVS, so I'm not entirely sure what I'm doing wrong
here. Do you have any suggestions?

Re: [Htmlparser-developer] Page.getLine() seems broken.

From: Derrick O. <Der...@Ro...> - 2005-09-29 11:24:46

Sorry, I fired from the hip in a hurry and didn't even see the attachment.
I'll give it a better look when I get some time.

Matthew Buckett wrote:

> Derrick Oswald wrote:
>
>> It's zero based, unlike the usual text editor counting.
>
>
> Yeah, but I'm passing in the position:
>
>     Page.getLine(int position)
>           Get the text line the position of the cursor lies on.
>
> So if I parse "line0\nline1\nline2\n".
> then call page.getLine(8) I should get back "line1\n" but I get 
> "line2\n";
>
> row(8) correctly gives back 1 (zero based line number). But 
> mIndex.elementAt(1) returns the end of row 1 (position 12) then the 
> line is incremented and mIndex.elementAt(2) returns the end of row 2 
> (position 18). This is then passed to getText which returns the text 
> for the last row.
>
> Try the tests without the patch and they fail. Are you saying my tests 
> should fail?
>
>> Matthew Buckett wrote:
>>
>>> Page.getLine always seems to return the previous line. Attached are 
>>> some  tests that show this. It seems that the documentation on 
>>> PageIndex says it should be the index the the first character of the 
>>> line but it is actually set as being the position of the newline.
>>>
>>> I've attached a fix to Page.getLine() that makes it work but I don't 
>>> know if the correct fix change PageIndex so that the index of the 
>>> start of the line is put in it instead.
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> Index: Page.java
>>> ===================================================================
>>> RCS file: 
>>> /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v
>>> retrieving revision 1.51
>>> diff -u -r1.51 Page.java
>>> --- Page.java    20 Jun 2005 01:56:32 -0000    1.51
>>> +++ Page.java    28 Sep 2005 16:16:14 -0000
>>> @@ -1106,12 +1106,12 @@
>>>         size = mIndex.size ();
>>>         if (line < size)
>>>         {
>>> -            start = mIndex.elementAt (line);
>>> -            line++;
>>> -            if (line <= size)
>>> -                end = mIndex.elementAt (line);
>>> +            end = mIndex.elementAt (line);
>>> +            line--;
>>> +            if (line >= 0)
>>> +                start = mIndex.elementAt (line);
>>>             else
>>> -                end = mSource.offset ();
>>> +                start = 0;
>>>         }
>>>         else // current line
>>>         {
>>>  
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> /* 
>>> ======================================================================
>>> The Bodington System Software License, Version 1.0
>>
>
> Sorry Eclipse was still configured for the wrong project...
>
>
>>> package org.htmlparser.tests;
>>>
>>> import junit.framework.TestCase;
>>>
>>> import org.htmlparser.Node;
>>> import org.htmlparser.Parser;
>>> import org.htmlparser.filters.TagNameFilter;
>>> import org.htmlparser.util.NodeList;
>>> import org.htmlparser.util.ParserException;
>>>
>>> public class LineTests extends TestCase
>>> {
>>>    public void testGetLine1() throws ParserException   {
>>>        Parser parser = getParser();
>>>        NodeList list = parser.parse(new TagNameFilter("h1"));
>>>        Node node = list.elementAt(0);
>>>        assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine(
>>>            node.getStartPosition()));
>>>    }
>>>       public void testGetLine2() throws ParserException   {
>>>        Parser parser = getParser();
>>>        NodeList list = parser.parse(new TagNameFilter("h2"));
>>>        Node node = list.elementAt(0);
>>>        assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine(
>>>            node.getStartPosition()));
>>>    }
>>>       public void testGetLine3() throws ParserException   {
>>>        Parser parser = getParser();
>>>        NodeList list = parser.parse(new TagNameFilter("h3"));
>>>        Node node = list.elementAt(0);
>>>        assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine(
>>>            node.getStartPosition()));
>>>    }
>>>       public Parser getParser()
>>>    {
>>>        Parser parser = new Parser();
>>>        try
>>>        {
>>>            parser.setInputHTML(
>>>                "<h1>Line 1</h1>\n"+
>>>                "<h2>Line 2</h2>\n"+
>>>                "<h3>Line 3</h3>\n"
>>>                );
>>>        }
>>>        catch (ParserException e)
>>>        {
>>>            fail("Failed to parse");
>>>        }
>>>        return parser;
>>>    }
>>> }
>>>  
>>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by:
>> Power Architecture Resource Center: Free content, downloads, 
>> discussions,
>> and more. http://solutions.newsforge.com/ibmarch.tmpl
>> _______________________________________________
>> Htmlparser-developer mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
>>
>>
>
>

Re: [Htmlparser-developer] Page.getLine() seems broken.

From: Matthew B. <mat...@co...> - 2005-09-29 09:29:50

Derrick Oswald wrote:
> It's zero based, unlike the usual text editor counting.

Yeah, but I'm passing in the position:

	Page.getLine(int position)
           Get the text line the position of the cursor lies on.

So if I parse "line0\nline1\nline2\n".
then call page.getLine(8) I should get back "line1\n" but I get "line2\n";

row(8) correctly gives back 1 (zero based line number). But 
mIndex.elementAt(1) returns the end of row 1 (position 12) then the line 
is incremented and mIndex.elementAt(2) returns the end of row 2 
(position 18). This is then passed to getText which returns the text for 
the last row.

Try the tests without the patch and they fail. Are you saying my tests 
should fail?

> Matthew Buckett wrote:
> 
>> Page.getLine always seems to return the previous line. Attached are 
>> some  tests that show this. It seems that the documentation on 
>> PageIndex says it should be the index the the first character of the 
>> line but it is actually set as being the position of the newline.
>>
>> I've attached a fix to Page.getLine() that makes it work but I don't 
>> know if the correct fix change PageIndex so that the index of the 
>> start of the line is put in it instead.
>>
>> ------------------------------------------------------------------------
>>
>> Index: Page.java
>> ===================================================================
>> RCS file: 
>> /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v
>> retrieving revision 1.51
>> diff -u -r1.51 Page.java
>> --- Page.java    20 Jun 2005 01:56:32 -0000    1.51
>> +++ Page.java    28 Sep 2005 16:16:14 -0000
>> @@ -1106,12 +1106,12 @@
>>         size = mIndex.size ();
>>         if (line < size)
>>         {
>> -            start = mIndex.elementAt (line);
>> -            line++;
>> -            if (line <= size)
>> -                end = mIndex.elementAt (line);
>> +            end = mIndex.elementAt (line);
>> +            line--;
>> +            if (line >= 0)
>> +                start = mIndex.elementAt (line);
>>             else
>> -                end = mSource.offset ();
>> +                start = 0;
>>         }
>>         else // current line
>>         {
>>  
>>
>> ------------------------------------------------------------------------
>>
>> /* ======================================================================
>> The Bodington System Software License, Version 1.0

Sorry Eclipse was still configured for the wrong project...


>> package org.htmlparser.tests;
>>
>> import junit.framework.TestCase;
>>
>> import org.htmlparser.Node;
>> import org.htmlparser.Parser;
>> import org.htmlparser.filters.TagNameFilter;
>> import org.htmlparser.util.NodeList;
>> import org.htmlparser.util.ParserException;
>>
>> public class LineTests extends TestCase
>> {
>>    public void testGetLine1() throws ParserException   {
>>        Parser parser = getParser();
>>        NodeList list = parser.parse(new TagNameFilter("h1"));
>>        Node node = list.elementAt(0);
>>        assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine(
>>            node.getStartPosition()));
>>    }
>>       public void testGetLine2() throws ParserException   {
>>        Parser parser = getParser();
>>        NodeList list = parser.parse(new TagNameFilter("h2"));
>>        Node node = list.elementAt(0);
>>        assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine(
>>            node.getStartPosition()));
>>    }
>>       public void testGetLine3() throws ParserException   {
>>        Parser parser = getParser();
>>        NodeList list = parser.parse(new TagNameFilter("h3"));
>>        Node node = list.elementAt(0);
>>        assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine(
>>            node.getStartPosition()));
>>    }
>>       public Parser getParser()
>>    {
>>        Parser parser = new Parser();
>>        try
>>        {
>>            parser.setInputHTML(
>>                "<h1>Line 1</h1>\n"+
>>                "<h2>Line 2</h2>\n"+
>>                "<h3>Line 3</h3>\n"
>>                );
>>        }
>>        catch (ParserException e)
>>        {
>>            fail("Failed to parse");
>>        }
>>        return parser;
>>    }
>> }
>>  
>>
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads, discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Htmlparser-developer mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-developer
> 
> 


-- 
+--Matthew Buckett-----------------------------------------+
| VLE Developer, Learning Technologies Group               |
| Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/       |
+------------Computing Services, University of Oxford------+

Re: [Htmlparser-developer] Page.getLine() seems broken.

From: Derrick O. <Der...@Ro...> - 2005-09-28 22:26:24

It's zero based, unlike the usual text editor counting.

Matthew Buckett wrote:

> Page.getLine always seems to return the previous line. Attached are 
> some  tests that show this. It seems that the documentation on 
> PageIndex says it should be the index the the first character of the 
> line but it is actually set as being the position of the newline.
>
> I've attached a fix to Page.getLine() that makes it work but I don't 
> know if the correct fix change PageIndex so that the index of the 
> start of the line is put in it instead.
>
>------------------------------------------------------------------------
>
>Index: Page.java
>===================================================================
>RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v
>retrieving revision 1.51
>diff -u -r1.51 Page.java
>--- Page.java	20 Jun 2005 01:56:32 -0000	1.51
>+++ Page.java	28 Sep 2005 16:16:14 -0000
>@@ -1106,12 +1106,12 @@
>         size = mIndex.size ();
>         if (line < size)
>         {
>-            start = mIndex.elementAt (line);
>-            line++;
>-            if (line <= size)
>-                end = mIndex.elementAt (line);
>+            end = mIndex.elementAt (line);
>+            line--;
>+            if (line >= 0)
>+                start = mIndex.elementAt (line);
>             else
>-                end = mSource.offset ();
>+                start = 0;
>         }
>         else // current line
>         {
>  
>
>------------------------------------------------------------------------
>
>/* ======================================================================
>The Bodington System Software License, Version 1.0
> 
>Copyright (c) 2001 The University of Leeds.  All rights reserved.
> 
>Redistribution and use in source and binary forms, with or without
>modification, are permitted provided that the following conditions are
>met:
> 
>1.  Redistributions of source code must retain the above copyright notice,
>this list of conditions and the following disclaimer.
> 
>2.  Redistributions in binary form must reproduce the above copyright
>notice, this list of conditions and the following disclaimer in the
>documentation and/or other materials provided with the distribution.
> 
>3.  The end-user documentation included with the redistribution, if any,
>must include the following acknowledgement:  "This product includes
>software developed by the University of Leeds
>(http://www.bodington.org/)."  Alternately, this acknowledgement may
>appear in the software itself, if and wherever such third-party
>acknowledgements normally appear.
> 
>4.  The names "Bodington", "Nathan Bodington", "Bodington System",
>"Bodington Open Source Project", and "The University of Leeds" must not be
>used to endorse or promote products derived from this software without
>prior written permission. For written permission, please contact
>d.g...@le....
> 
>5.  The name "Bodington" may not appear in the name of products derived
>from this software without prior written permission of the University of
>Leeds.
> 
>THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
>WARRANTIES, INCLUDING, BUT NOT LIMITED TO,  TITLE,  THE IMPLIED WARRANTIES
>OF QUALITY  AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
>EVENT SHALL THE UNIVERSITY OF LEEDS OR ITS CONTRIBUTORS BE LIABLE FOR
>ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
>STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>POSSIBILITY OF SUCH DAMAGE.
>=========================================================
> 
>This software was originally created by the University of Leeds and may contain voluntary
>contributions from others.  For more information on the Bodington Open Source Project, please
>see http://bodington.org/
> 
>====================================================================== */
>
>package org.htmlparser.tests;
>
>import junit.framework.TestCase;
>
>import org.htmlparser.Node;
>import org.htmlparser.Parser;
>import org.htmlparser.filters.TagNameFilter;
>import org.htmlparser.util.NodeList;
>import org.htmlparser.util.ParserException;
>
>public class LineTests extends TestCase
>{
>    public void testGetLine1() throws ParserException   {
>        Parser parser = getParser();
>        NodeList list = parser.parse(new TagNameFilter("h1"));
>        Node node = list.elementAt(0);
>        assertEquals("<h1>Line 1</h1>\n", node.getPage().getLine(
>            node.getStartPosition()));
>    }
>    
>    public void testGetLine2() throws ParserException   {
>        Parser parser = getParser();
>        NodeList list = parser.parse(new TagNameFilter("h2"));
>        Node node = list.elementAt(0);
>        assertEquals("<h2>Line 2</h2>\n", node.getPage().getLine(
>            node.getStartPosition()));
>    }
>    
>    public void testGetLine3() throws ParserException   {
>        Parser parser = getParser();
>        NodeList list = parser.parse(new TagNameFilter("h3"));
>        Node node = list.elementAt(0);
>        assertEquals("<h3>Line 3</h3>\n", node.getPage().getLine(
>            node.getStartPosition()));
>    }
>    
>    public Parser getParser()
>    {
>        Parser parser = new Parser();
>        try
>        {
>            parser.setInputHTML(
>                "<h1>Line 1</h1>\n"+
>                "<h2>Line 2</h2>\n"+
>                "<h3>Line 3</h3>\n"
>                );
>        }
>        catch (ParserException e)
>        {
>            fail("Failed to parse");
>        }
>        return parser;
>    }
>}
>  
>

[Htmlparser-developer] Page.getLine() seems broken.

From: Matthew B. <mat...@co...> - 2005-09-28 16:25:31

Attachments: page.txt LineTests.java

Page.getLine always seems to return the previous line. Attached are some 
  tests that show this. It seems that the documentation on PageIndex 
says it should be the index the the first character of the line but it 
is actually set as being the position of the newline.

I've attached a fix to Page.getLine() that makes it work but I don't 
know if the correct fix change PageIndex so that the index of the start 
of the line is put in it instead.

-- 
+--Matthew Buckett-----------------------------------------+
| VLE Developer, Learning Technologies Group               |
| Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/       |
+------------Computing Services, University of Oxford------+

Re: [Htmlparser-developer] Added methods to NodeList

From: Matthew B. <mat...@co...> - 2005-09-14 09:06:06

Derrick Oswald wrote:
> 
> To answer your second question first, it's a legacy thing, trying to 
> keep the base classes compatible with Java 1.x and avoiding the new Java 
> Collections Framework. This can probably be revisited, since the goal of 
> backward compatiblity has less emphasis these days.

Ok. Even if NodeList didn't implement java.util.List having similar 
methods makes the learning curve smaller for most Java programmers.

On a slightly related note, is there a reason for using a array for the 
Nodes in NodeList but a Vector for the attributes in TagNode? Switching 
to a Vector in NodeList would make it easy to expose more flexible 
methods. Was it orginally decided to use an array for performance reasons?

> Removing nodes from an underlying collection while an iterator is active 
> on it is fraught with peril.

Nothing like living dangerously ;-)

> It might work in some cases (and I'm a little surprised it worked for 
> you), but I think the better approach it to throw all the nodes to be 
> deleted in a 'garbage bin' and remove them all later.

Ok. I'll probably change to this method.

I was just going to use a filter but then I either end up running 
multiple filters over the same tree or do the same tests on the nodes 
the filter returns to work out what I should be doing to them, neither 
seemed very sensible.

Using a visitor I at least only traverse the tree once and can perform 
the alterations.

 > Yes, the NodeList
> could use a remove(Node) call.
 > You could add the patch to the Patches
> tracker (http://sourceforge.net/tracker/?group_id=24399&atid=381401) or 
> Request For Enhancement tracker 
> (http://sourceforge.net/tracker/?group_id=24399&atid=381402), but it's 
> probably good enough in the mail list here.

Ok.

-- 
+--Matthew Buckett-----------------------------------------+
| VLE Developer, Learning Technologies Group               |
| Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/       |
+------------Computing Services, University of Oxford------+

Re: [Htmlparser-developer] Added methods to NodeList

From: Derrick O. <Der...@Ro...> - 2005-09-13 22:54:34

To answer your second question first, it's a legacy thing, trying to 
keep the base classes compatible with Java 1.x and avoiding the new Java 
Collections Framework. This can probably be revisited, since the goal of 
backward compatiblity has less emphasis these days.

Removing nodes from an underlying collection while an iterator is active 
on it is fraught with peril.
It might work in some cases (and I'm a little surprised it worked for 
you), but I think the better approach it to throw all the nodes to be 
deleted in a 'garbage bin' and remove them all later. Yes, the NodeList 
could use a remove(Node) call. You could add the patch to the Patches 
tracker (http://sourceforge.net/tracker/?group_id=24399&atid=381401) or 
Request For Enhancement tracker 
(http://sourceforge.net/tracker/?group_id=24399&atid=381402), but it's 
probably good enough in the mail list here.
   

Matthew Buckett wrote:

> First can I thanks for htmlparser, it's really useful.
>
> I'm trying to remove a Tag that I am visiting (using a NodeVisitor) 
> from its parent:
>
>     public void visitTag( Tag tag )
>     {
> ....
>             NodeList children = tag.getParent().getChildren();
>             for (int child = 0; child < children.size(); child++)
>             {
>                 if (tag.equals(children.elementAt(child)))
>                 {
>                     children.remove(child);
>                     break;
>                 }
>             }
>
> and would rather be able todo:
>
>     public void visitTag( Tag tag )
>     {
> ....
>             NodeList children = tag.getParent().getChildren();
>             children.remove(tag);
>
> Comments?
>
> I was also wondering if there was a reason why NodeList doesn't 
> implement java.util.List as most Java programmers are already familiar 
> with the semantics of it.
>
> I would have attached a patch but I don't seem to be able todo a a CVS 
> diff against sourceforge anonymous CVS at the moment (timeouts) :-(
>
> -- Added code--
>     /**
>      * Check to see if the NodeList contains the supplied Node.
>      * @param node The node to look for.
>      * @return True is the Node is in this NodeList.
>      */
>     public boolean contains(Node node) {
>         return indexOf(node) != -1;
>     }
>
>     /**
>      * Finds the index of the supplied Node.
>      * @param node The node to look for.
>      * @return The index of the node in the list or -1 if it isn't found.
>      */
>     public int indexOf(Node node) {
>         for (int i=0;i<size;i++) {
>             if (nodeData.equals(node))
>                 return i;
>         }
>         return -1;
>     }
>
>     /**
>      * Remove the supplied Node from the list.
>      * @param node The node to remove.
>      * @return True if the node was found and removed from the list.
>      */
>     public boolean remove(Node node) {
>         int index = indexOf(node);
>         if (index != -1) {
>             remove(index);
>             return true;
>         }
>         return false;
>     }
>

Re: [Htmlparser-developer] Added methods to NodeList

From: Matthew B. <mat...@co...> - 2005-09-13 13:30:34

Matthew Buckett wrote:

Sorry, missed this.

>     /**
>      * Finds the index of the supplied Node.
>      * @param node The node to look for.
>      * @return The index of the node in the list or -1 if it isn't found.
>      */
>     public int indexOf(Node node) {
>         for (int i=0;i<size;i++) {
>             if (nodeData.equals(node))
Should have been:

             if (nodeData[i].equals(node))


-- 
+--Matthew Buckett-----------------------------------------+
| VLE Developer, Learning Technologies Group               |
| Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/       |
+------------Computing Services, University of Oxford------+

[Htmlparser-developer] Added methods to NodeList

From: Matthew B. <mat...@ou...> - 2005-09-13 13:25:05

First can I thanks for htmlparser, it's really useful.

I'm trying to remove a Tag that I am visiting (using a NodeVisitor) from 
its parent:

     public void visitTag( Tag tag )
     {
....
             NodeList children = tag.getParent().getChildren();
             for (int child = 0; child < children.size(); child++)
             {
                 if (tag.equals(children.elementAt(child)))
                 {
                     children.remove(child);
                     break;
                 }
             }

and would rather be able todo:

     public void visitTag( Tag tag )
     {
....
             NodeList children = tag.getParent().getChildren();
             children.remove(tag);

Comments?

I was also wondering if there was a reason why NodeList doesn't 
implement java.util.List as most Java programmers are already familiar 
with the semantics of it.

I would have attached a patch but I don't seem to be able todo a a CVS 
diff against sourceforge anonymous CVS at the moment (timeouts) :-(

-- Added code--
     /**
      * Check to see if the NodeList contains the supplied Node.
      * @param node The node to look for.
      * @return True is the Node is in this NodeList.
      */
     public boolean contains(Node node) {
         return indexOf(node) != -1;
     }

     /**
      * Finds the index of the supplied Node.
      * @param node The node to look for.
      * @return The index of the node in the list or -1 if it isn't found.
      */
     public int indexOf(Node node) {
         for (int i=0;i<size;i++) {
             if (nodeData.equals(node))
                 return i;
         }
         return -1;
     }

     /**
      * Remove the supplied Node from the list.
      * @param node The node to remove.
      * @return True if the node was found and removed from the list.
      */
     public boolean remove(Node node) {
         int index = indexOf(node);
         if (index != -1) {
             remove(index);
             return true;
         }
         return false;
     }

-- 
+--Matthew Buckett-----------------------------------------+
| VLE Developer, Learning Technologies Group               |
| Tel: +44 (0) 1865 283660 http://www.oucs.ox.ac.uk/       |
+------------Computing Services, University of Oxford------+

Re: [Htmlparser-developer] Nonrecursive FilterBean.setFilters()

From: Derrick O. <Der...@Ro...> - 2005-09-02 11:59:28

It was an oversight. There probably needs to be explicit set/get of a 
recursion flag on the bean.
The reason for there being a recursion flag on NodeList is to gain some 
control of the process, otherwise just asking the nodes in the list to 
do their own filtering would automatically get recursive behaviour, and 
that may not be what is desired when processing the list.

Martin Hudson wrote:

> Having a great time with the tool. I ran into the following behaviour 
> and wanted insight into the design decision and/or and alternative to 
> my fix.
>
> I had previously been manually creating a parser, creating a filter, 
> copyinto a NodeList and then processing that NodeList with a new 
> filter. Basically a two step filter. The first pass utilized relative 
> context to grab chunks of html while the second did a quick filter on 
> the resulting elements to pull from that reduced list.
>
> I was refactoring code to use the FilterBean class as this seemed to 
> offer an opportunity to simplify the code and handle the two filters 
> in series automagically. The unexpected result (for me J ) was that 
> the behavior was not identical. It turns out that the filter bean 
> explicitly does not recurse on the subsequent filter applications.
>
> From FilterBean, Line 166:
>
> ret = ret.extractAllNodesThatMatch (getFilters ()[i], false);
>
> as a result the second filter can’t find <A>s that are within <SPAN>s, 
> for instance. My short term hack was to set the recursion flag to true.
>
> Finally my question! Why is non-recursion the intended behavior as 
> this behaves differently than manually applying subsequent filters? Is 
> my fix OK or will it break some intended behavior elsewhere?
>
> Martin N. Hudson
>
> devIS - Development InfoStructure
>

[Htmlparser-developer] Nonrecursive FilterBean.setFilters()

From: Martin H. <Mh...@de...> - 2005-09-01 14:22:34

Having a great time with the tool.  I ran into the following behaviour
and wanted insight into the design decision and/or and alternative to my
fix.

=20

I had previously been manually creating a parser, creating a filter,
copyinto a NodeList and then processing that NodeList with a new filter.
Basically a two step filter.  The first pass utilized relative context
to grab chunks of html while the second did a quick filter on the
resulting elements to pull from that reduced list.

=20

I was refactoring code to use the FilterBean class as this seemed to
offer an opportunity to simplify the code and handle the two filters in
series automagically.  The unexpected result (for me :-) ) was that the
behavior was not identical.  It turns out that the filter bean
explicitly does not recurse on the subsequent filter applications.

=20

From FilterBean, Line 166:

                    ret =3D ret.extractAllNodesThatMatch (getFilters
()[i], false);

=20

as a result the second filter can't find <A>s that are within <SPAN>s,
for instance.  My short term hack was to set the recursion flag to true.

=20

Finally my question! Why is non-recursion the intended behavior as this
behaves differently than manually applying subsequent filters?  Is my
fix OK or will it break some intended behavior elsewhere?

=20

Martin N. Hudson

devIS - Development InfoStructure

=20

Re: [Htmlparser-developer] Incomplete HTML Tag Definitions

From: Derrick O. <Der...@Ro...> - 2005-08-26 02:45:15

No, I think you've got the zen exactly right.
The incomplete coverage is historical. It's sort of grown organically 
with the tags people wanted most coming first.
The downside of too many tags is that it gets more rigid. The closing 
tags are expected and the penalty for their absence may be a bit harsh.
You're welcome to contribute the classes back, but there may be a bit of 
push-back from users of the current set. Maybe not.
There are probably ways to add the new tags without breaking existing 
applications.

Martin Hudson wrote:

> First, I am grateful for all the work that has been done to produce 
> this project.
>
> Second, I noticed that the tag classes defined are not a complete 
> representation of the HTML tags available. I may have missed a 
> fundamental usage approach but noticed, for example, that there is no 
> ‘H2’ specific tag. The filters appear to find the opening tag as “H2” 
> but it is not associated with the end tag in any way, nor is the text 
> that appears between it easily extractable without resorting to serial 
> processing of the list. So, am I missing something in the approach or 
> is the tag list incomplete?
>
> Third, due perhaps to my ignorance about the overall ‘zen’ of the 
> parser, I created a clone of the Span.java class, edited it to create 
> an H2Tag.java class, and then registered it as a compound tag. It 
> works beautifully. This worked for many other tags. So, if I am not 
> completely missing the point, should I contribute these back to 
> increase the tag coverage?
>
> Martin N. Hudson
>
> devIS - Development InfoStructure
>

[Htmlparser-developer] Incomplete HTML Tag Definitions

From: Martin H. <Mh...@de...> - 2005-08-25 23:01:33

First, I am grateful for all the work that has been done to produce this
project.

=20

Second, I noticed that the tag classes defined are not a complete
representation of the HTML tags available.  I may have missed a
fundamental usage approach but noticed, for example, that there is no
'H2' specific tag.  The filters appear to find the opening tag as "H2"
but it is not associated with the end tag in any way, nor is the text
that appears between it easily extractable without resorting to serial
processing of the list.  So, am I missing something in the approach or
is the tag list incomplete?

=20

Third, due perhaps to my ignorance about the overall 'zen' of the
parser, I created a clone of the Span.java class, edited it to create an
H2Tag.java class, and then registered it as a compound tag.  It works
beautifully.  This worked for many other tags.  So, if I am not
completely missing the point, should I contribute these back to increase
the tag coverage?

=20

Martin N. Hudson

devIS - Development InfoStructure

=20

Re: [Htmlparser-developer] Is anyone can give me some help for using the Nodelist to delete the comment in html

From: Derrick O. <Der...@Ro...> - 2005-04-21 22:12:08

I think you want to keepAllNodesThatMatch recursively (i.e. use the
boolean recursive argument form):

nodelist.keepAllNodesThatMatch( new NotFilter(new
NodeClassFilter (RemarkNode.class), true);

That way, all the remarks that are children of the nodes in the node
list (and their children etc.) are also filtered out.

dualspacekimo wrote:

>I am confused with that....Nodelist's method
>
>Especially the NotFilter ...
>
>If I want to filter the RemarkNode.
>Is it possible that use the following code to achive
>the goal? I have write the code...but not skip the
>comment in the html.
>
>NodeList nodelist =
>parser.extractAllNodesThatMatch(new HasAttributeFilter
>("class","blogbody"));
>
>nodelist.keepAllNodesThatMatch( new NotFilter(new
>NodeClassFilter (RemarkNode.class));
>
>SimpleNodeIterator Noderator = nodelist.elements();
>while(Noderator.hasMoreNodes()){
>......//print the content
>}//end while
>
>  
>

14 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 5 6 .. 33 > >> (Page 4 of 33)