htmlparser-cvs Mailing List for HTML Parser (Page 28)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(141) |
Jun
(108) |
Jul
(66) |
Aug
(127) |
Sep
(155) |
Oct
(149) |
Nov
(72) |
Dec
(72) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(100) |
Feb
(36) |
Mar
(21) |
Apr
(3) |
May
(87) |
Jun
(28) |
Jul
(84) |
Aug
(5) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(1) |
Feb
(39) |
Mar
(26) |
Apr
(38) |
May
(14) |
Jun
(10) |
Jul
|
Aug
|
Sep
(13) |
Oct
(8) |
Nov
(10) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(24) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser/parserapplications Modified Files: LinkExtractor.java MailRipper.java Robot.java StringExtractor.java package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: LinkExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/LinkExtractor.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** LinkExtractor.java 7 Dec 2003 23:41:40 -0000 1.48 --- LinkExtractor.java 8 Dec 2003 01:31:52 -0000 1.49 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: MailRipper.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/MailRipper.java,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** MailRipper.java 7 Dec 2003 23:41:40 -0000 1.49 --- MailRipper.java 8 Dec 2003 01:31:52 -0000 1.50 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Robot.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/Robot.java,v retrieving revision 1.51 retrieving revision 1.52 diff -C2 -d -r1.51 -r1.52 *** Robot.java 7 Dec 2003 23:41:40 -0000 1.51 --- Robot.java 8 Dec 2003 01:31:52 -0000 1.52 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: StringExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/StringExtractor.java,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** StringExtractor.java 9 Nov 2003 17:07:09 -0000 1.44 --- StringExtractor.java 8 Dec 2003 01:31:52 -0000 1.45 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/package.html,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** package.html 9 Nov 2003 17:07:09 -0000 1.17 --- package.html 8 Dec 2003 01:31:52 -0000 1.18 *************** *** 5,9 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 5,9 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser/lexer/nodes Modified Files: Attribute.java PageAttribute.java RemarkNode.java StringNode.java TagNode.java package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: Attribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/Attribute.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** Attribute.java 7 Dec 2003 23:41:40 -0000 1.15 --- Attribute.java 8 Dec 2003 01:31:51 -0000 1.16 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: PageAttribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/PageAttribute.java,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** PageAttribute.java 9 Nov 2003 17:07:09 -0000 1.4 --- PageAttribute.java 8 Dec 2003 01:31:51 -0000 1.5 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/RemarkNode.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** RemarkNode.java 7 Dec 2003 23:41:40 -0000 1.14 --- RemarkNode.java 8 Dec 2003 01:31:51 -0000 1.15 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/StringNode.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** StringNode.java 7 Dec 2003 23:41:40 -0000 1.15 --- StringNode.java 8 Dec 2003 01:31:51 -0000 1.16 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/TagNode.java,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** TagNode.java 7 Dec 2003 23:41:40 -0000 1.25 --- TagNode.java 8 Dec 2003 01:31:52 -0000 1.26 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/package.html,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** package.html 9 Nov 2003 17:07:09 -0000 1.8 --- package.html 8 Dec 2003 01:31:52 -0000 1.9 *************** *** 7,11 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 7,11 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser/lexer Modified Files: Cursor.java Lexer.java Page.java PageIndex.java Source.java Stream.java package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: Cursor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Cursor.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** Cursor.java 9 Nov 2003 17:07:08 -0000 1.14 --- Cursor.java 8 Dec 2003 01:31:51 -0000 1.15 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** Lexer.java 7 Dec 2003 23:41:40 -0000 1.21 --- Lexer.java 8 Dec 2003 01:31:51 -0000 1.22 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** Page.java 7 Dec 2003 23:41:40 -0000 1.27 --- Page.java 8 Dec 2003 01:31:51 -0000 1.28 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: PageIndex.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/PageIndex.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** PageIndex.java 9 Nov 2003 17:07:09 -0000 1.14 --- PageIndex.java 8 Dec 2003 01:31:51 -0000 1.15 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Source.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Source.java,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** Source.java 9 Nov 2003 17:07:09 -0000 1.13 --- Source.java 8 Dec 2003 01:31:51 -0000 1.14 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Stream.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Stream.java,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** Stream.java 9 Nov 2003 17:07:09 -0000 1.9 --- Stream.java 8 Dec 2003 01:31:51 -0000 1.10 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/package.html,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** package.html 9 Nov 2003 17:07:09 -0000 1.10 --- package.html 8 Dec 2003 01:31:51 -0000 1.11 *************** *** 7,11 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 7,11 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
From: <der...@us...> - 2003-12-08 01:32:24
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser/filters Modified Files: package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/package.html,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** package.html 9 Nov 2003 17:07:08 -0000 1.2 --- package.html 8 Dec 2003 01:31:51 -0000 1.3 *************** *** 7,11 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 7,11 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser/beans Modified Files: BeanyBaby.java HTMLLinkBean.java HTMLTextBean.java LinkBean.java StringBean.java package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: BeanyBaby.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/BeanyBaby.java,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** BeanyBaby.java 9 Nov 2003 17:07:08 -0000 1.19 --- BeanyBaby.java 8 Dec 2003 01:31:51 -0000 1.20 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: HTMLLinkBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/HTMLLinkBean.java,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** HTMLLinkBean.java 9 Nov 2003 17:07:08 -0000 1.19 --- HTMLLinkBean.java 8 Dec 2003 01:31:51 -0000 1.20 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: HTMLTextBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/HTMLTextBean.java,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** HTMLTextBean.java 9 Nov 2003 17:07:08 -0000 1.20 --- HTMLTextBean.java 8 Dec 2003 01:31:51 -0000 1.21 *************** *** 1,3 **** ! /// HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! /// HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: LinkBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/LinkBean.java,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** LinkBean.java 7 Dec 2003 23:41:39 -0000 1.24 --- LinkBean.java 8 Dec 2003 01:31:51 -0000 1.25 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: StringBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/StringBean.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** StringBean.java 7 Dec 2003 23:41:39 -0000 1.33 --- StringBean.java 8 Dec 2003 01:31:51 -0000 1.34 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/package.html,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** package.html 9 Nov 2003 17:07:08 -0000 1.17 --- package.html 8 Dec 2003 01:31:51 -0000 1.18 *************** *** 6,10 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 6,10 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1:/tmp/cvs-serv1466/src/org/htmlparser Modified Files: AbstractNode.java Node.java Parser.java RemarkNode.java StringNode.java StringNodeFactory.java package.html Log Message: Update version headers to 1.4-20031207 and update changelog. Index: AbstractNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/AbstractNode.java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** AbstractNode.java 9 Nov 2003 17:07:08 -0000 1.21 --- AbstractNode.java 8 Dec 2003 01:31:50 -0000 1.22 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Node.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Node.java,v retrieving revision 1.45 retrieving revision 1.46 diff -C2 -d -r1.45 -r1.46 *** Node.java 9 Nov 2003 17:07:08 -0000 1.45 --- Node.java 8 Dec 2003 01:31:51 -0000 1.46 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.76 retrieving revision 1.77 diff -C2 -d -r1.76 -r1.77 *** Parser.java 7 Dec 2003 23:41:39 -0000 1.76 --- Parser.java 8 Dec 2003 01:31:51 -0000 1.77 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // *************** *** 96,100 **** */ public final static String ! VERSION_DATE = "Nov 09, 2003" ; --- 96,100 ---- */ public final static String ! VERSION_DATE = "Dec 07, 2003" ; Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/RemarkNode.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** RemarkNode.java 7 Dec 2003 23:41:39 -0000 1.38 --- RemarkNode.java 8 Dec 2003 01:31:51 -0000 1.39 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/StringNode.java,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** StringNode.java 7 Dec 2003 23:41:39 -0000 1.46 --- StringNode.java 8 Dec 2003 01:31:51 -0000 1.47 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: StringNodeFactory.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/StringNodeFactory.java,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** StringNodeFactory.java 7 Dec 2003 23:41:39 -0000 1.8 --- StringNodeFactory.java 8 Dec 2003 01:31:51 -0000 1.9 *************** *** 1,3 **** ! // HTMLParser Library v1_4_20031109 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // --- 1,3 ---- ! // HTMLParser Library v1_4_20031207 - A java-based parser for HTML // Copyright (C) Dec 31, 2000 Somik Raha // Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/package.html,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** package.html 9 Nov 2003 17:07:08 -0000 1.18 --- package.html 8 Dec 2003 01:31:51 -0000 1.19 *************** *** 6,10 **** @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031109 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha --- 6,10 ---- @(#)package.html 1.60 98/01/27 ! HTMLParser Library v1_4_20031207 - A java-based parser for HTML Copyright (C) Dec 31, 2000 Somik Raha |
From: <der...@us...> - 2003-12-08 01:32:23
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1:/tmp/cvs-serv1466/docs Modified Files: changes.txt release.txt Log Message: Update version headers to 1.4-20031207 and update changelog. Index: changes.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/changes.txt,v retrieving revision 1.192 retrieving revision 1.193 diff -C2 -d -r1.192 -r1.193 *** changes.txt 9 Nov 2003 17:07:07 -0000 1.192 --- changes.txt 8 Dec 2003 01:31:49 -0000 1.193 *************** *** 13,16 **** --- 13,133 ---- ******************************************************************************* + Integration Build 1.4 - 20031207 + -------------------------------- + + 2003-12-07 18:41 derrickoswald + + * src/org/htmlparser/: Parser.java, PrototypicalNodeFactory.java, + RemarkNode.java, StringNode.java, StringNodeFactory.java, + beans/LinkBean.java, beans/StringBean.java, lexer/Lexer.java, + lexer/Page.java, lexer/nodes/Attribute.java, + lexer/nodes/RemarkNode.java, lexer/nodes/StringNode.java, + lexer/nodes/TagNode.java, parserapplications/LinkExtractor.java, + parserapplications/MailRipper.java, parserapplications/Robot.java, + scanners/AppletScanner.java, scanners/BaseHrefScanner.java, + scanners/BodyScanner.java, scanners/BulletListScanner.java, + scanners/BulletScanner.java, scanners/DivScanner.java, + scanners/DoctypeScanner.java, scanners/FormScanner.java, + scanners/FrameScanner.java, scanners/FrameSetScanner.java, + scanners/HeadScanner.java, scanners/HtmlScanner.java, + scanners/ImageScanner.java, scanners/InputTagScanner.java, + scanners/LabelScanner.java, scanners/LinkScanner.java, + scanners/MetaTagScanner.java, scanners/OptionTagScanner.java, + scanners/ScriptScanner.java, scanners/SelectTagScanner.java, + scanners/SpanScanner.java, scanners/StyleScanner.java, + scanners/TableColumnScanner.java, scanners/TableRowScanner.java, + scanners/TableScanner.java, scanners/TextareaTagScanner.java, + scanners/TitleScanner.java, tags/CompositeTag.java, + tags/FormTag.java, tags/ImageTag.java, tags/InputTag.java, + tags/LabelTag.java, tags/LinkTag.java, tags/MetaTag.java, + tags/SelectTag.java, tags/TableColumn.java, tags/TableRow.java, + tags/TextareaTag.java, tests/FunctionalTests.java, + tests/InstanceofPerformanceTest.java, + tests/LineNumberAssignedByNodeReaderTest.java, + tests/ParserTest.java, tests/ParserTestCase.java, + tests/PerformanceTest.java, tests/filterTests/FilterTest.java, + tests/lexerTests/AttributeTests.java, + tests/lexerTests/TagTests.java, + tests/nodeDecoratorTests/DecodingNodeTest.java, + tests/nodeDecoratorTests/EscapeCharacterRemovingNodeTest.java, + tests/nodeDecoratorTests/NonBreakingSpaceConvertingNodeTest.java, + tests/parserHelperTests/RemarkNodeParserTest.java, + tests/parserHelperTests/StringParserTest.java, + tests/scannersTests/AllTests.java, + tests/scannersTests/AppletScannerTest.java, + tests/scannersTests/BaseHREFScannerTest.java, + tests/scannersTests/BodyScannerTest.java, + tests/scannersTests/BulletListScannerTest.java, + tests/scannersTests/BulletScannerTest.java, + tests/scannersTests/CompositeTagScannerTest.java, + tests/scannersTests/DivScannerTest.java, + tests/scannersTests/FormScannerTest.java, + tests/scannersTests/FrameScannerTest.java, + tests/scannersTests/FrameSetScannerTest.java, + tests/scannersTests/HeadScannerTest.java, + tests/scannersTests/HtmlTest.java, + tests/scannersTests/ImageScannerTest.java, + tests/scannersTests/InputTagScannerTest.java, + tests/scannersTests/JspScannerTest.java, + tests/scannersTests/LabelScannerTest.java, + tests/scannersTests/LinkScannerTest.java, + tests/scannersTests/MetaTagScannerTest.java, + tests/scannersTests/OptionTagScannerTest.java, + tests/scannersTests/ScriptScannerTest.java, + tests/scannersTests/SelectTagScannerTest.java, + tests/scannersTests/SpanScannerTest.java, + tests/scannersTests/StyleScannerTest.java, + tests/scannersTests/TableScannerTest.java, + tests/scannersTests/TextareaTagScannerTest.java, + tests/scannersTests/TitleScannerTest.java, + tests/scannersTests/XmlEndTagScanningTest.java, + tests/tagTests/AllTests.java, tests/tagTests/AppletTagTest.java, + tests/tagTests/BaseHrefTagTest.java, + tests/tagTests/BodyTagTest.java, + tests/tagTests/BulletListTagTest.java, + tests/tagTests/BulletTagTest.java, + tests/tagTests/CompositeTagTest.java, + tests/tagTests/DivTagTest.java, tests/tagTests/DoctypeTagTest.java, + tests/tagTests/EndTagTest.java, tests/tagTests/FormTagTest.java, + tests/tagTests/FrameSetTagTest.java, + tests/tagTests/FrameTagTest.java, tests/tagTests/HeadTagTest.java, + tests/tagTests/HtmlTagTest.java, tests/tagTests/ImageTagTest.java, + tests/tagTests/InputTagTest.java, tests/tagTests/JspTagTest.java, + tests/tagTests/LabelTagTest.java, tests/tagTests/LinkTagTest.java, + tests/tagTests/MetaTagTest.java, + tests/tagTests/ObjectCollectionTest.java, + tests/tagTests/OptionTagTest.java, + tests/tagTests/ScriptTagTest.java, + tests/tagTests/SelectTagTest.java, tests/tagTests/SpanTagTest.java, + tests/tagTests/StyleTagTest.java, tests/tagTests/TableTagTest.java, + tests/tagTests/TagTest.java, tests/tagTests/TextareaTagTest.java, + tests/tagTests/TitleTagTest.java, tests/utilTests/BeanTest.java, + tests/utilTests/HTMLLinkProcessorTest.java, + tests/visitorsTests/HtmlPageTest.java, + tests/visitorsTests/LinkFindingVisitorTest.java, + tests/visitorsTests/TextExtractingVisitorTest.java, + util/Generate.java, util/ParserUtils.java, util/Translate.java, + visitors/HtmlPage.java, visitors/NodeVisitor.java, + visitors/UrlModifyingVisitor.java: + + Remove most of the scanners. + The only scanners left are ones that really do something different (script and jsp). + Instead of registering a scanner to enable returning a specific tag you now add a + tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default + in a new Parser which is similar to having called the old 'registerDOMScanners()', + so tags are fully nested. This is different behaviour, and specifically, + you will need to recurse into returned nodes to get at what you want. + I've tried to adjust the applications accordingly, but worked examples are still scarce. + If you want to return only some of the derived tags while keeping most as generic tags, + there are various constructors and manipulators on the factory. See the javadocs + and examples in the tests package. + Nearly all the old scanner tests are folded into the tag tests. + + toString() has been revamped. + This means that the default Parser mainline now returns an indented listing of tags, + making it easy to see the structure of a page. The downside is the text of the page + had to have newlines, tabs etc. turned into escape sequences. But if you were really + interested in content you would be using toHtml() or toPlainTextString(). + Integration Build 1.4 - 20031109 -------------------------------- Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.51 retrieving revision 1.52 diff -C2 -d -r1.51 -r1.52 *** release.txt 9 Nov 2003 17:07:07 -0000 1.51 --- release.txt 8 Dec 2003 01:31:50 -0000 1.52 *************** *** 1,3 **** ! HTMLParser Version 1.4 (Integration Build Nov 09, 2003) ********************************************* --- 1,3 ---- ! HTMLParser Version 1.4 (Integration Build Dec 07, 2003) ********************************************* |
From: <der...@us...> - 2003-12-07 23:42:15
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests Modified Files: FunctionalTests.java InstanceofPerformanceTest.java LineNumberAssignedByNodeReaderTest.java ParserTest.java ParserTestCase.java PerformanceTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: FunctionalTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/FunctionalTests.java,v retrieving revision 1.50 retrieving revision 1.51 diff -C2 -d -r1.50 -r1.51 *** FunctionalTests.java 9 Nov 2003 17:07:13 -0000 1.50 --- FunctionalTests.java 7 Dec 2003 23:41:41 -0000 1.51 *************** *** 42,46 **** import org.htmlparser.Node; import org.htmlparser.Parser; ! import org.htmlparser.scanners.ImageScanner; import org.htmlparser.tags.ImageTag; import org.htmlparser.util.DefaultParserFeedback; --- 42,46 ---- import org.htmlparser.Node; import org.htmlparser.Parser; ! import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.ImageTag; import org.htmlparser.util.DefaultParserFeedback; *************** *** 89,93 **** public int countImageTagsWithHTMLParser() throws ParserException { Parser parser = new Parser("http://education.yahoo.com/",new DefaultParserFeedback()); ! parser.addScanner(new ImageScanner("-i")); setParser (parser); int parserImgTagCount = 0; --- 89,93 ---- public int countImageTagsWithHTMLParser() throws ParserException { Parser parser = new Parser("http://education.yahoo.com/",new DefaultParserFeedback()); ! parser.setNodeFactory (new PrototypicalNodeFactory (new ImageTag ())); setParser (parser); int parserImgTagCount = 0; Index: InstanceofPerformanceTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/InstanceofPerformanceTest.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** InstanceofPerformanceTest.java 9 Nov 2003 17:07:13 -0000 1.17 --- InstanceofPerformanceTest.java 7 Dec 2003 23:41:41 -0000 1.18 *************** *** 35,43 **** import org.htmlparser.Parser; import org.htmlparser.tags.FormTag; - import org.htmlparser.tests.scannersTests.FormScannerTest; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.SimpleNodeIterator; public class InstanceofPerformanceTest { FormTag formTag; Vector formChildren; --- 35,59 ---- import org.htmlparser.Parser; import org.htmlparser.tags.FormTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.SimpleNodeIterator; public class InstanceofPerformanceTest { + + public static final String FORM_HTML = + "<FORM METHOD=\""+FormTag.POST+"\" ACTION=\"do_login.php\" NAME=\"login_form\" onSubmit=\"return CheckData()\">\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><FONT face=\"Arial, verdana\" size=2><b>User Name</b></font></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"text\" NAME=\"name\" SIZE=\"20\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><FONT face=\"Arial, verdana\" size=2><b>Password</b></font></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"password\" NAME=\"passwd\" SIZE=\"20\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"submit\" NAME=\"submit\" VALUE=\"Login\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TEXTAREA name=\"Description\" rows=\"15\" cols=\"55\" wrap=\"virtual\" class=\"composef\" tabindex=\"5\">Contents of TextArea</TEXTAREA>\n"+ + // "<TEXTAREA name=\"AnotherDescription\" rows=\"15\" cols=\"55\" wrap=\"virtual\" class=\"composef\" tabindex=\"5\">\n"+ + "<INPUT TYPE=\"hidden\" NAME=\"password\" SIZE=\"20\">\n"+ + "<INPUT TYPE=\"submit\">\n"+ + "</FORM>"; + FormTag formTag; Vector formChildren; *************** *** 45,51 **** Parser parser = Parser.createParser( ! FormScannerTest.FORM_HTML ); - parser.registerScanners(); NodeIterator e = parser.elements(); Node node = e.nextNode(); --- 61,66 ---- Parser parser = Parser.createParser( ! FORM_HTML ); NodeIterator e = parser.elements(); Node node = e.nextNode(); Index: LineNumberAssignedByNodeReaderTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/LineNumberAssignedByNodeReaderTest.java,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** LineNumberAssignedByNodeReaderTest.java 9 Nov 2003 17:07:13 -0000 1.28 --- LineNumberAssignedByNodeReaderTest.java 7 Dec 2003 23:41:41 -0000 1.29 *************** *** 35,41 **** import junit.framework.TestSuite; ! import org.htmlparser.tests.scannersTests.CompositeTagScannerTest.CustomScanner; import org.htmlparser.tests.scannersTests.CompositeTagScannerTest.CustomTag; import org.htmlparser.util.ParserException; /** * @author Somik Raha --- 35,42 ---- import junit.framework.TestSuite; ! import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tests.scannersTests.CompositeTagScannerTest.CustomTag; import org.htmlparser.util.ParserException; + /** * @author Somik Raha *************** *** 145,149 **** private void testLineNumber(String xml, int numNodes, int useNode, int expectedStartLine, int expectedEndLine) throws ParserException { createParser(xml); ! parser.addScanner(new CustomScanner()); parseAndAssertNodeCount(numNodes); assertType("custom node",CustomTag.class,node[useNode]); --- 146,150 ---- private void testLineNumber(String xml, int numNodes, int useNode, int expectedStartLine, int expectedEndLine) throws ParserException { createParser(xml); ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag ())); parseAndAssertNodeCount(numNodes); assertType("custom node",CustomTag.class,node[useNode]); Index: ParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTest.java,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** ParserTest.java 9 Nov 2003 17:07:14 -0000 1.49 --- ParserTest.java 7 Dec 2003 23:41:41 -0000 1.50 *************** *** 40,43 **** --- 40,44 ---- import org.htmlparser.Node; import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.StringNode; import org.htmlparser.filters.NodeClassFilter; *************** *** 45,53 **** import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; - import org.htmlparser.scanners.FormScanner; import org.htmlparser.scanners.TagScanner; import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.Tag; import org.htmlparser.util.DefaultParserFeedback; --- 46,54 ---- import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.Page; import org.htmlparser.scanners.TagScanner; import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.LinkTag; + import org.htmlparser.tags.MetaTag; import org.htmlparser.tags.Tag; import org.htmlparser.util.DefaultParserFeedback; *************** *** 300,303 **** --- 301,305 ---- out.close (); parser = new Parser (connection); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); } catch (Exception e) *************** *** 352,355 **** --- 354,358 ---- out.close (); parser = new Parser (file.getAbsolutePath (), new DefaultParserFeedback(DefaultParserFeedback.QUIET)); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); nodes = new AbstractNode[30]; i = 0; *************** *** 404,408 **** { parser = new Parser("http://www.sony.co.jp", Parser.noFeedback); - parser.registerScanners (); assertEquals("Character set by default is ISO-8859-1", "ISO-8859-1", parser.getEncoding ()); enumeration = parser.elements(); --- 407,410 ---- *************** *** 432,435 **** --- 434,438 ---- parser = new Parser(url); + parser.setNodeFactory (new PrototypicalNodeFactory (new MetaTag ())); i = 0; nodes = new AbstractNode[30]; *************** *** 454,458 **** parser = new Parser(url); - parser.registerScanners (); for (NodeIterator e = parser.elements();e.hasMoreNodes();) e.nextNode(); --- 457,460 ---- *************** *** 475,479 **** parser = new Parser(url); - parser.registerScanners (); for (NodeIterator e = parser.elements();e.hasMoreNodes();) e.nextNode(); --- 477,480 ---- *************** *** 544,548 **** page.setConnection (connection); parser = new Parser (new Lexer (page)); - parser.registerScanners (); // must be the default assertTrue ("Wrong encoding", parser.getEncoding ().equals ("ISO-8859-1")); --- 545,548 ---- *************** *** 575,578 **** --- 575,579 ---- parser = new Parser(url); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Node node [] = new AbstractNode[30]; int i = 0; *************** *** 636,640 **** "<p><font size=-2>©2002 Google</font><font size=-2> - Searching 3,083,324,652 web pages</font></center></body></html>\n" ); - parser.registerScanners(); NodeList collectionList = new NodeList(); NodeClassFilter filter = new NodeClassFilter (LinkTag.class); --- 637,640 ---- *************** *** 690,694 **** "</body>\n"+ "</html>"); - parser.registerScanners(); NodeList collectionList = new NodeList(); TagNameFilter filter = new TagNameFilter ("IMG"); --- 690,693 ---- *************** *** 703,717 **** } - public void testRemoveScanner() throws Exception { - createParser( - "" - ); - parser.registerScanners(); - parser.removeScanner(new FormScanner("",parser)); - Map scanners = parser.getScanners(); - TagScanner scanner = (TagScanner)scanners.get("FORM"); - assertNull("shouldnt have found scanner",scanner); - } - /** * See bug #728241 OutOfMemory error/ Infinite loop --- 702,705 ---- *************** *** 748,751 **** --- 736,740 ---- + "</table>\n" + "</body></html>"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); int i = 0; for (NodeIterator e = parser.elements();e.hasMoreNodes();) Index: ParserTestCase.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTestCase.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** ParserTestCase.java 9 Nov 2003 17:07:14 -0000 1.40 --- ParserTestCase.java 7 Dec 2003 23:41:41 -0000 1.41 *************** *** 67,71 **** protected void parse(String response) throws ParserException { createParser(response,10000); - parser.registerScanners(); parseNodes(); } --- 67,70 ---- Index: PerformanceTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/PerformanceTest.java,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** PerformanceTest.java 9 Nov 2003 17:07:14 -0000 1.44 --- PerformanceTest.java 7 Dec 2003 23:41:41 -0000 1.45 *************** *** 89,93 **** // Create the parser object parser = new Parser(file,new DefaultParserFeedback()); - parser.registerScanners(); Node node; long start=System.currentTimeMillis(); --- 89,92 ---- |
From: <der...@us...> - 2003-12-07 23:42:15
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/lexerTests Modified Files: AttributeTests.java TagTests.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: AttributeTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/AttributeTests.java,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** AttributeTests.java 9 Nov 2003 17:07:14 -0000 1.6 --- AttributeTests.java 7 Dec 2003 23:41:41 -0000 1.7 *************** *** 35,38 **** --- 35,39 ---- import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.lexer.nodes.PageAttribute; *************** *** 68,71 **** --- 69,73 ---- html = "<" + tagContents + ">"; createParser (html); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); try { Index: TagTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/TagTests.java,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** TagTests.java 9 Nov 2003 17:07:14 -0000 1.6 --- TagTests.java 7 Dec 2003 23:41:41 -0000 1.7 *************** *** 33,36 **** --- 33,37 ---- import org.htmlparser.Node; import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.MetaTag; *************** *** 92,95 **** --- 93,97 ---- createParser(testHtml); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("should be Tag",Tag.class,node[0]); *************** *** 107,110 **** --- 109,113 ---- String html = "<custom/>"; createParser(html); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("should be Tag",Tag.class,node[0]); *************** *** 121,124 **** --- 124,128 ---- public void testTagWithCloseTagSymbolInAttribute() throws ParserException { createParser("<tag att=\"a>b\">"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("should be Tag",Tag.class,node[0]); *************** *** 129,132 **** --- 133,137 ---- public void testTagWithOpenTagSymbolInAttribute() throws ParserException { createParser("<tag att=\"a<b\">"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("should be Tag",Tag.class,node[0]); *************** *** 138,141 **** --- 143,147 ---- String html = "<tag att=\'a<b\'>"; createParser(html); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("should be Tag",Tag.class,node[0]); *************** *** 154,158 **** String html = "<meta name=\"foo\" content=\"foo<bar>\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 160,163 ---- *************** *** 169,173 **** String html = "<meta name=\"foo\" content=\"foo<bar\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 174,177 ---- *************** *** 184,188 **** String html = "<meta name=\"foo\" content=\"foobar>\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 188,191 ---- *************** *** 199,203 **** String html = "<meta name=\"foo\" content=\"foo\nbar>\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 202,205 ---- *************** *** 220,224 **** String html = "<meta name=\"foo\" content=\"<foo>\nbar\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 222,225 ---- *************** *** 241,245 **** String html = "<meta name=\"foo\" content=\"foo>\nbar\">"; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 242,245 ---- *************** *** 262,266 **** String html = "<meta name=\"foo\" content=\"<foo\nbar\""; createParser(html); - parser.registerScanners (); parseAndAssertNodeCount (1); assertType ("should be MetaTag", MetaTag.class, node[0]); --- 262,265 ---- *************** *** 284,287 **** --- 283,287 ---- { createParser("<html></html>"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); String testHtml1 = "<a HREF=\"/cgi-bin/view_search?query_text=postdate>20020701&txt_clr=White&bg_clr=Red&url=http://localhost/Testing/Report1.html\">20020702 Report 1</A>" + TEST_HTML; *************** *** 361,367 **** this.id = id; this.max = max; ! this.parser = ! Parser.createParser(testHtml); ! parser.registerScanners(); } --- 361,365 ---- this.id = id; this.max = max; ! this.parser = Parser.createParser(testHtml); } *************** *** 411,414 **** --- 409,413 ---- String html = "<input disabled>"; createParser(html); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount (1); assertType ("should be Tag", Tag.class, node[0]); *************** *** 424,427 **** --- 423,427 ---- String html = "<input disabled=>"; createParser(html); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount (1); assertType ("should be Tag", Tag.class, node[0]); |
From: <der...@us...> - 2003-12-07 23:42:14
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/filterTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/filterTests Modified Files: FilterTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: FilterTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/filterTests/FilterTest.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** FilterTest.java 8 Nov 2003 21:30:58 -0000 1.1 --- FilterTest.java 7 Dec 2003 23:41:41 -0000 1.2 *************** *** 70,74 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch (new NodeClassFilter (BodyTag.class)); assertEquals ("only one element", 1, list.size ()); --- 70,73 ---- *************** *** 93,97 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch (new TagNameFilter ("booty")); assertEquals ("only one element", 1, list.size ()); --- 92,95 ---- *************** *** 112,116 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch (new StringFilter ("Time")); assertEquals ("only one element", 1, list.size ()); --- 110,113 ---- *************** *** 134,138 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch (new HasChildFilter (new TagNameFilter ("b"))); assertEquals ("only one element", 1, list.size ()); --- 131,134 ---- *************** *** 157,161 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch (new HasAttributeFilter ("id")); assertEquals ("only one element", 1, list.size ()); --- 153,156 ---- *************** *** 177,181 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch ( new AndFilter ( --- 172,175 ---- *************** *** 203,207 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch ( new OrFilter ( --- 197,200 ---- *************** *** 232,236 **** html = "<html>" + guts + "</html>"; createParser (html); - parser.registerDomScanners (); list = parser.extractAllNodesThatMatch ( new AndFilter ( --- 225,228 ---- |
From: <der...@us...> - 2003-12-07 23:42:14
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tags Modified Files: CompositeTag.java FormTag.java ImageTag.java InputTag.java LabelTag.java LinkTag.java MetaTag.java SelectTag.java TableColumn.java TableRow.java TextareaTag.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.66 retrieving revision 1.67 diff -C2 -d -r1.66 -r1.67 *** CompositeTag.java 9 Nov 2003 17:07:11 -0000 1.66 --- CompositeTag.java 7 Dec 2003 23:41:41 -0000 1.67 *************** *** 434,436 **** --- 434,491 ---- return stringNode; } + + public String toString () + { + StringBuffer ret; + + ret = new StringBuffer (1024); + toString (0, ret); + + return (ret.toString ()); + } + + /** + * Return the text contained in this tag. + * @return The complete contents of the tag (within the angle brackets). + */ + public String getText () + { + String ret; + + ret = super.toHtml (); + ret = ret.substring (1, ret.length () - 1); + + return (ret); + } + + public void toString (int level, StringBuffer buffer) + { + Node node; + + for (int i = 0; i < level; i++) + buffer.append (" "); + buffer.append (super.toString ()); + buffer.append (System.getProperty ("line.separator")); + for (SimpleNodeIterator e = children (); e.hasMoreNodes ();) + { + node = e.nextNode (); + if (node instanceof CompositeTag) + ((CompositeTag)node).toString (level + 1, buffer); + else + { + for (int i = 0; i <= level; i++) + buffer.append (" "); + buffer.append (node); + buffer.append (System.getProperty ("line.separator")); + } + } + // eliminate virtual tags + // if (!(getEndTag ().getStartPosition () == getEndTag ().getEndPosition ())) + { + for (int i = 0; i <= level; i++) + buffer.append (" "); + buffer.append (getEndTag ().toString ()); + buffer.append (System.getProperty ("line.separator")); + } + } } Index: FormTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FormTag.java,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** FormTag.java 9 Nov 2003 17:07:11 -0000 1.41 --- FormTag.java 7 Dec 2003 23:41:41 -0000 1.42 *************** *** 56,60 **** * The set of end tag names that indicate the end of this tag. */ ! private static final String[] mEndTagEnders = new String[] {"HTML", "BODY"}; /** --- 56,60 ---- * The set of end tag names that indicate the end of this tag. */ ! private static final String[] mEndTagEnders = new String[] {"HTML", "BODY", "TABLE"}; /** Index: ImageTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ImageTag.java,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** ImageTag.java 9 Nov 2003 17:07:11 -0000 1.36 --- ImageTag.java 7 Dec 2003 23:41:41 -0000 1.37 *************** *** 188,196 **** } - public String toString() - { - return "IMAGE TAG : Image at " + getImageURL () +"; begins at : "+getStartPosition ()+"; ends at : "+getEndPosition (); - } - public void setImageURL (String url) { --- 188,191 ---- Index: InputTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/InputTag.java,v retrieving revision 1.31 retrieving revision 1.32 diff -C2 -d -r1.31 -r1.32 *** InputTag.java 9 Nov 2003 17:07:11 -0000 1.31 --- InputTag.java 7 Dec 2003 23:41:41 -0000 1.32 *************** *** 56,62 **** return (mIds); } - - public String toString() { - return (ParserUtils.toString(this)); - } } --- 56,58 ---- Index: LabelTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LabelTag.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** LabelTag.java 9 Nov 2003 17:07:11 -0000 1.32 --- LabelTag.java 7 Dec 2003 23:41:41 -0000 1.33 *************** *** 43,47 **** /** ! * Create a new lavel tag. */ public LabelTag () --- 43,47 ---- /** ! * Create a new label tag. */ public LabelTag () Index: LinkTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LinkTag.java,v retrieving revision 1.44 retrieving revision 1.45 diff -C2 -d -r1.44 -r1.45 *** LinkTag.java 9 Nov 2003 17:07:11 -0000 1.44 --- LinkTag.java 7 Dec 2003 23:41:41 -0000 1.45 *************** *** 31,35 **** import org.htmlparser.Node; - import org.htmlparser.scanners.LinkScanner; import org.htmlparser.util.ParserUtils; import org.htmlparser.util.SimpleNodeIterator; --- 31,34 ---- Index: MetaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/MetaTag.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** MetaTag.java 9 Nov 2003 17:07:11 -0000 1.32 --- MetaTag.java 7 Dec 2003 23:41:41 -0000 1.33 *************** *** 118,129 **** } } - - public String toString() - { - return "META TAG\n"+ - "--------\n"+ - "Http-Equiv : "+getHttpEquiv()+"\n"+ - "Name : "+ getMetaTagName() +"\n"+ - "Contents : "+getMetaContent()+"\n"; - } } --- 118,120 ---- Index: SelectTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/SelectTag.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** SelectTag.java 9 Nov 2003 17:07:11 -0000 1.33 --- SelectTag.java 7 Dec 2003 23:41:41 -0000 1.34 *************** *** 99,122 **** return (ret); } - - public String toString() - { - StringBuffer lString; - NodeList children; - Node node; - - lString = new StringBuffer(ParserUtils.toString(this)); - children = getChildren (); - for(int i=0;i<children.size(); i++) - { - node = children.elementAt(i); - if (node instanceof OptionTag) - { - OptionTag optionTag = (OptionTag)node; - lString.append(optionTag.toString()).append("\n"); - } - } - - return lString.toString(); - } } --- 99,101 ---- Index: TableColumn.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableColumn.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** TableColumn.java 9 Nov 2003 17:07:11 -0000 1.33 --- TableColumn.java 7 Dec 2003 23:41:41 -0000 1.34 *************** *** 40,43 **** --- 40,53 ---- /** + * The set of tag names that indicate the end of this tag. + */ + private static final String[] mEnders = new String[] {"TD", "TR"}; + + /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"TR", "TABLE"}; + + /** * Create a new table column tag. */ *************** *** 62,65 **** --- 72,84 ---- { return (mIds); + } + + /** + * Return the set of end tag names that cause this tag to finish. + * @return The names of following end tags that stop further scanning. + */ + public String[] getEndTagEnders () + { + return (mEndTagEnders); } } Index: TableRow.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableRow.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** TableRow.java 9 Nov 2003 17:07:11 -0000 1.35 --- TableRow.java 7 Dec 2003 23:41:41 -0000 1.36 *************** *** 42,45 **** --- 42,50 ---- /** + * The set of end tag names that indicate the end of this tag. + */ + private static final String[] mEndTagEnders = new String[] {"TABLE"}; + + /** * Create a new table row tag. */ *************** *** 64,67 **** --- 69,81 ---- { return (mIds); + } + + /** + * Return the set of end tag names that cause this tag to finish. + * @return The names of following end tags that stop further scanning. + */ + public String[] getEndTagEnders () + { + return (mEndTagEnders); } Index: TextareaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TextareaTag.java,v retrieving revision 1.30 retrieving revision 1.31 diff -C2 -d -r1.30 -r1.31 *** TextareaTag.java 9 Nov 2003 17:07:11 -0000 1.30 --- TextareaTag.java 7 Dec 2003 23:41:41 -0000 1.31 *************** *** 88,98 **** return toPlainTextString(); } - - public String toString() - { - StringBuffer buff = new StringBuffer(ParserUtils.toString(this)); - buff.append("VALUE : ").append(getValue()).append("\n"); - - return buff.toString(); - } } --- 88,90 ---- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1:/tmp/cvs-serv16537/scanners Modified Files: ScriptScanner.java Removed Files: AppletScanner.java BaseHrefScanner.java BodyScanner.java BulletListScanner.java BulletScanner.java DivScanner.java DoctypeScanner.java FormScanner.java FrameScanner.java FrameSetScanner.java HeadScanner.java HtmlScanner.java ImageScanner.java InputTagScanner.java LabelScanner.java LinkScanner.java MetaTagScanner.java OptionTagScanner.java SelectTagScanner.java SpanScanner.java StyleScanner.java TableColumnScanner.java TableRowScanner.java TableScanner.java TextareaTagScanner.java TitleScanner.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: ScriptScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v retrieving revision 1.51 retrieving revision 1.52 diff -C2 -d -r1.51 -r1.52 *** ScriptScanner.java 9 Nov 2003 17:07:10 -0000 1.51 --- ScriptScanner.java 7 Dec 2003 23:41:40 -0000 1.52 *************** *** 30,35 **** --- 30,37 ---- import java.util.Vector; + import org.htmlparser.Node; import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; *************** *** 47,51 **** * It gathers all interior nodes into one undifferentiated string node. */ ! public class ScriptScanner extends CompositeTagScanner { private static final String SCRIPT_END_TAG = "</SCRIPT>"; private static final String MATCH_NAME [] = {"SCRIPT"}; --- 49,56 ---- * It gathers all interior nodes into one undifferentiated string node. */ ! public class ScriptScanner ! extends ! CompositeTagScanner ! { private static final String SCRIPT_END_TAG = "</SCRIPT>"; private static final String MATCH_NAME [] = {"SCRIPT"}; *************** *** 101,105 **** end = null; factory = lexer.getNodeFactory (); ! lexer.setNodeFactory (new Parser ()); // no scanners on a new Parser right? try { --- 106,110 ---- end = null; factory = lexer.getNodeFactory (); ! lexer.setNodeFactory (new PrototypicalNodeFactory (true)); try { --- AppletScanner.java DELETED --- --- BaseHrefScanner.java DELETED --- --- BodyScanner.java DELETED --- --- BulletListScanner.java DELETED --- --- BulletScanner.java DELETED --- --- DivScanner.java DELETED --- --- DoctypeScanner.java DELETED --- --- FormScanner.java DELETED --- --- FrameScanner.java DELETED --- --- FrameSetScanner.java DELETED --- --- HeadScanner.java DELETED --- --- HtmlScanner.java DELETED --- --- ImageScanner.java DELETED --- --- InputTagScanner.java DELETED --- --- LabelScanner.java DELETED --- --- LinkScanner.java DELETED --- --- MetaTagScanner.java DELETED --- --- OptionTagScanner.java DELETED --- --- SelectTagScanner.java DELETED --- --- SpanScanner.java DELETED --- --- StyleScanner.java DELETED --- --- TableColumnScanner.java DELETED --- --- TableRowScanner.java DELETED --- --- TableScanner.java DELETED --- --- TextareaTagScanner.java DELETED --- --- TitleScanner.java DELETED --- |
From: <der...@us...> - 2003-12-07 23:42:13
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications In directory sc8-pr-cvs1:/tmp/cvs-serv16537/parserapplications Modified Files: LinkExtractor.java MailRipper.java Robot.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: LinkExtractor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/LinkExtractor.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** LinkExtractor.java 9 Nov 2003 17:07:09 -0000 1.47 --- LinkExtractor.java 7 Dec 2003 23:41:40 -0000 1.48 *************** *** 45,49 **** try { this.parser = new Parser(location); // Create the parser object - parser.registerScanners(); // Register standard scanners (Very Important) } catch (ParserException e) { --- 45,48 ---- Index: MailRipper.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/MailRipper.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** MailRipper.java 9 Nov 2003 17:07:09 -0000 1.48 --- MailRipper.java 7 Dec 2003 23:41:40 -0000 1.49 *************** *** 33,40 **** --- 33,44 ---- import org.htmlparser.Node; + import org.htmlparser.NodeFilter; import org.htmlparser.Parser; + import org.htmlparser.filters.AndFilter; + import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.DefaultParserFeedback; import org.htmlparser.util.NodeIterator; + import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; *************** *** 44,113 **** * Pass a web site (or html file on your local disk) as an argument. */ ! public class MailRipper { ! private org.htmlparser.Parser parser; /** * MailRipper c'tor takes the url to be ripped * @param resourceLocation url to be ripped */ ! public MailRipper(String resourceLocation) { ! try { ! parser = new Parser(resourceLocation,new DefaultParserFeedback()); ! parser.registerScanners(); } ! catch (ParserException e) { ! System.err.println("Could not create parser object"); ! e.printStackTrace(); } } - public static void main(String[] args) { - System.out.println("Mail Ripper v" + Parser.getVersion ()); - if (args.length<1 || args[0].equals("-help")) - { - System.out.println(); - System.out.println("Syntax : java -classpath htmlparser.jar org.htmlparser.parserapplications.MailRipper <resourceLocn/website>"); - System.out.println(); - System.out.println(" <resourceLocn> the name of the file to be parsed (with complete path "); - System.out.println(" if not in current directory)"); - System.out.println(" -help This screen"); - System.out.println(); - System.out.println("HTML Parser home page : http://htmlparser.sourceforge.net"); - System.out.println(); - System.out.println("Example : java -classpath htmlparser.jar com.kizna.parserapplications.MailRipper http://htmlparser.sourceforge.net"); - System.out.println(); - System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); - System.exit(-1); - } - String resourceLocation = "http://htmlparser.sourceforge.net"; - if (args.length!=0) resourceLocation = args[0]; ! MailRipper ripper = new MailRipper(resourceLocation); ! System.out.println("Ripping Site "+resourceLocation); ! try { ! for (Enumeration e=ripper.rip();e.hasMoreElements();) { ! LinkTag tag = (LinkTag)e.nextElement(); ! System.out.println("Ripped mail address : "+tag.getLink()); ! } ! } ! catch (ParserException e) { ! e.printStackTrace(); ! } } /** * Rip all mail addresses from the given url, and return an enumeration of such mail addresses. ! * @return Enumeration of mail addresses (a vector of LinkTag) */ ! public Enumeration rip() throws ParserException { ! Node node; ! Vector mailAddresses = new Vector(); ! for (NodeIterator e = parser.elements();e.hasMoreNodes();) ! { ! node = e.nextNode(); ! if (node instanceof LinkTag) ! { ! LinkTag linkTag = (LinkTag)node; ! if (linkTag.isMailLink()) mailAddresses.addElement(linkTag); ! } ! } ! return mailAddresses.elements(); } } --- 48,134 ---- * Pass a web site (or html file on your local disk) as an argument. */ ! public class MailRipper ! { ! private Parser parser; ! /** * MailRipper c'tor takes the url to be ripped * @param resourceLocation url to be ripped */ ! public MailRipper (String resourceLocation) ! { ! try ! { ! parser = new Parser (resourceLocation,new DefaultParserFeedback ()); } ! catch (ParserException e) ! { ! System.err.println ("Could not create parser object"); ! e.printStackTrace (); } } ! public static void main (String[] args) ! { ! System.out.println ("Mail Ripper v" + Parser.getVersion ()); ! if (args.length<1 || args[0].equals ("-help")) ! { ! System.out.println (); ! System.out.println ("Syntax : java -classpath htmlparser.jar org.htmlparser.parserapplications.MailRipper <resourceLocn/website>"); ! System.out.println (); ! System.out.println (" <resourceLocn> the name of the file to be parsed (with complete path "); ! System.out.println (" if not in current directory)"); ! System.out.println (" -help This screen"); ! System.out.println (); ! System.out.println ("HTML Parser home page : http://htmlparser.sourceforge.net"); ! System.out.println (); ! System.out.println ("Example : java -classpath htmlparser.jar com.kizna.parserapplications.MailRipper http://htmlparser.sourceforge.net"); ! System.out.println (); ! System.out.println ("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. "); ! System.exit (-1); ! } ! String resourceLocation = "http://htmlparser.sourceforge.net"; ! if (args.length!=0) resourceLocation = args[0]; ! ! MailRipper ripper = new MailRipper (resourceLocation); ! System.out.println ("Ripping Site "+resourceLocation); ! try ! { ! NodeList list; ! ! list = ripper.rip (); ! for (NodeIterator iterator = list.elements (); iterator.hasMoreNodes (); ) ! { ! LinkTag mail = (LinkTag)iterator.nextNode (); ! System.out.println (mail.getLink ()); ! } ! } ! catch (ParserException e) ! { ! e.printStackTrace (); ! } } + /** * Rip all mail addresses from the given url, and return an enumeration of such mail addresses. ! * @return A node list of mail addresses (LinkTag type). */ ! public NodeList rip() throws ParserException ! { ! NodeList ret; ! ! ret = parser.extractAllNodesThatMatch ( ! new AndFilter ( ! new NodeClassFilter (LinkTag.class), ! new NodeFilter () ! { ! public boolean accept (Node node) ! { ! return (((LinkTag)node).isMailLink ()); ! } ! } ! )); ! ! return (ret); } } Index: Robot.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/Robot.java,v retrieving revision 1.50 retrieving revision 1.51 diff -C2 -d -r1.50 -r1.51 *** Robot.java 9 Nov 2003 17:07:09 -0000 1.50 --- Robot.java 7 Dec 2003 23:41:40 -0000 1.51 *************** *** 46,50 **** try { parser = new Parser(resourceLocation,new DefaultParserFeedback()); - parser.registerScanners(); } catch (ParserException e) { --- 46,49 ---- *************** *** 89,93 **** { Parser newParser = new Parser(linkTag.getLink(),new DefaultParserFeedback()); - newParser.registerScanners(); System.out.print("Crawling to "+linkTag.getLink()); crawl(newParser,crawlDepth-1); --- 88,91 ---- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1:/tmp/cvs-serv16537 Modified Files: Parser.java RemarkNode.java StringNode.java StringNodeFactory.java Added Files: PrototypicalNodeFactory.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). --- NEW FILE: PrototypicalNodeFactory.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/PrototypicalNodeFactory.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:39 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser; import java.io.Serializable; import java.util.Hashtable; import java.util.Map; import java.util.Vector; import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.lexer.nodes.NodeFactory; import org.htmlparser.nodeDecorators.DecodingNode; import org.htmlparser.nodeDecorators.EscapeCharacterRemovingNode; import org.htmlparser.nodeDecorators.NonBreakingSpaceConvertingNode; //import org.htmlparser.tags.Tag; import org.htmlparser.tags.*; // import everything for now import org.htmlparser.util.ParserException; /** * A node factory based on the prototype pattern. * This factory uses the prototype pattern to generate new Tag nodes. * Prototype tags, in the form of undifferentiated tags are held in a hash * table. On a */ public class PrototypicalNodeFactory implements Serializable, NodeFactory { /** * The list of tags to return at the top level. * The list is keyed by tag name. */ protected Map mBlastocyst; /** * Create a new factory with all but DOM tags registered. */ public PrototypicalNodeFactory () { this (false); } /** * Create a new factory with no registered tags. */ public PrototypicalNodeFactory (boolean empty) { clear (); if (!empty) registerTags (); } /** * Create a new factory with the given tag as the only one registered. */ public PrototypicalNodeFactory (Tag tag) { this (true); registerTag (tag); } /** * Create a new factory with the given tags registered. */ public PrototypicalNodeFactory (Tag[] tags) { this (true); for (int i = 0; i < tags.length; i++) registerTag (tags[i]); } /** * Adds a tag to the registry. * @param id The name under which to register the tag. * @param tag The tag to be returned from a createTag(id) call. * @return The tag previously registered with that id, * or <code>null</code> if none. */ public Tag put (String id, Tag tag) { return ((Tag)mBlastocyst.put (id, tag)); } /** * Adds a tag to the registry. * @param id The name under which to register the tag. * @param tag The tag to be returned from a createTag(id) call. */ public Tag get (String id) { return ((Tag)mBlastocyst.get (id)); } /** * Remove a tag from the registry. * @param id The name under which to register the tag. * @return The tag that was registered with that id. */ public Tag remove (String id) { return ((Tag)mBlastocyst.remove (id)); } /** * Clean out the registry. */ public void clear () { mBlastocyst = new Hashtable (); } public void registerTag (Tag tag) { String ids[]; ids = tag.getIds (); for (int i = 0; i < ids.length; i++) put (ids[i], tag); } public void unregisterTag (Tag tag) { String ids[]; ids = tag.getIds (); for (int i = 0; i < ids.length; i++) remove (ids[i]); } public PrototypicalNodeFactory registerTags () { registerTag (new AppletTag ()); registerTag (new BaseHrefTag ()); registerTag (new Bullet ()); registerTag (new BulletList ()); registerTag (new DoctypeTag ()); registerTag (new FormTag ()); registerTag (new FrameSetTag ()); registerTag (new FrameTag ()); registerTag (new ImageTag ()); registerTag (new InputTag ()); registerTag (new JspTag ()); registerTag (new LabelTag ()); registerTag (new LinkTag ()); registerTag (new MetaTag ()); registerTag (new OptionTag ()); registerTag (new ScriptTag ()); registerTag (new SelectTag ()); registerTag (new StyleTag ()); registerTag (new TableColumn ()); registerTag (new TableRow ()); registerTag (new TableTag ()); registerTag (new TextareaTag ()); registerTag (new TitleTag ()); registerTag (new Div ()); registerTag (new Span ()); registerTag (new BodyTag ()); registerTag (new HeadTag ()); registerTag (new Html ()); return (this); } // // NodeFactory interface // /** * Create a new string node. * @param page The page the node is on. * @param start The beginning position of the string. * @param end The ending positiong of the string. */ public Node createStringNode (Page page, int start, int end) { Node ret; ret = new StringNode (page, start, end); return (ret); } /** * Create a new remark node. * @param page The page the node is on. * @param start The beginning position of the remark. * @param end The ending positiong of the remark. */ public Node createRemarkNode (Page page, int start, int end) { return (new RemarkNode (page, start, end)); } /** * Create a new tag node. * Note that the attributes vector contains at least one element, * which is the tag name (standalone attribute) at position zero. * This can be used to decide which type of node to create, or * gate other processing that may be appropriate. * @param page The page the node is on. * @param start The beginning position of the tag. * @param end The ending positiong of the tag. * @param attributes The attributes contained in this tag. */ public Node createTagNode (Page page, int start, int end, Vector attributes) throws ParserException { Attribute attribute; String id; Tag prototype; Tag ret; ret = null; if (0 != attributes.size ()) { attribute = (Attribute)attributes.elementAt (0); id = attribute.getName (); if (null != id) { try { id = id.toUpperCase (); if (!id.startsWith ("/")) { if (id.endsWith ("/")) id = id.substring (0, id.length () - 1); prototype = (Tag)mBlastocyst.get (id); if (null != prototype) { ret = (Tag)prototype.clone (); ret.setPage (page); ret.setStartPosition (start); ret.setEndPosition (end); ret.setAttributesEx (attributes); } } } catch (CloneNotSupportedException cnse) { // default to creating a new one } } } if (null == ret) ret = new Tag (page, start, end, attributes); return (ret); } } Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.75 retrieving revision 1.76 diff -C2 -d -r1.75 -r1.76 *** Parser.java 9 Nov 2003 17:07:08 -0000 1.75 --- Parser.java 7 Dec 2003 23:41:39 -0000 1.76 *************** *** 49,75 **** import org.htmlparser.nodeDecorators.EscapeCharacterRemovingNode; import org.htmlparser.nodeDecorators.NonBreakingSpaceConvertingNode; ! import org.htmlparser.scanners.AppletScanner; ! import org.htmlparser.scanners.BaseHrefScanner; ! import org.htmlparser.scanners.BodyScanner; ! import org.htmlparser.scanners.BulletListScanner; ! import org.htmlparser.scanners.CompositeTagScanner; ! import org.htmlparser.scanners.DivScanner; ! import org.htmlparser.scanners.DoctypeScanner; ! import org.htmlparser.scanners.FormScanner; ! import org.htmlparser.scanners.FrameSetScanner; ! import org.htmlparser.scanners.HeadScanner; ! import org.htmlparser.scanners.HtmlScanner; ! import org.htmlparser.scanners.ImageScanner; ! import org.htmlparser.scanners.JspScanner; ! import org.htmlparser.scanners.LinkScanner; ! import org.htmlparser.scanners.MetaTagScanner; ! import org.htmlparser.scanners.ScriptScanner; ! import org.htmlparser.scanners.StyleScanner; ! import org.htmlparser.scanners.TableScanner; ! import org.htmlparser.scanners.TagScanner; ! import org.htmlparser.scanners.TitleScanner; ! import org.htmlparser.tags.ImageTag; ! import org.htmlparser.tags.LinkTag; ! import org.htmlparser.tags.Tag; import org.htmlparser.util.DefaultParserFeedback; import org.htmlparser.util.IteratorImpl; --- 49,53 ---- import org.htmlparser.nodeDecorators.EscapeCharacterRemovingNode; import org.htmlparser.nodeDecorators.NonBreakingSpaceConvertingNode; ! import org.htmlparser.tags.Tag; // temporarily import org.htmlparser.util.DefaultParserFeedback; import org.htmlparser.util.IteratorImpl; *************** *** 87,143 **** * Typical usage of the parser is as follows : <BR> * [1] Create a parser object - passing the URL and a feedback object to the parser<BR> ! * [2] Register the common scanners. See {@link #registerScanners()} <BR> ! * You wouldnt do this if you want to configure a custom lightweight parser. In that case, ! * you would add the scanners of your choice using {@link #addScanner(TagScanner)}<BR> ! * [3] Enumerate through the elements from the parser object <BR> ! * It is important to note that the parsing occurs when you enumerate, ON DEMAND. This is a thread-safe way, ! * and you only get the control back after a particular element is parsed and returned. ! * ! * <BR> ! * Below is some sample code to parse Yahoo.com and print all the tags. ! * <pre> ! * Parser parser = new Parser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); ! * // In this example, we are registering all the common scanners ! * parser.registerScanners(); ! * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! * Node node = i.nextNode(); ! * node.print(); ! * } ! * </pre> Below is some sample code to parse Yahoo.com and print only the text ! * information. This scanning will run faster, as there are no scanners ! * registered here. ! * <pre> ! * Parser parser = new Parser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); ! * // In this example, none of the scanners need to be registered ! * // as a string node is not a tag to be scanned for. ! * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! * Node node = i.nextNode(); ! * if (node instanceof StringNode) { ! * StringNode stringNode = ! * (StringNode)node; ! * System.out.println(stringNode.getText()); ! * } ! * } ! * </pre> ! * The above snippet will print out only the text contents in the html document.<br> ! * Here's another snippet that will only print out the link urls in a document. ! * This is an example of adding a link scanner. ! * <pre> ! * Parser parser = new Parser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); ! * parser.addScanner(new LinkScanner("-l")); ! * for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! * Node node = i.nextNode(); ! * if (node instanceof LinkTag) { ! * LinkTag linkTag = (LinkTag)node; ! * System.out.println(linkTag.getLink()); ! * } ! * } ! * </pre> * @see Parser#elements() */ public class Parser implements ! Serializable, ! NodeFactory { // Please don't change the formatting of the version variables below. --- 65,77 ---- * Typical usage of the parser is as follows : <BR> * [1] Create a parser object - passing the URL and a feedback object to the parser<BR> ! * [2] Enumerate through the elements from the parser object <BR> ! * It is important to note that the parsing occurs when you enumerate, ON DEMAND. ! * This is a thread-safe way, and you only get the control back after a ! * particular element is parsed and returned, which could be the entire body. * @see Parser#elements() */ public class Parser implements ! Serializable { // Please don't change the formatting of the version variables below. *************** *** 175,187 **** /** - * This object is used by the StringParser to create new StringNodes at runtime, based on - * use configurations of the factory - */ - private StringNodeFactory stringNodeFactory; - - /** * Feedback object. */ ! protected ParserFeedback feedback; /** --- 109,115 ---- /** * Feedback object. */ ! protected ParserFeedback mFeedback; /** *************** *** 191,210 **** /** - * The list of scanners to apply at the top level. - */ - protected Map mScanners; - - /** - * The list of tags to return at the top level. - * The list is keyed by tag name. - */ - protected Map mBlastocyst; - - /** - * The current scanner when recursing into a tag. - */ - protected TagScanner mScanner; - - /** * Variable to store lineSeparator. * This is setup to read <code>line.separator</code> from the System property. --- 119,122 ---- *************** *** 273,279 **** public Parser () { ! setFeedback (null); ! setScanners (null); ! setLexer (new Lexer (new Page (""))); } --- 185,189 ---- public Parser () { ! this (new Lexer (new Page ("")), noFeedback); } *************** *** 300,305 **** { setFeedback (fb); ! setScanners (null); setLexer (lexer); } --- 210,217 ---- { setFeedback (fb); ! if (null == lexer) ! throw new IllegalArgumentException ("lexer cannot be null"); setLexer (lexer); + setNodeFactory (new PrototypicalNodeFactory ()); } *************** *** 314,320 **** ParserException { ! setFeedback (fb); ! setScanners (null); ! setConnection (connection); } --- 226,230 ---- ParserException { ! this (new Lexer (connection), fb); } *************** *** 383,389 **** * Set the connection for this parser. * This method creates a new <code>Lexer</code> reading from the connection. ! * It does not adjust the <code>mScanners</code> list ! * or <code>feedback</code> object. Trying to ! * set the connection to null is a noop. * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. --- 293,297 ---- * Set the connection for this parser. * This method creates a new <code>Lexer</code> reading from the connection. ! * Trying to set the connection to null is a noop. * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. *************** *** 391,394 **** --- 299,303 ---- * HTTP header is not supported, or an i/o exception occurs creating the * lexer. + * @see #setLexer */ public void setConnection (URLConnection connection) *************** *** 414,420 **** * Set the URL for this parser. * This method creates a new Lexer reading from the given URL. ! * It does not adjust the <code>mScanners</code> list ! * or <code>feedback</code> object. Trying to set the url to null or an ! * empty string is a noop. * @see #setConnection(URLConnection) */ --- 323,327 ---- * Set the URL for this parser. * This method creates a new Lexer reading from the given URL. ! * Trying to set the url to null or an empty string is a noop. * @see #setConnection(URLConnection) */ *************** *** 460,465 **** /** * Set the lexer for this parser. ! * TIt does not adjust the <code>mScanners</code> list ! * or <code>feedback</code> object. * Trying to set the lexer to <code>null</code> is a noop. * @param lexer The lexer object to use. --- 367,373 ---- /** * Set the lexer for this parser. ! * The current NodeFactory is set on the given lexer, since the lexer ! * contains the node factory object. ! * It does not adjust the <code>feedback</code> object. * Trying to set the lexer to <code>null</code> is a noop. * @param lexer The lexer object to use. *************** *** 467,474 **** public void setLexer (Lexer lexer) { if (null != lexer) ! { mLexer = lexer; - mLexer.setNodeFactory (this); } } --- 375,388 ---- public void setLexer (Lexer lexer) { + NodeFactory factory; + if (null != lexer) ! { // move a node factory that's been set to the new lexer ! factory = null; ! if (null != getLexer ()) ! factory = getLexer ().getNodeFactory (); ! if (null != factory) ! lexer.setNodeFactory (factory); mLexer = lexer; } } *************** *** 484,520 **** /** ! * Get the number of scanners registered currently in the parser. ! * @return int number of scanners registered. ! */ ! public int getNumScanners() ! { ! return mScanners.size(); ! } ! ! /** ! * This method is to be used to change the set of scanners in the current parser. ! * @param newScanners List of scanner objects to be used during the parsing process. */ ! public void setScanners (Map newScanners) { ! Iterator iterator; ! TagScanner scanner; ! ! flushScanners (); ! if (null != newScanners) ! for (iterator = newScanners.entrySet ().iterator (); iterator.hasNext (); ) ! { ! scanner = (TagScanner)iterator.next (); ! addScanner (scanner); ! } } /** ! * Get the list of scanners registered currently in the parser ! * @return List of scanners currently registered in the parser */ ! public Map getScanners() { ! return mScanners; } --- 398,418 ---- /** ! * Get the current node factory. ! * @return The parser's node factory. */ ! public NodeFactory getNodeFactory () { ! return (getLexer ().getNodeFactory ()); } /** ! * Get the current node factory. ! * @return The parser's node factory. */ ! public void setNodeFactory (NodeFactory factory) { ! if (null == factory) ! throw new IllegalArgumentException ("node factory cannot be null"); ! getLexer ().setNodeFactory (factory); } *************** *** 523,529 **** * @param fb The new feedback object to use. */ ! public void setFeedback(ParserFeedback fb) { ! feedback = (null == fb) ? noFeedback : fb; } --- 421,427 ---- * @param fb The new feedback object to use. */ ! public void setFeedback (ParserFeedback fb) { ! mFeedback = (null == fb) ? noFeedback : fb; } *************** *** 532,537 **** * @return HTMLParserFeedback */ ! public ParserFeedback getFeedback() { ! return feedback; } --- 430,436 ---- * @return HTMLParserFeedback */ ! public ParserFeedback getFeedback() ! { ! return (mFeedback); } *************** *** 549,590 **** /** - * Add a new Tag Scanner. - * In typical situations where you require a no-frills parser, use the registerScanners() method to add the most - * common parsers. But when you wish to either compose a parser with only certain scanners registered, use this method. - * It is advantageous to register only the scanners you want, in order to achieve faster parsing speed. This method - * would also be of use when you have developed custom scanners, and need to register them into the parser. - * @param scanner TagScanner object (or derivative) to be added to the list of registered scanners. - */ - public void addScanner(TagScanner scanner) - { - String ids[]; - Tag tag; - - ids = scanner.getID(); - for (int i = 0; i < ids.length; i++) - { - mScanners.put (ids[i], scanner); - // for now, the only way to create a tag is to ask the scanner... - try - { - if (scanner instanceof CompositeTagScanner) - { - tag = ((CompositeTagScanner)scanner).createTag (null, 0, 0, null, null, null, null); - mBlastocyst.put (ids[i], tag); - } - else - { - tag = scanner.createTag (null, 0, 0, null, null, null); - mBlastocyst.put (ids[i], tag); - } - } - catch (Exception e) - { - e.printStackTrace (); - } - } - } - - /** * Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ * string/link/image<br> --- 448,451 ---- *************** *** 593,597 **** * <pre> * Parser parser = new Parser("http://www.yahoo.com"); - * parser.registerScanners(); * for (NodeIterator i = parser.elements();i.hasMoreElements();) { * Node node = i.nextHTMLNode(); --- 454,457 ---- *************** *** 605,608 **** --- 465,469 ---- * if (node instanceof ...) { * // Downcast, and process + * // recursively (nodes within nodes) * } * } *************** *** 612,646 **** public NodeIterator elements () throws ParserException { ! return (new IteratorImpl (getLexer (), feedback)); ! } ! ! /** ! * Flush the current scanners registered. ! * The registered scanners list becomes empty with this call. ! */ ! public void flushScanners() ! { ! mScanners = new Hashtable (); ! mBlastocyst = new Hashtable (); ! } ! ! /** ! * Return the scanner registered in the parser having the ! * given id ! * @param id The id of the requested scanner ! * @return TagScanner The Tag Scanner ! */ ! public TagScanner getScanner (String id) ! { ! Tag tag; ! TagScanner ret; ! ! ret = null; ! ! tag = (Tag)mBlastocyst.get (id); ! if (null != tag) ! ret = (TagScanner)tag.getThisScanner (); ! ! return (ret); } --- 473,477 ---- public NodeIterator elements () throws ParserException { ! return (new IteratorImpl (getLexer (), getFeedback ())); } *************** *** 672,762 **** /** - * This method should be invoked in order to register some common scanners. - * The scanners that get added are : <br> - * LinkScanner (filter key "-l")<br> - * ImageScanner (filter key "-i")<br> - * ScriptScanner (filter key "-s") <br> - * StyleScanner (filter key "-t") <br> - * JspScanner (filter key "-j") <br> - * AppletScanner (filter key "-a") <br> - * MetaTagScanner (filter key "-m") <br> - * TitleScanner (filter key "-t") <br> - * DoctypeScanner (filter key "-d") <br> - * FormScanner (filter key "-f") <br> - * FrameSetScanner(filter key "-r") <br> - * BulletListScanner(filter key "-bulletList") <br> - * DivScanner(filter key "-div") <br> - * TableScanner(filter key "") <br> - * <br> - * Call this method after creating the Parser object. e.g. <BR> - * <pre> - * Parser parser = new Parser("http://www.yahoo.com"); - * parser.registerScanners(); - * </pre> - */ - public void registerScanners() { - if (mScanners.size()>0) - { - System.err.println("registerScanners() should be called first, when no other scanner has been registered."); - System.err.println("Other scanners already exist, hence this method call won't have any effect"); - return; - } - addScanner(new LinkScanner(LinkTag.LINK_TAG_FILTER)); - addScanner(new ImageScanner(ImageTag.IMAGE_TAG_FILTER)); - addScanner(new ScriptScanner("-s")); - addScanner(new StyleScanner("-t")); - addScanner(new JspScanner("-j")); - addScanner(new AppletScanner("-a")); - addScanner(new MetaTagScanner("-m")); - addScanner(new TitleScanner("-T")); - addScanner(new DoctypeScanner("-d")); - addScanner(new FormScanner("-f",this)); - addScanner(new FrameSetScanner("-r")); - addScanner(new BaseHrefScanner("-b")); - addScanner(new BulletListScanner("-bulletList",this)); - // addScanner(new SpanScanner("-p")); - addScanner(new DivScanner("-div")); - addScanner(new TableScanner(this)); - } - - /** - * Make a call to registerDomScanners(), instead of registerScanners(), - * when you are interested in retrieving a Dom representation of the html - * page. Upon parsing, you will receive an Html object - which will contain - * children, one of which would be the body. This is still evolving, and in - * future releases, you might see consolidation of Html - to provide you - * with methods to access the body and the head. - */ - public void registerDomScanners() { - registerScanners(); - addScanner(new HtmlScanner()); - addScanner(new BodyScanner()); - addScanner(new HeadScanner()); - } - - /** - * Removes a specified scanner object. You can create - * an anonymous object as a parameter. This method - * will use the scanner's key and remove it from the - * registry of scanners. - * e.g. - * <pre> - * removeScanner(new FormScanner("")); - * </pre> - * @param scanner TagScanner object to be removed from the list of registered scanners - */ - public void removeScanner(TagScanner scanner) - { - String[] ids; - - ids = scanner.getID (); - for (int i = 0; i < ids.length; i++) - { - mScanners.remove (ids[i]); - mBlastocyst.remove (ids[i]); - } - } - - /** * Opens a connection using the given url. * @param url The url to open. --- 503,506 ---- *************** *** 874,878 **** { Parser parser = new Parser (args[0]); - parser.registerScanners (); System.out.println ("Parsing " + parser.getURL ()); NodeFilter filter; --- 618,621 ---- *************** *** 959,968 **** } - public static Parser createLinkRecognizingParser(String inputHTML) { - Parser parser = createParser(inputHTML); - parser.addScanner(new LinkScanner(LinkTag.LINK_TAG_FILTER)); - return parser; - } - /** * @return String lineSeparator that will be used in toHTML() --- 702,705 ---- *************** *** 970,1080 **** public static String getLineSeparator() { return lineSeparator; - } - - public StringNodeFactory getStringNodeFactory() { - if (stringNodeFactory == null) - stringNodeFactory = new StringNodeFactory(); - return stringNodeFactory; - } - - public void setStringNodeFactory(StringNodeFactory stringNodeFactory) { - this.stringNodeFactory = stringNodeFactory; - } - - // - // NodeFactory interface - // - - /** - * Create a new string node. - * @param page The page the node is on. - * @param start The beginning position of the string. - * @param end The ending positiong of the string. - */ - public Node createStringNode (Page page, int start, int end) - { - Node ret; - - ret = new StringNode (page, start, end); - if (null != stringNodeFactory) - { - if (stringNodeFactory.shouldDecodeNodes ()) - ret = new DecodingNode (ret); - if (stringNodeFactory.shouldRemoveEscapeCharacters ()) - ret = new EscapeCharacterRemovingNode (ret); - if (stringNodeFactory.shouldConvertNonBreakingSpace ()) - ret = new NonBreakingSpaceConvertingNode (ret); - } - - return (ret); - } - - /** - * Create a new remark node. - * @param page The page the node is on. - * @param start The beginning position of the remark. - * @param end The ending positiong of the remark. - */ - public Node createRemarkNode (Page page, int start, int end) - { - return (new RemarkNode (page, start, end)); - } - - /** - * Create a new tag node. - * Note that the attributes vector contains at least one element, - * which is the tag name (standalone attribute) at position zero. - * This can be used to decide which type of node to create, or - * gate other processing that may be appropriate. - * @param page The page the node is on. - * @param start The beginning position of the tag. - * @param end The ending positiong of the tag. - * @param attributes The attributes contained in this tag. - */ - public Node createTagNode (Page page, int start, int end, Vector attributes) - throws - ParserException - { - Attribute attribute; - String id; - Tag prototype; - Tag ret; - - ret = null; - - if (0 != attributes.size ()) - { - attribute = (Attribute)attributes.elementAt (0); - id = attribute.getName (); - if (null != id) - { - try - { - id = id.toUpperCase (); - if (!id.startsWith ("/")) - { - if (id.endsWith ("/")) - id = id.substring (0, id.length () - 1); - prototype = (Tag)mBlastocyst.get (id); - if (null != prototype) - { - ret = (Tag)prototype.clone (); - ret.setPage (page); - ret.setStartPosition (start); - ret.setEndPosition (end); - ret.setAttributesEx (attributes); - } - } - } - catch (CloneNotSupportedException cnse) - { - // default to creating a new one - } - } - } - if (null == ret) - ret = new Tag (page, start, end, attributes); - - return (ret); } } --- 707,710 ---- Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/RemarkNode.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** RemarkNode.java 9 Nov 2003 17:07:08 -0000 1.37 --- RemarkNode.java 7 Dec 2003 23:41:39 -0000 1.38 *************** *** 41,64 **** org.htmlparser.lexer.nodes.RemarkNode { - public final static String REMARK_NODE_FILTER="-r"; - - // /** - // * Tag contents will have the contents of the comment tag. - // */ - // String tagContents; - // - // /** - // * The HTMLRemarkTag is constructed by providing the beginning posn, ending posn - // * and the tag contents. - // * @param nodeBegin beginning position of the tag - // * @param nodeEnd ending position of the tag - // * @param tagContents contents of the remark tag - // */ - // public RemarkNode(int nodeBegin, int nodeEnd, String tagContents) - // { - // super(nodeBegin,nodeEnd); - // this.tagContents = tagContents; - // } - /** * Constructor takes in the text string, beginning and ending posns. --- 41,44 ---- *************** *** 73,95 **** /** - * Print the contents of the remark tag. - */ - public String toString() - { - StringBuffer ret; - - ret = new StringBuffer (1024); - ret.append ("Comment Tag : "); - ret.append (getText ()); - ret.append ("; begins at : "); - ret.append (getStartPosition ()); - ret.append ("; ends at : "); - ret.append (getEndPosition ()); - ret.append ("\n"); - - return (ret.toString ()); - } - - /** * Remark visiting code. * @param visitor The <code>NodeVisitor</code> object to invoke --- 53,56 ---- *************** *** 100,103 **** ((NodeVisitor)visitor).visitRemarkNode (this); } - } --- 61,63 ---- Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/StringNode.java,v retrieving revision 1.45 retrieving revision 1.46 diff -C2 -d -r1.45 -r1.46 *** StringNode.java 9 Nov 2003 17:07:08 -0000 1.45 --- StringNode.java 7 Dec 2003 23:41:39 -0000 1.46 *************** *** 41,62 **** org.htmlparser.lexer.nodes.StringNode { - public static final String STRING_FILTER="-string"; - - // /** - // * The text of the string. - // */ - // protected StringBuffer textBuffer; - // - /** - * Constructor takes in the text string, beginning and ending posns. - * @param text The contents of the string line - * @param textBegin The beginning position of the string - * @param textEnd The ending positiong of the string - */ - public StringNode (StringBuffer text, int textBegin,int textEnd) - { - super(new Page (text.toString ()), textBegin,textEnd); - } - /** * Constructor takes in the text string, beginning and ending posns. --- 41,44 ---- *************** *** 68,86 **** { super (page, start, end); - } - - public String toString() - { - StringBuffer ret; - - ret = new StringBuffer (1024); - ret.append ("Text = "); - ret.append (getText ()); - ret.append ("; begins at : "); - ret.append (getStartPosition ()); - ret.append ("; ends at : "); - ret.append (getEndPosition ()); - - return (ret.toString ()); } --- 50,53 ---- Index: StringNodeFactory.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/StringNodeFactory.java,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** StringNodeFactory.java 9 Nov 2003 17:07:08 -0000 1.7 --- StringNodeFactory.java 7 Dec 2003 23:41:39 -0000 1.8 *************** *** 30,33 **** --- 30,34 ---- import java.io.Serializable; + import org.htmlparser.lexer.Page; import org.htmlparser.nodeDecorators.DecodingNode; *************** *** 35,45 **** import org.htmlparser.nodeDecorators.NonBreakingSpaceConvertingNode; ! public class StringNodeFactory implements Serializable { ! /** * Flag to tell the parser to decode strings returned by StringNode's toPlainTextString. * Decoding occurs via the method, org.htmlparser.util.Translate.decode() */ ! private boolean shouldDecodeNodes = false; --- 36,50 ---- import org.htmlparser.nodeDecorators.NonBreakingSpaceConvertingNode; ! public class StringNodeFactory ! extends ! PrototypicalNodeFactory ! implements ! Serializable ! { /** * Flag to tell the parser to decode strings returned by StringNode's toPlainTextString. * Decoding occurs via the method, org.htmlparser.util.Translate.decode() */ ! protected boolean mDecode; *************** *** 48,52 **** * Escape character removal occurs via the method, org.htmlparser.util.ParserUtils.removeEscapeCharacters() */ ! private boolean shouldRemoveEscapeCharacters = false; /** --- 53,57 ---- * Escape character removal occurs via the method, org.htmlparser.util.ParserUtils.removeEscapeCharacters() */ ! protected boolean mRemoveEscapes; /** *************** *** 54,98 **** * (i.e. \u00a0) to a space (" "). If true, this will happen inside StringNode's toPlainTextString. */ ! private boolean shouldConvertNonBreakingSpace = false; ! public Node createStringNode( ! StringBuffer textBuffer, ! int textBegin, ! int textEnd) { ! Node newNode = new StringNode(textBuffer, textBegin, textEnd); ! if (shouldDecodeNodes()) ! newNode = new DecodingNode(newNode); ! if (shouldRemoveEscapeCharacters()) ! newNode = new EscapeCharacterRemovingNode(newNode); ! if (shouldConvertNonBreakingSpace()) ! newNode = new NonBreakingSpaceConvertingNode(newNode); ! return newNode; } /** ! * Tells the parser to decode nodes using org.htmlparser.util.Translate.decode() */ ! public void setNodeDecoding(boolean shouldDecodeNodes) { ! this.shouldDecodeNodes = shouldDecodeNodes; ! } ! public boolean shouldDecodeNodes() { ! return shouldDecodeNodes; } ! public void setEscapeCharacterRemoval(boolean shouldRemoveEscapeCharacters) { ! this.shouldRemoveEscapeCharacters = shouldRemoveEscapeCharacters; } ! public boolean shouldRemoveEscapeCharacters() { ! return shouldRemoveEscapeCharacters; } ! public void setNonBreakSpaceConversion(boolean shouldConvertNonBreakSpace) { ! this.shouldConvertNonBreakingSpace = shouldConvertNonBreakSpace; } ! public boolean shouldConvertNonBreakingSpace() { ! return shouldConvertNonBreakingSpace; } } --- 59,148 ---- * (i.e. \u00a0) to a space (" "). If true, this will happen inside StringNode's toPlainTextString. */ ! protected boolean mConvertNonBreakingSpaces; ! ! public StringNodeFactory () ! { ! mDecode = false; ! mRemoveEscapes = false; ! mConvertNonBreakingSpaces = false; ! } ! // ! // NodeFactory interface override ! // ! ! /** ! * Create a new string node. ! * @param page The page the node is on. ! * @param start The beginning position of the string. ! * @param end The ending positiong of the string. ! */ ! public Node createStringNode (Page page, int start, int end) ! { ! Node ret; ! ! ret = super.createStringNode (page, start, end); ! if (getDecode ()) ! ret = new DecodingNode (ret); ! if (getRemoveEscapes ()) ! ret = new EscapeCharacterRemovingNode (ret); ! if (getConvertNonBreakingSpaces ()) ! ret = new NonBreakingSpaceConvertingNode (ret); ! ! return (ret); } /** ! * Set the decoding state. ! * @param decode If <code>true</code>, string nodes decode text using {@link org.htmlparser.util.Translate#decode}. */ ! public void setDecode (boolean decode) ! { ! mDecode = decode; ! } ! /** ! * Get the decoding state. ! * @return <code>true</code> if string nodes decode text. ! */ ! public boolean getDecode () ! { ! return (mDecode); } ! /** ! * Set the escape removing state. ! * @param decode If <code>true</code>, string nodes remove escape characters. ! */ ! public void setRemoveEscapes (boolean remove) ! { ! mRemoveEscapes = remove; } ! /** ! * Get the escape removing state. ! * @return The removing state. ! */ ! public boolean getRemoveEscapes () ! { ! return (mRemoveEscapes); } ! /** ! * Set the non-breaking space replacing state. ! * @param convert If <code>true</code>, string nodes replace ;nbsp; characters with spaces. ! */ ! public void setConvertNonBreakingSpaces (boolean convert) ! { ! mConvertNonBreakingSpaces = convert; } ! /** ! * Get the non-breaking space replacing state. ! * @return The replacing state. ! */ ! public boolean getConvertNonBreakingSpaces () ! { ! return (mConvertNonBreakingSpaces); } } |
From: <der...@us...> - 2003-12-07 23:42:13
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1:/tmp/cvs-serv16537/lexer/nodes Modified Files: Attribute.java RemarkNode.java StringNode.java TagNode.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: Attribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/Attribute.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** Attribute.java 9 Nov 2003 17:07:09 -0000 1.14 --- Attribute.java 7 Dec 2003 23:41:40 -0000 1.15 *************** *** 33,36 **** --- 33,37 ---- package org.htmlparser.lexer.nodes; + import java.io.Serializable; import org.htmlparser.lexer.Page; import org.htmlparser.util.Translate; *************** *** 198,201 **** --- 199,204 ---- */ public class Attribute + implements + Serializable { /** Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/RemarkNode.java,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** RemarkNode.java 9 Nov 2003 17:07:09 -0000 1.13 --- RemarkNode.java 7 Dec 2003 23:41:40 -0000 1.14 *************** *** 34,37 **** --- 34,38 ---- import org.htmlparser.lexer.Page; import org.htmlparser.util.NodeList; + import org.htmlparser.util.ParserException; /** *************** *** 81,84 **** --- 82,86 ---- return (mPage.getText (getStartPosition (), getEndPosition ())); } + /** * Print the contents of the remark tag. *************** *** 86,95 **** public String toString() { Cursor start; Cursor end; ! start = new Cursor (getPage (), getStartPosition ()); ! end = new Cursor (getPage (), getEndPosition ()); ! return ("Rem (" + start.toString () + "," + end.toString () + "): " + getText ()); } --- 88,140 ---- public String toString() { + int startpos; + int endpos; Cursor start; Cursor end; + char c; + StringBuffer ret; ! startpos = getStartPosition (); ! endpos = getEndPosition (); ! ret = new StringBuffer (endpos - startpos + 20); ! start = new Cursor (getPage (), startpos); ! end = new Cursor (getPage (), endpos); ! ret.append ("Rem ("); ! ret.append (start); ! ret.append (","); ! ret.append (end); ! ret.append ("): "); ! while (start.getPosition () < endpos) ! { ! try ! { ! c = mPage.getCharacter (start); ! switch (c) ! { ! case '\t': ! ret.append ("\\t"); ! break; ! case '\n': ! ret.append ("\\n"); ! break; ! case '\r': ! ret.append ("\\r"); ! break; ! default: ! ret.append (c); ! } ! } ! catch (ParserException pe) ! { ! // not really expected, but we'return only doing toString, so ignore ! } ! if (77 <= ret.length ()) ! { ! ret.append ("..."); ! break; ! } ! } ! ! return (ret.toString ()); } Index: StringNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/StringNode.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** StringNode.java 9 Nov 2003 17:07:09 -0000 1.14 --- StringNode.java 7 Dec 2003 23:41:40 -0000 1.15 *************** *** 55,59 **** /** ! * Returns the text of the string line */ public String getText () --- 55,59 ---- /** ! * Returns the text of the string line. */ public String getText () *************** *** 93,104 **** } public String toString () { Cursor start; Cursor end; ! start = new Cursor (getPage (), getStartPosition ()); ! end = new Cursor (getPage (), getEndPosition ()); ! return ("Txt (" + start.toString () + "," + end.toString () + "): " + getText ()); } --- 93,154 ---- } + /** + * Express this string node as a printable string + * This is suitable for display in a debugger or output to a printout. + * Control characters are replaced by their equivalent escape + * sequence and contents is truncated to 80 characters. + * @return A string representation of the string node. + */ public String toString () { + int startpos; + int endpos; Cursor start; Cursor end; + char c; + StringBuffer ret; ! startpos = getStartPosition (); ! endpos = getEndPosition (); ! ret = new StringBuffer (endpos - startpos + 20); ! start = new Cursor (getPage (), startpos); ! end = new Cursor (getPage (), endpos); ! ret.append ("Txt ("); ! ret.append (start); ! ret.append (","); ! ret.append (end); ! ret.append ("): "); ! while (start.getPosition () < endpos) ! { ! try ! { ! c = mPage.getCharacter (start); ! switch (c) ! { ! case '\t': ! ret.append ("\\t"); ! break; ! case '\n': ! ret.append ("\\n"); ! break; ! case '\r': ! ret.append ("\\r"); ! break; ! default: ! ret.append (c); ! } ! } ! catch (ParserException pe) ! { ! // not really expected, but we'return only doing toString, so ignore ! } ! if (77 <= ret.length ()) ! { ! ret.append ("..."); ! break; ! } ! } ! ! return (ret.toString ()); } Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/TagNode.java,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** TagNode.java 9 Nov 2003 17:07:09 -0000 1.24 --- TagNode.java 7 Dec 2003 23:41:40 -0000 1.25 *************** *** 637,644 **** --- 637,648 ---- public String toString () { + String text; String type; Cursor start; Cursor end; + StringBuffer ret; + text = getText (); + ret = new StringBuffer (20 + text.length ()); if (isEndTag ()) type = "End"; *************** *** 647,651 **** start = new Cursor (getPage (), getStartPosition ()); end = new Cursor (getPage (), getEndPosition ()); ! return (type + " (" + start.toString () + "," + end.toString () + "): " + getText ()); } --- 651,670 ---- start = new Cursor (getPage (), getStartPosition ()); end = new Cursor (getPage (), getEndPosition ()); ! ret.append (type); ! ret.append (" ("); ! ret.append (start); ! ret.append (","); ! ret.append (end); ! ret.append ("): "); ! if (80 < ret.length () + text.length ()) ! { ! text = text.substring (0, 77 - ret.length ()); ! ret.append (text); ! ret.append ("..."); ! } ! else ! ret.append (text); ! ! return (ret.toString ()); } |
From: <der...@us...> - 2003-12-07 23:42:13
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans In directory sc8-pr-cvs1:/tmp/cvs-serv16537/beans Modified Files: LinkBean.java StringBean.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: LinkBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/LinkBean.java,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** LinkBean.java 9 Nov 2003 17:07:08 -0000 1.23 --- LinkBean.java 7 Dec 2003 23:41:39 -0000 1.24 *************** *** 95,99 **** parser = new Parser (url); - parser.registerScanners (); ObjectFindingVisitor visitor = new ObjectFindingVisitor(LinkTag.class); parser.visitAllNodesWith(visitor); --- 95,98 ---- Index: StringBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/StringBean.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** StringBean.java 9 Nov 2003 17:07:08 -0000 1.32 --- StringBean.java 7 Dec 2003 23:41:39 -0000 1.33 *************** *** 260,265 **** String ret; - mParser.flushScanners (); - mParser.registerScanners (); mIsPre = false; mIsScript = false; --- 260,263 ---- *************** *** 297,302 **** try { - mParser.flushScanners (); - mParser.registerScanners (); mIsPre = false; mIsScript = false; --- 295,298 ---- |
From: <der...@us...> - 2003-12-07 23:42:13
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1:/tmp/cvs-serv16537/lexer Modified Files: Lexer.java Page.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.20 retrieving revision 1.21 diff -C2 -d -r1.20 -r1.21 *** Lexer.java 9 Nov 2003 17:07:08 -0000 1.20 --- Lexer.java 7 Dec 2003 23:41:40 -0000 1.21 *************** *** 187,192 **** /** ! * Get the current node factory. ! * @return The lexer's cursor position. */ public void setNodeFactory (NodeFactory factory) --- 187,192 ---- /** ! * Set the current node factory. ! * @param factory The node factory to be used by the lexer. */ public void setNodeFactory (NodeFactory factory) Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** Page.java 9 Nov 2003 17:07:08 -0000 1.26 --- Page.java 7 Dec 2003 23:41:40 -0000 1.27 *************** *** 669,672 **** --- 669,673 ---- ParserException { + String encoding; InputStream stream; char[] buffer; *************** *** 674,678 **** char[] new_chars; ! if (!getEncoding ().equals (character_set)) { stream = getSource ().getStream (); --- 675,680 ---- char[] new_chars; ! encoding = getEncoding (); ! if (!encoding.equals (character_set)) { stream = getSource ().getStream (); *************** *** 694,698 **** + " != old: " + buffer[i] ! + ") for encoding at offset " + offset); } --- 696,704 ---- + " != old: " + buffer[i] ! + ") for encoding change from " ! + encoding ! + " to " ! + character_set ! + " at offset " + offset); } |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/tagTests Modified Files: AllTests.java AppletTagTest.java BaseHrefTagTest.java BodyTagTest.java CompositeTagTest.java DoctypeTagTest.java EndTagTest.java FormTagTest.java FrameSetTagTest.java FrameTagTest.java ImageTagTest.java InputTagTest.java JspTagTest.java LinkTagTest.java MetaTagTest.java ObjectCollectionTest.java OptionTagTest.java ScriptTagTest.java SelectTagTest.java StyleTagTest.java TagTest.java TextareaTagTest.java TitleTagTest.java Added Files: BulletListTagTest.java BulletTagTest.java DivTagTest.java HeadTagTest.java HtmlTagTest.java LabelTagTest.java SpanTagTest.java TableTagTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). --- NEW FILE: BulletListTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BulletListTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.StringNode; import org.htmlparser.tags.Bullet; import org.htmlparser.tags.BulletList; import org.htmlparser.tags.CompositeTag; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class BulletListTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.BulletListTagTest", "BulletListTagTest"); } public BulletListTagTest (String name) { super(name); } public void testScan() throws ParserException { createParser( "<ul TYPE=DISC>" + "<ul TYPE=\"DISC\"><li>Energy supply\n"+ " (Campbell) <A HREF=\"/hansard/37th3rd/h20307p.htm#1646\">1646</A>\n"+ " (MacPhail) <A HREF=\"/hansard/37th3rd/h20307p.htm#1646\">1646</A>\n"+ "</ul><A NAME=\"calpinecorp\"></A><B>Calpine Corp.</B>\n"+ "<ul TYPE=\"DISC\"><li>Power plant projects\n"+ " (Neufeld) <A HREF=\"/hansard/37th3rd/h20314p.htm#1985\">1985</A>\n"+ "</ul>" + "</ul>" ); parseAndAssertNodeCount(1); NodeList nestedBulletLists = ((CompositeTag)node[0]).searchFor( BulletList.class ); assertEquals( "bullets in first list", 2, nestedBulletLists.size() ); BulletList firstList = (BulletList)nestedBulletLists.elementAt(0); Bullet firstBullet = (Bullet)firstList.childAt(0); Node firstNodeInFirstBullet = firstBullet.childAt(0); assertType( "first child in bullet", StringNode.class, firstNodeInFirstBullet ); assertStringEquals( "expected text", "Energy supply\n" + " (Campbell) ", firstNodeInFirstBullet.toPlainTextString() ); } public void testMissingendtag () throws ParserException { createParser ("<li>item 1<li>item 2"); parseAndAssertNodeCount (2); assertStringEquals ("item 1 not correct", "item 1", ((Bullet)node[0]).childAt (0).toHtml ()); assertStringEquals ("item 2 not correct", "item 2", ((Bullet)node[1]).childAt (0).toHtml ()); } } --- NEW FILE: BulletTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BulletTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.tags.Bullet; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; public class BulletTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.BulletTagTest", "BulletTagTest"); } public BulletTagTest (String name) { super(name); } public void testBulletFound() throws Exception { createParser( "<LI><A HREF=\"collapseHierarchy.html\">Collapse Hierarchy</A>\n"+ "</LI>" ); parseAndAssertNodeCount(1); assertType("should be a bullet",Bullet.class,node[0]); } public void testOutOfMemoryBug() throws ParserException { createParser( "<html>" + "<head>" + "<title>Foo</title>" + "</head>" + "<body>" + " <ul>" + " <li>" + " <a href=\"http://foo.com/c.html\">bibliographies on:" + " <ul>" + " <li>chironomidae</li>" + " </ul>" + " </a>" + " </li>" + " </ul>" + "" + "</body>" + "</html>" ); for (NodeIterator i = parser.elements();i.hasMoreNodes();) i.nextNode(); } public void testNonEndedBullets() throws ParserException { createParser( "<li>forest practices legislation penalties for non-compliance\n"+ " (Kwan) <A HREF=\"/hansard/37th3rd/h21107a.htm#4384\">4384-5</A>\n"+ "<li>passenger rail service\n"+ " (MacPhail) <A HREF=\"/hansard/37th3rd/h21021p.htm#3904\">3904</A>\n"+ "<li>referendum on principles for treaty negotiations\n"+ " (MacPhail) <A HREF=\"/hansard/37th3rd/h20313p.htm#1894\">1894</A>\n"+ "<li>transportation infrastructure projects\n"+ " (MacPhail) <A HREF=\"/hansard/37th3rd/h21022a.htm#3945\">3945-7</A>\n"+ "<li>tuition fee freeze" ); parseAndAssertNodeCount(5); for (int i=0;i<nodeCount;i++) { assertType("node "+i,Bullet.class,node[i]); } } } --- NEW FILE: DivTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/DivTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.Div; import org.htmlparser.tags.InputTag; import org.htmlparser.tags.TableTag; import org.htmlparser.tags.Tag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.ParserException; public class DivTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.DivTagTest", "DivTagTest"); } public DivTagTest (String name) { super(name); } public void testScan() throws ParserException { createParser("<table><div align=\"left\">some text</div></table>"); parseAndAssertNodeCount(1); assertType("node should be table",TableTag.class,node[0]); TableTag tableTag = (TableTag)node[0]; Div div = (Div)tableTag.searchFor(Div.class).toNodeArray()[0]; assertEquals("div contents","some text",div.toPlainTextString()); } /** * Test case for bug #735193 Explicit tag type recognition for CompositTags not working. */ public void testInputInDiv() throws ParserException { createParser("<div><INPUT type=\"text\" name=\"X\">Hello</INPUT></div>"); parser.setNodeFactory ( new PrototypicalNodeFactory ( new Tag[] { new Div (), new InputTag (), })); parseAndAssertNodeCount(1); assertType("node should be div",Div.class,node[0]); Div div = (Div)node[0]; assertType("child not input",InputTag.class,div.getChild (0)); } } --- NEW FILE: HeadTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/HeadTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.tags.HeadTag; import org.htmlparser.tags.Html; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.ParserException; public class HeadTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.HeadTagTest", "HeadTagTest"); } public HeadTagTest (String name) { super(name); } public void testSimpleHead() throws ParserException { createParser("<HTML><HEAD></HEAD></HTML>"); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof Html); Html htmlTag = (Html)node[0]; assertTrue(htmlTag.getChild(0) instanceof HeadTag); } public void testSimpleHeadWithoutEndTag() throws ParserException { createParser("<HTML><HEAD></HTML>"); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof Html); Html htmlTag = (Html)node[0]; assertTrue(htmlTag.getChild(0) instanceof HeadTag); HeadTag headTag = (HeadTag)htmlTag.getChild(0); assertEquals("toHtml()","<HEAD></HEAD>",headTag.toHtml()); assertEquals("toHtml()","<HTML><HEAD></HEAD></HTML>",htmlTag.toHtml()); } public void testSimpleHeadWithBody() throws ParserException { createParser("<HTML><HEAD><BODY></HTML>"); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof Html); Html htmlTag = (Html)node[0]; assertTrue(htmlTag.getChild(0) instanceof HeadTag); //assertTrue(htmlTag.getChild(1) instanceof BodyTag); HeadTag headTag = (HeadTag)htmlTag.getChild(0); assertEquals("toHtml()","<HEAD></HEAD>",headTag.toHtml()); assertEquals("toHtml()","<HTML><HEAD></HEAD><BODY></BODY></HTML>",htmlTag.toHtml()); } } --- NEW FILE: HtmlTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/HtmlTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.Html; import org.htmlparser.tags.Tag; import org.htmlparser.tags.TitleTag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.NodeList; public class HtmlTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.HtmlTagTest", "HtmlTagTest"); } public HtmlTagTest (String name) { super(name); } public void testScan() throws Exception { createParser( "<html>" + " <head>" + " <title>Some Title</title>" + " </head>" + " <body>" + " Some data" + " </body>" + "</html>"); parser.setNodeFactory ( new PrototypicalNodeFactory ( new Tag[] { new TitleTag (), new Html (), })); parseAndAssertNodeCount(1); assertType("html tag",Html.class,node[0]); Html html = (Html)node[0]; NodeList nodeList = new NodeList(); NodeClassFilter filter = new NodeClassFilter (TitleTag.class); html.collectInto(nodeList, filter); assertEquals("nodelist size",1,nodeList.size()); Node node = nodeList.elementAt(0); assertType("expected title tag",TitleTag.class,node); TitleTag titleTag = (TitleTag)node; assertStringEquals("title","Some Title",titleTag.getTitle()); } } --- NEW FILE: LabelTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/LabelTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import java.util.Hashtable; import org.htmlparser.tags.LabelTag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.ParserException; public class LabelTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.LabelTagTest", "LabelTagTest"); } public LabelTagTest (String name) { super(name); } public void testSimpleLabels() throws ParserException { String html = "<label>This is a label tag</label>"; createParser(html); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LabelTag); // check the title node LabelTag labelTag = (LabelTag) node[0]; assertEquals("Label","This is a label tag",labelTag.getChildrenHTML()); assertEquals("Label","This is a label tag",labelTag.getLabel()); assertStringEquals("Label", html, labelTag.toHtml()); } public void testLabelWithJspTag() throws ParserException { String label = "<label><%=labelValue%></label>"; createParser(label); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LabelTag); // check the title node LabelTag labelTag = (LabelTag) node[0]; assertStringEquals("Label",label,labelTag.toHtml()); } public void testLabelWithOtherTags() throws ParserException { String html = "<label><span>Span within label</span></label>"; createParser(html); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LabelTag); // check the title node LabelTag labelTag = (LabelTag) node[0]; assertEquals("Label value","Span within label",labelTag.getLabel()); assertStringEquals("Label", html, labelTag.toHtml()); } public void testLabelWithManyCompositeTags() throws ParserException { String guts = "<span>Jane <b> Doe </b> Smith</span>"; String html = "<label>" + guts + "</label>"; createParser(html); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LabelTag); LabelTag labelTag = (LabelTag) node[0]; assertEquals("Label value",guts,labelTag.getChildrenHTML()); assertEquals("Label value","Jane Doe Smith",labelTag.getLabel()); assertStringEquals("Label",html,labelTag.toHtml()); } public void testLabelsID() throws ParserException { String html = "<label>John Doe</label>"; createParser(html); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LabelTag); LabelTag labelTag = (LabelTag) node[0]; assertStringEquals("Label", html, labelTag.toHtml()); Hashtable attr = labelTag.getAttributes(); assertNull("ID",attr.get("id")); } public void testNestedLabels() throws ParserException { String label1 = "<label id=\"attr1\">"; String label2 = "<label>Jane Doe"; createParser(label1 + label2); parseAndAssertNodeCount(2); assertTrue(node[0] instanceof LabelTag); assertTrue(node[1] instanceof LabelTag); LabelTag labelTag = (LabelTag) node[0]; assertStringEquals("Label", label1 + "</label>", labelTag.toHtml()); labelTag = (LabelTag) node[1]; assertStringEquals("Label", label2 + "</label>",labelTag.toHtml()); Hashtable attr = labelTag.getAttributes(); assertNull("ID",attr.get("id")); } public void testNestedLabels2() throws ParserException { String label1 = "<LABEL value=\"Google Search\">Google</LABEL>"; String label2 = "<LABEL value=\"AltaVista Search\">AltaVista"; String label3 = "<LABEL value=\"Lycos Search\"></LABEL>"; String label4 = "<LABEL>Yahoo!</LABEL>"; String label5 = "<LABEL>\nHotmail</LABEL>"; String label6 = "<LABEL value=\"ICQ Messenger\">"; String label7 = "<LABEL>Mailcity\n</LABEL>"; String label8 = "<LABEL>\nIndiatimes\n</LABEL>"; String label9 = "<LABEL>\nRediff\n</LABEL>"; String label10 = "<LABEL>Cricinfo"; String label11 = "<LABEL value=\"Microsoft Passport\">"; String label12 = "<LABEL value=\"AOL\"><SPAN>AOL</SPAN></LABEL>"; String label13 = "<LABEL value=\"Time Warner\">Time <B>Warner <SPAN>AOL </SPAN>Inc.</B>"; String testHTML = label1 + label2 + label3 + label4 + label5 + label6 + label7 + label8 + label9 + label10 + label11 + label12 + label13; createParser(testHTML); parseAndAssertNodeCount(13); LabelTag LabelTag; LabelTag = (LabelTag) node[0]; assertStringEquals("HTML String", label1, LabelTag.toHtml()); LabelTag = (LabelTag) node[1]; assertStringEquals("HTML String", label2 + "</LABEL>", LabelTag.toHtml()); LabelTag = (LabelTag) node[2]; assertStringEquals("HTML String", label3, LabelTag.toHtml()); LabelTag = (LabelTag) node[3]; assertStringEquals("HTML String", label4, LabelTag.toHtml()); LabelTag = (LabelTag) node[4]; assertStringEquals("HTML String", label5, LabelTag.toHtml()); LabelTag = (LabelTag) node[5]; assertStringEquals("HTML String", label6 + "</LABEL>",LabelTag.toHtml()); LabelTag = (LabelTag) node[6]; assertStringEquals("HTML String", label7, LabelTag.toHtml()); LabelTag = (LabelTag) node[7]; assertStringEquals("HTML String", label8, LabelTag.toHtml()); LabelTag = (LabelTag) node[8]; assertStringEquals("HTML String", label9, LabelTag.toHtml()); LabelTag = (LabelTag) node[9]; assertStringEquals("HTML String", label10 + "</LABEL>",LabelTag.toHtml()); LabelTag = (LabelTag) node[10]; assertStringEquals("HTML String", label11 + "</LABEL>",LabelTag.toHtml()); LabelTag = (LabelTag) node[11]; assertStringEquals("HTML String", label12, LabelTag.toHtml()); LabelTag = (LabelTag) node[12]; assertStringEquals("HTML String", label13 + "</LABEL>",LabelTag.toHtml()); } } --- NEW FILE: SpanTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/SpanTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.Span; import org.htmlparser.tags.TableColumn; import org.htmlparser.tags.Tag; import org.htmlparser.tests.ParserTestCase; public class SpanTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.SpanTagTest", "SpanTagTest"); } private static final String HTML_WITH_SPAN = "<TD BORDER=\"0.0\" VALIGN=\"Top\" COLSPAN=\"4\" WIDTH=\"33.33%\">" + " <DIV>" + " <SPAN>Flavor: small(90 to 120 minutes)<BR /></SPAN>" + " <SPAN>The short version of our Refactoring Challenge gives participants a general feel for the smells in the code base and includes time for participants to find and implement important refactorings.
<BR /></SPAN>" + " </DIV>" + "</TD>"; public SpanTagTest (String name) { super(name); } public void testScan() throws Exception { createParser( HTML_WITH_SPAN ); parser.setNodeFactory ( new PrototypicalNodeFactory ( new Tag[] { new TableColumn (), new Span (), })); parseAndAssertNodeCount(1); assertType("node",TableColumn.class,node[0]); TableColumn col = (TableColumn)node[0]; Node spans [] = col.searchFor(Span.class).toNodeArray(); assertEquals("number of spans found",2,spans.length); assertStringEquals( "span 1", "Flavor: small(90 to 120 minutes)", spans[0].toPlainTextString() ); assertStringEquals( "span 2", "The short version of our Refactoring Challenge gives participants a general feel for the smells in the code base and includes time for participants to find and implement important refactorings.
", spans[1].toPlainTextString() ); } } --- NEW FILE: TableTagTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2003 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/TableTagTest.java,v $ // $Author: derrickoswald $ // $Date: 2003/12/07 23:41:43 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.tags.Html; import org.htmlparser.tags.TableColumn; import org.htmlparser.tags.TableRow; import org.htmlparser.tags.TableTag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; public class TableTagTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.tagTests.TableTagTest", "TableTagTest"); } public TableTagTest (String name) { super(name); } private String createHtmlWithTable() { return "<table width=\"100.0%\" align=\"Center\" cellpadding=\"5.0\" cellspacing=\"0.0\" border=\"0.0\">"+ " <tr>" + " <td border=\"0.0\" valign=\"Top\" colspan=\"5\">" + " <img src=\"file:/c:/data/dev/eclipse_workspace/ShoppingServerTests/resources/pictures/fishbowl.jpg\" width=\"446.0\" height=\"335.0\" />" + " </td>" + " <td border=\"0.0\" align=\"Justify\" valign=\"Top\" colspan=\"7\">" + " <span>The best way to improve your refactoring skills is to practice cleaning up poorly-designed code. And we've got just the thing: code we custom-designed to reek of over 90% of the code smells identified in the refactoring literature. This poorly designed code functions correctly, which you can verify by running a full suite of tests against it. Your challenge is to identify the smells in this code, determining which refactoring(s) can help you clean up the smells and implement the refactorings to arrive at a well-designed new version of the code that continues to pass its unit tests. This exercise takes place using our popular class fishbowl. There is a lot to learn from this challenge, so we recommend that you spend as much time on it as possible.
<br /></span>" + " </td>" + " </tr>" + "</table>"; } public void testScan() throws ParserException { createParser(createHtmlWithTable()); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof TableTag); TableTag tableTag = (TableTag)node[0]; assertEquals("rows",1,tableTag.getRowCount()); TableRow row = tableTag.getRow(0); assertEquals("columns in row 1",2,row.getColumnCount()); assertEquals("table width","100.0%",tableTag.getAttribute("WIDTH")); } public void testErroneousTables() throws ParserException { createParser( "<HTML>\n"+ "<table border>\n"+ "<tr>\n"+ "<td>Head1</td>\n"+ "<td>Val1</td>\n"+ "</tr>\n"+ "<tr>\n"+ "<td>Head2</td>\n"+ "<td>Val2</td>\n"+ "</tr>\n"+ "<tr>\n"+ "<td>\n"+ "<table border>\n"+ "<tr>\n"+ "<td>table2 Head1</td>\n"+ "<td>table2 Val1</td>\n"+ "</tr>\n"+ "</table>\n"+ "</td>\n"+ "</tr>\n"+ "</BODY>\n"+ "</HTML>" ); parseAndAssertNodeCount(1); assertType("only tag should be a HTML tag", Html.class,node[0]); Html html = (Html)node[0]; assertEquals("html tag should have 4 children", 4, html.getChildCount ()); assertType("second tag",TableTag.class,html.getChild (1)); TableTag table = (TableTag)html.getChild (1); assertEquals("rows",3,table.getRowCount()); TableRow tr = table.getRow(2); assertEquals("columns",1,tr.getColumnCount()); TableColumn td = tr.getColumns()[0]; Node node = td.childAt(1); assertType("node",TableTag.class,node); TableTag table2 = (TableTag)node; assertEquals("second table row count",1,table2.getRowCount()); tr = table2.getRow(0); assertEquals("second table col count",2,tr.getColumnCount()); } /** * Test many unclosed tags (causes heavy recursion). * See feature request #729259 Increase maximum recursion depth. * Only perform this test if it's version 1.4 or higher. */ public void testRecursionDepth () throws ParserException { Parser parser; String url = "http://htmlparser.sourceforge.net/test/badtable2.html"; parser = new Parser (url); for (NodeIterator e = parser.elements();e.hasMoreNodes();) e.nextNode(); // Note: The test will throw a StackOverFlowException, // so we are successful if we get to here... assertTrue ("Crash", true); } /** * See bug #742254 Nested <TR> &<TD> tags should not be allowed */ public void testUnClosed1 () throws ParserException { createParser ("<TABLE><TR><TR></TR></TABLE>"); parseAndAssertNodeCount (1); String s = node[0].toHtml (); assertEquals ("Unclosed","<TABLE><TR></TR><TR></TR></TABLE>",s); } /** * See bug #742254 Nested <TR> &<TD> tags should not be allowed */ public void testUnClosed2 () throws ParserException { createParser ("<TABLE><TR><TD><TD></TD></TR></TABLE>"); parseAndAssertNodeCount (1); String s = node[0].toHtml (); assertEquals ("Unclosed","<TABLE><TR><TD></TD><TD></TD></TR></TABLE>",s); } /** * See bug #742254 Nested <TR> &<TD> tags should not be allowed */ public void testUnClosed3 () throws ParserException { createParser ("<TABLE><TR><TD>blah blah</TD><TR><TD>blah blah</TD></TR></TABLE>"); parseAndAssertNodeCount (1); String s = node[0].toHtml (); assertEquals ("Unclosed","<TABLE><TR><TD>blah blah</TD></TR><TR><TD>blah blah</TD></TR></TABLE>",s); } /** * See bug #750117 StackOverFlow while Node-Iteration * Not reproducible. */ public void testOverFlow () throws ParserException { parser = new Parser( "http://www.sec.gov/Archives/edgar/data/30554/000089322002000287/w57038e10-k.htm" ); Node node; for (NodeIterator e = parser.elements(); e.hasMoreNodes(); ) node = e.nextNode(); } } Index: AllTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/AllTests.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** AllTests.java 9 Nov 2003 17:07:15 -0000 1.47 --- AllTests.java 7 Dec 2003 23:41:42 -0000 1.48 *************** *** 46,71 **** public static TestSuite suite() { TestSuite suite = new TestSuite("Tag Tests"); ! suite.addTestSuite(JspTagTest.class); ! suite.addTestSuite(ScriptTagTest.class); ! suite.addTestSuite(ImageTagTest.class); ! suite.addTestSuite(LinkTagTest.class); ! suite.addTestSuite(TagTest.class); ! suite.addTestSuite(TitleTagTest.class); suite.addTestSuite(DoctypeTagTest.class); suite.addTestSuite(EndTagTest.class); ! suite.addTestSuite(MetaTagTest.class); ! suite.addTestSuite(StyleTagTest.class); ! suite.addTestSuite(AppletTagTest.class); ! suite.addTestSuite(FrameTagTest.class); suite.addTestSuite(FrameSetTagTest.class); suite.addTestSuite(InputTagTest.class); suite.addTestSuite(OptionTagTest.class); suite.addTestSuite(SelectTagTest.class); suite.addTestSuite(TextareaTagTest.class); ! suite.addTestSuite(FormTagTest.class); ! suite.addTestSuite(BaseHrefTagTest.class); ! suite.addTestSuite(ObjectCollectionTest.class); ! suite.addTestSuite(BodyTagTest.class); ! suite.addTestSuite(CompositeTagTest.class); return suite; } --- 46,79 ---- public static TestSuite suite() { TestSuite suite = new TestSuite("Tag Tests"); ! suite.addTestSuite(AppletTagTest.class); ! suite.addTestSuite(BaseHrefTagTest.class); ! suite.addTestSuite(BodyTagTest.class); ! suite.addTestSuite(BulletListTagTest.class); ! suite.addTestSuite(BulletTagTest.class); ! suite.addTestSuite(CompositeTagTest.class); ! suite.addTestSuite(DivTagTest.class); suite.addTestSuite(DoctypeTagTest.class); suite.addTestSuite(EndTagTest.class); ! suite.addTestSuite(FormTagTest.class); suite.addTestSuite(FrameSetTagTest.class); + suite.addTestSuite(FrameTagTest.class); + suite.addTestSuite(HeadTagTest.class); + suite.addTestSuite(HtmlTagTest.class); + suite.addTestSuite(ImageTagTest.class); suite.addTestSuite(InputTagTest.class); + suite.addTestSuite(JspTagTest.class); + suite.addTestSuite(LabelTagTest.class); + suite.addTestSuite(LinkTagTest.class); + suite.addTestSuite(MetaTagTest.class); + suite.addTestSuite(ObjectCollectionTest.class); suite.addTestSuite(OptionTagTest.class); + suite.addTestSuite(ScriptTagTest.class); suite.addTestSuite(SelectTagTest.class); + suite.addTestSuite(SpanTagTest.class); + suite.addTestSuite(StyleTagTest.class); + suite.addTestSuite(TableTagTest.class); + suite.addTestSuite(TagTest.class); suite.addTestSuite(TextareaTagTest.class); ! suite.addTestSuite(TitleTagTest.class); return suite; } Index: AppletTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/AppletTagTest.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** AppletTagTest.java 9 Nov 2003 17:07:15 -0000 1.34 --- AppletTagTest.java 7 Dec 2003 23:41:43 -0000 1.35 *************** *** 29,32 **** --- 29,33 ---- package org.htmlparser.tests.tagTests; + import java.util.Enumeration; import java.util.Hashtable; *************** *** 59,63 **** "</HTML>"; createParser(testHTML); - parser.registerScanners(); parseAndAssertNodeCount(3); assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); --- 60,63 ---- *************** *** 73,76 **** --- 73,108 ---- } + public void testScan() throws ParserException + { + String [][]paramsData = {{"Param1","Value1"},{"Name","Somik"},{"Age","23"}}; + Hashtable paramsMap = new Hashtable(); + String testHTML = new String("<APPLET CODE=Myclass.class ARCHIVE=test.jar CODEBASE=www.kizna.com>\n"); + for (int i = 0;i<paramsData.length;i++) + { + testHTML+="<PARAM NAME=\""+paramsData[i][0]+"\" VALUE=\""+paramsData[i][1]+"\">\n"; + paramsMap.put(paramsData[i][0],paramsData[i][1]); + } + testHTML+= + "</APPLET></HTML>"; + createParser(testHTML); + parseAndAssertNodeCount(2); + assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); + // Check the data in the applet tag + AppletTag appletTag = (AppletTag)node[0]; + assertEquals("Class Name","Myclass.class",appletTag.getAppletClass()); + assertEquals("Archive","test.jar",appletTag.getArchive()); + assertEquals("Codebase","www.kizna.com",appletTag.getCodeBase()); + // Check the params data + int cnt = 0; + for (Enumeration e = appletTag.getParameterNames();e.hasMoreElements();) + { + String paramName = (String)e.nextElement(); + String paramValue = appletTag.getParameter(paramName); + assertEquals("Param "+cnt+" value",paramsMap.get(paramName),paramValue); + cnt++; + } + assertEquals("Number of params",new Integer(paramsData.length),new Integer(cnt)); + } + public void testChangeCodebase() throws ParserException { String [][]paramsData = {{"Param1","Value1"},{"Name","Somik"},{"Age","23"}}; *************** *** 86,90 **** "</HTML>"; createParser(testHTML); - parser.registerScanners(); parseAndAssertNodeCount(3); assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); --- 118,121 ---- *************** *** 113,117 **** "</APPLET>"; createParser(testHTML + "\n</HTML>"); - parser.registerScanners(); parseAndAssertNodeCount(3); assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); --- 144,147 ---- *************** *** 137,141 **** "</APPLET>"; createParser(testHTML + "\n</HTML>"); - parser.registerScanners(); parseAndAssertNodeCount(3); assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); --- 167,170 ---- *************** *** 162,166 **** "</HTML>"; createParser(testHTML); - parser.registerScanners(); parseAndAssertNodeCount(3); assertTrue("Node should be an applet tag",node[0] instanceof AppletTag); --- 191,194 ---- Index: BaseHrefTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BaseHrefTagTest.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** BaseHrefTagTest.java 9 Nov 2003 17:07:15 -0000 1.34 --- BaseHrefTagTest.java 7 Dec 2003 23:41:43 -0000 1.35 *************** *** 30,35 **** --- 30,41 ---- import java.util.Vector; + + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.BaseHrefTag; + import org.htmlparser.tags.LinkTag; + import org.htmlparser.tags.Tag; + import org.htmlparser.tags.TitleTag; import org.htmlparser.tests.ParserTestCase; + import org.htmlparser.util.LinkProcessor; import org.htmlparser.util.ParserException; *************** *** 51,59 **** } public void testNotHREFBaseTag() throws ParserException { String html = "<base target=\"_top\">"; createParser(html); - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Should be a base tag but was "+node[0].getClass().getName(),node[0] instanceof BaseHrefTag); --- 57,89 ---- } + public void testRemoveLastSlash() { + String url1 = "http://www.yahoo.com/"; + String url2 = "http://www.google.com"; + String modifiedUrl1 = LinkProcessor.removeLastSlash(url1); + String modifiedUrl2 = LinkProcessor.removeLastSlash(url2); + assertEquals("Url1","http://www.yahoo.com",modifiedUrl1); + assertEquals("Url2","http://www.google.com",modifiedUrl2); + } + + public void testScan() throws ParserException{ + createParser("<html><head><TITLE>test page</TITLE><BASE HREF=\"http://www.abc.com/\"><a href=\"home.cfm\">Home</a>...</html>","http://www.google.com/test/index.html"); + parser.setNodeFactory ( + new PrototypicalNodeFactory ( + new Tag[] + { + new TitleTag (), + new LinkTag (), + new BaseHrefTag (), + })); + parseAndAssertNodeCount(7); + assertTrue("Base href tag should be the 4th tag", node[3] instanceof BaseHrefTag); + BaseHrefTag baseRefTag = (BaseHrefTag)node[3]; + assertEquals("Base HREF Url","http://www.abc.com",baseRefTag.getBaseUrl()); + } + public void testNotHREFBaseTag() throws ParserException { String html = "<base target=\"_top\">"; createParser(html); parseAndAssertNodeCount(1); assertTrue("Should be a base tag but was "+node[0].getClass().getName(),node[0] instanceof BaseHrefTag); Index: BodyTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BodyTagTest.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** BodyTagTest.java 9 Nov 2003 17:07:15 -0000 1.17 --- BodyTagTest.java 7 Dec 2003 23:41:43 -0000 1.18 *************** *** 33,38 **** import org.htmlparser.Node; ! import org.htmlparser.scanners.BodyScanner; import org.htmlparser.tags.BodyTag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.NodeIterator; --- 33,41 ---- import org.htmlparser.Node; ! import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.BodyTag; + import org.htmlparser.tags.Html; + import org.htmlparser.tags.Tag; + import org.htmlparser.tags.TitleTag; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.NodeIterator; *************** *** 56,64 **** super.setUp(); createParser("<html><head><title>body tag test</title></head>" + html + "</html>"); ! parser.registerScanners(); ! parser.addScanner(new BodyScanner("-b")); ! parseAndAssertNodeCount(6); ! assertTrue(node[4] instanceof BodyTag); ! bodyTag = (BodyTag) node[4]; } --- 59,68 ---- super.setUp(); createParser("<html><head><title>body tag test</title></head>" + html + "</html>"); ! parseAndAssertNodeCount(1); ! assertTrue("Only node should be an HTML node",node[0] instanceof Html); ! Html html = (Html)node[0]; ! assertTrue("HTML node should have two children",2 == html.getChildCount ()); ! assertTrue("Second node should be an BODY tag",html.getChild(1) instanceof BodyTag); ! bodyTag = (BodyTag)html.getChild(1); } *************** *** 85,89 **** { createParser("<body style=\"margin-top:4px; margin-left:20px;\" title=\"body\">"); ! parser.addScanner (new BodyScanner ("-b")); iterator = parser.elements (); node = null; --- 89,93 ---- { createParser("<body style=\"margin-top:4px; margin-left:20px;\" title=\"body\">"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new BodyTag ())); iterator = parser.elements (); node = null; *************** *** 107,110 **** --- 111,170 ---- fail ("exception thrown " + pe.getMessage ()); } + } + + public void testSimpleBody() throws ParserException { + createParser("<html><head><title>Test 1</title></head><body>This is a body tag</body></html>"); + parser.setNodeFactory ( + new PrototypicalNodeFactory ( + new Tag[] + { + new BodyTag (), + new TitleTag (), + })); + parseAndAssertNodeCount(6); + assertTrue(node[4] instanceof BodyTag); + // check the body node + BodyTag bodyTag = (BodyTag) node[4]; + assertEquals("Body","This is a body tag",bodyTag.getBody()); + assertEquals("Body","<body>This is a body tag</body>",bodyTag.toHtml()); + } + + public void testBodywithJsp() throws ParserException { + String body = "<body><%=BodyValue%></body>"; + createParser("<html><head><title>Test 1</title></head>" + body + "</html>"); + parser.setNodeFactory (new PrototypicalNodeFactory (new BodyTag ())); + parseAndAssertNodeCount(8); + assertTrue(node[6] instanceof BodyTag); + // check the body node + BodyTag bodyTag = (BodyTag) node[6]; + assertStringEquals("Body",body,bodyTag.toHtml()); + } + + public void testBodyMixed() throws ParserException { + String body = "<body>before jsp<%=BodyValue%>after jsp</body>"; + createParser("<html><head><title>Test 1</title></head>" + body + "</html>"); + parser.setNodeFactory ( + new PrototypicalNodeFactory ( + new Tag[] + { + new BodyTag (), + new TitleTag (), + })); + parseAndAssertNodeCount(6); + assertTrue(node[4] instanceof BodyTag); + // check the body node + BodyTag bodyTag = (BodyTag) node[4]; + assertEquals("Body",body,bodyTag.toHtml()); + } + + public void testBodyEnding() throws ParserException { + String body = "<body>before jsp<%=BodyValue%>after jsp"; + createParser("<html>" + body + "</html>"); + parser.setNodeFactory (new PrototypicalNodeFactory (new BodyTag ())); + parseAndAssertNodeCount(3); + assertTrue(node[1] instanceof BodyTag); + // check the body node + BodyTag bodyTag = (BodyTag) node[1]; + assertEquals("Body",body + "</body>",bodyTag.toHtml()); } Index: CompositeTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/CompositeTagTest.java,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -d -r1.13 -r1.14 *** CompositeTagTest.java 9 Nov 2003 17:07:16 -0000 1.13 --- CompositeTagTest.java 7 Dec 2003 23:41:43 -0000 1.14 *************** *** 29,36 **** package org.htmlparser.tests.tagTests; ! import org.htmlparser.*; ! import org.htmlparser.tags.*; ! import org.htmlparser.tests.*; ! import org.htmlparser.util.*; --- 29,40 ---- package org.htmlparser.tests.tagTests; ! import org.htmlparser.Node; ! import org.htmlparser.StringNode; ! import org.htmlparser.tags.CompositeTag; ! import org.htmlparser.tags.TableColumn; ! import org.htmlparser.tags.TableRow; ! import org.htmlparser.tags.TableTag; ! import org.htmlparser.tests.ParserTestCase; ! import org.htmlparser.util.ParserException; *************** *** 58,62 **** "</table>" ); - parser.registerScanners(); parseAndAssertNodeCount(1); TableTag tableTag = (TableTag)node[0]; --- 62,65 ---- *************** *** 90,94 **** "</table>" ); - parser.registerScanners(); parseAndAssertNodeCount(1); TableTag tableTag = (TableTag)node[0]; --- 93,96 ---- Index: DoctypeTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/DoctypeTagTest.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** DoctypeTagTest.java 9 Nov 2003 17:07:16 -0000 1.32 --- DoctypeTagTest.java 7 Dec 2003 23:41:43 -0000 1.33 *************** *** 56,63 **** "</HTML>\n"); createParser(testHTML); ! parser.registerScanners(); ! parseAndAssertNodeCount(16); ! // The node should be an DoctypeTag ! assertTrue("Node should be a DoctypeTag",node[0] instanceof DoctypeTag); DoctypeTag docTypeTag = (DoctypeTag)node[0]; assertStringEquals("toHTML()","<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0//EN\">",docTypeTag.toHtml()); --- 56,62 ---- "</HTML>\n"); createParser(testHTML); ! parseAndAssertNodeCount(4); ! // The first node should be an DoctypeTag ! assertTrue("First node should be a DoctypeTag",node[0] instanceof DoctypeTag); DoctypeTag docTypeTag = (DoctypeTag)node[0]; assertStringEquals("toHTML()","<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0//EN\">",docTypeTag.toHtml()); Index: EndTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/EndTagTest.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** EndTagTest.java 9 Nov 2003 17:07:16 -0000 1.35 --- EndTagTest.java 7 Dec 2003 23:41:43 -0000 1.36 *************** *** 30,33 **** --- 30,34 ---- package org.htmlparser.tests.tagTests; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.Tag; import org.htmlparser.tests.ParserTestCase; *************** *** 47,54 **** public void testToHTML() throws ParserException { createParser("<HTML></HTML>"); ! // Register the image scanner ! parser.registerScanners(); parseAndAssertNodeCount(2); ! // The node should be an HTMLLinkTag assertTrue("Node should be a Tag",node[1] instanceof Tag); Tag endTag = (Tag)node[1]; --- 48,54 ---- public void testToHTML() throws ParserException { createParser("<HTML></HTML>"); ! parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(2); ! // The node should be a tag assertTrue("Node should be a Tag",node[1] instanceof Tag); Tag endTag = (Tag)node[1]; *************** *** 61,64 **** --- 61,65 ---- "<SCRIPT>document.write(d+\".com\")</SCRIPT><BR>"; createParser(testHtml); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); int pos = testHtml.indexOf("</SCRIPT>"); parseAndAssertNodeCount(4); Index: FormTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/FormTagTest.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** FormTagTest.java 9 Nov 2003 17:07:16 -0000 1.38 --- FormTagTest.java 7 Dec 2003 23:41:43 -0000 1.39 *************** *** 29,40 **** package org.htmlparser.tests.tagTests; import org.htmlparser.Node; import org.htmlparser.StringNode; ! import org.htmlparser.scanners.FormScanner; import org.htmlparser.tags.FormTag; import org.htmlparser.tags.InputTag; import org.htmlparser.tags.Tag; import org.htmlparser.tests.ParserTestCase; ! import org.htmlparser.tests.scannersTests.FormScannerTest; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; --- 29,52 ---- package org.htmlparser.tests.tagTests; + import org.htmlparser.AbstractNode; import org.htmlparser.Node; + import org.htmlparser.Parser; import org.htmlparser.StringNode; ! import org.htmlparser.PrototypicalNodeFactory; ! import org.htmlparser.RemarkNode; ! import org.htmlparser.filters.NodeClassFilter; ! import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.FormTag; + import org.htmlparser.tags.HeadTag; + import org.htmlparser.tags.Html; import org.htmlparser.tags.InputTag; + import org.htmlparser.tags.LinkTag; + import org.htmlparser.tags.OptionTag; + import org.htmlparser.tags.SelectTag; + import org.htmlparser.tags.TableTag; import org.htmlparser.tags.Tag; + import org.htmlparser.tags.TextareaTag; import org.htmlparser.tests.ParserTestCase; ! import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; *************** *** 47,58 **** } public FormTagTest(String name) { super(name); } ! public void testSetFormLocation() throws ParserException{ ! createParser(FormScannerTest.FORM_HTML); ! parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Node 0 should be Form Tag",node[0] instanceof FormTag); --- 59,338 ---- } + public static final String FORM_HTML = + "<FORM METHOD=\""+FormTag.POST+"\" ACTION=\"do_login.php\" NAME=\"login_form\" onSubmit=\"return CheckData()\">\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><FONT face=\"Arial, verdana\" size=2><b>User Name</b></font></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"text\" NAME=\"name\" SIZE=\"20\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><FONT face=\"Arial, verdana\" size=2><b>Password</b></font></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"password\" NAME=\"passwd\" SIZE=\"20\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"><INPUT TYPE=\"submit\" NAME=\"submit\" VALUE=\"Login\"></TD></TR>\n"+ + "<TR><TD ALIGN=\"center\"> </TD></TR>\n"+ + "<TEXTAREA name=\"Description\" rows=\"15\" cols=\"55\" wrap=\"virtual\" class=\"composef\" tabindex=\"5\">Contents of TextArea</TEXTAREA>\n"+ + // "<TEXTAREA name=\"AnotherDescription\" rows=\"15\" cols=\"55\" wrap=\"virtual\" class=\"composef\" tabindex=\"5\">\n"+ + "<INPUT TYPE=\"hidden\" NAME=\"password\" SIZE=\"20\">\n"+ + "<INPUT TYPE=\"submit\">\n"+ + "</FORM>"; + public FormTagTest(String name) { super(name); } ! public void assertTypeNameSize(String description,String type,String name,String size,InputTag inputTag) ! { ! assertEquals(description+" type",type,inputTag.getAttribute("TYPE")); ! assertEquals(description+" name",name,inputTag.getAttribute("NAME")); ! assertEquals(description+" size",size,inputTag.getAttribute("SIZE")); ! } ! public void assertTypeNameValue(String description,String type,String name,String value,InputTag inputTag) ! { ! assertEquals(description+" type",type,inputTag.getAttribute("TYPE")); ! assertEquals(description+" name",name,inputTag.getAttribute("NAME")); ! assertEquals(description+" value",value,inputTag.getAttribute("VALUE")); ! } ! ! public void testScan() throws ParserException ! { ! createParser(FORM_HTML,"http://www.google.com/test/index.html"); ! parseAndAssertNodeCount(1); ! assertTrue("Node 0 should be Form Tag",node[0] instanceof FormTag); ! FormTag formTag = (FormTag)node[0]; ! assertStringEquals("Method",FormTag.POST,formTag.getFormMethod()); ! assertStringEquals("Location","http://www.google.com/test/do_login.php",formTag.getFormLocation()); ! assertStringEquals("Name","login_form",formTag.getFormName()); ! InputTag nameTag = formTag.getInputTag("name"); ! InputTag passwdTag = formTag.getInputTag("passwd"); ! InputTag submitTag = formTag.getInputTag("submit"); ! InputTag dummyTag = formTag.getInputTag("dummy"); ! assertNotNull("Input Name Tag should not be null",nameTag); ! assertNotNull("Input Password Tag should not be null",passwdTag); ! assertNotNull("Input Submit Tag should not be null",submitTag); ! assertNull("Input dummy tag should be null",dummyTag); ! ! assertTypeNameSize("Input Name Tag","text","name","20",nameTag); ! assertTypeNameSize("Input Password Tag","password","passwd","20",passwdTag); ! assertTypeNameValue("Input Submit Tag","submit","submit","Login",submitTag); ! ! TextareaTag textAreaTag = formTag.getTextAreaTag("Description"); ! assertNotNull("Text Area Tag should have been found",textAreaTag); ! assertEquals("Text Area Tag Contents","Contents of TextArea",textAreaTag.getValue()); ! assertNull("Should have been null",formTag.getTextAreaTag("junk")); ! ! ... [truncated message content] |
From: <der...@us...> - 2003-12-07 23:41:48
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1:/tmp/cvs-serv16537/util Modified Files: Generate.java ParserUtils.java Translate.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: Generate.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/Generate.java,v retrieving revision 1.45 retrieving revision 1.46 diff -C2 -d -r1.45 -r1.46 *** Generate.java 9 Nov 2003 17:07:17 -0000 1.45 --- Generate.java 7 Dec 2003 23:41:43 -0000 1.46 *************** *** 77,81 **** { parser = new Parser ("http://www.w3.org/TR/REC-html40/sgml/entities.html"); - parser.registerScanners (); } --- 77,80 ---- *************** *** 150,153 **** --- 149,195 ---- } + public void gather (Node node, StringBuffer buffer) + { + NodeList children; + + if (node instanceof StringNode) + { + // Node is a plain string + // Cast it to an HTMLStringNode + StringNode stringNode = (StringNode)node; + // Retrieve the data from the object + buffer.append (stringNode.getText ()); + } + else if (node instanceof LinkTag) + { + // Node is a link + // Cast it to an HTMLLinkTag + LinkTag linkNode = (LinkTag)node; + // Retrieve the data from the object and print it + buffer.append (linkNode.getLinkText ()); + } + else if (node instanceof Tag) + { + String name = ((Tag)node).getTagName (); + if (name.equals ("BR") || name.equals ("P")) + buffer.append (nl); + else + { + children = ((Tag)node).getChildren (); + if (null != children) + for (int i = 0; i < children.size (); i++) + gather (children.elementAt (i), buffer); + } + } + else if (node instanceof RemarkNode) + { + } + else + { + System.out.println (); + System.out.println(node.toString()); + } + } + /** * Pull out text elements from the HTML. *************** *** 165,199 **** { node = e.nextNode (); ! ! if (node instanceof StringNode) ! { ! // Node is a plain string ! // Cast it to an HTMLStringNode ! StringNode stringNode = (StringNode)node; ! // Retrieve the data from the object ! buffer.append (stringNode.getText ()); ! } ! else if (node instanceof LinkTag) ! { ! // Node is a link ! // Cast it to an HTMLLinkTag ! LinkTag linkNode = (LinkTag)node; ! // Retrieve the data from the object and print it ! buffer.append (linkNode.getLinkText ()); ! } ! else if (node instanceof Tag) ! { ! String contents = ((Tag)node).getText (); ! if (contents.equals ("BR") || contents.equals ("P")) ! buffer.append (nl); ! } ! else if (node instanceof RemarkNode) ! { ! } ! else ! { ! System.out.println (); ! System.out.println(node.toString()); ! } } --- 207,211 ---- { node = e.nextNode (); ! gather (node, buffer); } *************** *** 431,436 **** { Generate filter = new Generate (); ! System.out.println ("import java.util.Hashtable;"); System.out.println ("import java.util.Iterator;"); System.out.println (); System.out.println ("/**"); --- 443,483 ---- { Generate filter = new Generate (); ! System.out.println ("// HTMLParser Library v1_4_20031109 - A java-based parser for HTML"); ! System.out.println ("// Copyright (C) Dec 31, 2000 Somik Raha"); ! System.out.println ("//"); ! System.out.println ("// This library is free software; you can redistribute it and/or"); ! System.out.println ("// modify it under the terms of the GNU Lesser General Public"); ! System.out.println ("// License as published by the Free Software Foundation; either"); ! System.out.println ("// version 2.1 of the License, or (at your option) any later version."); ! System.out.println ("//"); ! System.out.println ("// This library is distributed in the hope that it will be useful,"); ! System.out.println ("// but WITHOUT ANY WARRANTY; without even the implied warranty of"); ! System.out.println ("// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU"); ! System.out.println ("// Lesser General Public License for more details."); ! System.out.println ("//"); ! System.out.println ("// You should have received a copy of the GNU Lesser General Public"); ! System.out.println ("// License along with this library; if not, write to the Free Software"); ! System.out.println ("// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA"); ! System.out.println ("//"); ! System.out.println ("// For any questions or suggestions, you can write to me at :"); ! System.out.println ("// Email :so...@in..."); ! System.out.println ("//"); ! System.out.println ("// Postal Address :"); ! System.out.println ("// Somik Raha"); ! System.out.println ("// Extreme Programmer & Coach"); ! System.out.println ("// Industrial Logic Corporation"); ! System.out.println ("// 2583 Cedar Street, Berkeley,"); ! System.out.println ("// CA 94708, USA"); ! System.out.println ("// Website : http://www.industriallogic.com"); ! System.out.println ("//"); ! System.out.println ("// This class was contributed by"); ! System.out.println ("// Derrick Oswald"); ! System.out.println ("//"); ! System.out.println (); ! System.out.println ("package org.htmlparser.util;"); ! System.out.println (); ! System.out.println ("import java.util.HashMap;"); System.out.println ("import java.util.Iterator;"); + System.out.println ("import java.util.Map;"); System.out.println (); System.out.println ("/**"); *************** *** 451,458 **** System.out.println (" * <p><code>String</code>-><code>Character</code>"); System.out.println (" */"); ! System.out.println (" protected static Hashtable mRefChar;"); System.out.println (" static"); System.out.println (" {"); ! System.out.println (" mRefChar = new Hashtable (1000);"); System.out.println (); filter.parse (); --- 498,505 ---- System.out.println (" * <p><code>String</code>-><code>Character</code>"); System.out.println (" */"); ! System.out.println (" protected static Map mRefChar;"); System.out.println (" static"); System.out.println (" {"); ! System.out.println (" mRefChar = new HashMap (1000);"); System.out.println (); filter.parse (); *************** *** 463,471 **** System.out.println (" * <p><code>Character</code>-><code>String</code>"); System.out.println (" */"); ! System.out.println (" protected static Hashtable mCharRef;"); System.out.println (" static"); System.out.println (" {"); ! System.out.println (" mCharRef = new Hashtable (mRefChar.size ());"); ! System.out.println (); System.out.println (" Iterator iterator = mRefChar.keySet ().iterator ();"); System.out.println (" while (iterator.hasNext ())"); --- 510,517 ---- System.out.println (" * <p><code>Character</code>-><code>String</code>"); System.out.println (" */"); ! System.out.println (" protected static Map mCharRef;"); System.out.println (" static"); System.out.println (" {"); ! System.out.println (" mCharRef = new HashMap (mRefChar.size ());"); System.out.println (" Iterator iterator = mRefChar.keySet ().iterator ();"); System.out.println (" while (iterator.hasNext ())"); *************** *** 496,523 **** System.out.println (" public static char convertToChar (String string)"); System.out.println (" {"); - System.out.println (" int length;"); System.out.println (" Character item;"); System.out.println (" char ret;"); System.out.println (); System.out.println (" ret = 0;"); System.out.println (); ! System.out.println (" length = string.length ();"); ! System.out.println (" if (0 < length)"); System.out.println (" {"); System.out.println (" if ('&' == string.charAt (0))"); System.out.println (" {"); ! System.out.println (" string = string.substring (1);"); ! System.out.println (" length--;"); ! System.out.println (" }"); ! System.out.println (" if (0 < length)"); ! System.out.println (" {"); ! System.out.println (" if (';' == string.charAt (length - 1))"); ! System.out.println (" string = string.substring (0, --length);"); ! System.out.println (" if (0 < length)"); System.out.println (" {"); ! System.out.println (" if ('#' == string.charAt (0))"); System.out.println (" try"); System.out.println (" {"); ! System.out.println (" ret = (char)Integer.parseInt (string.substring (1));"); System.out.println (" }"); System.out.println (" catch (NumberFormatException nfe)"); --- 542,568 ---- System.out.println (" public static char convertToChar (String string)"); System.out.println (" {"); System.out.println (" Character item;"); + System.out.println (" int start;"); + System.out.println (" int end;"); System.out.println (" char ret;"); System.out.println (); System.out.println (" ret = 0;"); System.out.println (); ! System.out.println (" start = 0;"); ! System.out.println (" end = string.length ();"); ! System.out.println (" if (0 < end)"); System.out.println (" {"); System.out.println (" if ('&' == string.charAt (0))"); + System.out.println (" start++;"); + System.out.println (" if (0 < end)"); System.out.println (" {"); ! System.out.println (" if (';' == string.charAt (end - 1))"); ! System.out.println (" --end;"); ! System.out.println (" if (0 < end)"); System.out.println (" {"); ! System.out.println (" if ('#' == string.charAt (start))"); System.out.println (" try"); System.out.println (" {"); ! System.out.println (" ret = (char)Integer.parseInt (string.substring (start + 1, end));"); System.out.println (" }"); System.out.println (" catch (NumberFormatException nfe)"); *************** *** 527,531 **** System.out.println (" else"); System.out.println (" {"); ! System.out.println (" item = (Character)refChar.get (string);"); System.out.println (" if (null != item)"); System.out.println (" ret = item.charValue ();"); --- 572,576 ---- System.out.println (" else"); System.out.println (" {"); ! System.out.println (" item = (Character)mRefChar.get (string.substring (start,end));"); System.out.println (" if (null != item)"); System.out.println (" ret = item.charValue ();"); *************** *** 536,539 **** --- 581,588 ---- System.out.println (); System.out.println (" return (ret);"); + System.out.println (" }"); + System.out.println (); + System.out.println (" public static String decode (StringBuffer stringBuffer) {"); + System.out.println (" return decode(stringBuffer.toString());"); System.out.println (" }"); System.out.println (); Index: ParserUtils.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/ParserUtils.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** ParserUtils.java 9 Nov 2003 17:07:17 -0000 1.35 --- ParserUtils.java 7 Dec 2003 23:41:43 -0000 1.36 *************** *** 39,75 **** import org.htmlparser.tags.Tag; ! public class ParserUtils { ! ! public static String toString(Tag tag) { ! String tagName = tag.getRawTagName (); ! Hashtable attrs = tag.getAttributes(); ! ! StringBuffer lString = new StringBuffer(tagName); ! lString.append(" TAG\n"); ! lString.append("--------\n"); ! ! for (Enumeration e = attrs.keys(); e.hasMoreElements();) { ! String key = (String) e.nextElement(); ! String value = (String) attrs.get(key); ! if (!key.equalsIgnoreCase(SpecialHashtable.TAGNAME) && value.length() > 0) ! lString.append(key).append(" : ").append(value).append("\n"); ! } ! ! return lString.toString(); ! } ! ! public static Map adjustScanners(Parser parser) { ! Map tempScanners = new Hashtable(); ! tempScanners = parser.getScanners(); ! // Remove all existing scanners ! parser.flushScanners(); ! return tempScanners; ! } ! ! public static void restoreScanners(Parser parser, Map tempScanners) { ! // Flush the scanners ! parser.setScanners(tempScanners); ! } ! public static String removeChars(String s, char occur) { StringBuffer newString = new StringBuffer(); --- 39,44 ---- import org.htmlparser.tags.Tag; ! public class ParserUtils ! { public static String removeChars(String s, char occur) { StringBuffer newString = new StringBuffer(); *************** *** 88,97 **** inputString = ParserUtils.removeChars(inputString, '\t'); return inputString; - } - - public static String removeLeadingBlanks(String plainText) { - while (plainText.indexOf(' ') == 0) - plainText = plainText.substring(1); - return plainText; } --- 57,60 ---- Index: Translate.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/Translate.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** Translate.java 9 Nov 2003 17:07:17 -0000 1.39 --- Translate.java 7 Dec 2003 23:41:43 -0000 1.40 *************** *** 54,63 **** * <p><code>String</code>-><code>Character</code> */ ! protected static Map refChar; static { ! refChar = new HashMap(1000); ! // Portions � International Organization for Standardization 1986 // Permission to copy in any form is granted for use with // conforming SGML systems and applications as defined in --- 54,63 ---- * <p><code>String</code>-><code>Character</code> */ ! protected static Map mRefChar; static { ! mRefChar = new HashMap (1000); ! // Portions \u00a9 International Organization for Standardization 1986 // Permission to copy in any form is granted for use with // conforming SGML systems and applications as defined in *************** *** 67,166 **** // "-//W3C//ENTITIES Latin 1//EN//HTML"> // %HTMLlat1; ! refChar.put ("nbsp", new Character ('\u00a0')); // no-break space = non-breaking space, U+00A0 ISOnum ! refChar.put ("iexcl", new Character ('\u00a1')); // inverted exclamation mark, U+00A1 ISOnum ! refChar.put ("cent", new Character ('\u00a2')); // cent sign, U+00A2 ISOnum ! refChar.put ("pound", new Character ('\u00a3')); // pound sign, U+00A3 ISOnum ! refChar.put ("curren", new Character ('\u00a4')); // currency sign, U+00A4 ISOnum ! refChar.put ("yen", new Character ('\u00a5')); // yen sign = yuan sign, U+00A5 ISOnum ! refChar.put ("brvbar", new Character ('\u00a6')); // broken bar = broken vertical bar, U+00A6 ISOnum ! refChar.put ("sect", new Character ('\u00a7')); // section sign, U+00A7 ISOnum ! refChar.put ("uml", new Character ('\u00a8')); // diaeresis = spacing diaeresis, U+00A8 ISOdia ! refChar.put ("copy", new Character ('\u00a9')); // copyright sign, U+00A9 ISOnum ! refChar.put ("ordf", new Character ('\u00aa')); // feminine ordinal indicator, U+00AA ISOnum ! refChar.put ("laquo", new Character ('\u00ab')); // left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum ! refChar.put ("not", new Character ('\u00ac')); // not sign, U+00AC ISOnum ! refChar.put ("shy", new Character ('\u00ad')); // soft hyphen = discretionary hyphen, U+00AD ISOnum ! refChar.put ("reg", new Character ('\u00ae')); // registered sign = registered trade mark sign, U+00AE ISOnum ! refChar.put ("macr", new Character ('\u00af')); // macron = spacing macron = overline = APL overbar, U+00AF ISOdia ! refChar.put ("deg", new Character ('\u00b0')); // degree sign, U+00B0 ISOnum ! refChar.put ("plusmn", new Character ('\u00b1')); // plus-minus sign = plus-or-minus sign, U+00B1 ISOnum ! refChar.put ("sup2", new Character ('\u00b2')); // superscript two = superscript digit two = squared, U+00B2 ISOnum ! refChar.put ("sup3", new Character ('\u00b3')); // superscript three = superscript digit three = cubed, U+00B3 ISOnum ! refChar.put ("acute", new Character ('\u00b4')); // acute accent = spacing acute, U+00B4 ISOdia ! refChar.put ("micro", new Character ('\u00b5')); // micro sign, U+00B5 ISOnum ! refChar.put ("para", new Character ('\u00b6')); // pilcrow sign = paragraph sign, U+00B6 ISOnum ! refChar.put ("middot", new Character ('\u00b7')); // middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum ! refChar.put ("cedil", new Character ('\u00b8')); // cedilla = spacing cedilla, U+00B8 ISOdia ! refChar.put ("sup1", new Character ('\u00b9')); // superscript one = superscript digit one, U+00B9 ISOnum ! refChar.put ("ordm", new Character ('\u00ba')); // masculine ordinal indicator, U+00BA ISOnum ! refChar.put ("raquo", new Character ('\u00bb')); // right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum ! refChar.put ("frac14", new Character ('\u00bc')); // vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum ! refChar.put ("frac12", new Character ('\u00bd')); // vulgar fraction one half = fraction one half, U+00BD ISOnum ! refChar.put ("frac34", new Character ('\u00be')); // vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum ! refChar.put ("iquest", new Character ('\u00bf')); // inverted question mark = turned question mark, U+00BF ISOnum ! refChar.put ("Agrave", new Character ('\u00c0')); // latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1 ! refChar.put ("Aacute", new Character ('\u00c1')); // latin capital letter A with acute, U+00C1 ISOlat1 ! refChar.put ("Acirc", new Character ('\u00c2')); // latin capital letter A with circumflex, U+00C2 ISOlat1 ! refChar.put ("Atilde", new Character ('\u00c3')); // latin capital letter A with tilde, U+00C3 ISOlat1 ! refChar.put ("Auml", new Character ('\u00c4')); // latin capital letter A with diaeresis, U+00C4 ISOlat1 ! refChar.put ("Aring", new Character ('\u00c5')); // latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1 ! refChar.put ("AElig", new Character ('\u00c6')); // latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1 ! refChar.put ("Ccedil", new Character ('\u00c7')); // latin capital letter C with cedilla, U+00C7 ISOlat1 ! refChar.put ("Egrave", new Character ('\u00c8')); // latin capital letter E with grave, U+00C8 ISOlat1 ! refChar.put ("Eacute", new Character ('\u00c9')); // latin capital letter E with acute, U+00C9 ISOlat1 ! refChar.put ("Ecirc", new Character ('\u00ca')); // latin capital letter E with circumflex, U+00CA ISOlat1 ! refChar.put ("Euml", new Character ('\u00cb')); // latin capital letter E with diaeresis, U+00CB ISOlat1 ! refChar.put ("Igrave", new Character ('\u00cc')); // latin capital letter I with grave, U+00CC ISOlat1 ! refChar.put ("Iacute", new Character ('\u00cd')); // latin capital letter I with acute, U+00CD ISOlat1 ! refChar.put ("Icirc", new Character ('\u00ce')); // latin capital letter I with circumflex, U+00CE ISOlat1 ! refChar.put ("Iuml", new Character ('\u00cf')); // latin capital letter I with diaeresis, U+00CF ISOlat1 ! refChar.put ("ETH", new Character ('\u00d0')); // latin capital letter ETH, U+00D0 ISOlat1 ! refChar.put ("Ntilde", new Character ('\u00d1')); // latin capital letter N with tilde, U+00D1 ISOlat1 ! refChar.put ("Ograve", new Character ('\u00d2')); // latin capital letter O with grave, U+00D2 ISOlat1 ! refChar.put ("Oacute", new Character ('\u00d3')); // latin capital letter O with acute, U+00D3 ISOlat1 ! refChar.put ("Ocirc", new Character ('\u00d4')); // latin capital letter O with circumflex, U+00D4 ISOlat1 ! refChar.put ("Otilde", new Character ('\u00d5')); // latin capital letter O with tilde, U+00D5 ISOlat1 ! refChar.put ("Ouml", new Character ('\u00d6')); // latin capital letter O with diaeresis, U+00D6 ISOlat1 ! refChar.put ("times", new Character ('\u00d7')); // multiplication sign, U+00D7 ISOnum ! refChar.put ("Oslash", new Character ('\u00d8')); // latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1 ! refChar.put ("Ugrave", new Character ('\u00d9')); // latin capital letter U with grave, U+00D9 ISOlat1 ! refChar.put ("Uacute", new Character ('\u00da')); // latin capital letter U with acute, U+00DA ISOlat1 ! refChar.put ("Ucirc", new Character ('\u00db')); // latin capital letter U with circumflex, U+00DB ISOlat1 ! refChar.put ("Uuml", new Character ('\u00dc')); // latin capital letter U with diaeresis, U+00DC ISOlat1 ! refChar.put ("Yacute", new Character ('\u00dd')); // latin capital letter Y with acute, U+00DD ISOlat1 ! refChar.put ("THORN", new Character ('\u00de')); // latin capital letter THORN, U+00DE ISOlat1 ! refChar.put ("szlig", new Character ('\u00df')); // latin small letter sharp s = ess-zed, U+00DF ISOlat1 ! refChar.put ("agrave", new Character ('\u00e0')); // latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1 ! refChar.put ("aacute", new Character ('\u00e1')); // latin small letter a with acute, U+00E1 ISOlat1 ! refChar.put ("acirc", new Character ('\u00e2')); // latin small letter a with circumflex, U+00E2 ISOlat1 ! refChar.put ("atilde", new Character ('\u00e3')); // latin small letter a with tilde, U+00E3 ISOlat1 ! refChar.put ("auml", new Character ('\u00e4')); // latin small letter a with diaeresis, U+00E4 ISOlat1 ! refChar.put ("aring", new Character ('\u00e5')); // latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1 ! refChar.put ("aelig", new Character ('\u00e6')); // latin small letter ae = latin small ligature ae, U+00E6 ISOlat1 ! refChar.put ("ccedil", new Character ('\u00e7')); // latin small letter c with cedilla, U+00E7 ISOlat1 ! refChar.put ("egrave", new Character ('\u00e8')); // latin small letter e with grave, U+00E8 ISOlat1 ! refChar.put ("eacute", new Character ('\u00e9')); // latin small letter e with acute, U+00E9 ISOlat1 ! refChar.put ("ecirc", new Character ('\u00ea')); // latin small letter e with circumflex, U+00EA ISOlat1 ! refChar.put ("euml", new Character ('\u00eb')); // latin small letter e with diaeresis, U+00EB ISOlat1 ! refChar.put ("igrave", new Character ('\u00ec')); // latin small letter i with grave, U+00EC ISOlat1 ! refChar.put ("iacute", new Character ('\u00ed')); // latin small letter i with acute, U+00ED ISOlat1 ! refChar.put ("icirc", new Character ('\u00ee')); // latin small letter i with circumflex, U+00EE ISOlat1 ! refChar.put ("iuml", new Character ('\u00ef')); // latin small letter i with diaeresis, U+00EF ISOlat1 ! refChar.put ("eth", new Character ('\u00f0')); // latin small letter eth, U+00F0 ISOlat1 ! refChar.put ("ntilde", new Character ('\u00f1')); // latin small letter n with tilde, U+00F1 ISOlat1 ! refChar.put ("ograve", new Character ('\u00f2')); // latin small letter o with grave, U+00F2 ISOlat1 ! refChar.put ("oacute", new Character ('\u00f3')); // latin small letter o with acute, U+00F3 ISOlat1 ! refChar.put ("ocirc", new Character ('\u00f4')); // latin small letter o with circumflex, U+00F4 ISOlat1 ! refChar.put ("otilde", new Character ('\u00f5')); // latin small letter o with tilde, U+00F5 ISOlat1 ! refChar.put ("ouml", new Character ('\u00f6')); // latin small letter o with diaeresis, U+00F6 ISOlat1 ! refChar.put ("divide", new Character ('\u00f7')); // division sign, U+00F7 ISOnum ! refChar.put ("oslash", new Character ('\u00f8')); // latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1 ! refChar.put ("ugrave", new Character ('\u00f9')); // latin small letter u with grave, U+00F9 ISOlat1 ! refChar.put ("uacute", new Character ('\u00fa')); // latin small letter u with acute, U+00FA ISOlat1 ! refChar.put ("ucirc", new Character ('\u00fb')); // latin small letter u with circumflex, U+00FB ISOlat1 ! refChar.put ("uuml", new Character ('\u00fc')); // latin small letter u with diaeresis, U+00FC ISOlat1 ! refChar.put ("yacute", new Character ('\u00fd')); // latin small letter y with acute, U+00FD ISOlat1 ! refChar.put ("thorn", new Character ('\u00fe')); // latin small letter thorn, U+00FE ISOlat1 ! refChar.put ("yuml", new Character ('\u00ff')); // latin small letter y with diaeresis, U+00FF ISOlat1 // Mathematical, Greek and Symbolic characters for HTML // Character entity set. Typical invocation: --- 67,166 ---- // "-//W3C//ENTITIES Latin 1//EN//HTML"> // %HTMLlat1; ! mRefChar.put ("nbsp", new Character ('\u00a0')); // no-break space = non-breaking space, U+00A0 ISOnum ! mRefChar.put ("iexcl", new Character ('\u00a1')); // inverted exclamation mark, U+00A1 ISOnum ! mRefChar.put ("cent", new Character ('\u00a2')); // cent sign, U+00A2 ISOnum ! mRefChar.put ("pound", new Character ('\u00a3')); // pound sign, U+00A3 ISOnum ! mRefChar.put ("curren", new Character ('\u00a4')); // currency sign, U+00A4 ISOnum ! mRefChar.put ("yen", new Character ('\u00a5')); // yen sign = yuan sign, U+00A5 ISOnum ! mRefChar.put ("brvbar", new Character ('\u00a6')); // broken bar = broken vertical bar, U+00A6 ISOnum ! mRefChar.put ("sect", new Character ('\u00a7')); // section sign, U+00A7 ISOnum ! mRefChar.put ("uml", new Character ('\u00a8')); // diaeresis = spacing diaeresis, U+00A8 ISOdia ! mRefChar.put ("copy", new Character ('\u00a9')); // copyright sign, U+00A9 ISOnum ! mRefChar.put ("ordf", new Character ('\u00aa')); // feminine ordinal indicator, U+00AA ISOnum ! mRefChar.put ("laquo", new Character ('\u00ab')); // left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum ! mRefChar.put ("not", new Character ('\u00ac')); // not sign, U+00AC ISOnum ! mRefChar.put ("shy", new Character ('\u00ad')); // soft hyphen = discretionary hyphen, U+00AD ISOnum ! mRefChar.put ("reg", new Character ('\u00ae')); // registered sign = registered trade mark sign, U+00AE ISOnum ! mRefChar.put ("macr", new Character ('\u00af')); // macron = spacing macron = overline = APL overbar, U+00AF ISOdia ! mRefChar.put ("deg", new Character ('\u00b0')); // degree sign, U+00B0 ISOnum ! mRefChar.put ("plusmn", new Character ('\u00b1')); // plus-minus sign = plus-or-minus sign, U+00B1 ISOnum ! mRefChar.put ("sup2", new Character ('\u00b2')); // superscript two = superscript digit two = squared, U+00B2 ISOnum ! mRefChar.put ("sup3", new Character ('\u00b3')); // superscript three = superscript digit three = cubed, U+00B3 ISOnum ! mRefChar.put ("acute", new Character ('\u00b4')); // acute accent = spacing acute, U+00B4 ISOdia ! mRefChar.put ("micro", new Character ('\u00b5')); // micro sign, U+00B5 ISOnum ! mRefChar.put ("para", new Character ('\u00b6')); // pilcrow sign = paragraph sign, U+00B6 ISOnum ! mRefChar.put ("middot", new Character ('\u00b7')); // middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum ! mRefChar.put ("cedil", new Character ('\u00b8')); // cedilla = spacing cedilla, U+00B8 ISOdia ! mRefChar.put ("sup1", new Character ('\u00b9')); // superscript one = superscript digit one, U+00B9 ISOnum ! mRefChar.put ("ordm", new Character ('\u00ba')); // masculine ordinal indicator, U+00BA ISOnum ! mRefChar.put ("raquo", new Character ('\u00bb')); // right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum ! mRefChar.put ("frac14", new Character ('\u00bc')); // vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum ! mRefChar.put ("frac12", new Character ('\u00bd')); // vulgar fraction one half = fraction one half, U+00BD ISOnum ! mRefChar.put ("frac34", new Character ('\u00be')); // vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum ! mRefChar.put ("iquest", new Character ('\u00bf')); // inverted question mark = turned question mark, U+00BF ISOnum ! mRefChar.put ("Agrave", new Character ('\u00c0')); // latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1 ! mRefChar.put ("Aacute", new Character ('\u00c1')); // latin capital letter A with acute, U+00C1 ISOlat1 ! mRefChar.put ("Acirc", new Character ('\u00c2')); // latin capital letter A with circumflex, U+00C2 ISOlat1 ! mRefChar.put ("Atilde", new Character ('\u00c3')); // latin capital letter A with tilde, U+00C3 ISOlat1 ! mRefChar.put ("Auml", new Character ('\u00c4')); // latin capital letter A with diaeresis, U+00C4 ISOlat1 ! mRefChar.put ("Aring", new Character ('\u00c5')); // latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1 ! mRefChar.put ("AElig", new Character ('\u00c6')); // latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1 ! mRefChar.put ("Ccedil", new Character ('\u00c7')); // latin capital letter C with cedilla, U+00C7 ISOlat1 ! mRefChar.put ("Egrave", new Character ('\u00c8')); // latin capital letter E with grave, U+00C8 ISOlat1 ! mRefChar.put ("Eacute", new Character ('\u00c9')); // latin capital letter E with acute, U+00C9 ISOlat1 ! mRefChar.put ("Ecirc", new Character ('\u00ca')); // latin capital letter E with circumflex, U+00CA ISOlat1 ! mRefChar.put ("Euml", new Character ('\u00cb')); // latin capital letter E with diaeresis, U+00CB ISOlat1 ! mRefChar.put ("Igrave", new Character ('\u00cc')); // latin capital letter I with grave, U+00CC ISOlat1 ! mRefChar.put ("Iacute", new Character ('\u00cd')); // latin capital letter I with acute, U+00CD ISOlat1 ! mRefChar.put ("Icirc", new Character ('\u00ce')); // latin capital letter I with circumflex, U+00CE ISOlat1 ! mRefChar.put ("Iuml", new Character ('\u00cf')); // latin capital letter I with diaeresis, U+00CF ISOlat1 ! mRefChar.put ("ETH", new Character ('\u00d0')); // latin capital letter ETH, U+00D0 ISOlat1 ! mRefChar.put ("Ntilde", new Character ('\u00d1')); // latin capital letter N with tilde, U+00D1 ISOlat1 ! mRefChar.put ("Ograve", new Character ('\u00d2')); // latin capital letter O with grave, U+00D2 ISOlat1 ! mRefChar.put ("Oacute", new Character ('\u00d3')); // latin capital letter O with acute, U+00D3 ISOlat1 ! mRefChar.put ("Ocirc", new Character ('\u00d4')); // latin capital letter O with circumflex, U+00D4 ISOlat1 ! mRefChar.put ("Otilde", new Character ('\u00d5')); // latin capital letter O with tilde, U+00D5 ISOlat1 ! mRefChar.put ("Ouml", new Character ('\u00d6')); // latin capital letter O with diaeresis, U+00D6 ISOlat1 ! mRefChar.put ("times", new Character ('\u00d7')); // multiplication sign, U+00D7 ISOnum ! mRefChar.put ("Oslash", new Character ('\u00d8')); // latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1 ! mRefChar.put ("Ugrave", new Character ('\u00d9')); // latin capital letter U with grave, U+00D9 ISOlat1 ! mRefChar.put ("Uacute", new Character ('\u00da')); // latin capital letter U with acute, U+00DA ISOlat1 ! mRefChar.put ("Ucirc", new Character ('\u00db')); // latin capital letter U with circumflex, U+00DB ISOlat1 ! mRefChar.put ("Uuml", new Character ('\u00dc')); // latin capital letter U with diaeresis, U+00DC ISOlat1 ! mRefChar.put ("Yacute", new Character ('\u00dd')); // latin capital letter Y with acute, U+00DD ISOlat1 ! mRefChar.put ("THORN", new Character ('\u00de')); // latin capital letter THORN, U+00DE ISOlat1 ! mRefChar.put ("szlig", new Character ('\u00df')); // latin small letter sharp s = ess-zed, U+00DF ISOlat1 ! mRefChar.put ("agrave", new Character ('\u00e0')); // latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1 ! mRefChar.put ("aacute", new Character ('\u00e1')); // latin small letter a with acute, U+00E1 ISOlat1 ! mRefChar.put ("acirc", new Character ('\u00e2')); // latin small letter a with circumflex, U+00E2 ISOlat1 ! mRefChar.put ("atilde", new Character ('\u00e3')); // latin small letter a with tilde, U+00E3 ISOlat1 ! mRefChar.put ("auml", new Character ('\u00e4')); // latin small letter a with diaeresis, U+00E4 ISOlat1 ! mRefChar.put ("aring", new Character ('\u00e5')); // latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1 ! mRefChar.put ("aelig", new Character ('\u00e6')); // latin small letter ae = latin small ligature ae, U+00E6 ISOlat1 ! mRefChar.put ("ccedil", new Character ('\u00e7')); // latin small letter c with cedilla, U+00E7 ISOlat1 ! mRefChar.put ("egrave", new Character ('\u00e8')); // latin small letter e with grave, U+00E8 ISOlat1 ! mRefChar.put ("eacute", new Character ('\u00e9')); // latin small letter e with acute, U+00E9 ISOlat1 ! mRefChar.put ("ecirc", new Character ('\u00ea')); // latin small letter e with circumflex, U+00EA ISOlat1 ! mRefChar.put ("euml", new Character ('\u00eb')); // latin small letter e with diaeresis, U+00EB ISOlat1 ! mRefChar.put ("igrave", new Character ('\u00ec')); // latin small letter i with grave, U+00EC ISOlat1 ! mRefChar.put ("iacute", new Character ('\u00ed')); // latin small letter i with acute, U+00ED ISOlat1 ! mRefChar.put ("icirc", new Character ('\u00ee')); // latin small letter i with circumflex, U+00EE ISOlat1 ! mRefChar.put ("iuml", new Character ('\u00ef')); // latin small letter i with diaeresis, U+00EF ISOlat1 ! mRefChar.put ("eth", new Character ('\u00f0')); // latin small letter eth, U+00F0 ISOlat1 ! mRefChar.put ("ntilde", new Character ('\u00f1')); // latin small letter n with tilde, U+00F1 ISOlat1 ! mRefChar.put ("ograve", new Character ('\u00f2')); // latin small letter o with grave, U+00F2 ISOlat1 ! mRefChar.put ("oacute", new Character ('\u00f3')); // latin small letter o with acute, U+00F3 ISOlat1 ! mRefChar.put ("ocirc", new Character ('\u00f4')); // latin small letter o with circumflex, U+00F4 ISOlat1 ! mRefChar.put ("otilde", new Character ('\u00f5')); // latin small letter o with tilde, U+00F5 ISOlat1 ! mRefChar.put ("ouml", new Character ('\u00f6')); // latin small letter o with diaeresis, U+00F6 ISOlat1 ! mRefChar.put ("divide", new Character ('\u00f7')); // division sign, U+00F7 ISOnum ! mRefChar.put ("oslash", new Character ('\u00f8')); // latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1 ! mRefChar.put ("ugrave", new Character ('\u00f9')); // latin small letter u with grave, U+00F9 ISOlat1 ! mRefChar.put ("uacute", new Character ('\u00fa')); // latin small letter u with acute, U+00FA ISOlat1 ! mRefChar.put ("ucirc", new Character ('\u00fb')); // latin small letter u with circumflex, U+00FB ISOlat1 ! mRefChar.put ("uuml", new Character ('\u00fc')); // latin small letter u with diaeresis, U+00FC ISOlat1 ! mRefChar.put ("yacute", new Character ('\u00fd')); // latin small letter y with acute, U+00FD ISOlat1 ! mRefChar.put ("thorn", new Character ('\u00fe')); // latin small letter thorn, U+00FE ISOlat1 ! mRefChar.put ("yuml", new Character ('\u00ff')); // latin small letter y with diaeresis, U+00FF ISOlat1 // Mathematical, Greek and Symbolic characters for HTML // Character entity set. Typical invocation: *************** *** 168,172 **** // "-//W3C//ENTITIES Symbols//EN//HTML"> // %HTMLsymbol; ! // Portions � International Organization for Standardization 1986: // Permission to copy in any form is granted for use with // conforming SGML systems and applications as defined in --- 168,172 ---- // "-//W3C//ENTITIES Symbols//EN//HTML"> // %HTMLsymbol; ! // Portions \u00a9 International Organization for Standardization 1986: // Permission to copy in any form is granted for use with // conforming SGML systems and applications as defined in *************** *** 179,340 **** // character set. Names are ISO 10646 names. // Latin Extended-B ! refChar.put ("fnof", new Character ('\u0192')); // latin small f with hook = function = florin, U+0192 ISOtech // Greek ! refChar.put ("Alpha", new Character ('\u0391')); // greek capital letter alpha, U+0391 ! refChar.put ("Beta", new Character ('\u0392')); // greek capital letter beta, U+0392 ! refChar.put ("Gamma", new Character ('\u0393')); // greek capital letter gamma, U+0393 ISOgrk3 ! refChar.put ("Delta", new Character ('\u0394')); // greek capital letter delta, U+0394 ISOgrk3 ! refChar.put ("Epsilon", new Character ('\u0395')); // greek capital letter epsilon, U+0395 ! refChar.put ("Zeta", new Character ('\u0396')); // greek capital letter zeta, U+0396 ! refChar.put ("Eta", new Character ('\u0397')); // greek capital letter eta, U+0397 ! refChar.put ("Theta", new Character ('\u0398')); // greek capital letter theta, U+0398 ISOgrk3 ! refChar.put ("Iota", new Character ('\u0399')); // greek capital letter iota, U+0399 ! refChar.put ("Kappa", new Character ('\u039a')); // greek capital letter kappa, U+039A ! refChar.put ("Lambda", new Character ('\u039b')); // greek capital letter lambda, U+039B ISOgrk3 ! refChar.put ("Mu", new Character ('\u039c')); // greek capital letter mu, U+039C ! refChar.put ("Nu", new Character ('\u039d')); // greek capital letter nu, U+039D ! refChar.put ("Xi", new Character ('\u039e')); // greek capital letter xi, U+039E ISOgrk3 ! refChar.put ("Omicron", new Character ('\u039f')); // greek capital letter omicron, U+039F ! refChar.put ("Pi", new Character ('\u03a0')); // greek capital letter pi, U+03A0 ISOgrk3 ! refChar.put ("Rho", new Character ('\u03a1')); // greek capital letter rho, U+03A1 // there is no Sigmaf, and no U+03A2 character either ! refChar.put ("Sigma", new Character ('\u03a3')); // greek capital letter sigma, U+03A3 ISOgrk3 ! refChar.put ("Tau", new Character ('\u03a4')); // greek capital letter tau, U+03A4 ! refChar.put ("Upsilon", new Character ('\u03a5')); // greek capital letter upsilon, U+03A5 ISOgrk3 ! refChar.put ("Phi", new Character ('\u03a6')); // greek capital letter phi, U+03A6 ISOgrk3 ! refChar.put ("Chi", new Character ('\u03a7')); // greek capital letter chi, U+03A7 ! refChar.put ("Psi", new Character ('\u03a8')); // greek capital letter psi, U+03A8 ISOgrk3 ! refChar.put ("Omega", new Character ('\u03a9')); // greek capital letter omega, U+03A9 ISOgrk3 ! refChar.put ("alpha", new Character ('\u03b1')); // greek small letter alpha, U+03B1 ISOgrk3 ! refChar.put ("beta", new Character ('\u03b2')); // greek small letter beta, U+03B2 ISOgrk3 ! refChar.put ("gamma", new Character ('\u03b3')); // greek small letter gamma, U+03B3 ISOgrk3 ! refChar.put ("delta", new Character ('\u03b4')); // greek small letter delta, U+03B4 ISOgrk3 ! refChar.put ("epsilon", new Character ('\u03b5')); // greek small letter epsilon, U+03B5 ISOgrk3 ! refChar.put ("zeta", new Character ('\u03b6')); // greek small letter zeta, U+03B6 ISOgrk3 ! refChar.put ("eta", new Character ('\u03b7')); // greek small letter eta, U+03B7 ISOgrk3 ! refChar.put ("theta", new Character ('\u03b8')); // greek small letter theta, U+03B8 ISOgrk3 ! refChar.put ("iota", new Character ('\u03b9')); // greek small letter iota, U+03B9 ISOgrk3 ! refChar.put ("kappa", new Character ('\u03ba')); // greek small letter kappa, U+03BA ISOgrk3 ! refChar.put ("lambda", new Character ('\u03bb')); // greek small letter lambda, U+03BB ISOgrk3 ! refChar.put ("mu", new Character ('\u03bc')); // greek small letter mu, U+03BC ISOgrk3 ! refChar.put ("nu", new Character ('\u03bd')); // greek small letter nu, U+03BD ISOgrk3 ! refChar.put ("xi", new Character ('\u03be')); // greek small letter xi, U+03BE ISOgrk3 ! refChar.put ("omicron", new Character ('\u03bf')); // greek small letter omicron, U+03BF NEW ! refChar.put ("pi", new Character ('\u03c0')); // greek small letter pi, U+03C0 ISOgrk3 ! refChar.put ("rho", new Character ('\u03c1')); // greek small letter rho, U+03C1 ISOgrk3 ! refChar.put ("sigmaf", new Character ('\u03c2')); // greek small letter final sigma, U+03C2 ISOgrk3 ! refChar.put ("sigma", new Character ('\u03c3')); // greek small letter sigma, U+03C3 ISOgrk3 ! refChar.put ("tau", new Character ('\u03c4')); // greek small letter tau, U+03C4 ISOgrk3 ! refChar.put ("upsilon", new Character ('\u03c5')); // greek small letter upsilon, U+03C5 ISOgrk3 ! refChar.put ("phi", new Character ('\u03c6')); // greek small letter phi, U+03C6 ISOgrk3 ! refChar.put ("chi", new Character ('\u03c7')); // greek small letter chi, U+03C7 ISOgrk3 ! refChar.put ("psi", new Character ('\u03c8')); // greek small letter psi, U+03C8 ISOgrk3 ! refChar.put ("omega", new Character ('\u03c9')); // greek small letter omega, U+03C9 ISOgrk3 ! refChar.put ("thetasym", new Character ('\u03d1')); // greek small letter theta symbol, U+03D1 NEW ! refChar.put ("upsih", new Character ('\u03d2')); // greek upsilon with hook symbol, U+03D2 NEW ! refChar.put ("piv", new Character ('\u03d6')); // greek pi symbol, U+03D6 ISOgrk3 // General Punctuation ! refChar.put ("bull", new Character ('\u2022')); // bullet = black small circle, U+2022 ISOpub // bullet is NOT the same as bullet operator, U+2219 ! refChar.put ("hellip", new Character ('\u2026')); // horizontal ellipsis = three dot leader, U+2026 ISOpub ! refChar.put ("prime", new Character ('\u2032')); // prime = minutes = feet, U+2032 ISOtech ! refChar.put ("Prime", new Character ('\u2033')); // double prime = seconds = inches, U+2033 ISOtech ! refChar.put ("oline", new Character ('\u203e')); // overline = spacing overscore, U+203E NEW ! refChar.put ("frasl", new Character ('\u2044')); // fraction slash, U+2044 NEW // Letterlike Symbols ! refChar.put ("weierp", new Character ('\u2118')); // script capital P = power set = Weierstrass p, U+2118 ISOamso ! refChar.put ("image", new Character ('\u2111')); // blackletter capital I = imaginary part, U+2111 ISOamso ! refChar.put ("real", new Character ('\u211c')); // blackletter capital R = real part symbol, U+211C ISOamso ! refChar.put ("trade", new Character ('\u2122')); // trade mark sign, U+2122 ISOnum ! refChar.put ("alefsym", new Character ('\u2135')); // alef symbol = first transfinite cardinal, U+2135 NEW // alef symbol is NOT the same as hebrew letter alef, // U+05D0 although the same glyph could be used to depict both characters // Arrows ! refChar.put ("larr", new Character ('\u2190')); // leftwards arrow, U+2190 ISOnum ! refChar.put ("uarr", new Character ('\u2191')); // upwards arrow, U+2191 ISOnum ! refChar.put ("rarr", new Character ('\u2192')); // rightwards arrow, U+2192 ISOnum ! refChar.put ("darr", new Character ('\u2193')); // downwards arrow, U+2193 ISOnum ! refChar.put ("harr", new Character ('\u2194')); // left right arrow, U+2194 ISOamsa ! refChar.put ("crarr", new Character ('\u21b5')); // downwards arrow with corner leftwards = carriage return, U+21B5 NEW ! refChar.put ("lArr", new Character ('\u21d0')); // leftwards double arrow, U+21D0 ISOtech // ISO 10646 does not say that lArr is the same as the 'is implied by' arrow // but also does not have any other character for that function. So ? lArr can // be used for 'is implied by' as ISOtech suggests ! refChar.put ("uArr", new Character ('\u21d1')); // upwards double arrow, U+21D1 ISOamsa ! refChar.put ("rArr", new Character ('\u21d2')); // rightwards double arrow, U+21D2 ISOtech ! // ISO 10646 does not say this is the 'implies' character but does not have // another character with this function so ? // rArr can be used for 'implies' as ISOtech suggests ! refChar.put ("dArr", new Character ('\u21d3')); // downwards double arrow, U+21D3 ISOamsa ! refChar.put ("hArr", new Character ('\u21d4')); // left right double arrow, U+21D4 ISOamsa // Mathematical Operators ! refChar.put ("forall", new Character ('\u2200')); // for all, U+2200 ISOtech ! refChar.put ("part", new Character ('\u2202')); // partial differential, U+2202 ISOtech ! refChar.put ("exist", new Character ('\u2203')); // there exists, U+2203 ISOtech ! refChar.put ("empty", new Character ('\u2205')); // empty set = null set = diameter, U+2205 ISOamso ! refChar.put ("nabla", new Character ('\u2207')); // nabla = backward difference, U+2207 ISOtech ! refChar.put ("isin", new Character ('\u2208')); // element of, U+2208 ISOtech ! refChar.put ("notin", new Character ('\u2209')); // not an element of, U+2209 ISOtech ! refChar.put ("ni", new Character ('\u220b')); // contains as member, U+220B ISOtech // should there be a more memorable name than 'ni'? ! refChar.put ("prod", new Character ('\u220f')); // n-ary product = product sign, U+220F ISOamsb // prod is NOT the same character as U+03A0 'greek capital letter pi' though // the same glyph might be used for both ! refChar.put ("sum", new Character ('\u2211')); // n-ary sumation, U+2211 ISOamsb // sum is NOT the same character as U+03A3 'greek capital letter sigma' // though the same glyph might be used for both ! refChar.put ("minus", new Character ('\u2212')); // minus sign, U+2212 ISOtech ! refChar.put ("lowast", new Character ('\u2217')); // asterisk operator, U+2217 ISOtech ! refChar.put ("radic", new Character ('\u221a')); // square root = radical sign, U+221A ISOtech ! refChar.put ("prop", new Character ('\u221d')); // proportional to, U+221D ISOtech ! refChar.put ("infin", new Character ('\u221e')); // infinity, U+221E ISOtech ! refChar.put ("ang", new Character ('\u2220')); // angle, U+2220 ISOamso ! refChar.put ("and", new Character ('\u2227')); // logical and = wedge, U+2227 ISOtech ! refChar.put ("or", new Character ('\u2228')); // logical or = vee, U+2228 ISOtech ! refChar.put ("cap", new Character ('\u2229')); // intersection = cap, U+2229 ISOtech ! refChar.put ("cup", new Character ('\u222a')); // union = cup, U+222A ISOtech ! refChar.put ("int", new Character ('\u222b')); // integral, U+222B ISOtech ! refChar.put ("there4", new Character ('\u2234')); // therefore, U+2234 ISOtech ! refChar.put ("sim", new Character ('\u223c')); // tilde operator = varies with = similar to, U+223C ISOtech // tilde operator is NOT the same character as the tilde, U+007E, // although the same glyph might be used to represent both ! refChar.put ("cong", new Character ('\u2245')); // approximately equal to, U+2245 ISOtech ! refChar.put ("asymp", new Character ('\u2248')); // almost equal to = asymptotic to, U+2248 ISOamsr ! refChar.put ("ne", new Character ('\u2260')); // not equal to, U+2260 ISOtech ! refChar.put ("equiv", new Character ('\u2261')); // identical to, U+2261 ISOtech ! refChar.put ("le", new Character ('\u2264')); // less-than or equal to, U+2264 ISOtech ! refChar.put ("ge", new Character ('\u2265')); // greater-than or equal to, U+2265 ISOtech ! refChar.put ("sub", new Character ('\u2282')); // subset of, U+2282 ISOtech ! refChar.put ("sup", new Character ('\u2283')); // superset of, U+2283 ISOtech ! // note that nsup, 'not a superset of, U+2283' is not covered by the Symbol // font encoding and is not included. Should it be, for symmetry? // It is in ISOamsn ! refChar.put ("nsub", new Character ('\u2284')); // not a subset of, U+2284 ISOamsn ! refChar.put ("sube", new Character ('\u2286')); // subset of or equal to, U+2286 ISOtech ! refChar.put ("supe", new Character ('\u2287')); // superset of or equal to, U+2287 ISOtech ! refChar.put ("oplus", new Character ('\u2295')); // circled plus = direct sum, U+2295 ISOamsb ! refChar.put ("otimes", new Character ('\u2297')); // circled times = vector product, U+2297 ISOamsb ! refChar.put ("perp", new Character ('\u22a5')); // up tack = orthogonal to = perpendicular, U+22A5 ISOtech ! refChar.put ("sdot", new Character ('\u22c5')); // dot operator, U+22C5 ISOamsb // dot operator is NOT the same character as U+00B7 middle dot // Miscellaneous Technical ! refChar.put ("lceil", new Character ('\u2308')); // left ceiling = apl upstile, U+2308 ISOamsc ! refChar.put ("rceil", new Character ('\u2309')); // right ceiling, U+2309 ISOamsc ! refChar.put ("lfloor", new Character ('\u230a')); // left floor = apl downstile, U+230A ISOamsc ! refChar.put ("rfloor", new Character ('\u230b')); // right floor, U+230B ISOamsc ! refChar.put ("lang", new Character ('\u2329')); // left-pointing angle bracket = bra, U+2329 ISOtech ! // lang is NOT the same character as U+003C 'less than' // or U+2039 'single left-pointing angle quotation mark' ! refChar.put ("rang", new Character ('\u232a')); // right-pointing angle bracket = ket, U+232A ISOtech ! // rang is NOT the same character as U+003E 'greater than' // or U+203A 'single right-pointing angle quotation mark' // Geometric Shapes ! refChar.put ("loz", new Character ('\u25ca')); // lozenge, U+25CA ISOpub // Miscellaneous Symbols ! refChar.put ("spades", new Character ('\u2660')); // black spade suit, U+2660 ISOpub // black here seems to mean filled as opposed to hollow ! refChar.put ("clubs", new Character ('\u2663')); // black club suit = shamrock, U+2663 ISOpub ! refChar.put ("hearts", new Character ('\u2665')); // black heart suit = valentine, U+2665 ISOpub ! refChar.put ("diams", new Character ('\u2666')); // black diamond suit, U+2666 ISOpub // Special characters for HTML // Character entity set. Typical invocation: --- 179,340 ---- // character set. Names are ISO 10646 names. // Latin Extended-B ! mRefChar.put ("fnof", new Character ('\u0192')); // latin small f with hook = function = florin, U+0192 ISOtech // Greek ! mRefChar.put ("Alpha", new Character ('\u0391')); // greek capital letter alpha, U+0391 ! mRefChar.put ("Beta", new Character ('\u0392')); // greek capital letter beta, U+0392 ! mRefChar.put ("Gamma", new Character ('\u0393')); // greek capital letter gamma, U+0393 ISOgrk3 ! mRefChar.put ("Delta", new Character ('\u0394')); // greek capital letter delta, U+0394 ISOgrk3 ! mRefChar.put ("Epsilon", new Character ('\u0395')); // greek capital letter epsilon, U+0395 ! mRefChar.put ("Zeta", new Character ('\u0396')); // greek capital letter zeta, U+0396 ! mRefChar.put ("Eta", new Character ('\u0397')); // greek capital letter eta, U+0397 ! mRefChar.put ("Theta", new Character ('\u0398')); // greek capital letter theta, U+0398 ISOgrk3 ! mRefChar.put ("Iota", new Character ('\u0399')); // greek capital letter iota, U+0399 ! mRefChar.put ("Kappa", new Character ('\u039a')); // greek capital letter kappa, U+039A ! mRefChar.put ("Lambda", new Character ('\u039b')); // greek capital letter lambda, U+039B ISOgrk3 ! mRefChar.put ("Mu", new Character ('\u039c')); // greek capital letter mu, U+039C ! mRefChar.put ("Nu", new Character ('\u039d')); // greek capital letter nu, U+039D ! mRefChar.put ("Xi", new Character ('\u039e')); // greek capital letter xi, U+039E ISOgrk3 ! mRefChar.put ("Omicron", new Character ('\u039f')); // greek capital letter omicron, U+039F ! mRefChar.put ("Pi", new Character ('\u03a0')); // greek capital letter pi, U+03A0 ISOgrk3 ! mRefChar.put ("Rho", new Character ('\u03a1')); // greek capital letter rho, U+03A1 // there is no Sigmaf, and no U+03A2 character either ! mRefChar.put ("Sigma", new Character ('\u03a3')); // greek capital letter sigma, U+03A3 ISOgrk3 ! mRefChar.put ("Tau", new Character ('\u03a4')); // greek capital letter tau, U+03A4 ! mRefChar.put ("Upsilon", new Character ('\u03a5')); // greek capital letter upsilon, U+03A5 ISOgrk3 ! mRefChar.put ("Phi", new Character ('\u03a6')); // greek capital letter phi, U+03A6 ISOgrk3 ! mRefChar.put ("Chi", new Character ('\u03a7')); // greek capital letter chi, U+03A7 ! mRefChar.put ("Psi", new Character ('\u03a8')); // greek capital letter psi, U+03A8 ISOgrk3 ! mRefChar.put ("Omega", new Character ('\u03a9')); // greek capital letter omega, U+03A9 ISOgrk3 ! mRefChar.put ("alpha", new Character ('\u03b1')); // greek small letter alpha, U+03B1 ISOgrk3 ! mRefChar.put ("beta", new Character ('\u03b2')); // greek small letter beta, U+03B2 ISOgrk3 ! mRefChar.put ("gamma", new Character ('\u03b3')); // greek small letter gamma, U+03B3 ISOgrk3 ! mRefChar.put ("delta", new Character ('\u03b4')); // greek small letter delta, U+03B4 ISOgrk3 ! mRefChar.put ("epsilon", new Character ('\u03b5')); // greek small letter epsilon, U+03B5 ISOgrk3 ! mRefChar.put ("zeta", new Character ('\u03b6')); // greek small letter zeta, U+03B6 ISOgrk3 ! mRefChar.put ("eta", new Character ('\u03b7')); // greek small letter eta, U+03B7 ISOgrk3 ! mRefChar.put ("theta", new Character ('\u03b8')); // greek small letter theta, U+03B8 ISOgrk3 ! mRefChar.put ("iota", new Character ('\u03b9')); // greek small letter iota, U+03B9 ISOgrk3 ! mRefChar.put ("kappa", new Character ('\u03ba')); // greek small letter kappa, U+03BA ISOgrk3 ! mRefChar.put ("lambda", new Character ('\u03bb')); // greek small letter lambda, U+03... [truncated message content] |
From: <der...@us...> - 2003-12-07 23:41:47
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors In directory sc8-pr-cvs1:/tmp/cvs-serv16537/visitors Modified Files: HtmlPage.java NodeVisitor.java UrlModifyingVisitor.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: HtmlPage.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/HtmlPage.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** HtmlPage.java 9 Nov 2003 17:07:17 -0000 1.38 --- HtmlPage.java 7 Dec 2003 23:41:43 -0000 1.39 *************** *** 33,37 **** import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; ! import org.htmlparser.scanners.TableScanner; import org.htmlparser.tags.TableTag; import org.htmlparser.tags.Tag; --- 33,37 ---- import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; ! import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.TableTag; import org.htmlparser.tags.Tag; *************** *** 43,55 **** private NodeList nodesInBody; private NodeList tables; - private boolean bodyTagBegin; public HtmlPage(Parser parser) { ! super(false); ! parser.registerScanners(); ! parser.addScanner(new TableScanner(parser)); nodesInBody = new NodeList(); tables = new NodeList(); - bodyTagBegin = false; } --- 43,52 ---- private NodeList nodesInBody; private NodeList tables; public HtmlPage(Parser parser) { ! super(true); ! title = ""; nodesInBody = new NodeList(); tables = new NodeList(); } *************** *** 64,104 **** public void visitTag(Tag tag) { ! addTagToBodyIfApplicable(tag); ! ! if (isTable(tag)) { tables.add(tag); ! } ! else { ! if (isBodyTag(tag)) ! bodyTagBegin = true; ! } } ! public void visitEndTag(Tag tag) { ! if (isBodyTag(tag)) ! bodyTagBegin = false; ! addTagToBodyIfApplicable(tag); ! } ! ! private boolean isTable(Tag tag) { ! return tag instanceof TableTag; ! } ! ! private void addTagToBodyIfApplicable(Node node) { ! if (bodyTagBegin) ! nodesInBody.add(node); ! } ! ! public void visitRemarkNode(RemarkNode remarkNode) { ! addTagToBodyIfApplicable(remarkNode); ! } ! ! public void visitStringNode(StringNode stringNode) { ! addTagToBodyIfApplicable(stringNode); } ! private boolean isBodyTag(Tag tag) { ! return tag.getTagName().equals("BODY"); } --- 61,78 ---- public void visitTag(Tag tag) { ! if (isTable(tag)) tables.add(tag); ! else if (isBodyTag(tag)) ! nodesInBody = tag.getChildren (); } ! private boolean isTable(Tag tag) { ! return (tag instanceof TableTag); } ! private boolean isBodyTag(Tag tag) ! { ! return (tag instanceof BodyTag); } *************** *** 107,122 **** } ! public TableTag [] getTables() { TableTag [] tableArr = new TableTag[tables.size()]; ! for (int i=0;i<tables.size();i++) ! tableArr[i] = (TableTag)tables.elementAt(i); return tableArr; } ! ! ! public void visitTitleTag(TitleTag titleTag) { title = titleTag.getTitle(); } - } --- 81,94 ---- } ! public TableTag [] getTables() ! { TableTag [] tableArr = new TableTag[tables.size()]; ! tables.copyToNodeArray (tableArr); return tableArr; } ! public void visitTitleTag(TitleTag titleTag) ! { title = titleTag.getTitle(); } } Index: NodeVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/NodeVisitor.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** NodeVisitor.java 9 Nov 2003 17:07:18 -0000 1.33 --- NodeVisitor.java 7 Dec 2003 23:41:43 -0000 1.34 *************** *** 67,71 **** * { * Parser parser = new Parser ("http://cbc.ca"); - * parser.registerScanners (); * Visitor visitor = new Visitor (); * parser.visitAllNodesWith (visitor); --- 67,70 ---- Index: UrlModifyingVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/UrlModifyingVisitor.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** UrlModifyingVisitor.java 9 Nov 2003 17:07:18 -0000 1.39 --- UrlModifyingVisitor.java 7 Dec 2003 23:41:43 -0000 1.40 *************** *** 34,39 **** import org.htmlparser.Parser; import org.htmlparser.StringNode; - import org.htmlparser.scanners.ImageScanner; - import org.htmlparser.scanners.LinkScanner; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.ImageTag; --- 34,37 ---- *************** *** 49,54 **** super(true,true); this.parser = parser; - parser.addScanner(new LinkScanner()); - parser.addScanner(new ImageScanner(ImageTag.IMAGE_TAG_FILTER)); this.linkPrefix =linkPrefix; modifiedResult = new StringBuffer(); --- 47,50 ---- *************** *** 82,89 **** parent = tag.getParent (); if (null == parent) modifiedResult.append(tag.toHtml()); else ! modifiedResult.append(parent.toHtml()); } --- 78,89 ---- parent = tag.getParent (); + // process only those nodes not processed by a parent if (null == parent) + // an orphan end tag modifiedResult.append(tag.toHtml()); else ! if (null == parent.getParent ()) ! // a top level tag with no parents ! modifiedResult.append(parent.toHtml()); } |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/visitorsTests Modified Files: HtmlPageTest.java LinkFindingVisitorTest.java TextExtractingVisitorTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: HtmlPageTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests/HtmlPageTest.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** HtmlPageTest.java 9 Nov 2003 17:07:17 -0000 1.15 --- HtmlPageTest.java 7 Dec 2003 23:41:43 -0000 1.16 *************** *** 55,58 **** --- 55,67 ---- "</html>"; + private static final String guts = + "Welcome to HTMLParser" + + "<table>" + + "<tr>" + + "<td>cell 1</td>" + + "<td>cell 2</td>" + + "</tr>" + + "</table>"; + private static final String PAGE_WITH_TABLE = "<html>" + *************** *** 61,71 **** "</head>" + "<body>" + ! "Welcome to HTMLParser" + ! "<table>" + ! "<tr>" + ! "<td>cell 1</td>" + ! "<td>cell 2</td>" + ! "</tr>" + ! "</table>" + "</body>" + "</html>"; --- 70,74 ---- "</head>" + "<body>" + ! guts + "</body>" + "</html>"; *************** *** 107,121 **** NodeList bodyNodes = page.getBody(); assertEquals("number of nodes in body",2,bodyNodes.size()); ! assertXmlEquals( ! "body html", ! "Welcome to HTMLParser" + ! "<table>" + ! "<tr>" + ! " <td>cell 1</td>" + ! " <td>cell 2</td>" + ! "</tr>" + ! "</table>", ! bodyNodes.asHtml() ! ); TableTag tables [] = page.getTables(); assertEquals("number of tables",1,tables.length); --- 110,114 ---- NodeList bodyNodes = page.getBody(); assertEquals("number of nodes in body",2,bodyNodes.size()); ! assertXmlEquals("body html", guts, bodyNodes.asHtml()); TableTag tables [] = page.getTables(); assertEquals("number of tables",1,tables.length); Index: LinkFindingVisitorTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests/LinkFindingVisitorTest.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** LinkFindingVisitorTest.java 9 Nov 2003 17:07:17 -0000 1.12 --- LinkFindingVisitorTest.java 7 Dec 2003 23:41:43 -0000 1.13 *************** *** 48,52 **** public void testLinkFoundCorrectly() throws Exception { createParser(html); - parser.registerScanners(); LinkFindingVisitor visitor = new LinkFindingVisitor("Industrial Logic"); parser.visitAllNodesWith(visitor); --- 48,51 ---- Index: TextExtractingVisitorTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests/TextExtractingVisitorTest.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** TextExtractingVisitorTest.java 9 Nov 2003 17:07:17 -0000 1.12 --- TextExtractingVisitorTest.java 7 Dec 2003 23:41:43 -0000 1.13 *************** *** 56,60 **** public void testSimpleVisitWithRegisteredScanners() throws Exception { createParser("<HTML><HEAD><TITLE>Hello World</TITLE></HEAD></HTML>"); - parser.registerScanners(); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); --- 56,59 ---- |
From: <der...@us...> - 2003-12-07 23:41:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/utilTests Modified Files: BeanTest.java HTMLLinkProcessorTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: BeanTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/BeanTest.java,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** BeanTest.java 9 Nov 2003 17:07:16 -0000 1.46 --- BeanTest.java 7 Dec 2003 23:41:43 -0000 1.47 *************** *** 214,218 **** parser = new Parser ("http://htmlparser.sourceforge.net/test/example.html"); - parser.registerScanners (); enumeration = parser.elements (); vector = new Vector (50); --- 214,217 ---- Index: HTMLLinkProcessorTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/HTMLLinkProcessorTest.java,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** HTMLLinkProcessorTest.java 9 Nov 2003 17:07:16 -0000 1.49 --- HTMLLinkProcessorTest.java 7 Dec 2003 23:41:43 -0000 1.50 *************** *** 72,76 **** public void testLinkWithNoSlashes() throws Exception { createParser("<A HREF=\".foo.txt\">Foo</A>","http://www.oygevalt.com"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue(node[0] instanceof LinkTag); --- 72,75 ---- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/scannersTests Modified Files: AllTests.java CompositeTagScannerTest.java JspScannerTest.java ScriptScannerTest.java XmlEndTagScanningTest.java Removed Files: AppletScannerTest.java BaseHREFScannerTest.java BodyScannerTest.java BulletListScannerTest.java BulletScannerTest.java DivScannerTest.java FormScannerTest.java FrameScannerTest.java FrameSetScannerTest.java HeadScannerTest.java HtmlTest.java ImageScannerTest.java InputTagScannerTest.java LabelScannerTest.java LinkScannerTest.java MetaTagScannerTest.java OptionTagScannerTest.java SelectTagScannerTest.java SpanScannerTest.java StyleScannerTest.java TableScannerTest.java TextareaTagScannerTest.java TitleScannerTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: AllTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/AllTests.java,v retrieving revision 1.52 retrieving revision 1.53 diff -C2 -d -r1.52 -r1.53 *** AllTests.java 9 Nov 2003 17:07:15 -0000 1.52 --- AllTests.java 7 Dec 2003 23:41:42 -0000 1.53 *************** *** 67,96 **** TestSuite suite = new TestSuite("Scanner Tests"); suite.addTestSuite(TagScannerTest.class); - suite.addTestSuite(AppletScannerTest.class); suite.addTestSuite(ScriptScannerTest.class); - suite.addTestSuite(ImageScannerTest.class); - suite.addTestSuite(LinkScannerTest.class); - suite.addTestSuite(StyleScannerTest.class); - suite.addTestSuite(MetaTagScannerTest.class); - suite.addTestSuite(TitleScannerTest.class); - suite.addTestSuite(FormScannerTest.class); - suite.addTestSuite(FrameScannerTest.class); - suite.addTestSuite(FrameSetScannerTest.class); - suite.addTestSuite(InputTagScannerTest.class); - suite.addTestSuite(OptionTagScannerTest.class); - suite.addTestSuite(SelectTagScannerTest.class); - suite.addTestSuite(TextareaTagScannerTest.class); - suite.addTestSuite(BaseHREFScannerTest.class); suite.addTestSuite(JspScannerTest.class); - suite.addTestSuite(TableScannerTest.class); - suite.addTestSuite(SpanScannerTest.class); - suite.addTestSuite(DivScannerTest.class); - suite.addTestSuite(LabelScannerTest.class); - suite.addTestSuite(BodyScannerTest.class); suite.addTestSuite(CompositeTagScannerTest.class); - suite.addTestSuite(HeadScannerTest.class); - suite.addTestSuite(BulletListScannerTest.class); - suite.addTestSuite(BulletScannerTest.class); - suite.addTestSuite(HtmlTest.class); suite.addTestSuite(XmlEndTagScanningTest.class); return suite; --- 67,73 ---- Index: CompositeTagScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/CompositeTagScannerTest.java,v retrieving revision 1.53 retrieving revision 1.54 diff -C2 -d -r1.53 -r1.54 *** CompositeTagScannerTest.java 9 Nov 2003 17:07:15 -0000 1.53 --- CompositeTagScannerTest.java 7 Dec 2003 23:41:42 -0000 1.54 *************** *** 32,35 **** --- 32,36 ---- import org.htmlparser.AbstractNode; import org.htmlparser.Node; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.StringNode; import org.htmlparser.lexer.Page; *************** *** 72,76 **** private CustomTag parseCustomTag(int expectedNodeCount) throws ParserException { ! parser.addScanner(new CustomScanner()); parseAndAssertNodeCount(expectedNodeCount); assertType("node",CustomTag.class,node[0]); --- 73,77 ---- private CustomTag parseCustomTag(int expectedNodeCount) throws ParserException { ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag ())); parseAndAssertNodeCount(expectedNodeCount); assertType("node",CustomTag.class,node[0]); *************** *** 150,155 **** "</Custom>" ); ! parser.addScanner(new AnotherScanner()); ! CustomTag customTag = parseCustomTag(1); assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); --- 151,164 ---- "</Custom>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] ! { ! new CustomTag (), ! new AnotherTag (true), ! })); ! parseAndAssertNodeCount(1); ! assertType("node",CustomTag.class,node[0]); ! CustomTag customTag = (CustomTag)node[0]; assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); *************** *** 175,179 **** "<Custom/>" ); ! parser.addScanner(new CustomScanner()); parseAndAssertNodeCount(2); assertType("tag 1",CustomTag.class,node[0]); --- 184,188 ---- "<Custom/>" ); ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag ())); parseAndAssertNodeCount(2); assertType("tag 1",CustomTag.class,node[0]); *************** *** 189,194 **** "<Custom/>" ); ! parser.addScanner(new CustomScanner()); ! parser.addScanner(new AnotherScanner()); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); --- 198,207 ---- "<Custom/>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); *************** *** 211,216 **** "<Custom/>" ); ! parser.addScanner(new CustomScanner()); ! parser.addScanner(new AnotherScanner()); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); --- 224,233 ---- "<Custom/>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); *************** *** 240,245 **** "<Custom/>" ); ! parser.addScanner(new CustomScanner()); ! parser.addScanner(new AnotherScanner()); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); --- 257,266 ---- "<Custom/>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(2); assertType("first node",CustomTag.class,node[0]); *************** *** 301,305 **** String tag2 = "<custom></endtag>"; createParser(tag1 + tag2); ! parser.addScanner(new CustomScanner(false)); parseAndAssertNodeCount(2); CustomTag customTag = (CustomTag)node[0]; --- 322,326 ---- String tag2 = "<custom></endtag>"; createParser(tag1 + tag2); ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag (false))); parseAndAssertNodeCount(2); CustomTag customTag = (CustomTag)node[0]; *************** *** 323,328 **** custom ); ! parser.addScanner(new AnotherScanner()); ! parser.addScanner(new CustomScanner()); parseAndAssertNodeCount(2); AnotherTag anotherTag = (AnotherTag)node[0]; --- 344,353 ---- custom ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(2); AnotherTag anotherTag = (AnotherTag)node[0]; *************** *** 346,351 **** "</custom>" ); ! parser.addScanner(new AnotherScanner(true)); ! CustomTag customTag = parseCustomTag(1); assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should be xml end tag",customTag.isEmptyXmlTag()); --- 371,384 ---- "</custom>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] ! { ! new CustomTag (), ! new AnotherTag (true), ! })); ! parseAndAssertNodeCount(1); ! assertType("node",CustomTag.class,node[0]); ! CustomTag customTag = (CustomTag)node[0]; assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should be xml end tag",customTag.isEmptyXmlTag()); *************** *** 368,373 **** "</custom>" ); ! parser.addScanner(new AnotherScanner(true)); ! CustomTag customTag = parseCustomTag(2); assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); --- 401,415 ---- "</custom>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] ! { ! new CustomTag (), ! new AnotherTag (true), ! })); ! parseAndAssertNodeCount(2); ! assertType("node",CustomTag.class,node[0]); ! CustomTag customTag = (CustomTag)node[0]; ! assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); *************** *** 399,404 **** "</custom>" ); ! parser.addScanner(new AnotherScanner(true)); ! CustomTag customTag = parseCustomTag(1); assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); --- 441,454 ---- "</custom>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] ! { ! new CustomTag (), ! new AnotherTag (true), ! })); ! parseAndAssertNodeCount(1); ! assertType("node",CustomTag.class,node[0]); ! CustomTag customTag = (CustomTag)node[0]; assertEquals("child count",1,customTag.getChildCount()); assertFalse("custom tag should not be xml end tag",customTag.isEmptyXmlTag()); *************** *** 418,423 **** String tag3 = "</custom>"; createParser(tag1 + tag2 + tag3); ! parser.addScanner(new CustomScanner(false)); ! parser.addScanner(new AnotherScanner()); parseAndAssertNodeCount(3); --- 468,477 ---- String tag3 = "</custom>"; createParser(tag1 + tag2 + tag3); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (false), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(3); *************** *** 450,455 **** String tag3 = "</custom>"; createParser(tag1 + tag2 + tag3); ! parser.addScanner(new CustomScanner(false)); ! parser.addScanner(new AnotherScanner()); parseAndAssertNodeCount(3); --- 504,513 ---- String tag3 = "</custom>"; createParser(tag1 + tag2 + tag3); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (false), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(3); *************** *** 499,514 **** createParser("<Custom/>","http://www.yahoo.com"); ! parser.addScanner(new CustomScanner() ! // { ! // public Tag createTag(Page page, int start, int end, Vector attributes, Tag startTag, Tag endTag, NodeList children) throws ParserException ! // { ! // if (null != page) ! // url = page.getUrl (); ! // else ! // url = null; ! // return (super.createTag (page, start, end, attributes, startTag, endTag, children)); ! // } ! // } ! ); parseAndAssertNodeCount(1); assertStringEquals("url","http://www.yahoo.com",((AbstractNode)node[0]).getPage ().getUrl ()); --- 557,561 ---- createParser("<Custom/>","http://www.yahoo.com"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag ())); parseAndAssertNodeCount(1); assertStringEquals("url","http://www.yahoo.com",((AbstractNode)node[0]).getPage ().getUrl ()); *************** *** 526,531 **** "</custom>" ); ! parser.addScanner(new CustomScanner()); ! parser.addScanner(new AnotherScanner(false)); parseAndAssertNodeCount(1); assertType("root node",CustomTag.class, node[0]); --- 573,582 ---- "</custom>" ); ! parser.setNodeFactory ( ! new PrototypicalNodeFactory ( ! new Tag[] { ! new CustomTag (), ! new AnotherTag (false), ! })); parseAndAssertNodeCount(1); assertType("root node",CustomTag.class, node[0]); *************** *** 550,554 **** "</custom>" ); ! parser.addScanner(new CustomScanner(false)); parseAndAssertNodeCount(3); for (int i=0;i<nodeCount;i++) { --- 601,605 ---- "</custom>" ); ! parser.setNodeFactory (new PrototypicalNodeFactory (new CustomTag (false))); parseAndAssertNodeCount(3); for (int i=0;i<nodeCount;i++) { *************** *** 637,640 **** --- 688,696 ---- protected String[] mEnders; + /** + * The default scanner for custom tags. + */ + protected final static CustomScanner mDefaultScanner = new CustomScanner (); + public CustomTag () { *************** *** 648,651 **** --- 704,708 ---- else mEnders = mIds; + setThisScanner (mDefaultScanner); } *************** *** 667,670 **** --- 724,729 ---- return (mEnders); } + + } *************** *** 686,689 **** --- 745,753 ---- private final String[] mEndTagEnders; + /** + * The default scanner for custom tags. + */ + protected final static AnotherScanner mDefaultScanner = new AnotherScanner (); + public AnotherTag (boolean acceptCustomTagsButDontAcceptCustomEndTags) { *************** *** 698,701 **** --- 762,766 ---- mEndTagEnders = new String[] {"CUSTOM"}; } + setThisScanner (mDefaultScanner); } Index: JspScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/JspScannerTest.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** JspScannerTest.java 9 Nov 2003 17:07:15 -0000 1.33 --- JspScannerTest.java 7 Dec 2003 23:41:42 -0000 1.34 *************** *** 30,33 **** --- 30,34 ---- import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.scanners.JspScanner; import org.htmlparser.tags.JspTag; *************** *** 58,63 **** "</h1>"); ! // Register the Jsp Scanner ! parser.addScanner(new JspScanner("-j")); parseAndAssertNodeCount(5); // The first node should be an JspTag --- 59,63 ---- "</h1>"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); parseAndAssertNodeCount(5); // The first node should be an JspTag *************** *** 89,94 **** "%>"); Parser.setLineSeparator("\r\n"); ! // Register the Jsp Scanner ! parser.addScanner(new JspScanner("-j")); parseAndAssertNodeCount(1); } --- 89,93 ---- "%>"); Parser.setLineSeparator("\r\n"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); parseAndAssertNodeCount(1); } Index: ScriptScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/ScriptScannerTest.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** ScriptScannerTest.java 9 Nov 2003 17:07:15 -0000 1.47 --- ScriptScannerTest.java 7 Dec 2003 23:41:42 -0000 1.48 *************** *** 34,37 **** --- 34,38 ---- import org.htmlparser.Parser; import org.htmlparser.scanners.ScriptScanner; + import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.tests.ParserTestCase; *************** *** 53,58 **** String testHtml = "<SCRIPT>document.write(d+\".com\")</SCRIPT>"; createParser(testHtml,"http://www.google.com/test/index.html"); - // Register the script scanner - parser.addScanner(new ScriptScanner("-s")); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] instanceof ScriptTag); --- 54,57 ---- *************** *** 74,79 **** String src = "../js/DetermineBrowser.js"; createParser("<SCRIPT LANGUAGE=\"JavaScript\" SRC=\"" + src + "\"></SCRIPT>","http://www.google.com/test/index.html"); - // Register the image scanner - parser.addScanner(new ScriptScanner("-s")); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] instanceof ScriptTag); --- 73,76 ---- *************** *** 113,125 **** createParser(testHTML1,"http://www.google.com/test/index.html"); Parser.setLineSeparator("\r\n"); ! // Register the image scanner ! parser.addScanner(new ScriptScanner("-s")); ! ! parseAndAssertNodeCount(2); ! ! assertTrue("Node should be a script tag",node[1] ! instanceof ScriptTag); ! // Check the data in the applet tag ! ScriptTag scriptTag = (ScriptTag)node[1]; String s = scriptTag.getScriptCode(); assertStringEquals("Expected Script Code",testHTML2,s); --- 110,120 ---- createParser(testHTML1,"http://www.google.com/test/index.html"); Parser.setLineSeparator("\r\n"); ! parseAndAssertNodeCount(1); ! assertTrue("Node should be a body tag", node[0] instanceof BodyTag); ! BodyTag body = (BodyTag)node[0]; ! assertTrue("Node should have one child", 1 == body.getChildCount ()); ! assertTrue("Child should be a script tag", body.getChild (0) instanceof ScriptTag); ! // Check the data in the script tag ! ScriptTag scriptTag = (ScriptTag)body.getChild (0); String s = scriptTag.getScriptCode(); assertStringEquals("Expected Script Code",testHTML2,s); *************** *** 135,142 **** createParser(testHTML1,"http://www.hardwareextreme.com/"); - // Register the image scanner - parser.registerScanners(); - //parser.addScanner(new HTMLScriptScanner("-s")); - parseAndAssertNodeCount(2); assertTrue("Node should be a script tag",node[0] --- 130,133 ---- *************** *** 156,161 **** createParser("<SCRIPT Language=\"JavaScript\">"+expectedCode+ "</SCRIPT>","http://www.hardwareextreme.com/"); - // Register the image scanner - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] --- 147,150 ---- *************** *** 180,185 **** "</SCRIPT>"; createParser(testHtml); - - parser.addScanner(new ScriptScanner("-s")); parseAndAssertNodeCount(1); ScriptTag scriptTag = (ScriptTag)node[0]; --- 169,172 ---- *************** *** 219,223 **** "</html>" ); - parser.registerScanners(); Node scriptNodes [] = parser.extractAllNodesThatAre(ScriptTag.class); --- 206,209 ---- *************** *** 250,254 **** "</SCRIPT>" ); - parser.registerScanners(); parseAndAssertNodeCount(1); assertType("script",ScriptTag.class,node[0]); --- 236,239 ---- *************** *** 269,273 **** "</SCRIPT>" ); - parser.registerScanners(); parseAndAssertNodeCount(1); assertType("script",ScriptTag.class,node[0]); --- 254,257 ---- *************** *** 485,489 **** "</script>" ); - parser.registerScanners(); parseAndAssertNodeCount(1); --- 469,472 ---- *************** *** 509,513 **** String scriptContents = "alert()\r\nalert()"; createParser("<script>" + scriptContents + "</script>"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertType("script",ScriptTag.class,node[0]); --- 492,495 ---- *************** *** 526,530 **** public void testScanNoEndTag() throws ParserException { createParser("<script>"); - parser.addScanner(new ScriptScanner("-s")); parseAndAssertNodeCount(1); } --- 508,511 ---- *************** *** 537,541 **** String html = "<SCRIPT language=\"JavaScript\">document.write('</SCRIPT>');</SCRIPT>"; createParser(html); - parser.addScanner(new ScriptScanner("-s")); parseAndAssertNodeCount(1); assertStringEquals ("Parse error", html, node[0].toHtml ()); --- 518,521 ---- *************** *** 547,551 **** String javascript = "\n// This is javascript with <li> tag in the comment\n"; createParser("<script>"+ javascript + "</script>"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] instanceof ScriptTag); --- 527,530 ---- *************** *** 561,565 **** "that spans multiple lines;\"\n"; createParser("<script>"+ javascript + "</script>"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] instanceof ScriptTag); --- 540,543 ---- *************** *** 573,577 **** String javascript = "\nAnything inside the script tag should be unchanged, even <li> and other html tags\n"; createParser("<script>"+ javascript + "</script>"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertTrue("Node should be a script tag",node[0] instanceof ScriptTag); --- 551,554 ---- Index: XmlEndTagScanningTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/XmlEndTagScanningTest.java,v retrieving revision 1.33 retrieving revision 1.34 diff -C2 -d -r1.33 -r1.34 *** XmlEndTagScanningTest.java 9 Nov 2003 17:07:15 -0000 1.33 --- XmlEndTagScanningTest.java 7 Dec 2003 23:41:42 -0000 1.34 *************** *** 46,50 **** public void testSingleTagParsing() throws ParserException { createParser("<div style=\"page-break-before: always; \" />"); - parser.registerScanners(); parseAndAssertNodeCount(1); assertType("div tag",Div.class,node[0]); --- 46,49 ---- --- AppletScannerTest.java DELETED --- --- BaseHREFScannerTest.java DELETED --- --- BodyScannerTest.java DELETED --- --- BulletListScannerTest.java DELETED --- --- BulletScannerTest.java DELETED --- --- DivScannerTest.java DELETED --- --- FormScannerTest.java DELETED --- --- FrameScannerTest.java DELETED --- --- FrameSetScannerTest.java DELETED --- --- HeadScannerTest.java DELETED --- --- HtmlTest.java DELETED --- --- ImageScannerTest.java DELETED --- --- InputTagScannerTest.java DELETED --- --- LabelScannerTest.java DELETED --- --- LinkScannerTest.java DELETED --- --- MetaTagScannerTest.java DELETED --- --- OptionTagScannerTest.java DELETED --- --- SelectTagScannerTest.java DELETED --- --- SpanScannerTest.java DELETED --- --- StyleScannerTest.java DELETED --- --- TableScannerTest.java DELETED --- --- TextareaTagScannerTest.java DELETED --- --- TitleScannerTest.java DELETED --- |
From: <der...@us...> - 2003-12-07 23:41:45
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/parserHelperTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/parserHelperTests Modified Files: RemarkNodeParserTest.java StringParserTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: RemarkNodeParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/parserHelperTests/RemarkNodeParserTest.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** RemarkNodeParserTest.java 9 Nov 2003 17:07:15 -0000 1.40 --- RemarkNodeParserTest.java 7 Dec 2003 23:41:41 -0000 1.41 *************** *** 31,34 **** --- 31,35 ---- import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; *************** *** 75,78 **** --- 76,80 ---- "<TEST>\n"+ "</TEST>\n"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\r\n"); parseAndAssertNodeCount(15); *************** *** 98,101 **** --- 100,104 ---- "<TEST>\n"+ "</TEST>\n"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\r\n"); parseAndAssertNodeCount(15); *************** *** 122,125 **** --- 125,129 ---- "<TEST>\n"+ "</TEST>\n"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\r\n"); parseAndAssertNodeCount(15); *************** *** 157,160 **** --- 161,165 ---- "\n"+ "-->"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\r\n"); parseAndAssertNodeCount(1); *************** *** 172,175 **** --- 177,181 ---- public void testRemarkNodeWithNothing() throws ParserException { createParser("<!-->"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertTrue("Node should be a RemarkNode",node[0] instanceof RemarkNode); *************** *** 189,192 **** --- 195,199 ---- "<A>\n"+ "bcd -->"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\n"); parseAndAssertNodeCount(1); *************** *** 210,213 **** --- 217,221 ---- "-\n"+ "ssd -->"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); Parser.setLineSeparator("\n"); parseAndAssertNodeCount(1); *************** *** 227,230 **** --- 235,239 ---- public void testDashesInComment() throws ParserException{ createParser("<!-- -- -->"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertTrue("Node should be a HTMLRemarkNode but was "+node[0],node[0] instanceof RemarkNode); *************** *** 273,276 **** --- 282,286 ---- + "</HTML>\n" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(18); assertTrue("Node should be a RemarkNode but was "+node[12],node[12] instanceof RemarkNode); *************** *** 296,299 **** --- 306,310 ---- + "</HTML>\n" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(18); assertTrue("Node should be a RemarkNode but was "+node[12],node[12] instanceof RemarkNode); *************** *** 319,322 **** --- 330,334 ---- + "</HTML>\n" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(18); assertTrue("Node should be a RemarkNode but was "+node[12],node[12] instanceof RemarkNode); *************** *** 369,372 **** --- 381,385 ---- + "</html>\n" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount (18); } Index: StringParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/parserHelperTests/StringParserTest.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** StringParserTest.java 9 Nov 2003 17:07:15 -0000 1.43 --- StringParserTest.java 7 Dec 2003 23:41:42 -0000 1.44 *************** *** 29,36 **** package org.htmlparser.tests.parserHelperTests; import org.htmlparser.Parser; import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; ! import org.htmlparser.scanners.LinkScanner; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.MetaTag; --- 29,39 ---- package org.htmlparser.tests.parserHelperTests; + import org.htmlparser.Parser; + import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; ! import org.htmlparser.tags.HeadTag; ! import org.htmlparser.tags.Html; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.MetaTag; *************** *** 59,62 **** --- 62,66 ---- public void testStringNodeBug1() throws ParserException { createParser("<HTML><HEAD><TITLE>Google</TITLE>"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(5); // The fourth node should be a StringNode- with the text - Google *************** *** 80,84 **** "Acrobat Reader</A> installed on your computer."); Parser.setLineSeparator("\r\n"); - parser.addScanner(new LinkScanner("-l")); parseAndAssertNodeCount(3); // The first node should be a StringNode- with the text - view these documents, you must have --- 84,87 ---- *************** *** 104,108 **** public void testTagCharsInStringNode() throws ParserException { createParser("<a href=\"http://asgard.ch\">[> ASGARD <]</a>"); - parser.addScanner(new LinkScanner("-l")); parseAndAssertNodeCount(1); assertTrue("Node identified must be a link tag",node[0] instanceof LinkTag); --- 107,110 ---- *************** *** 114,117 **** --- 116,120 ---- public void testToPlainTextString() throws ParserException { createParser("<HTML><HEAD><TITLE>This is the Title</TITLE></HEAD><BODY>Hello World, this is the HTML Parser</BODY></HTML>"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(10); assertTrue("Fourth Node identified must be a string node",node[3] instanceof StringNode); *************** *** 125,128 **** --- 128,132 ---- public void testToHTML() throws ParserException { createParser("<HTML><HEAD><TITLE>This is the Title</TITLE></HEAD><BODY>Hello World, this is the HTML Parser</BODY></HTML>"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(10); assertTrue("Fourth Node identified must be a string node",node[3] instanceof StringNode); *************** *** 140,143 **** --- 144,148 ---- "<br>" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(4); assertTrue("Third Node identified must be a string node",node[2] instanceof StringNode); *************** *** 152,155 **** --- 157,161 ---- "Before Comment <!-- Comment --> After Comment" ); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(3); assertTrue("First node should be StringNode",node[0] instanceof StringNode); *************** *** 171,174 **** --- 177,181 ---- public void testLastLineWithOneChar() throws ParserException { createParser("a"); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertTrue("First node should be StringNode",node[0] instanceof StringNode); *************** *** 180,183 **** --- 187,191 ---- String text = "a\n\nb"; createParser(text); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertTrue("First node should be StringNode",node[0] instanceof StringNode); *************** *** 211,218 **** "</html>" ); ! parser.registerScanners(); ! parseAndAssertNodeCount(10); ! assertType("fourth node",MetaTag.class,node[4]); ! MetaTag metaTag = (MetaTag)node[4]; assertStringEquals( --- 219,231 ---- "</html>" ); ! parseAndAssertNodeCount(2); ! assertTrue(node[1] instanceof Html); ! Html htmlTag = (Html)node[1]; ! assertTrue("The HTML tag should have 3 nodes", 3 == htmlTag.getChildCount ()); ! assertTrue("The first child should be a HEAD tag",htmlTag.getChild(0) instanceof HeadTag); ! HeadTag headTag = (HeadTag)htmlTag.getChild(0); ! assertTrue("The HEAD tag should have 2 nodes", 2 == headTag.getChildCount ()); ! assertTrue("The second child should be a META tag",headTag.getChild(1) instanceof MetaTag); ! MetaTag metaTag = (MetaTag)headTag.getChild(1); assertStringEquals( *************** *** 226,229 **** --- 239,243 ---- String text = "Testing &\nRefactoring"; createParser(text); + parser.setNodeFactory (new PrototypicalNodeFactory (true)); parseAndAssertNodeCount(1); assertType("first node",StringNode.class,node[0]); |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/nodeDecoratorTests In directory sc8-pr-cvs1:/tmp/cvs-serv16537/tests/nodeDecoratorTests Modified Files: DecodingNodeTest.java EscapeCharacterRemovingNodeTest.java NonBreakingSpaceConvertingNodeTest.java Log Message: Remove most of the scanners. The only scanners left are ones that really do something different (script and jsp). Instead of registering a scanner to enable returning a specific tag you now add a tag to the a PrototypicalNodeFactory. All known tags are 'registered' by default in a new Parser which is similar to having called the old 'registerDOMScanners()', so tags are fully nested. This is different behaviour, and specifically, you will need to recurse into returned nodes to get at what you want. I've tried to adjust the applications accordingly, but worked examples are still scarce. If you want to return only some of the derived tags while keeping most as generic tags, there are various constructors and manipulators on the factory. See the javadocs and examples in the tests package. Nearly all the old scanner tests are folded into the tag tests. toString() has been revamped. This means that the default Parser mainline now returns an indented listing of tags, making it easy to see the structure of a page. The downside is the text of the page had to have newlines, tabs etc. turned into escape sequences. But if you were really interested in content you would be using toHtml() or toPlainTextString(). Index: DecodingNodeTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/nodeDecoratorTests/DecodingNodeTest.java,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** DecodingNodeTest.java 9 Nov 2003 17:07:14 -0000 1.16 --- DecodingNodeTest.java 7 Dec 2003 23:41:41 -0000 1.17 *************** *** 51,57 **** StringBuffer decodedContent = new StringBuffer(); StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setNodeDecoding(true); createParser(STRING_TO_DECODE); ! parser.setStringNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); --- 51,57 ---- StringBuffer decodedContent = new StringBuffer(); StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setDecode (true); createParser(STRING_TO_DECODE); ! parser.setNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); Index: EscapeCharacterRemovingNodeTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/nodeDecoratorTests/EscapeCharacterRemovingNodeTest.java,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** EscapeCharacterRemovingNodeTest.java 9 Nov 2003 17:07:14 -0000 1.15 --- EscapeCharacterRemovingNodeTest.java 7 Dec 2003 23:41:41 -0000 1.16 *************** *** 52,58 **** StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setEscapeCharacterRemoval(true); createParser(STRING_TO_DECODE); ! parser.setStringNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); --- 52,58 ---- StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setRemoveEscapes (true); createParser(STRING_TO_DECODE); ! parser.setNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); *************** *** 100,108 **** StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setNodeDecoding(true); ! stringNodeFactory.setEscapeCharacterRemoval(true); createParser(ENCODED_WORKSHOP_TITLE); ! parser.setStringNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); --- 100,108 ---- StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setDecode (true); ! stringNodeFactory.setRemoveEscapes (true); createParser(ENCODED_WORKSHOP_TITLE); ! parser.setNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); Index: NonBreakingSpaceConvertingNodeTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/nodeDecoratorTests/NonBreakingSpaceConvertingNodeTest.java,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** NonBreakingSpaceConvertingNodeTest.java 9 Nov 2003 17:07:14 -0000 1.14 --- NonBreakingSpaceConvertingNodeTest.java 7 Dec 2003 23:41:41 -0000 1.15 *************** *** 51,57 **** StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setNonBreakSpaceConversion(true); createParser(STRING_TO_DECODE); ! parser.setStringNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); --- 51,57 ---- StringNodeFactory stringNodeFactory = new StringNodeFactory(); ! stringNodeFactory.setConvertNonBreakingSpaces (true); createParser(STRING_TO_DECODE); ! parser.setNodeFactory(stringNodeFactory); NodeIterator nodes = parser.elements(); |