htmlparser-cvs Mailing List for HTML Parser (Page 22)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(141) |
Jun
(108) |
Jul
(66) |
Aug
(127) |
Sep
(155) |
Oct
(149) |
Nov
(72) |
Dec
(72) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(100) |
Feb
(36) |
Mar
(21) |
Apr
(3) |
May
(87) |
Jun
(28) |
Jul
(84) |
Aug
(5) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(1) |
Feb
(39) |
Mar
(26) |
Apr
(38) |
May
(14) |
Jun
(10) |
Jul
|
Aug
|
Sep
(13) |
Oct
(8) |
Nov
(10) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(24) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <der...@us...> - 2004-02-09 02:12:55
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9169 Modified Files: build.xml Log Message: Rework character entity translation. See task 58599 enhance character reference translation. Decode now handles missing semi colons, encoding is more efficient, hexadecimal numeric character entity references are handled and both encoding and decoding make minimal use of substring(). Augmented the tests in CharacterTranslationTest significantly, and merged the Generate class into the tests. Added translate command scripts in bin, which read from stdin and write to stdout. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.58 retrieving revision 1.59 diff -C2 -d -r1.58 -r1.59 *** build.xml 14 Jan 2004 02:53:46 -0000 1.58 --- build.xml 9 Feb 2004 02:09:43 -0000 1.59 *************** *** 301,304 **** --- 301,305 ---- <pathelement location="${junit.jar}"/> <pathelement location="${commons-logging.jar}"/> + <pathelement location="${java.home}/../lib/tools.jar"/> </classpath> </javac> *************** *** 309,312 **** --- 310,314 ---- <pathelement location="${junit.jar}"/> <pathelement location="${commons-logging.jar}"/> + <pathelement location="${java.home}/../lib/tools.jar"/> </classpath> <arg value="-text"/> |
From: <der...@us...> - 2004-02-07 14:42:42
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15266/lexer Modified Files: Lexer.java Log Message: Fix bug #891058 Bug in lexer. Patch submitted by Gernot Fricke. This change causes attribute parsing to be more 'greedy' resulting in 'empty' attributes consuming the next attribute. This brings the lexer parsing more in line with other (browser) interpretations and simplifies it immensely. Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** Lexer.java 24 Jan 2004 17:13:43 -0000 1.25 --- Lexer.java 7 Feb 2004 12:53:09 -0000 1.26 *************** *** 531,534 **** --- 531,535 ---- * <li>state 4 - within single quoted attribute value</li> * <li>state 5 - within double quoted attribute value</li> + * <li>state 6 - whitespaces after attribute name could lead to state 2 (=)or state 0</li> * </ol> * <p> *************** *** 541,544 **** --- 542,546 ---- * The first slot is for attribute name (kind of like a standalone attribute). * @param cursor The position at which to start scanning. + * @return The parsed tag. */ protected Node parseTag (Cursor cursor) *************** *** 555,559 **** attributes = new Vector (); state = 0; ! bookmarks = new int[7]; bookmarks[0] = cursor.getPosition (); while (!done) --- 557,561 ---- attributes = new Vector (); state = 0; ! bookmarks = new int[8]; bookmarks[0] = cursor.getPosition (); while (!done) *************** *** 595,601 **** else if (Character.isWhitespace (ch)) { ! standalone (attributes, bookmarks); ! bookmarks[0] = bookmarks[2]; ! state = 0; } else if ('=' == ch) --- 597,604 ---- else if (Character.isWhitespace (ch)) { ! // whitespaces might be followed by next attribute or an equal sign ! // see Bug #891058 Bug in lexer. ! bookmarks[6] = bookmarks[2]; // setting the bookmark[0] is done in state 6 if applicable ! state = 6; } else if ('=' == ch) *************** *** 619,626 **** } else if (Character.isWhitespace (ch)) ! { ! empty (attributes, bookmarks); ! bookmarks[0] = bookmarks[3]; ! state = 0; } else --- 622,629 ---- } else if (Character.isWhitespace (ch)) ! { ! // collect white spaces after "=" into the assignment string; ! // do nothing ! // see Bug #891058 Bug in lexer. } else *************** *** 666,669 **** --- 669,708 ---- } break; + // patch for lexer state correction by + // Gernot Fricke + // See Bug # 891058 Bug in lexer. + case 6: // undecided for state 0 or 2 + // we have read white spaces after an attributte name + if (0 == ch) + { + // same as last else clause + standalone (attributes, bookmarks); + bookmarks[0]=bookmarks[6]; + cursor.retreat(); + state=0; + } + else if (Character.isWhitespace (ch)) + { + // proceed + } + else if ('=' == ch) // yepp. the white spaces belonged to the equal. + { + bookmarks[2] = bookmarks[6]; + bookmarks[3] = bookmarks[7]; + state=2; + } + else + { + // white spaces were not ended by equal + // meaning the attribute was a stand alone attribute + // now: create the stand alone attribute and rewind + // the cursor to the end of the white spaces + // and restart scanning as whitespace attribute. + standalone (attributes, bookmarks); + bookmarks[0]=bookmarks[6]; + cursor.retreat(); + state=0; + } + break; default: throw new IllegalStateException ("how the fuck did we get in state " + state); *************** *** 671,811 **** } - // OK, before constructing the node, fix up erroneous attributes - fixAttributes (attributes); - return (makeTag (cursor, attributes)); } /** - * Try to resolve bad attributes. - * Look for the following patterns and assume what they meant was the - * construct on the right: - * <p>Rule 1. - * <pre> - * att = -> att= - * </pre> - * An attribute named "=", converts a previous standalone attribute into - * an empty attribute. - * <p>Rule 2. - * <pre> - * att =value -> att=value - * </pre> - * An attribute name beginning with an equals sign, is the value of - * a previous standalone attribute. - * <p>Rule 3. - * <pre> - * att= "value" -> att="value" - * </pre> - * A quoted attribute name, is the value of a previous empty - * attribute. - * <p>Rule 4 and Rule 5. - * <pre> - * att="va"lue" -> att='va"lue' - * att='val'ue' -> att="val'ue" - * </pre> - * An attribute name ending in a quote is a second part of a - * similarly quoted value of a previous attribute. Note, this doesn't - * change the quote value but it should, or the contained quote should be - * removed. - * <p>Note: - * <pre> - * att = "value" -> att="value" - * </pre> - * A quoted attribute name, is the value of a previous standalone - * attribute separated by an attribute named "=" will be handled by - * sequential application of rule 1 and 3. - */ - protected void fixAttributes (Vector attributes) throws ParserException - { - PageAttribute attribute; - Cursor cursor; - char ch1; // name starting character - char ch2; // name ending character - PageAttribute prev1; // attribute prior to the current - PageAttribute prev2; // attribute prior but one to the current - char quote; - - cursor = new Cursor (getPage (), 0); - prev1 = null; - prev2 = null; - // leave the name alone & start with second attribute - for (int i = 2; i < attributes.size (); ) - { - attribute = (PageAttribute)attributes.elementAt (i); - if (!attribute.isWhitespace ()) - { - cursor.setPosition (attribute.getNameStartPosition ()); - ch1 = attribute.getPage ().getCharacter (cursor); - cursor.setPosition (attribute.getNameEndPosition () - 1); - ch2 = attribute.getPage ().getCharacter (cursor); - if ('=' == ch1) - { // possible rule 1 or 2 - // check for a previous standalone, both rules need it, also check prev1 as a sanity check - if (null != prev2 && prev2.isStandAlone () && prev1.isWhitespace ()) - { - if (1 == attribute.getNameEndPosition () - attribute.getNameStartPosition ()) - { // rule 1, an isolated equals sign - prev2.setValueStartPosition (attribute.getNameEndPosition ()); - attributes.removeElementAt (i); // current - attributes.removeElementAt (i - 1); // whitespace - prev1 = prev2; - prev2 = null; - i--; - continue; - } - else - { - // rule 2, name starts with equals - prev2.setValueStartPosition (attribute.getNameStartPosition () + 1); // past the equals sign - prev2.setValueEndPosition (attribute.getNameEndPosition ()); - attributes.removeElementAt (i); // current - attributes.removeElementAt (i - 1); // whitespace - prev1 = prev2; - prev2 = null; - i--; - continue; - } - } - } - else if ((('\'' == ch1) && ('\'' == ch2)) || (('"' == ch1) && ('"' == ch2))) - { // possible rule 3 - // check for a previous empty, also check prev1 as a sanity check - if (null != prev2 && prev2.isEmpty () && prev1.isWhitespace ()) - { // TODO check that name has more than one character - prev2.setValueStartPosition (attribute.getNameStartPosition () + 1); - prev2.setValueEndPosition (attribute.getNameEndPosition () - 1); - prev2.setQuote (ch1); - attributes.removeElementAt (i); // current - attributes.removeElementAt (i - 1); // whitespace - prev1 = prev2; - prev2 = null; - i--; - continue; - } - } - else if (('\'' == ch2) || ('"' == ch2)) - { // possible rule 4 or 5 - // check for a previous valued attribute - if (null != prev1 && prev1.isValued ()) - { // check for a terminating quote of the same type - cursor.setPosition (prev1.getValueEndPosition ()); - ch1 = prev1.getPage ().getCharacter (cursor); // crossing pages with cursor? - if (ch1 == ch2) - { - prev1.setValueEndPosition (attribute.getNameEndPosition () - 1); - attributes.removeElementAt (i); // current - continue; - } - } - } - } - // shift and go on to next attribute - prev2 = prev1; - prev1 = attribute; - i++; - } - } - - /** * Create a tag node based on the current cursor and the one provided. */ --- 710,717 ---- |
From: <der...@us...> - 2004-02-07 14:09:57
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15266/tests/lexerTests Modified Files: AttributeTests.java Log Message: Fix bug #891058 Bug in lexer. Patch submitted by Gernot Fricke. This change causes attribute parsing to be more 'greedy' resulting in 'empty' attributes consuming the next attribute. This brings the lexer parsing more in line with other (browser) interpretations and simplifies it immensely. Index: AttributeTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/AttributeTests.java,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** AttributeTests.java 14 Jan 2004 02:53:47 -0000 1.11 --- AttributeTests.java 7 Feb 2004 12:53:09 -0000 1.12 *************** *** 58,65 **** public void getParameterTableFor(String tagContents) { String html; NodeIterator iterator; Node node; - Tag tag; Vector attributes; --- 58,69 ---- public void getParameterTableFor(String tagContents) { + getParameterTableFor (tagContents, false); + } + + public void getParameterTableFor(String tagContents, boolean dump) + { String html; NodeIterator iterator; Node node; Vector attributes; *************** *** 75,81 **** tag = (Tag)node; attributes = tag.getAttributesEx (); ! // for (int i = 0; i < attributes.size (); i++) ! // System.out.print ("|" + attributes.elementAt (i)); ! // System.out.println ("|"); table = tag.getAttributes (); } --- 79,100 ---- tag = (Tag)node; attributes = tag.getAttributesEx (); ! if (dump) ! { ! for (int i = 0; i < attributes.size (); i++) ! { ! System.out.print ("Attribute #" + i); ! Attribute attribute = (Attribute)attributes.elementAt (i); ! if (null != attribute.getName ()) ! System.out.print (" Name: '" + attribute.getName () + "'"); ! if (null != attribute.getAssignment ()) ! System.out.print (" Assignment: '" + attribute.getAssignment () + "'"); ! if (0 != attribute.getQuote ()) ! System.out.print (" Quote: " + attribute.getQuote ()); ! if (null != attribute.getValue ()) ! System.out.print (" Value: '" + attribute.getValue () + "'"); ! System.out.println (); ! } ! System.out.println (); ! } table = tag.getAttributes (); } *************** *** 98,101 **** --- 117,121 ---- { Vector attributes; + Tag tag; String html; *************** *** 126,129 **** --- 146,150 ---- Attribute space; Vector attributes; + Tag tag; String html; *************** *** 180,183 **** --- 201,205 ---- { Vector attributes; + Tag tag; String html; *************** *** 208,211 **** --- 230,234 ---- Attribute space; Vector attributes; + Tag tag; String html; *************** *** 455,458 **** --- 478,482 ---- /** * Test Rule 1. + * See discussion in Bug#891058 Bug in lexer. regarding alternate interpretations. */ public void testRule1 () *************** *** 460,467 **** getParameterTableFor ("tag att = other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "", (String)table.get ("ATT")); assertTrue ("No attribute should be called equal sign", !table.containsKey ("=")); - assertTrue ("Attribute missing", table.containsKey ("OTHER")); - assertEquals ("Attribute has wrong value", "fred", (String)table.get ("OTHER")); } --- 484,489 ---- getParameterTableFor ("tag att = other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "other=fred", (String)table.get ("ATT")); assertTrue ("No attribute should be called equal sign", !table.containsKey ("=")); } *************** *** 494,497 **** --- 516,520 ---- /** * Test Rule 4. + * See discussion in Bug#891058 Bug in lexer. regarding alternate interpretations. */ public void testRule4 () *************** *** 499,504 **** getParameterTableFor ("tag att=\"va\"lue\" other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "va\"lue", (String)table.get ("ATT")); assertTrue ("No attribute should be called va\"lue", !table.containsKey ("VA\"LUE")); assertTrue ("Attribute missing", table.containsKey ("OTHER")); assertEquals ("Attribute has wrong value", "fred", (String)table.get ("OTHER")); --- 522,529 ---- getParameterTableFor ("tag att=\"va\"lue\" other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "va", (String)table.get ("ATT")); assertTrue ("No attribute should be called va\"lue", !table.containsKey ("VA\"LUE")); + assertTrue ("Attribute missing", table.containsKey ("LUE\"")); + assertNull ("Attribute has wrong value", table.get ("LUE\"")); assertTrue ("Attribute missing", table.containsKey ("OTHER")); assertEquals ("Attribute has wrong value", "fred", (String)table.get ("OTHER")); *************** *** 507,510 **** --- 532,536 ---- /** * Test Rule 5. + * See discussion in Bug#891058 Bug in lexer. regarding alternate interpretations. */ public void testRule5 () *************** *** 512,517 **** getParameterTableFor ("tag att='va'lue' other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "va'lue", (String)table.get ("ATT")); assertTrue ("No attribute should be called va'lue", !table.containsKey ("VA'LUE")); assertTrue ("Attribute missing", table.containsKey ("OTHER")); assertEquals ("Attribute has wrong value", "fred", (String)table.get ("OTHER")); --- 538,545 ---- getParameterTableFor ("tag att='va'lue' other=fred"); assertTrue ("Attribute missing", table.containsKey ("ATT")); ! assertEquals ("Attribute has wrong value", "va", (String)table.get ("ATT")); assertTrue ("No attribute should be called va'lue", !table.containsKey ("VA'LUE")); + assertTrue ("Attribute missing", table.containsKey ("LUE'")); + assertNull ("Attribute has wrong value", table.get ("LUE'")); assertTrue ("Attribute missing", table.containsKey ("OTHER")); assertEquals ("Attribute has wrong value", "fred", (String)table.get ("OTHER")); |
From: <der...@pr...> - 2004-01-31 20:52:46
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28080 Modified Files: Page.java Log Message: Compare encoding names without case sensitivity. From HTML spec (http://www.w3.org/TR/html4/charset.html section 5.2.1): Names for character encodings are case-insensitive, so that for example "SHIFT_JIS", "Shift_JIS", and "shift_jis" are equivalent. and from to IANA(http://www.iana.org/assignments/character-sets): The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters. Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.32 retrieving revision 1.33 diff -C2 -d -r1.32 -r1.33 *** Page.java 10 Jan 2004 15:23:33 -0000 1.32 --- Page.java 31 Jan 2004 20:51:01 -0000 1.33 *************** *** 684,688 **** encoding = getEncoding (); ! if (!encoding.equals (character_set)) { stream = getSource ().getStream (); --- 684,688 ---- encoding = getEncoding (); ! if (!encoding.equalsIgnoreCase (character_set)) { stream = getSource ().getStream (); |
From: <der...@pr...> - 2004-01-31 16:33:15
|
Update of /cvsroot/htmlparser/htmlparser/src/doc-files In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv17338 Modified Files: overview.html Removed Files: todo.html Log Message: Move ToDo list to SourceForge trackers and tasks. Index: overview.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/doc-files/overview.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** overview.html 16 Dec 2003 02:29:56 -0000 1.1 --- overview.html 31 Jan 2004 16:31:20 -0000 1.2 *************** *** 20,30 **** Parser project on Sourceforge</a> if you haven't already, and then follow the <A href="{@docRoot}/doc-files/building.html">build instructions</A>. - <h2>History</h2> - <p> - Originally started by Somik Raha, the HTML Parser has evolved with input from - numerous people, and through several revisions... <h2>Outstanding Issues.</h2> ! The <A href="{@docRoot}/doc-files/todo.html">ToDo list</A> lists things that ! can or should be done. <h2>Mailing Lists.</h2> If you want to be notified when new releases of HTML Parser are available, join the --- 20,69 ---- Parser project on Sourceforge</a> if you haven't already, and then follow the <A href="{@docRoot}/doc-files/building.html">build instructions</A>. <h2>Outstanding Issues.</h2> ! Bugs are by far, the highest priority issues. Various reports of bugs related to ! the HTML Parser is available from the <A ! HREF="http://sourceforge.net/tracker/?group_id=24399&atid=381399">Bug ! Tracker</A> on SourceForge. Issues related to incorrect behaviour of the ! current parser should be logged and tracked using this mechanism. Please use ! task lists and enhancement requests for issues that would not be considered ! bugs. ! <p> ! Several task lists are used to track the items that are not percieved as bugs, ! but are viewed by developers as things that need attention. The following list ! summarizes the purpose and target issues for each list. ! <ul> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21604&group_id=24399&func=browse"> ! Applications</A> - Work associated with the sample applications included with ! the HTML Parser download is tracked by this list. This would also include ! proposals for other example applications.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21648&group_id=24399&func=browse"> ! Release</A> - Work to be done before a major release is tracked by this list. ! Items included here must be resolved before the major release is considered ! complete. This can include refactoring, code clean-up, out-of-the-box ! experience work, build process fixes, platform (JDK) issues, performance ! or scalability enhancements, memory usage issues and other 'quality' issues ! that are not associated with a specific bug.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&func=browse"> ! API</A> - Work needed to enhance or fix the parser API is tracked by this list. ! Standards compliance, additional classes, method signatures, changes to ! parameter types, refactoring, deprecation, new or enhanced constructors, and ! other programatic interface issues would fall into this category. This list ! should be limited to those changes that could impact the developer community ! that relies on existing behaviour from the parser.</li> ! <li><A HREF="http://sourceforge.net/pm/task.php?group_project_id=21602&group_id=24399&func=browse"> ! Documentation</A> - Work associated with documenting the parser and it's ! example code and sample applications is tracked by this list. Javadocs, the ! web site and Wiki, Sourceforge site maintenance, mailing lists, forums, ! project documentation and other developer visible reference material would all ! fall under this category.</li> ! </ul> ! <p> ! The <A HREF="http://sourceforge.net/tracker/?group_id=24399&atid=381402">Request ! For Enhancement</A> list contains items that are proposed for future versions ! of the parser. Users may add to this list what they feel are extensions beyond ! simple bug fixing. Some user entered bugs are also transferred to this list if ! the scope of the fix would be too significant a change for the current ! version, or involve API changes that need to be vetted against the current ! user community. <h2>Mailing Lists.</h2> If you want to be notified when new releases of HTML Parser are available, join the --- todo.html DELETED --- |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23128/tags Modified Files: AppletTag.java CompositeTag.java FormTag.java SelectTag.java TableRow.java TableTag.java Log Message: Fix bug #882940 empty applet tag contents causes NullPointerException Also found and fixed other similar problems where getChildren() could return null. Then changed table row and column handling to handle rows and columns embedded within other tags. Index: AppletTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/AppletTag.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** AppletTag.java 2 Jan 2004 16:24:54 -0000 1.38 --- AppletTag.java 24 Jan 2004 23:57:52 -0000 1.39 *************** *** 94,114 **** ret = new Hashtable (); kids = getChildren (); ! for (int i = 0; i < kids.size (); i++) ! { ! node = children.elementAt(i); ! if (node instanceof Tag) { ! tag = (Tag)node; ! if (tag.getTagName().equals ("PARAM")) { ! paramName = tag.getAttribute ("NAME"); ! if (null != paramName && 0 != paramName.length ()) { ! paramValue = tag.getAttribute ("VALUE"); ! ret.put (paramName,paramValue); } } } - } return (ret); --- 94,115 ---- ret = new Hashtable (); kids = getChildren (); ! if (null != kids) ! for (int i = 0; i < kids.size (); i++) { ! node = children.elementAt(i); ! if (node instanceof Tag) { ! tag = (Tag)node; ! if (tag.getTagName().equals ("PARAM")) { ! paramName = tag.getAttribute ("NAME"); ! if (null != paramName && 0 != paramName.length ()) ! { ! paramValue = tag.getAttribute ("VALUE"); ! ret.put (paramName,paramValue); ! } } } } return (ret); *************** *** 195,223 **** kids = getChildren (); ! // erase appletParams from kids ! for (int i = 0; i < kids.size (); ) ! { ! node = kids.elementAt (i); ! if (node instanceof Tag) ! if (((Tag)node).getTagName ().equals ("PARAM")) ! { ! kids.remove (i); ! // remove whitespace too ! if (i < kids.size ()) { ! node = kids.elementAt (i); ! if (node instanceof StringNode) { ! string = (StringNode)node; ! if (0 == string.getText ().trim ().length ()) ! kids.remove (i); ! } } ! } else i++; ! else ! i++; ! } // add newAppletParams to kids --- 196,227 ---- kids = getChildren (); ! if (null == kids) ! kids = new NodeList (); ! else ! // erase appletParams from kids ! for (int i = 0; i < kids.size (); ) ! { ! node = kids.elementAt (i); ! if (node instanceof Tag) ! if (((Tag)node).getTagName ().equals ("PARAM")) { ! kids.remove (i); ! // remove whitespace too ! if (i < kids.size ()) { ! node = kids.elementAt (i); ! if (node instanceof StringNode) ! { ! string = (StringNode)node; ! if (0 == string.getText ().trim ().length ()) ! kids.remove (i); ! } ! } } ! else ! i++; else i++; ! } // add newAppletParams to kids Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.73 retrieving revision 1.74 diff -C2 -d -r1.73 -r1.74 *** CompositeTag.java 24 Jan 2004 17:41:32 -0000 1.73 --- CompositeTag.java 24 Jan 2004 23:57:52 -0000 1.74 *************** *** 84,88 **** public Node getChild (int index) { ! return (getChildren ().elementAt (index)); } --- 84,90 ---- public Node getChild (int index) { ! return ( ! (null == getChildren ()) ? null : ! getChildren ().elementAt (index)); } *************** *** 93,97 **** public Node [] getChildrenAsNodeArray () { ! return (getChildren ().toNodeArray ()); } --- 95,101 ---- public Node [] getChildrenAsNodeArray () { ! return ( ! (null == getChildren ()) ? new Node[0] : ! getChildren ().toNodeArray ()); } *************** *** 102,106 **** public void removeChild (int i) { ! getChildren ().remove (i); } --- 106,111 ---- public void removeChild (int i) { ! if (null != getChildren ()) ! getChildren ().remove (i); } *************** *** 112,116 **** public SimpleNodeIterator elements() { ! return (getChildren ().elements ()); } --- 117,123 ---- public SimpleNodeIterator elements() { ! return ( ! (null == getChildren ()) ? new NodeList ().elements () : ! getChildren ().elements ()); } *************** *** 212,222 **** * Note that this will not check for parent types, and will not * recurse through child tags ! * @param classType ! * @return NodeList */ ! public NodeList searchFor(Class classType) { ! return (getChildren ().searchFor (classType)); } /** * Searches for any node whose text representation contains the search --- 219,233 ---- * Note that this will not check for parent types, and will not * recurse through child tags ! * @param classType The class to search for. ! * @param recursive If true, recursively search through the children. ! * @return A list of children found. */ ! public NodeList searchFor (Class classType, boolean recursive) { ! return ( ! (null == getChildren ()) ? new NodeList () : ! getChildren ().searchFor (classType, recursive)); } + /** * Searches for any node whose text representation contains the search *************** *** 285,293 **** /** * Get child at given index ! * @param index ! * @return Node */ ! public Node childAt(int index) { ! return (getChildren ().elementAt (index)); } --- 296,307 ---- /** * Get child at given index ! * @param index The index into the child node list. ! * @return Node The child node at the given index or null if none. */ ! public Node childAt (int index) ! { ! return ( ! (null == getChildren ()) ? null : ! getChildren ().elementAt (index)); } Index: FormTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FormTag.java,v retrieving revision 1.46 retrieving revision 1.47 diff -C2 -d -r1.46 -r1.47 *** FormTag.java 14 Jan 2004 02:53:46 -0000 1.46 --- FormTag.java 24 Jan 2004 23:57:52 -0000 1.47 *************** *** 95,99 **** public NodeList getFormInputs() { ! return (getChildren().searchFor(InputTag.class, true)); } --- 95,99 ---- public NodeList getFormInputs() { ! return (searchFor (InputTag.class, true)); } *************** *** 104,108 **** public NodeList getFormTextareas() { ! return (getChildren().searchFor(TextareaTag.class, true)); } --- 104,108 ---- public NodeList getFormTextareas() { ! return (searchFor (TextareaTag.class, true)); } Index: SelectTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/SelectTag.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** SelectTag.java 14 Jan 2004 02:53:46 -0000 1.37 --- SelectTag.java 24 Jan 2004 23:57:52 -0000 1.38 *************** *** 88,92 **** OptionTag[] ret; ! list = getChildren ().searchFor (OptionTag.class, true); ret = new OptionTag[list.size()]; list.copyToNodeArray (ret); --- 88,92 ---- OptionTag[] ret; ! list = searchFor (OptionTag.class, true); ret = new OptionTag[list.size()]; list.copyToNodeArray (ret); Index: TableRow.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableRow.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** TableRow.java 24 Jan 2004 18:12:41 -0000 1.39 --- TableRow.java 24 Jan 2004 23:57:52 -0000 1.40 *************** *** 27,30 **** --- 27,37 ---- package org.htmlparser.tags; + import org.htmlparser.NodeFilter; + import org.htmlparser.filters.AndFilter; + import org.htmlparser.filters.IsEqualFilter; + import org.htmlparser.filters.NodeClassFilter; + import org.htmlparser.filters.HasParentFilter; + import org.htmlparser.filters.NotFilter; + import org.htmlparser.filters.OrFilter; import org.htmlparser.util.NodeList; *************** *** 79,121 **** /** * Get the number of columns in this row. */ public int getColumnCount () { ! return ( ! (null == getChildren ()) ? 0 : ! getChildren ().searchFor (TableColumn.class).size ()); } /** ! * Get the children (columns) of this row. */ ! public TableColumn [] getColumns () { ! NodeList list; ! TableColumn [] ret; ! if (null != getChildren ()) { ! list = getChildren ().searchFor (TableColumn.class); ! ret = new TableColumn[list.size ()]; ! list.copyToNodeArray (ret); } else ! ret = new TableColumn[0]; ! return (ret); } /** - * Checks if this table has a header - * @return <code>true</code> if there is a header tag. - */ - public boolean hasHeader () - { - return (0 != getHeaderCount ()); - } - - /** * Get the number of headers in this row. * @return The count of header tags in this row. --- 86,174 ---- /** + * Get the column tags within this row. + */ + public TableColumn[] getColumns () + { + NodeList kids; + NodeClassFilter cls; + HasParentFilter recursion; + NodeFilter filter; + TableColumn[] ret; + + kids = getChildren (); + if (null != kids) + { + cls = new NodeClassFilter (TableRow.class); + recursion = new HasParentFilter (null); + filter = new OrFilter ( + new AndFilter ( + cls, + new IsEqualFilter (this)), + new AndFilter ( // recurse up the parent chain + new NotFilter (cls), // but not past the first row + recursion)); + recursion.mFilter = filter; + kids = kids.extractAllNodesThatMatch ( + // it's a column, and has this row as it's enclosing row + new AndFilter ( + new NodeClassFilter (TableColumn.class), + filter), true); + ret = new TableColumn[kids.size ()]; + kids.copyToNodeArray (ret); + } + else + ret = new TableColumn[0]; + + return (ret); + } + + /** * Get the number of columns in this row. */ public int getColumnCount () { ! return (getColumns ().length); } /** ! * Get the header of this table ! * @return Table header tags contained in this row. */ ! public TableHeader[] getHeaders () { ! NodeList kids; ! NodeClassFilter cls; ! HasParentFilter recursion; ! NodeFilter filter; ! TableHeader[] ret; ! kids = getChildren (); ! if (null != kids) { ! cls = new NodeClassFilter (TableRow.class); ! recursion = new HasParentFilter (null); ! filter = new OrFilter ( ! new AndFilter ( ! cls, ! new IsEqualFilter (this)), ! new AndFilter ( // recurse up the parent chain ! new NotFilter (cls), // but not past the first row ! recursion)); ! recursion.mFilter = filter; ! kids = kids.extractAllNodesThatMatch ( ! // it's a header, and has this row as it's enclosing row ! new AndFilter ( ! new NodeClassFilter (TableHeader.class), ! filter), true); ! ret = new TableHeader[kids.size ()]; ! kids.copyToNodeArray (ret); } else ! ret = new TableHeader[0]; ! return (ret); } /** * Get the number of headers in this row. * @return The count of header tags in this row. *************** *** 123,150 **** public int getHeaderCount () { ! return ( ! (null == getChildren ()) ? 0 : ! getChildren ().searchFor (TableHeader.class, false).size ()); } /** ! * Get the header of this table ! * @return Table header tags contained in this row. */ ! public TableHeader[] getHeader () { ! NodeList list; ! TableHeader [] ret; ! ! if (null != getChildren ()) ! { ! list = getChildren ().searchFor (TableHeader.class, false); ! ret = new TableHeader[list.size ()]; ! list.copyToNodeArray (ret); ! } ! else ! ret = new TableHeader[0]; ! ! return (ret); } } --- 176,189 ---- public int getHeaderCount () { ! return (getHeaders ().length); } /** ! * Checks if this table has a header ! * @return <code>true</code> if there is a header tag. */ ! public boolean hasHeader () { ! return (0 != getHeaderCount ()); } } Index: TableTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableTag.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** TableTag.java 2 Jan 2004 16:24:55 -0000 1.38 --- TableTag.java 24 Jan 2004 23:58:04 -0000 1.39 *************** *** 27,30 **** --- 27,39 ---- package org.htmlparser.tags; + import org.htmlparser.NodeFilter; + import org.htmlparser.filters.AndFilter; + import org.htmlparser.filters.IsEqualFilter; + import org.htmlparser.filters.NodeClassFilter; + import org.htmlparser.filters.HasParentFilter; + import org.htmlparser.filters.NotFilter; + import org.htmlparser.filters.OrFilter; + import org.htmlparser.util.NodeList; + /** * A table tag. *************** *** 68,76 **** /** * Get the number of rows in this table. */ ! public int getRowCount() { ! return (getChildren().searchFor(TableRow.class).size()); } --- 77,124 ---- /** + * Get the row tags within this table. + * @return The rows directly contained by this table. + */ + public TableRow[] getRows () + { + NodeList kids; + NodeClassFilter cls; + HasParentFilter recursion; + NodeFilter filter; + TableRow[] ret; + + kids = getChildren (); + if (null != kids) + { + cls = new NodeClassFilter (TableTag.class); + recursion = new HasParentFilter (null); + filter = new OrFilter ( + new AndFilter ( + cls, + new IsEqualFilter (this)), + new AndFilter ( // recurse up the parent chain + new NotFilter (cls), // but not past the first table + recursion)); + recursion.mFilter = filter; + kids = kids.extractAllNodesThatMatch ( + // it's a row, and has this table as it's enclosing table + new AndFilter ( + new NodeClassFilter (TableRow.class), + filter), true); + ret = new TableRow[kids.size ()]; + kids.copyToNodeArray (ret); + } + else + ret = new TableRow[0]; + + return (ret); + } + + /** * Get the number of rows in this table. */ ! public int getRowCount () { ! return (getRows ().length); } *************** *** 78,83 **** * Get the row at the given index. */ ! public TableRow getRow(int i) { ! return (TableRow)(getChildren().searchFor(TableRow.class)).elementAt(i); } --- 126,141 ---- * Get the row at the given index. */ ! public TableRow getRow (int i) ! { ! TableRow[] rows; ! TableRow ret; ! ! rows = getRows (); ! if (i < rows.length) ! ret = rows[i]; ! else ! ret = null; ! ! return (ret); } |
Update of /cvsroot/htmlparser/htmlparser/docs/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26789/docs/docs Modified Files: CustomTagExtraction.html EmailExtraction.html FactoryMethod.html ImageExtraction.html LinkExtraction.html PostOperation.html ReverseHtml.html SamplePrograms.html SearchingForData.html StringExtraction.html TemplateMethod.html WritingYourOwnScanners.html index.html Added Files: CustomTagLinks.html CustomVisitorLinks.html FilterLinks.html LexerLinks.html LinkBeanLinks.html VisitorLinks.html Log Message: Update version to 1.4-20040125 --- NEW FILE: CustomTagLinks.html --- <html><head><title>Custom Tag Links</title></head><body> <div class="wikitext"> <p><b>Using Custom Tags to Extract Links <p>The use of custom tags provides for altered behaviour during the parse: <pre> import org.htmlparser.Parser; import org.htmlparser.PrototypicalNodeFactory; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; class MyLinkTag extends LinkTag { public void doSemanticAction () throws ParserException { System.out.print ("\"" + getLinkText () + "\" => "); System.out.println (getLink ()); } } public class LinkDemo { public static void main (String[] args) throws ParserException { Parser parser = new Parser ("http://urlIWantToParse.com"); PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new MyLinkTag ()); parser.setNodeFactory (factory); for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) e.nextNode (); // just parsing the nodes executes doSemanticAction } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Tuesday, January 13, 2004 5:39:34 pm. <hr class="toolbar" noshade="noshade" /> </body></html> --- NEW FILE: CustomVisitorLinks.html --- <html><head><title>Custom Visitor Links</title></head><body> <div class="wikitext"> <p><b>Using a Custom Visitor to Extract Links <p>Creating a custom visitor is more powerful than just the processing of links demonstrated here: <pre> import org.htmlparser.Parser; import org.htmlparser.RemarkNode; import org.htmlparser.StringNode; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.Tag; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.NodeVisitor; class MyCustomizedVisitor extends NodeVisitor { public MyCustomizedVisitor () { super (true); // recurse into children } public void visitTag (Tag tag) { // process tags here if (tag instanceof LinkTag) { LinkTag linkTag = (LinkTag)tag; System.out.print ("\"" + linkTag.getLinkText () + "\" => "); System.out.println (linkTag.getLink ()); } } public void visitStringNode (StringNode stringNode) { // process text in the page here } public void visitEndTag (Tag endTag) { // process end tags here, // checking for end tags can be useful when performing // more involved page processing } public void visitRemarkNode (RemarkNode remarkNode) { // process remark nodes here } } public class LinkDemo { public static void main (String[] args) throws ParserException { Parser parser = new Parser ("http://urlIWantToParse.com"); MyCustomizedVisitor visitor = new MyCustomizedVisitor (); parser.visitAllNodesWith (visitor); } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Wednesday, January 7, 2004 5:24:34 pm. <hr class="toolbar" noshade="noshade" /> </body></html> --- NEW FILE: FilterLinks.html --- <html><head><title>Filter Links</title></head><body> <div class="wikitext"> <p><b>Using a NodeFilter to Extract Links <p>The filter capability is much more powerful than the simple link extraction illustrated here: <pre> import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class LinkDemo { public static void main (String[] args) throws ParserException { Parser parser = new Parser ("http://urlIWantToParse.com"); NodeFilter filter = new NodeClassFilter (LinkTag.class); NodeList links = new NodeList (); for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) e.nextNode ().collectInto (links, filter); for (int i = 0; i < links.size (); i++) { LinkTag linkTag = (LinkTag)links.elementAt (i); System.out.print ("\"" + linkTag.getLinkText () + "\" => "); System.out.println (linkTag.getLink ()); } } } <p>In fact, this is so useful that there is a convenience method to apply a NodeClassFilter directly from the parser: <pre> import org.htmlparser.Parser; import org.htmlparser.util.ParserException; import org.htmlparser.Node; import org.htmlparser.tags.LinkTag; public class LinkDemo { public static void main (String[] args) throws ParserException { Parser parser = new Parser ("http://urlIWantToParse.com"); Node [] links = parser.extractAllNodesThatAre (LinkTag.class); for (int i = 0; i < links.length; i++) { LinkTag linkTag = (LinkTag)links[i]; System.out.print ("\"" + linkTag.getLinkText () + "\" => "); System.out.println (linkTag.getLink ()); } } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Wednesday, January 7, 2004 4:48:39 pm. <hr class="toolbar" noshade="noshade" /> </body></html> --- NEW FILE: LexerLinks.html --- <html><head><title>Lexer Links</title></head><body> <div class="wikitext"> <p><b>Using a Lexer to Extract Links <p>If you are after raw link text only, then you can use a Lexer to access the links: <pre> import java.io.IOException; import java.net.URL; import java.net.URLConnection; import org.htmlparser.Node; import org.htmlparser.lexer.Lexer; import org.htmlparser.lexer.nodes.TagNode; import org.htmlparser.util.ParserException; public class LinkDemo { public static void main (String[] args) throws ParserException, IOException { Node node; URL url = new URL ("http://urlIWantToParse.com"); URLConnection connection = url.openConnection (); Lexer lexer = new Lexer (connection); while (null != (node = lexer.nextNode ())) if (node instanceof TagNode) { TagNode tag = (TagNode)node; if (tag.getTagName ().equals ("A") && !tag.isEndTag ()) { String href = tag.getAttribute ("href"); if (null != href) System.out.println (href); } } } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Thursday, January 8, 2004 4:06:57 am. <hr class="toolbar" noshade="noshade" /> </body></html> --- NEW FILE: LinkBeanLinks.html --- <html><head><title>Link Bean Links</title></head><body> <div class="wikitext"> <p><b>Using a LinkBean to Extract Links <p>A LinkBean is a pretty easy way to get just the links: <pre> import java.net.URL; import org.htmlparser.beans.LinkBean; public class LinkDemo { public static void main (String[] args) { LinkBean lb = new LinkBean (); lb.setURL ("http://urlIWantToParse.com"); URL[] urls = lb.getLinks (); for (int i = 0; i < urls.length; i++) System.out.println (urls[i]); } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Wednesday, January 7, 2004 4:10:21 pm. <hr class="toolbar" noshade="noshade" /> </body></html> --- NEW FILE: VisitorLinks.html --- <html><head><title>Visitor Links</title></head><body> <div class="wikitext"> <p><b>Using an ObjectFindingVisitor to Extract Links <p>A visitor visits all links, and an ObjectFindingVisitor is designed to find one specific class of nodes, in this case LinkTag tags: <pre> import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.ObjectFindingVisitor; public class LinkDemo { public static void main (String[] args) throws ParserException { Parser parser = new Parser ("http://urlIWantToParse.com"); ObjectFindingVisitor visitor = new ObjectFindingVisitor (LinkTag.class); parser.visitAllNodesWith (visitor); Node[] links = visitor.getTags (); for (int i = 0; i < links.length; i++) { LinkTag linkTag = (LinkTag)links[i]; System.out.print ("\"" + linkTag.getLinkText () + "\" => "); System.out.println (linkTag.getLink ()); } } } <div id="actionbar" class="toolbar"> <hr class="printer" noshade="noshade" /> <p class="editdate">Last edited on Wednesday, January 7, 2004 4:09:50 pm. <hr class="toolbar" noshade="noshade" /> </body></html> Index: CustomTagExtraction.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/CustomTagExtraction.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** CustomTagExtraction.html 9 Nov 2003 17:07:07 -0000 1.7 --- CustomTagExtraction.html 26 Jan 2004 01:02:09 -0000 1.8 *************** *** 6,25 **** <p><b>Custom Tag Extraction ! <p>Custom tag extraction is easy. Simply create an array of tag names that you want to extract from a page, and pass it in to <a href="TagFindingVisitor.html" class="wiki">TagFindingVisitor</a>, like so : <pre> ! Parser parser = new Parser(..); ! String [] tagsToBeFound = {"P","BR","MYTAG"}; ! TagFindingVisitor visitor = new TagFindingVisitor(tagsToBeFound); ! parser.visitAllNodesWith(visitor); ! // First tag specified in search ! Node [] allPTags = visitor.getTags(0); ! // Second tag specified in search ! Node [] allBRTags = visitor.getTags(1); ! // Third tag specified in search ! Node [] allMyTags = visitor.getTags(2); ! <p>--<a href="SomikRaha.html" class="wiki">SomikRaha</a> ! // Just a test of wiki --- 6,33 ---- <p><b>Custom Tag Extraction ! <p>Custom tag extraction is easy. Simply create an array of tag names that you want to extract from a page, and pass it in to a TagFindingVisitor, like so: <pre> ! import org.htmlparser.Node; ! import org.htmlparser.Parser; ! import org.htmlparser.util.ParserException; ! import org.htmlparser.visitors.TagFindingVisitor; ! public class CustomTagDemo ! { ! public static void main (String[] args) throws ParserException ! { ! Parser parser = new Parser ("http://urlIWantToParse.com"); ! String [] tagsToBeFound = {"P","BR","MYTAG"}; ! TagFindingVisitor visitor = new TagFindingVisitor (tagsToBeFound); ! parser.visitAllNodesWith (visitor); ! // First tag specified in search ! Node [] allPTags = visitor.getTags(0); ! // Second tag specified in search ! Node [] allBRTags = visitor.getTags(1); ! // Third tag specified in search ! Node [] allMyTags = visitor.getTags(2); ! } ! } *************** *** 29,33 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, April 2, 2003 1:38:24 pm. <hr class="toolbar" noshade="noshade" /> --- 37,41 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 6:22:39 pm. <hr class="toolbar" noshade="noshade" /> Index: EmailExtraction.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/EmailExtraction.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** EmailExtraction.html 9 Nov 2003 17:07:07 -0000 1.5 --- EmailExtraction.html 26 Jan 2004 01:02:09 -0000 1.6 *************** *** 6,24 **** <p><b>Email Extraction ! <p>This is very similar to link extraction. You have to extract links from a page and verify that they are email addresses. Link tags have a method - <i>isMailLink() <pre> ! Parser parser = new Parser(..); ! parser.registerScanners(); ! Node links [] = parser.extractAllNodesThatAre(LinkTag.class); ! for (int i=0;i<links.length;i++) { ! LinkTag linkTag = links[i]; ! if (linkTag[i].isMailLink()) { ! // Yes, its an email id ! System.out.println("Email address: "+linkTag.getLink()); ! } ! } ! <p>--<a href="SomikRaha.html" class="wiki">SomikRaha</a>, February 16, 2003 11:41 am --- 6,48 ---- <p><b>Email Extraction ! <p>This is very similar to <a href="LinkExtraction.html" class="named-wiki" title="LinkExtraction">link extraction</a>. You have to extract links from a page and verify that they are email addresses. Link tags have a method - <i>isMailLink() to check if the HREF starts with "mailto:". Using an inner class in the NodeFilter example: <pre> ! import org.htmlparser.Node; ! import org.htmlparser.NodeFilter; ! import org.htmlparser.Parser; ! import org.htmlparser.tags.LinkTag; ! import org.htmlparser.util.NodeIterator; ! import org.htmlparser.util.NodeList; ! import org.htmlparser.util.ParserException; ! public class EmailLinkDemo ! { ! public static void main (String[] args) throws ParserException ! { ! Parser parser = new Parser ("http://urlIWantToParse.com"); ! NodeFilter filter = new NodeFilter () ! { ! /** ! * Accept nodes that are mail links. ! * @param node The node to check. ! */ ! public boolean accept (Node node) ! { ! return (LinkTag.class.isAssignableFrom (node.getClass ()) ! && ((LinkTag)node).isMailLink ()); ! } ! }; ! NodeList links = new NodeList (); ! for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) ! e.nextNode ().collectInto (links, filter); ! for (int i = 0; i < links.size (); i++) ! { ! LinkTag linkTag = (LinkTag)links.elementAt (i); ! System.out.print ("\"" + linkTag.getLinkText () + "\" => "); ! System.out.println (linkTag.getLink ()); ! } ! } ! } *************** *** 28,32 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Sunday, February 23, 2003 5:24:25 pm. <hr class="toolbar" noshade="noshade" /> --- 52,56 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 5:26:12 pm. <hr class="toolbar" noshade="noshade" /> Index: FactoryMethod.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/FactoryMethod.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** FactoryMethod.html 9 Nov 2003 17:07:07 -0000 1.6 --- FactoryMethod.html 26 Jan 2004 01:02:09 -0000 1.7 *************** *** 6,10 **** <p><b>Factory Method ! <p><i><a href="TagScanner.html" class="wiki">TagScanner</a> possess an FM for the creation of a tag. <pre> --- 6,10 ---- <p><b>Factory Method ! <p><i><span class="wikiunknown"><u>TagScanner possess an FM for the creation of a tag. <pre> Index: ImageExtraction.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/ImageExtraction.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** ImageExtraction.html 9 Nov 2003 17:07:07 -0000 1.6 --- ImageExtraction.html 26 Jan 2004 01:02:09 -0000 1.7 *************** *** 4,47 **** <div class="wikitext"> ! <p><b>Image Extractions ! ! <p>This is very similar to <a href="LinkExtraction.html" class="wiki">LinkExtraction</a>. ! ! <p>1. Use the <i><span class="wikiunknown"><u>ObjectFindingVisitor like so : ! ! <pre> ! Parser parser = new Parser("http://urlIWantToParse.com"); ! // Create a visitor, specify that you want to recurse through its children ! // Recursion is needed only if you register all scanners, and a link tag could be embedded ! // within a form tag. But if you register only the link scanner, you don't need recursion. ! ObjectFindingVisitor visitor = ! new ObjectFindingVisitor(ImageTag.class,true); ! ! parser.registerScanners(); ! ! // Instead of registering all scanners, ! // you could also do - parser.addScanner(new ImageScanner("")); ! parser.visitAllNodesWith(visitor); ! Node [] images = visitor.getTags(); ! for (int i=0;i<images.length;i++) { ! ImageTag imageTag = (ImageTag)images[i]; ! System.out.println(imageTag.getImageLocation()); ! } ! <p>2: Use <i>extractAllNodesThatAre() <pre> ! Parser parser = new Parser("http://urlIWantToParse.com"); ! parser.registerScanners(); ! // Instead of registering all scanners, ! // you could also do - parser.addScanner(new ImageScanner("")); ! ! Node [] images = parser.extractAllNodesThatAre(ImageTag.class); ! for (int i=0;i<images.length;i++) { ! ImageTag imageTag = (ImageTag)images[i]; ! System.out.println(imageTag.getImageLocation()); ! } ! <p>--<a href="SomikRaha.html" class="wiki">SomikRaha</a>, Sunday, February 16, 2003 2:02:18 pm. --- 4,30 ---- <div class="wikitext"> ! <p><b>Image Extraction ! <p>This is very similar to <a href="LinkExtraction.html" class="named-wiki" title="LinkExtraction">link extraction</a>. Instead of looking for LinkTag nodes you look for ImageTag nodes: <pre> ! import org.htmlparser.Parser; ! import org.htmlparser.util.ParserException; ! import org.htmlparser.Node; ! import org.htmlparser.tags.ImageTag; ! public class ImageDemo ! { ! public static void main (String[] args) throws ParserException ! { ! Parser parser = new Parser ("http://urlIWantToParse.com"); ! Node [] images = parser.extractAllNodesThatAre (ImageTag.class); ! for (int i = 0; i < images.length; i++) ! { ! ImageTag imageTag = (ImageTag)images[i]; ! System.out.println (imageTag.getImageURL ()); ! } ! } ! } *************** *** 51,55 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, June 25, 2003 9:11:46 am. <hr class="toolbar" noshade="noshade" /> --- 34,38 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 5:33:01 pm. <hr class="toolbar" noshade="noshade" /> Index: LinkExtraction.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/LinkExtraction.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** LinkExtraction.html 9 Nov 2003 17:07:07 -0000 1.5 --- LinkExtraction.html 26 Jan 2004 01:02:09 -0000 1.6 *************** *** 8,101 **** <p>There are many ways of extracting links. ! <p>1. Use the <span class="wikiunknown"><u>ObjectFindingVisitor to extract links, like so: ! ! <pre> ! Parser parser = new Parser("http://urlIWantToParse.com"); ! // Create a visitor, specify that you want to recurse through its children ! // Recursion is needed only if you register all scanners, and a link tag could be embedded ! // within a form tag. But if you register only the link scanner, you don't need recursion. ! ObjectFindingVisitor visitor = ! new ObjectFindingVisitor(LinkTag.class,true); ! ! parser.registerScanners(); ! ! // Instead of registering all scanners, ! // you could also do - parser.addScanner(new LinkScanner("")); ! parser.visitAllNodesWith(visitor); ! Node [] links = visitor.getTags(); ! for (int i=0;i<links.length;i++) { ! LinkTag linkTag = (LinkTag)links[i]; ! System.out.println(linkTag.getLink()); ! System.out.println(linkTag.getLinkText()); ! } ! ! <p>2. Use the parser utility method - extractAllNodesThatAre(). ! ! <pre> ! Parser parser = new Parser("http://urlIWantToParse.com"); ! parser.registerScanners(); ! Node [] links = parser.extractAllNodesThatAre(LinkTag.class); ! // Instead of registering all scanners, ! // you could also do - parser.addScanner(new LinkScanner("")); ! for (int i=0;i<links.length;i++) { ! LinkTag linkTag = (LinkTag)links[i]; ! System.out.println(linkTag.getLink()); ! System.out.println(linkTag.getLinkText()); ! } ! ! <p>3. It is possible that you are interested in extracting more than just links. In order to customize extraction, write your own visitor. Extend the Visitor class (in the package org.htmlparser.visitors - Parser v1.3 upwards) like so : ! ! <pre> ! public class MyCustomizedVisitor extends Visitor { ! public MyCustomizedVisitor(Parser parser) { ! super(true); /// Its usually a good idea to perform recursion ! // Add the scanners you want. ! // This decouples your application from having to know which scanners are required ! parser.addScanner(new LinkScanner("")); ! parser.addScanner(new ImageScanner("")); ! // or add all scanners with registerScanners() ! } ! ! public void visitTag(Tag tag) { ! // Collect any tags you want ! // You can also do type checking like so: ! if (tag instanceof MetaTag) { ! // This tag is a meta tag ! MetaTag metaTag = (MetaTag)tag; ! } ! } ! ! public void visitStringNode(StringNode stringNode) { ! // Collect text in the page here ! } ! ! public void visitLinkTag(LinkTag linkTag) { ! // Collect links here ! } ! public void visitImageTag(ImageTag imageTag) { ! // Collect images here ! } ! public void visitEndTag(EndTag endTag) { ! // Checking for end tags can be useful when performing more involved ! // searches in a page ! } ! public void visitRemarkNode(RemarkNode remarkNode) { ! // Collect remark nodes here ! } ! // Add getters to get the data you have collected.. ! } ! In your app.. ! Parser parser = new Parser(...); ! MyCustomizedVisitor visitor = new MyCustomizedVisitor(parser); ! parser.visitAllNodesWith(visitor); ! // You can now get the data from the visitor interface. - <p>--<a href="SomikRaha.html" class="wiki">SomikRaha</a> --- 8,25 ---- <p>There are many ways of extracting links. ! <ul> ! <li><a href="VisitorLinks.html" class="named-wiki" title="VisitorLinks">Use an ObjectFindingVisitor</a> ! <li><a href="CustomVisitorLinks.html" class="named-wiki" title="CustomVisitorLinks">Use a custom Visitor</a> ! <li><a href="LinkBeanLinks.html" class="named-wiki" title="LinkBeanLinks">Use a LinkBean</a> ! <li><a href="CustomTagLinks.html" class="named-wiki" title="CustomTagLinks">Use a custom Tag</a> ! <li><a href="FilterLinks.html" class="named-wiki" title="FilterLinks">Use a NodeFilter</a> + <li><a href="LexerLinks.html" class="named-wiki" title="LexerLinks">Use a low level Lexer</a> *************** *** 105,109 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Tuesday, September 2, 2003 1:59:15 pm. <hr class="toolbar" noshade="noshade" /> --- 29,33 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 5:22:23 pm. <hr class="toolbar" noshade="noshade" /> Index: PostOperation.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/PostOperation.html,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** PostOperation.html 9 Nov 2003 17:07:07 -0000 1.4 --- PostOperation.html 26 Jan 2004 01:02:09 -0000 1.5 *************** *** 31,35 **** // ... do parser operations ! <p><a href="images/Zip.java" class="namedurl"><span style="white-space: nowrap"><img src="/docs/themes/MacOSX/images/http.png" alt="http" class="linkicon" border="0" />Source</span> Code.</a><a href="images/Zip.java" class="namedurl"><span style="white-space: nowrap">Source</span> Code.</a> <a href="images/Zip.html" class="namedurl"><span style="white-space: nowrap"><img src="/docs/themes/MacOSX/images/http.png" alt="http" class="linkicon" border="0" />Pretty</span> Print Source Code</a><a href="images/Zip.html" class="namedurl"><span style="white-space: nowrap">Pretty</span> Print Source Code</a> <pre> --- 31,35 ---- // ... do parser operations ! <p><a href="images/Zip.java" class="namedurl"><span style="white-space: nowrap"><img src="/wiki/themes/MacOSX/images/http.png" alt="http" class="linkicon" border="0" />Source</span> Code.</a><a href="images/Zip.java" class="namedurl"><span style="white-space: nowrap">Source</span> Code.</a> <a href="images/Zip.html" class="namedurl"><span style="white-space: nowrap"><img src="/wiki/themes/MacOSX/images/http.png" alt="http" class="linkicon" border="0" />Pretty</span> Print Source Code</a><a href="images/Zip.html" class="namedurl"><span style="white-space: nowrap">Pretty</span> Print Source Code</a> <pre> Index: ReverseHtml.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/ReverseHtml.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** ReverseHtml.html 9 Nov 2003 17:07:07 -0000 1.5 --- ReverseHtml.html 26 Jan 2004 01:02:09 -0000 1.6 *************** *** 6,41 **** <p><b>Reverse Html Rendering ! <p>In order to get back the html representation of a web page, you may use toHTML() recursively. Here's one way to get it: <pre> ! Parser parser = new Parser(..); ! parser.registerScanners(); ! StringBuffer htmlBuffer = new StringBuffer(); ! for (NodeIterator i = parser.elements();i.hasMoreNodes();) { ! htmlBuffer.append(i.nextNode().toHTML()); ! } ! System.out.println("reverse html rendered after parse : "+htmlBuffer.toString()); ! <p>This usually goes through child nodes of composite tags (like links, forms, etc..) ! <p>Often, it might be desired to modify the html being reconstructed. In such a case, you must change the tag's attributes prior to calling toHTML(). ! <p>e.g. if the tag in question is a link tag, and you wish to modify the href, do this : <pre> ! linkTag.setAttribute("SRC",newUrlString); ! doSomethingWith(linkTag.toHTML()); ! <p><i>toHtml() is basically a reconstruction of the tag using its attributes (at the atomic level) and its children (at the macro/composite level). ! <p>You can also change the name of the tag by setting its <i>TAGNAME attribute, like so: <pre> ! tag.setAttribute(Tag.TAGNAME,newTagName); ! <p>This should enable you to perform any transformations on the html. ! Take a look at another way of modifying tags in <a href="WebRipper.html" class="wiki">WebRipper</a>. ! <p>--<a href="SomikRaha.html" class="wiki">SomikRaha</a> --- 6,62 ---- <p><b>Reverse Html Rendering ! <p>In order to get back the html representation of a web page, you may use toHtml() recursively. Here's one way to get it: <pre> ! import org.htmlparser.Parser; ! import org.htmlparser.util.NodeIterator; ! import org.htmlparser.util.ParserException; ! public class ToHtmlDemo ! { ! public static void main (String[] args) throws ParserException ! { ! Parser parser = new Parser ("http://urlIWantToParse.com"); ! StringBuffer html = new StringBuffer (4096); ! for (NodeIterator i = parser.elements();i.hasMoreNodes();) ! html.append (i.nextNode().toHtml ()); ! System.out.println (html); ! } ! } ! <p>Often, it might be desired to modify the html being reconstructed. In such a case, you must change the tag's attributes prior to calling toHtml(). ! For example, if the tag in question is a link tag, and you wish to modify the href, do this: ! <pre> ! linkTag.setLink ("http://newUrlString"); ! linkTag.toHtml (); ! ! <p>This is equivalent to: <pre> ! linkTag.setAttribute ("href", "http://newUrlString"); ! linkTag.toHtml (); ! <p>This latter would work on any tag, but few other tags have an HREF attribute according to the <a href="http://www.w3.org/TR/html4/" class="namedurl"><span style="white-space: nowrap">HTML</span> specification</a>. ! The <i>toHtml() method applies to all nodes, not just tags. For tags it is basically a reconstruction of the tag using its attributes (at the atomic level) and its children (at the macro/composite level). ! <p>You can also change the name of the tag like so: <pre> ! tag.setTagName (newTagName); ! <p>and there are numerous ways to add, remove or change the attributes of a tag. For example, to add or change the ID attribute to "EditArea" use: ! <pre> ! tag.setAttribute ("id", "EditArea", '"'); ! ! <p>Whole tags can be added and removed from the list of children held by each tag. For example, to add a <P> tag at the same level as another tag: ! ! <pre> ! newTag = new Tag (); ! newTag.setTagName ("P"); ! tag.getParent ().getChildren ().add (newTag); ! ! <p>Be careful, getChildren () may return null for an arbitrary tag. *************** *** 45,49 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Sunday, February 23, 2003 5:34:12 pm. <hr class="toolbar" noshade="noshade" /> --- 66,70 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 6:14:37 pm. <hr class="toolbar" noshade="noshade" /> Index: SamplePrograms.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/SamplePrograms.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** SamplePrograms.html 9 Nov 2003 17:07:07 -0000 1.6 --- SamplePrograms.html 26 Jan 2004 01:02:09 -0000 1.7 *************** *** 10,14 **** <li><a href="StringExtraction.html" class="wiki">StringExtraction</a> ! <li><a href="LinkExtraction.html" class="wiki">LinkExtraction</a> (includes example of customized parsing with HTMLVisitor) <li><a href="EmailExtraction.html" class="wiki">EmailExtraction</a> --- 10,14 ---- <li><a href="StringExtraction.html" class="wiki">StringExtraction</a> ! <li><a href="LinkExtraction.html" class="wiki">LinkExtraction</a> <li><a href="EmailExtraction.html" class="wiki">EmailExtraction</a> *************** *** 16,23 **** <li><a href="ImageExtraction.html" class="wiki">ImageExtraction</a> - <li><a href="WebCrawler.html" class="wiki">WebCrawler</a> - - <li><a href="WebRipper.html" class="wiki">WebRipper</a> - <li><a href="ReverseHtml.html" class="named-wiki" title="ReverseHtml">ReverseHtml rendering</a> --- 16,19 ---- *************** *** 26,29 **** --- 22,29 ---- <li><a href="JavaBeans.html" class="wiki">JavaBeans</a> + <li><a href="WebCrawler.html" class="wiki">WebCrawler</a> - ignore this, it's old + + <li><a href="WebRipper.html" class="wiki">WebRipper</a> - ignore this, it's old + *************** *** 33,37 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Thursday, April 24, 2003 4:45:21 am. <hr class="toolbar" noshade="noshade" /> --- 33,37 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Wednesday, January 7, 2004 6:12:30 pm. <hr class="toolbar" noshade="noshade" /> Index: SearchingForData.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/SearchingForData.html,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** SearchingForData.html 26 Oct 2003 19:46:17 -0000 1.3 --- SearchingForData.html 26 Jan 2004 01:02:09 -0000 1.4 *************** *** 29,33 **** <pre> - parser.registerScanners(); Node nodes [] = parser.extractAllNodesThatAre(TableTag.class); // Get the first table found --- 29,32 ---- *************** *** 73,77 **** <pre> - parser.registerScanners(); Node nodes [] = parser.extractAllNodesThatAre(TableTag.class); // Get the first table found --- 72,75 ---- *************** *** 103,107 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Saturday, April 19, 2003 10:38:30 pm. <hr class="toolbar" noshade="noshade" /> --- 101,105 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Thursday, January 8, 2004 4:15:12 am. <hr class="toolbar" noshade="noshade" /> Index: StringExtraction.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/StringExtraction.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** StringExtraction.html 9 Nov 2003 17:07:07 -0000 1.6 --- StringExtraction.html 26 Jan 2004 01:02:09 -0000 1.7 *************** *** 6,24 **** <p><b>String Extraction ! <p>To get all the text content from a web page, use the <a href="TextExtractingVisitor.html" class="wiki">TextExtractingVisitor</a>, like so : <pre> ! Parser parser = new Parser("http://pageIwantToParse.com"); ! TextExtractingVisitor visitor = new TextExtractingVisitor(); ! parser.visitAllNodesWith(visitor); ! System.out.println(visitor.getExtractedText()); ! <p>If you want to strip all escape characters, do: <pre> ! String cleanText = ! ParserUtils.removeEscapeCharacters( ! visitor.getExtractedText() ! ); --- 6,42 ---- <p><b>String Extraction ! <p>To get all the text content from a web page, use the TextExtractingVisitor, like so: <pre> ! import org.htmlparser.Parser; ! import org.htmlparser.util.ParserException; ! import org.htmlparser.visitors.TextExtractingVisitor; ! public class StringDemo ! { ! public static void main (String[] args) throws ParserException ! { ! Parser parser = new Parser ("http://pageIwantToParse.com"); ! TextExtractingVisitor visitor = new TextExtractingVisitor (); ! parser.visitAllNodesWith (visitor); ! System.out.println (visitor.getExtractedText()); ! } ! } ! <p>If you want a more browser like behaviour, use the StringBean like so: <pre> ! import org.htmlparser.beans.StringBean; ! public class StringDemo ! { ! public static void main (String[] args) ! { ! StringBean sb = new StringBean (); ! sb.setLinks (false); ! sb.setReplaceNonBreakingSpaces (true); ! sb.setCollapse (true); ! sb.setURL ("http://pageIwantToParse.com"); ! System.out.println (sb.getStrings ()); ! } ! } *************** *** 28,32 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Sunday, February 23, 2003 5:20:23 pm. <hr class="toolbar" noshade="noshade" /> --- 46,50 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Tuesday, January 6, 2004 6:36:18 pm. <hr class="toolbar" noshade="noshade" /> Index: TemplateMethod.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/TemplateMethod.html,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** TemplateMethod.html 9 Nov 2003 17:07:07 -0000 1.6 --- TemplateMethod.html 26 Jan 2004 01:02:09 -0000 1.7 *************** *** 6,10 **** <p><b>Template Method ! <p><i><a href="TagScanner.html" class="wiki">TagScanner</a> uses a template method to create a scanned node - it calls a matching tag scanner to do its job and produce a scanned node in a series of steps. <pre> --- 6,10 ---- <p><b>Template Method ! <p><i><span class="wikiunknown"><u>TagScanner uses a template method to create a scanned node - it calls a matching tag scanner to do its job and produce a scanned node in a series of steps. <pre> Index: WritingYourOwnScanners.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/WritingYourOwnScanners.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** WritingYourOwnScanners.html 9 Nov 2003 17:07:07 -0000 1.7 --- WritingYourOwnScanners.html 26 Jan 2004 01:02:09 -0000 1.8 *************** *** 5,14 **** <div class="wikitext"> <p><b>Writing Your Own Scanners ! ! <p>There are two types of scanners, depending on the type of tags that you wish to parse: <ul> ! <li><a href="TagScanner.html" class="wiki">TagScanner</a> - for parsing tags that have no child elements <li>CompositeTagScanner - for parsing tags with children --- 5,14 ---- <div class="wikitext"> <p><b>Writing Your Own Scanners ! <b>Warning: this is out of date and needs to be completely rewritten ! There are two types of scanners, depending on the type of tags that you wish to parse: <ul> ! <li>TagScanner - for parsing tags that have no child elements <li>CompositeTagScanner - for parsing tags with children *************** *** 29,33 **** <br /> <br /> ! 3. If a match was found, call the scan() method. For both <a href="TagScanner.html" class="wiki">TagScanner</a> and CompositeTagScanner, overriding this method is optional, and NOT recommended for standard cases. The default scan() methods will make a call to createTag. <br /> <br /> --- 29,33 ---- <br /> <br /> ! 3. If a match was found, call the scan() method. For both <span class="wikiunknown"><u>TagScanner and CompositeTagScanner, overriding this method is optional, and NOT recommended for standard cases. The default scan() methods will make a call to createTag. <br /> <br /> *************** *** 109,113 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Thursday, May 1, 2003 6:54:01 pm. <hr class="toolbar" noshade="noshade" /> --- 109,113 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Thursday, January 8, 2004 4:13:18 am. <hr class="toolbar" noshade="noshade" /> Index: index.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/docs/index.html,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** index.html 2 Jan 2004 16:24:52 -0000 1.11 --- index.html 26 Jan 2004 01:02:09 -0000 1.12 *************** *** 6,9 **** --- 6,11 ---- <p><b>HTMLParser documentation + <p><a href="http://htmlparser.sourceforge.net/wiki/" class="namedurl"><span style="white-space: nowrap">This</span> page has moved to http://htmlparser.sourceforge.net/wiki</a> + <p>Welcome to the HTMLParser documentation page. You may visit *************** *** 13,17 **** <li><a href="SamplePrograms.html" class="wiki">SamplePrograms</a> - A quick tutorial on getting started with the parser ! <li><a href="WritingYourOwnScanners.html" class="wiki">WritingYourOwnScanners</a> - Learn how to write your own scanners to extend the capability of the parser <li><a href="SearchingForData.html" class="wiki">SearchingForData</a> - Learn how to perform powerful searches in html pages --- 15,19 ---- <li><a href="SamplePrograms.html" class="wiki">SamplePrograms</a> - A quick tutorial on getting started with the parser ! <li><a href="WritingYourOwnScanners.html" class="wiki">WritingYourOwnScanners</a> - ignore this, this is old <li><a href="SearchingForData.html" class="wiki">SearchingForData</a> - Learn how to perform powerful searches in html pages *************** *** 29,43 **** <li><a href="TestDrivenDevelopment.html" class="wiki">TestDrivenDevelopment</a> ! <li><a href="ParsingXml.html" class="wiki">ParsingXml</a> ! ! <li><a href="UnitTestingXsl.html" class="wiki">UnitTestingXsl</a> ! ! <li><a href="UnitTestingPdf.html" class="wiki">UnitTestingPdf</a> ! <li><a href="http://htmlparser.sourceforge.net/javadoc/" class="namedurl"><span style="white-space: nowrap">Javadocs</span> for Version 1.2</a> <li><a CLASS="namedurl" HREF="../javadoc/index.html"><span STYLE="white-space: nowrap">Javadocs</span></a> ! <li><a href="Benchmarks.html" class="named-wiki" title="Benchmarks">Benchmarks vs. JTidy</a> --- 31,41 ---- <li><a href="TestDrivenDevelopment.html" class="wiki">TestDrivenDevelopment</a> ! <li><a href="Benchmarks.html" class="named-wiki" title="Benchmarks">Benchmarks vs. JTidy</a> ! <li><a href="http://htmlparser.sourceforge.net/javadoc/" class="namedurl"><span style="white-space: nowrap">Javadocs</span></a> <li><a CLASS="namedurl" HREF="../javadoc/index.html"><span STYLE="white-space: nowrap">Javadocs</span></a> ! <li><a href="http://htmlparser.sourceforge.net/javadoc_1_2/" class="namedurl"><span style="white-space: nowrap">Javadocs</span> for Version 1.2</a> *************** *** 48,52 **** <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Tuesday, November 25, 2003 4:50:49 am. <hr class="toolbar" noshade="noshade" /> --- 46,50 ---- <hr class="printer" noshade="noshade" /> ! <p class="editdate">Last edited on Thursday, January 8, 2004 4:14:03 am. <hr class="toolbar" noshade="noshade" /> |
From: <der...@pr...> - 2004-01-27 16:20:24
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3286/lexer Modified Files: Lexer.java Log Message: Fix bug #880283 Character ">" erroneously inserted by Lexer. Some jsp tags are now handled in a separate jsp parse in the lexer. Jsp tags embedded as attributes are still not handled. Refer to bug #772700 Jsp Tags are not parsed correctly when in quoted attributes, which is now reversed (i.e. in quotes are OK, outside of quotes causes problems), but this points out a deficiency in the data structure holding tag contents (attribute lists) that doesn't provide for tags within attributes. Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.24 retrieving revision 1.25 diff -C2 -d -r1.24 -r1.25 *** Lexer.java 2 Jan 2004 16:24:53 -0000 1.24 --- Lexer.java 24 Jan 2004 17:13:43 -0000 1.25 *************** *** 267,270 **** --- 267,275 ---- if (0 == ch) ret = makeString (probe); + else if ('%' == ch) + { + probe.retreat (); + ret = parseJsp (probe); + } else if ('/' == ch || '%' == ch || Character.isLetter (ch)) { *************** *** 974,977 **** --- 979,1128 ---- } + /** + * Parse a java server page node. + * Scan characters until "%>" is encountered, or the input stream is + * exhausted, in which case <code>null</code> is returned. + * @param cursor The position at which to start scanning. + */ + protected Node parseJsp (Cursor cursor) + throws + ParserException + { + boolean done; + char ch; + int state; + Vector attributes; + int code; + Node ret; + + done = false; + state = 0; + code = 0; + attributes = new Vector (); + // <%xyz%> + // 012223d + // <%=xyz%> + // 0122223d + // <%@xyz%d + // 0122223d + while (!done) + { + ch = mPage.getCharacter (cursor); + switch (state) + { + case 0: // prior to the percent + switch (ch) + { + case '%': // <% + state = 1; + break; + // case 0: // <\0 + // case '>': // <> + default: + done = true; + break; + } + break; + case 1: // prior to the optional qualifier + switch (ch) + { + case 0: // <%\0 + case '>': // <%> + done = true; + break; + case '=': // <%= + case '@': // <%@ + code = cursor.getPosition (); + attributes.addElement (new PageAttribute (mPage, mCursor.getPosition () + 1, code, -1, -1, (char)0)); + state = 2; + break; + default: // <%x + code = cursor.getPosition () - 1; + attributes.addElement (new PageAttribute (mPage, mCursor.getPosition () + 1, code, -1, -1, (char)0)); + state = 2; + break; + } + break; + case 2: // prior to the closing percent + switch (ch) + { + case 0: // <%x\0 + case '>': // <%x> + done = true; + break; + case '\'': + case '"':// <%???" + state = ch; + break; + case '%': // <%???% + state = 3; + break; + default: // <%???x + break; + } + break; + case 3: + switch (ch) + { + case 0: // <%x??%\0 + done = true; + break; + case '>': + state = 4; + done = true; + break; + default: // <%???%x + state = 2; + break; + } + break; + case '"': + switch (ch) + { + case 0: // <%x??"\0 + done = true; + break; + case '"': + state = 2; + break; + default: // <%???'??x + break; + } + break; + case '\'': + switch (ch) + { + case 0: // <%x??'\0 + done = true; + break; + case '\'': + state = 2; + break; + default: // <%???"??x + break; + } + break; + default: + throw new IllegalStateException ("how the fuck did we get in state " + state); + } + } + + if (4 == state) // normal exit + { + if (0 != code) + { + state = cursor.getPosition () - 2; // reuse state + attributes.addElement (new PageAttribute (mPage, code, state, -1, -1, (char)0)); + attributes.addElement (new PageAttribute (mPage, state, state + 1, -1, -1, (char)0)); + } + else + throw new IllegalStateException ("jsp with no code!"); + } + else + return (parseString (cursor, true)); // hmmm, true? + + return (makeTag (cursor, attributes)); + } + // // NodeFactory interface |
From: <der...@pr...> - 2004-01-27 14:31:36
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23128/filters Modified Files: HasChildFilter.java Added Files: HasParentFilter.java IsEqualFilter.java Log Message: Fix bug #882940 empty applet tag contents causes NullPointerException Also found and fixed other similar problems where getChildren() could return null. Then changed table row and column handling to handle rows and columns embedded within other tags. --- NEW FILE: HasParentFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2004 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasParentFilter.java,v $ // $Author: derrickoswald $ // $Date: 2004/01/24 23:57:49 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.tags.Tag; import org.htmlparser.util.NodeList; /** * This class accepts all tags that have a parent acceptable to the filter. */ public class HasParentFilter implements NodeFilter { /** * The filter to apply to children. */ public NodeFilter mFilter; /** * Creates a new instance of HasParentFilter that accepts tags with parent acceptable to the filter. * @param filter The filter to apply to the parent. */ public HasParentFilter (NodeFilter filter) { mFilter = filter; } /** * Accept tags with parent acceptable to the filter. * @param node The node to check. */ public boolean accept (Node node) { Node parent; NodeList children; boolean ret; ret = false; parent = node.getParent (); if (null != parent) ret = mFilter.accept (parent); return (ret); } } --- NEW FILE: IsEqualFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2004 Derrick Oswald // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/IsEqualFilter.java,v $ // $Author: derrickoswald $ // $Date: 2004/01/24 23:57:50 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; /** * This class accepts only one specific node. */ public class IsEqualFilter implements NodeFilter { /** * The node to match. */ public Node mNode; /** * Creates a new instance of an IsEqualFilter that accepts only the node provided. * @param node The node to match. */ public IsEqualFilter (Node node) { mNode = node; } /** * Accept the node. * @param node The node to check. * @return <code>false</code> unless <code>node</code> is the one and only. */ public boolean accept (Node node) { return (mNode == node); } } Index: HasChildFilter.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasChildFilter.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** HasChildFilter.java 8 Nov 2003 21:30:58 -0000 1.1 --- HasChildFilter.java 24 Jan 2004 23:57:48 -0000 1.2 *************** *** 69,78 **** tag = (CompositeTag)node; children = tag.getChildren (); ! for (int i = 0; i < children.size (); i++) ! if (mFilter.accept (children.elementAt (i))) ! { ! ret = true; ! break; ! } } --- 69,79 ---- tag = (CompositeTag)node; children = tag.getChildren (); ! if (null != children) ! for (int i = 0; i < children.size (); i++) ! if (mFilter.accept (children.elementAt (i))) ! { ! ret = true; ! break; ! } } |
From: <der...@pr...> - 2004-01-27 14:29:07
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191/tests Modified Files: FunctionalTests.java ParserTest.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: FunctionalTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/FunctionalTests.java,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** FunctionalTests.java 14 Jan 2004 02:53:47 -0000 1.54 --- FunctionalTests.java 25 Jan 2004 21:33:12 -0000 1.55 *************** *** 29,32 **** --- 29,33 ---- import java.io.BufferedReader; import java.io.IOException; + import java.util.Locale; import junit.framework.TestSuite; *************** *** 105,109 **** if (line!=null) { // Check the line for image tags ! String newline = line.toUpperCase(); int fromIndex = -1; do { --- 106,110 ---- if (line!=null) { // Check the line for image tags ! String newline = line.toUpperCase (Locale.ENGLISH); int fromIndex = -1; do { Index: ParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTest.java,v retrieving revision 1.55 retrieving revision 1.56 diff -C2 -d -r1.55 -r1.56 *** ParserTest.java 24 Jan 2004 17:15:43 -0000 1.55 --- ParserTest.java 25 Jan 2004 21:33:12 -0000 1.56 *************** *** 36,39 **** --- 36,40 ---- import java.net.URL; import java.net.URLConnection; + import java.util.Locale; import org.htmlparser.AbstractNode; *************** *** 845,848 **** --- 846,850 ---- } } + /** * Test reproducing a java.lang.StackOverflowError. *************** *** 859,861 **** --- 861,886 ---- assertTrue ("bad toString()", -1 != output.indexOf (guts)); } + + /** + * See bug #883664 toUpperCase on tag names and attributes depends on locale + */ + public void testDifferentLocale () throws Exception + { + String html; + Locale original; + + html = "<title>This is supposedly Turkish.</title>"; + original = Locale.getDefault (); + try + { + Locale.setDefault (new Locale ("tr")); // turkish + createParser (html); + parseAndAssertNodeCount (1); + assertStringEquals ("html", html, node[0].toHtml ()); + } + finally + { + Locale.setDefault (original); + } + } } |
From: <der...@pr...> - 2004-01-27 14:09:14
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191/filters Modified Files: HasAttributeFilter.java StringFilter.java TagNameFilter.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: HasAttributeFilter.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/HasAttributeFilter.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** HasAttributeFilter.java 10 Jan 2004 00:06:03 -0000 1.2 --- HasAttributeFilter.java 25 Jan 2004 21:32:57 -0000 1.3 *************** *** 27,30 **** --- 27,32 ---- package org.htmlparser.filters; + import java.util.Locale; + import org.htmlparser.Node; import org.htmlparser.NodeFilter; *************** *** 63,67 **** public HasAttributeFilter (String attribute, String value) { ! mAttribute = attribute.toUpperCase (); mValue = value; } --- 65,69 ---- public HasAttributeFilter (String attribute, String value) { ! mAttribute = attribute.toUpperCase (Locale.ENGLISH); mValue = value; } Index: StringFilter.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/StringFilter.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** StringFilter.java 8 Nov 2003 21:30:58 -0000 1.1 --- StringFilter.java 25 Jan 2004 21:32:58 -0000 1.2 *************** *** 27,30 **** --- 27,32 ---- package org.htmlparser.filters; + import java.util.Locale; + import org.htmlparser.Node; import org.htmlparser.NodeFilter; *************** *** 47,50 **** --- 49,57 ---- /** + * The locale to use converting to uppercase in the case insensitive searches. + */ + protected Locale mLocale; + + /** * Creates a new instance of StringFilter that accepts string nodes containing a certain string. * The comparison is case insensitive. *************** *** 64,72 **** public StringFilter (String pattern, boolean case_sensitive) { mCaseSensitive = case_sensitive; if (mCaseSensitive) mPattern = pattern; else ! mPattern = pattern.toUpperCase (); } --- 71,93 ---- public StringFilter (String pattern, boolean case_sensitive) { + this (pattern, case_sensitive, null); + } + + /** + * Creates a new instance of StringFilter that accepts string nodes containing a certain string. + * @param pattern The pattern to search for. + * @param case_sensitive If <code>true</code>, comparisons are performed + * respecting case. + */ + public StringFilter (String pattern, boolean case_sensitive, Locale locale) + { mCaseSensitive = case_sensitive; if (mCaseSensitive) mPattern = pattern; else ! { ! mLocale = (null == locale) ? Locale.ENGLISH : locale; ! mPattern = pattern.toUpperCase (mLocale); ! } } *************** *** 85,89 **** string = ((StringNode)node).getText (); if (!mCaseSensitive) ! string = string.toUpperCase (); ret = -1 != string.indexOf (mPattern); } --- 106,110 ---- string = ((StringNode)node).getText (); if (!mCaseSensitive) ! string = string.toUpperCase (mLocale); ret = -1 != string.indexOf (mPattern); } Index: TagNameFilter.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/TagNameFilter.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** TagNameFilter.java 8 Nov 2003 21:30:58 -0000 1.1 --- TagNameFilter.java 25 Jan 2004 21:32:58 -0000 1.2 *************** *** 27,30 **** --- 27,32 ---- package org.htmlparser.filters; + import java.util.Locale; + import org.htmlparser.Node; import org.htmlparser.NodeFilter; *************** *** 49,53 **** public TagNameFilter (String name) { ! mName = name.toUpperCase (); } --- 51,55 ---- public TagNameFilter (String name) { ! mName = name.toUpperCase (Locale.ENGLISH); } |
From: <der...@pr...> - 2004-01-27 13:46:16
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11005 Modified Files: PrototypicalNodeFactory.java Log Message: Add TableHeaderTag submitted by Pim Schrama. Robustify TableRow against null getChildren(). Index: PrototypicalNodeFactory.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/PrototypicalNodeFactory.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** PrototypicalNodeFactory.java 14 Jan 2004 02:53:46 -0000 1.3 --- PrototypicalNodeFactory.java 24 Jan 2004 18:13:29 -0000 1.4 *************** *** 171,174 **** --- 171,175 ---- registerTag (new StyleTag ()); registerTag (new TableColumn ()); + registerTag (new TableHeader ()); registerTag (new TableRow ()); registerTag (new TableTag ()); |
From: <der...@pr...> - 2004-01-27 13:42:50
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11005/tags Modified Files: TableRow.java Added Files: TableHeader.java Log Message: Add TableHeaderTag submitted by Pim Schrama. Robustify TableRow against null getChildren(). --- NEW FILE: TableHeader.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2004 Pim Schrama // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableHeader.java,v $ // $Author: derrickoswald $ // $Date: 2004/01/24 18:12:57 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tags; /** * A table header tag. */ public class TableHeader extends CompositeTag { /** * The set of names handled by this tag. */ private static final String[] mIds = new String[] {"TH"}; /** * The set of tag names that indicate the end of this tag. */ private static final String[] mEnders = new String[] {"TH", "TR"}; /** * The set of end tag names that indicate the end of this tag. */ private static final String[] mEndTagEnders = new String[] {"TR", "TABLE"}; /** * Create a new table header tag. */ public TableHeader () { } /** * Return the set of names handled by this tag. * @return The names to be matched that create tags of this type. */ public String[] getIds () { return (mIds); } /** * Return the set of tag names that cause this tag to finish. * @return The names of following tags that stop further scanning. */ public String[] getEnders () { return (mIds); } /** * Return the set of end tag names that cause this tag to finish. * @return The names of following end tags that stop further scanning. */ public String[] getEndTagEnders () { return (mEndTagEnders); } } Index: TableRow.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableRow.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** TableRow.java 2 Jan 2004 16:24:55 -0000 1.38 --- TableRow.java 24 Jan 2004 18:12:41 -0000 1.39 *************** *** 81,87 **** * Get the number of columns in this row. */ ! public int getColumnCount() { ! return (getChildren ().searchFor (TableColumn.class).size ()); } --- 81,89 ---- * Get the number of columns in this row. */ ! public int getColumnCount () { ! return ( ! (null == getChildren ()) ? 0 : ! getChildren ().searchFor (TableColumn.class).size ()); } *************** *** 89,101 **** * Get the children (columns) of this row. */ ! public TableColumn [] getColumns() { NodeList list; ! ! list = getChildren ().searchFor (TableColumn.class); ! TableColumn [] columns = new TableColumn[list.size()]; ! list.copyToNodeArray (columns); ! ! return (columns); } } --- 91,150 ---- * Get the children (columns) of this row. */ ! public TableColumn [] getColumns () { NodeList list; ! TableColumn [] ret; ! ! if (null != getChildren ()) ! { ! list = getChildren ().searchFor (TableColumn.class); ! ret = new TableColumn[list.size ()]; ! list.copyToNodeArray (ret); ! } ! else ! ret = new TableColumn[0]; ! ! return (ret); ! } ! ! /** ! * Checks if this table has a header ! * @return <code>true</code> if there is a header tag. ! */ ! public boolean hasHeader () ! { ! return (0 != getHeaderCount ()); ! } ! ! /** ! * Get the number of headers in this row. ! * @return The count of header tags in this row. ! */ ! public int getHeaderCount () ! { ! return ( ! (null == getChildren ()) ? 0 : ! getChildren ().searchFor (TableHeader.class, false).size ()); ! } ! ! /** ! * Get the header of this table ! * @return Table header tags contained in this row. ! */ ! public TableHeader[] getHeader () ! { ! NodeList list; ! TableHeader [] ret; ! ! if (null != getChildren ()) ! { ! list = getChildren ().searchFor (TableHeader.class, false); ! ret = new TableHeader[list.size ()]; ! list.copyToNodeArray (ret); ! } ! else ! ret = new TableHeader[0]; ! ! return (ret); } } |
From: <der...@pr...> - 2004-01-27 13:34:45
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191/lexer/nodes Modified Files: TagNode.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/nodes/TagNode.java,v retrieving revision 1.28 retrieving revision 1.29 diff -C2 -d -r1.28 -r1.29 *** TagNode.java 14 Jan 2004 02:53:46 -0000 1.28 --- TagNode.java 25 Jan 2004 21:32:59 -0000 1.29 *************** *** 29,35 **** import java.util.Enumeration; import java.util.Hashtable; import java.util.Vector; - import org.htmlparser.AbstractNode; import org.htmlparser.lexer.Cursor; import org.htmlparser.lexer.Lexer; --- 29,36 ---- import java.util.Enumeration; import java.util.Hashtable; + import java.util.Locale; import java.util.Vector; + import org.htmlparser.AbstractNode; import org.htmlparser.lexer.Cursor; import org.htmlparser.lexer.Lexer; *************** *** 340,348 **** * <code>String</code> objects available from this <code>Hashtable</code>. * @return Returns a list of name/value pairs representing the attributes. ! * These are not in order, the keys (names) are capitalized and the values * are not quoted, even if they need to be. The table <em>will</em> return * <code>null</code> if there was no value for an attribute (no equals * sign or nothing to the right of the equals sign). A special entry with * a key of SpecialHashtable.TAGNAME ("$<TAGNAME>$") holds the tag name. */ public Hashtable getAttributes () --- 341,350 ---- * <code>String</code> objects available from this <code>Hashtable</code>. * @return Returns a list of name/value pairs representing the attributes. ! * These are not in order, the keys (names) are converted to uppercase and the values * are not quoted, even if they need to be. The table <em>will</em> return * <code>null</code> if there was no value for an attribute (no equals * sign or nothing to the right of the equals sign). A special entry with * a key of SpecialHashtable.TAGNAME ("$<TAGNAME>$") holds the tag name. + * The conversion to uppercase is performed with an ENGLISH locale. */ public Hashtable getAttributes () *************** *** 360,364 **** // special handling for the node name attribute = (Attribute)attributes.elementAt (0); ! ret.put (SpecialHashtable.TAGNAME, attribute.getName ().toUpperCase ()); // the rest for (int i = 1; i < attributes.size (); i++) --- 362,366 ---- // special handling for the node name attribute = (Attribute)attributes.elementAt (0); ! ret.put (SpecialHashtable.TAGNAME, attribute.getName ().toUpperCase (Locale.ENGLISH)); // the rest for (int i = 1; i < attributes.size (); i++) *************** *** 372,376 **** if (null == value) value = SpecialHashtable.NULLVALUE; ! ret.put (attribute.getName ().toUpperCase (), value); } } --- 374,378 ---- if (null == value) value = SpecialHashtable.NULLVALUE; ! ret.put (attribute.getName ().toUpperCase (Locale.ENGLISH), value); } } *************** *** 391,394 **** --- 393,397 ---- * To get at the original text of the tag name use * {@link #getRawTagName getRawTagName()}. + * The conversion to uppercase is performed with an ENGLISH locale. * </em> * @return The tag name. *************** *** 401,405 **** if (null != ret) { ! ret = ret.toUpperCase (); if (ret.startsWith ("/")) ret = ret.substring (1); --- 404,408 ---- if (null != ret) { ! ret = ret.toUpperCase (Locale.ENGLISH); if (ret.startsWith ("/")) ret = ret.substring (1); *************** *** 673,677 **** public boolean breaksFlow () { ! return (breakTags.containsKey (getTagName ().toUpperCase ())); } --- 676,680 ---- public boolean breaksFlow () { ! return (breakTags.containsKey (getTagName ())); } |
From: <der...@pr...> - 2004-01-27 13:25:38
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191/visitors Modified Files: LinkFindingVisitor.java StringFindingVisitor.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: LinkFindingVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/LinkFindingVisitor.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** LinkFindingVisitor.java 2 Jan 2004 16:24:58 -0000 1.34 --- LinkFindingVisitor.java 25 Jan 2004 21:33:14 -0000 1.35 *************** *** 27,55 **** package org.htmlparser.visitors; import org.htmlparser.tags.LinkTag; ! public class LinkFindingVisitor extends NodeVisitor { private String linkTextToFind; ! private boolean linkTagFound = false; ! private int count = 0; ! public LinkFindingVisitor(String linkTextToFind) { ! this.linkTextToFind = linkTextToFind.toUpperCase(); } ! public void visitLinkTag(LinkTag linkTag) { ! // System.out.println("Matching with "+linkTag.getLinkText()); ! if (linkTag.getLinkText().toUpperCase().indexOf(linkTextToFind)!=-1) { ! linkTagFound = true; count++; - } } ! public boolean linkTextFound() { ! return linkTagFound; } ! public int getCount() { ! return count; } --- 27,66 ---- package org.htmlparser.visitors; + import java.util.Locale; + import org.htmlparser.tags.LinkTag; ! public class LinkFindingVisitor extends NodeVisitor ! { private String linkTextToFind; ! private int count; ! private Locale locale; ! public LinkFindingVisitor (String linkTextToFind) ! { ! this (linkTextToFind, null); } ! public LinkFindingVisitor (String linkTextToFind, Locale locale) ! { ! count = 0; ! this.locale = (null == locale) ? Locale.ENGLISH : locale; ! this.linkTextToFind = linkTextToFind.toUpperCase (this.locale); ! } ! ! public void visitLinkTag(LinkTag linkTag) ! { ! if (-1 != linkTag.getLinkText ().toUpperCase (locale).indexOf (linkTextToFind)) count++; } ! public boolean linkTextFound() ! { ! return (0 != count); } ! public int getCount() ! { ! return (count); } Index: StringFindingVisitor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/visitors/StringFindingVisitor.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** StringFindingVisitor.java 2 Jan 2004 16:24:58 -0000 1.38 --- StringFindingVisitor.java 25 Jan 2004 21:33:14 -0000 1.39 *************** *** 27,53 **** package org.htmlparser.visitors; import org.htmlparser.StringNode; ! public class StringFindingVisitor extends NodeVisitor { ! private boolean stringFound = false; private String stringToFind; private int foundCount; private boolean multipleSearchesWithinStrings; ! public StringFindingVisitor(String stringToFind) { ! this.stringToFind = stringToFind.toUpperCase(); foundCount = 0; multipleSearchesWithinStrings = false; } ! public void doMultipleSearchesWithinStrings() { multipleSearchesWithinStrings = true; } ! public void visitStringNode(StringNode stringNode) { ! String stringToBeSearched = stringNode.getText().toUpperCase(); if (!multipleSearchesWithinStrings && stringToBeSearched.indexOf(stringToFind) != -1) { - stringFound = true; foundCount++; } else if (multipleSearchesWithinStrings) { --- 27,64 ---- package org.htmlparser.visitors; + import java.util.Locale; + import org.htmlparser.StringNode; ! public class StringFindingVisitor extends NodeVisitor ! { private String stringToFind; private int foundCount; private boolean multipleSearchesWithinStrings; + private Locale locale; ! public StringFindingVisitor(String stringToFind) ! { ! this (stringToFind, null); ! } ! ! public StringFindingVisitor(String stringToFind, Locale locale) ! { ! this.locale = (null == locale) ? Locale.ENGLISH : locale; ! this.stringToFind = stringToFind.toUpperCase (this.locale); foundCount = 0; multipleSearchesWithinStrings = false; } ! public void doMultipleSearchesWithinStrings() ! { multipleSearchesWithinStrings = true; } ! public void visitStringNode(StringNode stringNode) ! { ! String stringToBeSearched = stringNode.getText().toUpperCase(locale); if (!multipleSearchesWithinStrings && stringToBeSearched.indexOf(stringToFind) != -1) { foundCount++; } else if (multipleSearchesWithinStrings) { *************** *** 61,69 **** } ! public boolean stringWasFound() { ! return stringFound; } ! public int stringFoundCount() { return foundCount; } --- 72,82 ---- } ! public boolean stringWasFound() ! { ! return (0 != stringFoundCount()); } ! public int stringFoundCount() ! { return foundCount; } |
From: <der...@pr...> - 2004-01-27 13:22:41
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191 Modified Files: PrototypicalNodeFactory.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: PrototypicalNodeFactory.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/PrototypicalNodeFactory.java,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** PrototypicalNodeFactory.java 24 Jan 2004 18:13:29 -0000 1.4 --- PrototypicalNodeFactory.java 25 Jan 2004 21:32:56 -0000 1.5 *************** *** 29,34 **** --- 29,36 ---- import java.io.Serializable; import java.util.Hashtable; + import java.util.Locale; import java.util.Map; import java.util.Vector; + import org.htmlparser.lexer.Page; import org.htmlparser.lexer.nodes.Attribute; *************** *** 245,249 **** try { ! id = id.toUpperCase (); if (!id.startsWith ("/")) { --- 247,251 ---- try { ! id = id.toUpperCase (Locale.ENGLISH); if (!id.startsWith ("/")) { |
From: <der...@pr...> - 2004-01-27 13:10:45
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26789/src/org/htmlparser Modified Files: Parser.java Log Message: Update version to 1.4-20040125 Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.85 retrieving revision 1.86 diff -C2 -d -r1.85 -r1.86 *** Parser.java 26 Jan 2004 00:27:34 -0000 1.85 --- Parser.java 26 Jan 2004 01:02:10 -0000 1.86 *************** *** 88,92 **** */ public final static String ! VERSION_DATE = "Jan 19, 2004" ; --- 88,92 ---- */ public final static String ! VERSION_DATE = "Jan 25, 2004" ; |
From: <der...@pr...> - 2004-01-27 11:10:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7058 Modified Files: CompositeTag.java Log Message: Fix StackOverflowError similar to the previous one. Recursion in the collectInto() wasn't checking for an end tag the same as 'this'. Scanned for other similar occurances, and fixed it in visitor code too. Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.72 retrieving revision 1.73 diff -C2 -d -r1.72 -r1.73 *** CompositeTag.java 19 Jan 2004 22:44:59 -0000 1.72 --- CompositeTag.java 24 Jan 2004 17:41:32 -0000 1.73 *************** *** 243,254 **** * @return int */ ! public int findPositionOf(String text) { Node node; ! int loc = 0; ! for (SimpleNodeIterator e=children();e.hasMoreNodes();) { ! node = e.nextNode(); ! if (node.toPlainTextString().toUpperCase().indexOf(text.toUpperCase())!=-1) { return loc; - } loc++; } --- 243,258 ---- * @return int */ ! public int findPositionOf(String text) ! { Node node; ! int loc; ! ! loc = 0; ! text = text.toUpperCase (); ! for (SimpleNodeIterator e = children (); e.hasMoreNodes (); ) ! { ! node = e.nextNode (); ! if (-1 != node.toPlainTextString ().toUpperCase ().indexOf (text)) return loc; loc++; } *************** *** 326,330 **** for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) e.nextNode ().collectInto (list, filter); ! if (null != getEndTag ()) getEndTag ().collectInto (list, filter); } --- 330,334 ---- for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) e.nextNode ().collectInto (list, filter); ! if ((null != getEndTag ()) && (this != getEndTag ())) // 2nd guard handles <tag/> getEndTag ().collectInto (list, filter); } *************** *** 367,371 **** } } ! if (null != getEndTag ()) getEndTag ().accept (visitor); } --- 371,375 ---- } } ! if ((null != getEndTag ()) && (this != getEndTag ())) // 2nd guard handles <tag/> getEndTag ().accept (visitor); } *************** *** 462,465 **** --- 466,483 ---- } + /** + * Return the text between the start tag and the end tag. + * @return The contents of the CompositeTag. + */ + public String getStringText () + { + String ret; + int start = getEndPosition (); + int end = mEndTag.getStartPosition (); + ret = getPage ().getText (start, end); + + return (ret); + } + public void toString (int level, StringBuffer buffer) { |
From: <der...@pr...> - 2004-01-27 09:45:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26191/tags Modified Files: CompositeTag.java FrameSetTag.java ImageTag.java Log Message: Fix bug #883664 toUpperCase on tag names and attributes depends on locale Added locale information to all relevant toUpperCase() calls, with an English locale for tag names and attribute names, or developers choice of locale for methods that do uppercase conversion as part of their algorithms. Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.74 retrieving revision 1.75 diff -C2 -d -r1.74 -r1.75 *** CompositeTag.java 24 Jan 2004 23:57:52 -0000 1.74 --- CompositeTag.java 25 Jan 2004 21:33:11 -0000 1.75 *************** *** 27,30 **** --- 27,32 ---- package org.htmlparser.tags; + import java.util.Locale; + import org.htmlparser.Node; import org.htmlparser.NodeFilter; *************** *** 186,216 **** /** ! * Searches for any node whose text representation contains the search ! * string. Collects all such nodes in a NodeList. ! * e.g. if you wish to find any textareas in a form tag containing "hello ! * world", the code would be : * <code> ! * NodeList nodeList = formTag.searchFor("Hello World"); * </code> ! * @param searchString search criterion ! * @param caseSensitive specify whether this search should be case ! * sensitive ! * @return NodeList Collection of nodes whose string contents or ! * representation have the searchString in them */ ! public NodeList searchFor(String searchString, boolean caseSensitive) { ! NodeList foundList = new NodeList(); Node node; ! if (!caseSensitive) searchString = searchString.toUpperCase(); ! for (SimpleNodeIterator e = children();e.hasMoreNodes();) { ! node = e.nextNode(); ! String nodeTextString = node.toPlainTextString(); ! if (!caseSensitive) nodeTextString=nodeTextString.toUpperCase(); ! if (nodeTextString.indexOf(searchString)!=-1) { ! foundList.add(node); ! } } ! return foundList; } --- 188,266 ---- /** ! * Searches for all nodes whose text representation contains the search string. ! * Collects all nodes containing the search string into a NodeList. ! * This search is <b>case-insensitive</b> and the search string and the ! * node text are converted to uppercase using an English locale. ! * For example, if you wish to find any textareas in a form tag containing ! * "hello world", the code would be: * <code> ! * NodeList nodeList = formTag.searchFor("Hello World"); * </code> ! * @param searchString Search criterion. ! * @return A collection of nodes whose string contents or ! * representation have the <code>searchString</code> in them. */ + public NodeList searchFor (String searchString) + { + return (searchFor (searchString, false)); + } ! /** ! * Searches for all nodes whose text representation contains the search string. ! * Collects all nodes containing the search string into a NodeList. ! * For example, if you wish to find any textareas in a form tag containing ! * "hello world", the code would be: ! * <code> ! * NodeList nodeList = formTag.searchFor("Hello World"); ! * </code> ! * @param searchString Search criterion. ! * @param caseSensitive If <code>true</code> this search should be case ! * sensitive. Otherwise, the search string and the node text are converted ! * to uppercase using an English locale. ! * @return A collection of nodes whose string contents or ! * representation have the <code>searchString</code> in them. ! */ ! public NodeList searchFor (String searchString, boolean caseSensitive) ! { ! return (searchFor (searchString, caseSensitive, Locale.ENGLISH)); ! } ! ! /** ! * Searches for all nodes whose text representation contains the search string. ! * Collects all nodes containing the search string into a NodeList. ! * For example, if you wish to find any textareas in a form tag containing ! * "hello world", the code would be: ! * <code> ! * NodeList nodeList = formTag.searchFor("Hello World"); ! * </code> ! * @param searchString Search criterion. ! * @param caseSensitive If <code>true</code> this search should be case ! * sensitive. Otherwise, the search string and the node text are converted ! * to uppercase using the locale provided. ! * @parem locale The locale for uppercase conversion. ! * @return A collection of nodes whose string contents or ! * representation have the <code>searchString</code> in them. ! */ ! public NodeList searchFor (String searchString, boolean caseSensitive, Locale locale) ! { Node node; ! String text; ! NodeList ret; ! ! ret = new NodeList (); ! ! if (!caseSensitive) ! searchString = searchString.toUpperCase (locale); ! for (SimpleNodeIterator e = children (); e.hasMoreNodes (); ) ! { ! node = e.nextNode (); ! text = node.toPlainTextString (); ! if (!caseSensitive) ! text = text.toUpperCase (locale); ! if (-1 != text.indexOf (searchString)) ! ret.add (node); } ! ! return (ret); } *************** *** 231,258 **** /** ! * Searches for any node whose text representation contains the search ! * string. Collects all such nodes in a NodeList. ! * e.g. if you wish to find any textareas in a form tag containing "hello ! * world", the code would be : ! * <code> ! * NodeList nodeList = formTag.searchFor("Hello World"); ! * </code> ! * This search is <b>case-insensitive</b>. ! * @param searchString search criterion ! * @return NodeList Collection of nodes whose string contents or ! * representation have the searchString in them */ ! public NodeList searchFor(String searchString) { ! return searchFor(searchString, false); } /** ! * Returns the node number of the string node containing the ! * given text. This can be useful to index into the composite tag ! * and get other children. ! * @param text ! * @return int */ ! public int findPositionOf(String text) { Node node; --- 281,307 ---- /** ! * Returns the node number of the first node containing the given text. ! * This can be useful to index into the composite tag and get other children. ! * Text is compared without case sensitivity and conversion to uppercase ! * uses an English locale. ! * @param text The text to search for. ! * @return int The node index in the children list of the node containing ! * the text or -1 if not found. */ ! public int findPositionOf (String text) ! { ! return (findPositionOf (text, Locale.ENGLISH)); } /** ! * Returns the node number of the first node containing the given text. ! * This can be useful to index into the composite tag and get other children. ! * Text is compared without case sensitivity and conversion to uppercase ! * uses the supplied locale. ! * @param text The text to search for. ! * @return int The node index in the children list of the node containing ! * the text or -1 if not found. */ ! public int findPositionOf (String text, Locale locale) { Node node; *************** *** 260,268 **** loc = 0; ! text = text.toUpperCase (); for (SimpleNodeIterator e = children (); e.hasMoreNodes (); ) { node = e.nextNode (); ! if (-1 != node.toPlainTextString ().toUpperCase ().indexOf (text)) return loc; loc++; --- 309,317 ---- loc = 0; ! text = text.toUpperCase (locale); for (SimpleNodeIterator e = children (); e.hasMoreNodes (); ) { node = e.nextNode (); ! if (-1 != node.toPlainTextString ().toUpperCase (locale).indexOf (text)) return loc; loc++; Index: FrameSetTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameSetTag.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** FrameSetTag.java 2 Jan 2004 16:24:54 -0000 1.35 --- FrameSetTag.java 25 Jan 2004 21:33:12 -0000 1.36 *************** *** 27,30 **** --- 27,32 ---- package org.htmlparser.tags; + import java.util.Locale; + import org.htmlparser.Node; import org.htmlparser.util.NodeList; *************** *** 90,119 **** /** * Gets a frame by name. * @param name The name of the frame to retrieve. * @return The specified frame or <code>null</code> if it wasn't found. */ ! public FrameTag getFrame(String name) { - boolean found; Node node; ! FrameTag frameTag; ! found = false; ! name = name.toUpperCase (); ! frameTag = null; ! for (SimpleNodeIterator e=getFrames().elements();e.hasMoreNodes() && !found;) { node = e.nextNode(); if (node instanceof FrameTag) { ! frameTag = (FrameTag)node; ! if (frameTag.getFrameName().toUpperCase().equals(name)) ! found = true; } } ! if (found) ! return (frameTag); ! else ! return (null); } --- 92,133 ---- /** * Gets a frame by name. + * Names are checked without case sensitivity and conversion to uppercase + * is performed with an English locale. * @param name The name of the frame to retrieve. * @return The specified frame or <code>null</code> if it wasn't found. */ ! public FrameTag getFrame (String name) ! { ! return (getFrame (name, Locale.ENGLISH)); ! } ! ! /** ! * Gets a frame by name. ! * Names are checked without case sensitivity and conversion to uppercase ! * is performed with the locale provided. ! * @param name The name of the frame to retrieve. ! * @param locale The locale to use when converting to uppercase. ! * @return The specified frame or <code>null</code> if it wasn't found. ! */ ! public FrameTag getFrame (String name, Locale locale) { Node node; ! FrameTag ret; ! ! ret = null; ! name = name.toUpperCase (locale); ! for (SimpleNodeIterator e = getFrames ().elements (); e.hasMoreNodes () && (null == ret); ) { node = e.nextNode(); if (node instanceof FrameTag) { ! ret = (FrameTag)node; ! if (!ret.getFrameName ().toUpperCase (locale).equals (name)) ! ret = null; } } ! ! return (ret); } Index: ImageTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ImageTag.java,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** ImageTag.java 14 Jan 2004 02:53:46 -0000 1.41 --- ImageTag.java 25 Jan 2004 21:33:12 -0000 1.42 *************** *** 27,31 **** --- 27,33 ---- package org.htmlparser.tags; + import java.util.Locale; import java.util.Vector; + import org.htmlparser.lexer.nodes.Attribute; import org.htmlparser.util.ParserUtils; *************** *** 107,111 **** if (null != string) { ! name = string.toUpperCase (); if (name.equals ("SRC")) { --- 109,113 ---- if (null != string) { ! name = string.toUpperCase (Locale.ENGLISH); if (name.equals ("SRC")) { |
From: <der...@pr...> - 2004-01-27 05:48:31
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26789/docs Modified Files: changes.txt release.txt Log Message: Update version to 1.4-20040125 Index: changes.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/changes.txt,v retrieving revision 1.195 retrieving revision 1.196 diff -C2 -d -r1.195 -r1.196 *** changes.txt 19 Jan 2004 23:13:05 -0000 1.195 --- changes.txt 26 Jan 2004 01:01:56 -0000 1.196 *************** *** 13,16 **** --- 13,89 ---- ******************************************************************************* + Integration Build 1.4 - 20040125 + -------------------------------- + + 2004-01-25 19:27 derrickoswald + + * src/org/htmlparser/Parser.java: + + Fix RFE #817723 enhancement: add user-agent + Add get/setDefaultRequestProperties() which is used when creating a new connection + to condition the connection prior to connecting. + Currently, the only request property is "User-Agent", which is set to "HTMLParser/1.4". + Sophisticated users may set other properties to tailor the parser behaviour. + + 2004-01-25 16:32 derrickoswald + + * src/org/htmlparser/: PrototypicalNodeFactory.java, + filters/HasAttributeFilter.java, filters/StringFilter.java, + filters/TagNameFilter.java, lexer/nodes/TagNode.java, + tags/CompositeTag.java, tags/FrameSetTag.java, tags/ImageTag.java, + tests/FunctionalTests.java, tests/ParserTest.java, + visitors/LinkFindingVisitor.java, + visitors/StringFindingVisitor.java: + + Fix bug #883664 toUpperCase on tag names and attributes depends on locale + Added locale information to all relevant toUpperCase() calls, with an English locale + for tag names and attribute names, or developers choice of locale for methods that + do uppercase conversion as part of their algorithms. + + 2004-01-24 18:57 derrickoswald + + * src/org/htmlparser/: filters/HasChildFilter.java, + filters/HasParentFilter.java, filters/IsEqualFilter.java, + tags/AppletTag.java, tags/CompositeTag.java, tags/FormTag.java, + tags/SelectTag.java, tags/TableRow.java, tags/TableTag.java, + tests/tagTests/BulletListTagTest.java, + tests/tagTests/DivTagTest.java, tests/tagTests/SpanTagTest.java: + + Fix bug #882940 empty applet tag contents causes NullPointerException + Also found and fixed other similar problems where getChildren() could + return null. + Then changed table row and column handling to handle rows and + columns embedded within other tags. + + 2004-01-24 13:12 derrickoswald + + * src/org/htmlparser/: tags/TableRow.java, tags/TableHeader.java, + PrototypicalNodeFactory.java: + + Add TableHeaderTag submitted by Pim Schrama. + Robustify TableRow against null getChildren(). + + 2004-01-24 12:41 derrickoswald + + * src/org/htmlparser/tags/CompositeTag.java: + + Fix StackOverflowError similar to the previous one. + Recursion in the collectInto() wasn't checking for an end tag the same as 'this'. + Scanned for other similar occurances, and fixed it in visitor code too. + + 2004-01-24 12:13 derrickoswald + + * src/org/htmlparser/: lexer/Lexer.java, + tests/lexerTests/LexerTests.java, tests/tagTests/JspTagTest.java, + tests/ParserTest.java: + + Fix bug #880283 Character ">" erroneously inserted by Lexer. + Some jsp tags are now handled in a separate jsp parse in the lexer. + Jsp tags embedded as attributes are still not handled. + Refer to bug #772700 Jsp Tags are not parsed correctly when in quoted attributes, + which is now reversed (i.e. in quotes are OK, outside of quotes causes problems), + but this points out a deficiency in the data structure holding tag contents (attribute lists) + that doesn't provide for tags within attributes. + Integration Build 1.4 - 20040119 -------------------------------- Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** release.txt 19 Jan 2004 23:14:17 -0000 1.54 --- release.txt 26 Jan 2004 01:02:09 -0000 1.55 *************** *** 1,3 **** ! HTMLParser Version 1.4 (Integration Build Jan 19, 2004) ********************************************* --- 1,3 ---- ! HTMLParser Version 1.4 (Integration Build Jan 25, 2004) ********************************************* |
From: <der...@pr...> - 2004-01-27 05:21:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3286/tests/lexerTests Modified Files: LexerTests.java Log Message: Fix bug #880283 Character ">" erroneously inserted by Lexer. Some jsp tags are now handled in a separate jsp parse in the lexer. Jsp tags embedded as attributes are still not handled. Refer to bug #772700 Jsp Tags are not parsed correctly when in quoted attributes, which is now reversed (i.e. in quotes are OK, outside of quotes causes problems), but this points out a deficiency in the data structure holding tag contents (attribute lists) that doesn't provide for tags within attributes. Index: LexerTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/LexerTests.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** LexerTests.java 14 Jan 2004 02:53:47 -0000 1.17 --- LexerTests.java 24 Jan 2004 17:14:20 -0000 1.18 *************** *** 734,737 **** --- 734,792 ---- } + /** + * Check for StackOverflow error. + */ + public void testStackOverflow () + throws + ParserException + { + NodeIterator iterator; + Node node; + String html; + + html = "<a href = \"http://test.com\" />"; + createParser (html); + for (iterator = parser.elements (); iterator.hasMoreNodes (); ) + { + node = iterator.nextNode (); + String text = node.toHtml (); + assertStringEquals ("no overflow", html, text); + } + html = "<a href=\"http://test.com\"/>"; + createParser (html); + for (iterator = parser.elements (); iterator.hasMoreNodes (); ) + { + node = iterator.nextNode (); + String text = node.toHtml (); + assertStringEquals ("no overflow", html, text); + } + html = "<a href = \"http://test.com\"/>"; + createParser (html); + for (iterator = parser.elements (); iterator.hasMoreNodes (); ) + { + node = iterator.nextNode (); + String text = node.toHtml (); + assertStringEquals ("no overflow", html, text); + } + } + + /** + * See bug #880283 Character ">" erroneously inserted by Lexer + */ + public void testJsp () throws ParserException + { + String html; + Lexer lexer; + Node node; + + html = "<% out.urlEncode('abc') + \"<br>\" + out.urlEncode('xyz') %>"; + lexer = new Lexer (html); + node = lexer.nextNode (); + if (node == null) + fail ("too few nodes"); + else + assertStringEquals ("bad html", html, node.toHtml()); + assertNull ("too many nodes", lexer.nextNode ()); + } } |
From: <der...@pr...> - 2004-01-26 15:01:15
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv19868 Modified Files: Parser.java Log Message: Fix RFE #817723 enhancement: add user-agent Add get/setDefaultRequestProperties() which is used when creating a new connection to condition the connection prior to connecting. Currently, the only request property is "User-Agent", which is set to "HTMLParser/1.4". Sophisticated users may set other properties to tailor the parser behaviour. Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.84 retrieving revision 1.85 diff -C2 -d -r1.84 -r1.85 *** Parser.java 19 Jan 2004 23:14:18 -0000 1.84 --- Parser.java 26 Jan 2004 00:27:34 -0000 1.85 *************** *** 33,36 **** --- 33,39 ---- import java.net.URL; import java.net.URLConnection; + import java.util.HashMap; + import java.util.Iterator; + import java.util.Map; import org.htmlparser.filters.TagNameFilter; *************** *** 98,101 **** --- 101,114 ---- /** + * Default Request header fields. + * So far this is just "User-Agent". + */ + protected static Map mDefaultRequestProperties = new HashMap (); + static + { + mDefaultRequestProperties.put ("User-Agent", "HTMLParser/" + VERSION_NUMBER); + } + + /** * Feedback object. */ *************** *** 161,164 **** --- 174,241 ---- } + /** + * Get the current default request header properties. + * A String-to-String map of header keys and values. + * These fields are set by the parser when creating a connection. + */ + public static Map getDefaultRequestProperties () + { + return (mDefaultRequestProperties); + } + + /** + * Set the default request header properties. + * A String-to-String map of header keys and values. + * These fields are set by the parser when creating a connection. + * Some of these can be set directly on a <code>URLConnection</code>, + * i.e. If-Modified-Since is set with setIfModifiedSince(long), + * but since the parser transparently opens the connection on behalf + * of the developer, these properties are not available before the + * connection is fetched. Setting these request header fields affects all + * subsequent connections opened by the parser. For more direct control + * create a <code>URLConnection</code> and set it on the parser.<p> + * From <a href="http://www.ietf.org/rfc/rfc2616.txt">RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1</a>: + * <pre> + * 5.3 Request Header Fields + * + * The request-header fields allow the client to pass additional + * information about the request, and about the client itself, to the + * server. These fields act as request modifiers, with semantics + * equivalent to the parameters on a programming language method + * invocation. + * + * request-header = Accept ; Section 14.1 + * | Accept-Charset ; Section 14.2 + * | Accept-Encoding ; Section 14.3 + * | Accept-Language ; Section 14.4 + * | Authorization ; Section 14.8 + * | Expect ; Section 14.20 + * | From ; Section 14.22 + * | Host ; Section 14.23 + * | If-Match ; Section 14.24 + * | If-Modified-Since ; Section 14.25 + * | If-None-Match ; Section 14.26 + * | If-Range ; Section 14.27 + * | If-Unmodified-Since ; Section 14.28 + * | Max-Forwards ; Section 14.31 + * | Proxy-Authorization ; Section 14.34 + * | Range ; Section 14.35 + * | Referer ; Section 14.36 + * | TE ; Section 14.39 + * | User-Agent ; Section 14.43 + * + * Request-header field names can be extended reliably only in + * combination with a change in the protocol version. However, new or + * experimental header fields MAY be given the semantics of request- + * header fields if all parties in the communication recognize them to + * be request-header fields. Unrecognized header fields are treated as + * entity-header fields. + * </pre> + */ + public static void setDefaultRequestProperties (Map properties) + { + mDefaultRequestProperties = properties; + } + // // Constructors *************** *** 500,503 **** --- 577,583 ---- ParserException { + Map properties; + String key; + String value; URLConnection ret; *************** *** 505,508 **** --- 585,596 ---- { ret = url.openConnection (); + properties = getDefaultRequestProperties (); + if (null != properties) + for (Iterator iterator = properties.keySet ().iterator (); iterator.hasNext (); ) + { + key = (String)iterator.next (); + value = (String)properties.get (key); + ret.setRequestProperty (key, value); + } } catch (IOException ioe) |
From: <der...@pr...> - 2004-01-26 13:44:45
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3286/tests/tagTests Modified Files: JspTagTest.java Log Message: Fix bug #880283 Character ">" erroneously inserted by Lexer. Some jsp tags are now handled in a separate jsp parse in the lexer. Jsp tags embedded as attributes are still not handled. Refer to bug #772700 Jsp Tags are not parsed correctly when in quoted attributes, which is now reversed (i.e. in quotes are OK, outside of quotes causes problems), but this points out a deficiency in the data structure holding tag contents (attribute lists) that doesn't provide for tags within attributes. Index: JspTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/JspTagTest.java,v retrieving revision 1.43 retrieving revision 1.44 diff -C2 -d -r1.43 -r1.44 *** JspTagTest.java 14 Jan 2004 02:53:47 -0000 1.43 --- JspTagTest.java 24 Jan 2004 17:14:47 -0000 1.44 *************** *** 43,47 **** private static final boolean JSP_TESTS_ENABLED = false; ! public JspTagTest(String name) { super(name); } --- 43,48 ---- private static final boolean JSP_TESTS_ENABLED = false; ! public JspTagTest(String name) ! { super(name); } *************** *** 66,105 **** public void testJspTag() throws ParserException { ! if (JSP_TESTS_ENABLED) ! { ! String contents = "jsp:useBean id=\"transfer\" scope=\"session\" class=\"com.bank.PageBean\"/"; ! String jsp = "<" + contents + ">"; ! String contents2 = "%\n"+ ! " org.apache.struts.util.BeanUtils.populate(transfer, request);\n"+ ! " if(request.getParameter(\"marker\") == null)\n"+ ! " // initialize a pseudo-property\n"+ ! " transfer.set(\"days\", java.util.Arrays.asList(\n"+ ! " new String[] {\"1\", \"2\", \"3\", \"4\", \"31\"}));\n"+ ! " else \n"+ ! " if(transfer.validate(request))\n"+ ! " %"; ! createParser( ! "<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>\n"+ ! jsp + "\n" + ! "<" + contents2 + ">\n<jsp:forward page=\"transferConfirm.jsp\"/><%\n"+ ! "%>"); ! Parser.setLineSeparator("\r\n"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); ! parseAndAssertNodeCount(8); ! // The first node should be an JspTag ! assertTrue("Node 1 should be an JspTag",node[0] instanceof JspTag); ! JspTag tag = (JspTag)node[0]; ! assertStringEquals("Contents of the tag","%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %",tag.getText()); ! // The second node should be a normal tag ! assertTrue("Node 3 should be a normal Tag",node[2] instanceof Tag); ! Tag htag = (Tag)node[2]; ! assertStringEquals("Contents of the tag",contents,htag.getText()); ! assertStringEquals("html",jsp,htag.toHtml()); ! // The third node should be an JspTag ! assertTrue("Node 5 should be an JspTag",node[4] instanceof JspTag); ! JspTag tag2 = (JspTag)node[4]; ! assertStringEquals("Contents of the tag",contents2,tag2.getText()); ! } } --- 67,103 ---- public void testJspTag() throws ParserException { ! String contents = "jsp:useBean id=\"transfer\" scope=\"session\" class=\"com.bank.PageBean\"/"; ! String jsp = "<" + contents + ">"; ! String contents2 = "%\n"+ ! " org.apache.struts.util.BeanUtils.populate(transfer, request);\n"+ ! " if(request.getParameter(\"marker\") == null)\n"+ ! " // initialize a pseudo-property\n"+ ! " transfer.set(\"days\", java.util.Arrays.asList(\n"+ ! " new String[] {\"1\", \"2\", \"3\", \"4\", \"31\"}));\n"+ ! " else \n"+ ! " if(transfer.validate(request))\n"+ ! " %"; ! createParser( ! "<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>\n"+ ! jsp + "\n" + ! "<" + contents2 + ">\n<jsp:forward page=\"transferConfirm.jsp\"/><%\n"+ ! "%>"); ! Parser.setLineSeparator("\r\n"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); ! parseAndAssertNodeCount(8); ! // The first node should be a JspTag ! assertTrue("Node 1 should be a JspTag",node[0] instanceof JspTag); ! JspTag tag = (JspTag)node[0]; ! assertStringEquals("Contents of the tag","%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %",tag.getText()); ! // The second node should be a normal tag ! assertTrue("Node 3 should be a normal Tag",node[2] instanceof Tag); ! Tag htag = (Tag)node[2]; ! assertStringEquals("Contents of the tag",contents,htag.getText()); ! assertStringEquals("html",jsp,htag.toHtml()); ! // The third node should be an JspTag ! assertTrue("Node 5 should be an JspTag",node[4] instanceof JspTag); ! JspTag tag2 = (JspTag)node[4]; ! assertStringEquals("Contents of the tag",contents2,tag2.getText()); } *************** *** 121,169 **** * Creation date: (6/17/2001 4:01:06 PM) */ ! public void testToHTML() throws ParserException { ! if (JSP_TESTS_ENABLED) ! { ! createParser( ! "<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>\n"+ ! "<jsp:useBean id=\"transfer\" scope=\"session\" class=\"com.bank.PageBean\"/>\n"+ ! "<%\n"+ ! " org.apache.struts.util.BeanUtils.populate(transfer, request);\n"+ ! " if(request.getParameter(\"marker\") == null)\n"+ ! " // initialize a pseudo-property\n"+ ! " transfer.set(\"days\", java.util.Arrays.asList(\n"+ ! " new String[] {\"1\", \"2\", \"3\", \"4\", \"31\"}));\n"+ ! " else \n"+ ! " if(transfer.validate(request))\n"+ ! " %><jsp:forward page=\"transferConfirm.jsp\"/><%\n"+ ! "%>\n"); ! Parser.setLineSeparator("\r\n"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); ! parseAndAssertNodeCount(8); ! // The first node should be an JspTag ! assertTrue("Node 1 should be an JspTag",node[0] instanceof JspTag); ! JspTag tag = (JspTag)node[0]; ! assertEquals("Raw String of the first JSP tag","<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>",tag.toHtml()); ! // The third node should be an JspTag ! assertTrue("Node 5 should be an JspTag",node[5] instanceof JspTag); ! JspTag tag2 = (JspTag)node[8]; ! String expected = "<%\r\n"+ ! " org.apache.struts.util.BeanUtils.populate(transfer, request);\r\n"+ ! " if(request.getParameter(\"marker\") == null)\r\n"+ ! " // initialize a pseudo-property\r\n"+ ! " transfer.set(\"days\", java.util.Arrays.asList(\r\n"+ ! " new String[] {\"1\", \"2\", \"3\", \"4\", \"31\"}));\r\n"+ ! " else \r\n"+ ! " if(transfer.validate(request))\r\n"+ ! " %>"; ! assertEquals("Raw String of the second JSP tag",expected,tag2.toHtml()); ! assertTrue("Node 4 should be an HTMLJspTag",node[4] instanceof JspTag); ! JspTag tag4 = (JspTag)node[4]; ! expected = "<%\r\n"+ ! "%>"; ! assertEquals("Raw String of the fourth JSP tag",expected,tag4.toHtml()); ! } } --- 119,158 ---- * Creation date: (6/17/2001 4:01:06 PM) */ ! public void testToHtml () throws ParserException { ! String guts = "\n"+ ! " org.apache.struts.util.BeanUtils.populate(transfer, request);\n"+ ! " if(request.getParameter(\"marker\") == null)\n"+ ! " // initialize a pseudo-property\n"+ ! " transfer.set(\"days\", java.util.Arrays.asList(\n"+ ! " new String[] {\"1\", \"2\", \"3\", \"4\", \"31\"}));\n"+ ! " else \n"+ ! " if(transfer.validate(request))\n"+ ! " "; ! createParser( ! "<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>\n"+ ! "<jsp:useBean id=\"transfer\" scope=\"session\" class=\"com.bank.PageBean\"/>\n"+ ! "<%" + ! guts ! + "%><jsp:forward page=\"transferConfirm.jsp\"/><%\n"+ ! "%>\n"); ! Parser.setLineSeparator("\r\n"); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); ! parseAndAssertNodeCount(8); ! // The first node should be a JspTag ! assertTrue("Node 1 should be a JspTag",node[0] instanceof JspTag); ! JspTag tag = (JspTag)node[0]; ! assertEquals("Raw String of the first JSP tag","<%@ taglib uri=\"/WEB-INF/struts.tld\" prefix=\"struts\" %>",tag.toHtml()); ! // The fifth node should be a JspTag ! assertTrue("Node 5 should be a JspTag",node[4] instanceof JspTag); ! JspTag tag2 = (JspTag)node[4]; ! String expected = "<%" + guts + "%>"; ! assertEquals("Raw String of the second JSP tag",expected,tag2.toHtml()); ! assertTrue("Node 7 should be a JspTag",node[6] instanceof JspTag); ! JspTag tag4 = (JspTag)node[6]; ! expected = "<%\n%>"; ! assertEquals("Raw String of the fourth JSP tag",expected,tag4.toHtml()); } *************** *** 183,216 **** * See bug #772700 Jsp Tags are not parsed correctly when in quoted attributes. */ ! public void testJspTagsInUnQuotedAttribes() throws ParserException { ! // this test should pass when none of the attibutes are quoted ! testJspTagsInAttributes("<img alt=<%=altText1%> src=<%=imgUrl1%> border=<%=borderToggle%>>"); ! } /** * See bug #772700 Jsp Tags are not parsed correctly when in quoted attributes. */ ! public void testJspTagsInQuotedAttribes() throws ParserException ! { ! // this test seems to mess up.... ! testJspTagsInAttributes("<img alt=\"<%=altText1%>\" src=\"<%=imgUrl1%>\" border=\"<%=borderToggle%>\">"); ! } private void testJspTagsInAttributes(String html) throws ParserException { if (JSP_TESTS_ENABLED) { ! createParser(html); ! parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); ! parseAndAssertNodeCount(7); ! assertTrue("Should be a Jsp tag but was "+node[1].getClass().getName(),node[1] instanceof JspTag); ! assertTrue("Should be a Jsp tag but was "+node[3].getClass().getName(),node[3] instanceof JspTag); ! assertTrue("Should be a Jsp tag but was "+node[5].getClass().getName(),node[5] instanceof JspTag); ! assertTrue("Text Should be '<%=altText1%>'but was '" + node[1].toHtml() + "'" ,node[1].toHtml().equals("<%=altText1%>")); ! assertTrue("Text Should be '<%=imgUrl1%>' but was '" + node[3].toHtml() + "'" ,node[3].toHtml().equals("<%=imgUrl1%>")); ! assertTrue("Text Should be '<%=borderToggle%>' but was '" + node[5].toHtml() + "'" ,node[5].toHtml().equals("<%=borderToggle%>")); } } } --- 172,209 ---- * See bug #772700 Jsp Tags are not parsed correctly when in quoted attributes. */ ! public void testJspTagsInUnQuotedAttribes() throws ParserException ! { ! // this test should pass when none of the attibutes are quoted ! if (JSP_TESTS_ENABLED) ! testJspTagsInAttributes("<img alt=<%=altText1%> src=<%=imgUrl1%> border=<%=borderToggle%>>"); ! } /** * See bug #772700 Jsp Tags are not parsed correctly when in quoted attributes. */ ! public void testJspTagsInQuotedAttribes() throws ParserException ! { ! // this test seems to mess up.... ! testJspTagsInAttributes("<img alt=\"<%=altText1%>\" src=\"<%=imgUrl1%>\" border=\"<%=borderToggle%>\">"); ! } private void testJspTagsInAttributes(String html) throws ParserException { + createParser (html); + parser.setNodeFactory (new PrototypicalNodeFactory (new JspTag ())); if (JSP_TESTS_ENABLED) { ! parseAndAssertNodeCount (7); ! assertTrue ("Should be a Jsp tag but was " + node[1].getClass().getName(), node[1] instanceof JspTag); ! assertTrue ("Should be a Jsp tag but was " + node[3].getClass().getName(), node[3] instanceof JspTag); ! assertTrue ("Should be a Jsp tag but was " + node[5].getClass().getName(), node[5] instanceof JspTag); ! assertTrue ("Text Should be '<%=altText1%>'but was '" + node[1].toHtml() + "'" , node[1].toHtml().equals("<%=altText1%>")); ! assertTrue ("Text Should be '<%=imgUrl1%>' but was '" + node[3].toHtml() + "'" , node[3].toHtml().equals("<%=imgUrl1%>")); ! assertTrue ("Text Should be '<%=borderToggle%>' but was '" + node[5].toHtml() + "'" , node[5].toHtml().equals("<%=borderToggle%>")); } + else + parseAndAssertNodeCount (1); } } |
From: <der...@pr...> - 2004-01-26 12:42:00
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3286/tests Modified Files: ParserTest.java Log Message: Fix bug #880283 Character ">" erroneously inserted by Lexer. Some jsp tags are now handled in a separate jsp parse in the lexer. Jsp tags embedded as attributes are still not handled. Refer to bug #772700 Jsp Tags are not parsed correctly when in quoted attributes, which is now reversed (i.e. in quotes are OK, outside of quotes causes problems), but this points out a deficiency in the data structure holding tag contents (attribute lists) that doesn't provide for tags within attributes. Index: ParserTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/ParserTest.java,v retrieving revision 1.54 retrieving revision 1.55 diff -C2 -d -r1.54 -r1.55 *** ParserTest.java 14 Jan 2004 02:53:47 -0000 1.54 --- ParserTest.java 24 Jan 2004 17:15:43 -0000 1.55 *************** *** 845,847 **** --- 845,861 ---- } } + /** + * Test reproducing a java.lang.StackOverflowError. + */ + public void testXMLTypeToString () throws Exception + { + String guts; + String output; + + guts = "TD width=\"69\"/"; + createParser ("<" + guts + ">"); + parseAndAssertNodeCount (1); + output = node[0].toString (); // this was where StackOverflow was thrown + assertTrue ("bad toString()", -1 != output.indexOf (guts)); + } } |
From: <der...@pr...> - 2004-01-26 03:23:55
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23128/tests/tagTests Modified Files: BulletListTagTest.java DivTagTest.java SpanTagTest.java Log Message: Fix bug #882940 empty applet tag contents causes NullPointerException Also found and fixed other similar problems where getChildren() could return null. Then changed table row and column handling to handle rows and columns embedded within other tags. Index: BulletListTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/BulletListTagTest.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** BulletListTagTest.java 7 Dec 2003 23:41:43 -0000 1.1 --- BulletListTagTest.java 24 Jan 2004 23:58:06 -0000 1.2 *************** *** 64,68 **** NodeList nestedBulletLists = ((CompositeTag)node[0]).searchFor( ! BulletList.class ); assertEquals( --- 64,69 ---- NodeList nestedBulletLists = ((CompositeTag)node[0]).searchFor( ! BulletList.class, ! true ); assertEquals( Index: DivTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/DivTagTest.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** DivTagTest.java 7 Dec 2003 23:41:43 -0000 1.1 --- DivTagTest.java 24 Jan 2004 23:58:07 -0000 1.2 *************** *** 52,56 **** assertType("node should be table",TableTag.class,node[0]); TableTag tableTag = (TableTag)node[0]; ! Div div = (Div)tableTag.searchFor(Div.class).toNodeArray()[0]; assertEquals("div contents","some text",div.toPlainTextString()); } --- 52,56 ---- assertType("node should be table",TableTag.class,node[0]); TableTag tableTag = (TableTag)node[0]; ! Div div = (Div)tableTag.searchFor(Div.class, true).toNodeArray()[0]; assertEquals("div contents","some text",div.toPlainTextString()); } Index: SpanTagTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/tagTests/SpanTagTest.java,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** SpanTagTest.java 7 Dec 2003 23:41:43 -0000 1.1 --- SpanTagTest.java 24 Jan 2004 23:58:07 -0000 1.2 *************** *** 67,71 **** assertType("node",TableColumn.class,node[0]); TableColumn col = (TableColumn)node[0]; ! Node spans [] = col.searchFor(Span.class).toNodeArray(); assertEquals("number of spans found",2,spans.length); assertStringEquals( --- 67,71 ---- assertType("node",TableColumn.class,node[0]); TableColumn col = (TableColumn)node[0]; ! Node spans [] = col.searchFor(Span.class, true).toNodeArray(); assertEquals("number of spans found",2,spans.length); assertStringEquals( |