htmlparser-cvs Mailing List for HTML Parser (Page 8)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(141) |
Jun
(108) |
Jul
(66) |
Aug
(127) |
Sep
(155) |
Oct
(149) |
Nov
(72) |
Dec
(72) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(100) |
Feb
(36) |
Mar
(21) |
Apr
(3) |
May
(87) |
Jun
(28) |
Jul
(84) |
Aug
(5) |
Sep
(14) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(1) |
Feb
(39) |
Mar
(26) |
Apr
(38) |
May
(14) |
Jun
(10) |
Jul
|
Aug
|
Sep
(13) |
Oct
(8) |
Nov
(10) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(24) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Derrick O. <der...@us...> - 2005-04-10 23:20:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30655/htmlparser/src/org/htmlparser/scanners Modified Files: CompositeTagScanner.java Log Message: Documentation revamp part one. Deprecated node decorators. Added doSemanticAction for Text and Comment nodes. Added missing sitecapturer scripts. Fixed DOS batch files to work when called from any location. Index: CompositeTagScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/CompositeTagScanner.java,v retrieving revision 1.89 retrieving revision 1.90 diff -C2 -d -r1.89 -r1.90 *** CompositeTagScanner.java 31 Jul 2004 16:42:32 -0000 1.89 --- CompositeTagScanner.java 10 Apr 2005 23:20:44 -0000 1.90 *************** *** 233,237 **** --- 233,240 ---- } else + { addChild (ret, node); + node.doSemanticAction (); + } } |
From: Derrick O. <der...@us...> - 2005-04-10 23:20:59
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30655/htmlparser/src/org/htmlparser/nodes Modified Files: AbstractNode.java RemarkNode.java TagNode.java TextNode.java package.html Log Message: Documentation revamp part one. Deprecated node decorators. Added doSemanticAction for Text and Comment nodes. Added missing sitecapturer scripts. Fixed DOS batch files to work when called from any location. Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes/package.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** package.html 24 May 2004 16:18:37 -0000 1.1 --- package.html 10 Apr 2005 23:20:44 -0000 1.2 *************** *** 58,66 **** and child and parent references. Only the {@link org.htmlparser.nodes.TagNode TagNode} objects contain a list of {@link org.htmlparser.Attribute Attribute} objects. ! <p> ! The {@link org.htmlparser.lexer.Lexer Lexer} parses an HTML stream into a contiguous stream of these ! nodes. The {@link org.htmlparser.Parser Parser} returns specific {@link ! org.htmlparser.tags Tag} objects, which are subclasses of the {@link org.htmlparser.nodes.TagNode TagNode} ! class. <p> </BODY> --- 58,67 ---- and child and parent references. Only the {@link org.htmlparser.nodes.TagNode TagNode} objects contain a list of {@link org.htmlparser.Attribute Attribute} objects. ! <p>The {@link org.htmlparser.lexer.Lexer Lexer} parses an HTML stream into a ! contiguous stream of these nodes.</p> ! <p>The {@link org.htmlparser.Parser Parser} returns either these nodes or specific ! {@link org.htmlparser.tags Tag} objects (which are subclasses of TagNode) ! for tags with names that have been registered via ! {@link org.htmlparser.PrototypicalNodeFactory#registerTag registerTag()}. <p> </BODY> Index: TextNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes/TextNode.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** TextNode.java 17 Jul 2004 13:45:04 -0000 1.3 --- TextNode.java 10 Apr 2005 23:20:44 -0000 1.4 *************** *** 71,75 **** /** ! * Returns the text of the string line. */ public String getText () --- 71,77 ---- /** ! * Returns the text of the node. ! * This is the same as {@link #toHtml} for this type of node. ! * @return The contents of this text node. */ public String getText () *************** *** 89,92 **** --- 91,99 ---- } + /** + * Returns the text of the node. + * This is the same as {@link #toHtml} for this type of node. + * @return The contents of this text node. + */ public String toPlainTextString () { *************** *** 94,97 **** --- 101,108 ---- } + /** + * Returns the text of the node. + * @return The contents of this text node. + */ public String toHtml () { Index: AbstractNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes/AbstractNode.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** AbstractNode.java 17 Jul 2004 13:45:04 -0000 1.3 --- AbstractNode.java 10 Apr 2005 23:20:44 -0000 1.4 *************** *** 37,41 **** /** ! * AbstractNode, which implements the Node interface, is the base class for all types of nodes, including tags, string elements, etc */ public abstract class AbstractNode implements Node, Serializable --- 37,44 ---- /** ! * The concrete base class for all types of nodes (tags, text remarks). ! * This class provides basic functionality to hold the {@link Page}, the ! * starting and ending position in the page, the parent and the list of ! * {@link NodeList children}. */ public abstract class AbstractNode implements Node, Serializable *************** *** 95,130 **** /** ! * Returns a string representation of the node. This is an important method, it allows a simple string transformation ! * of a web page, regardless of a node.<br> ! * Typical application code (for extracting only the text from a web page) would then be simplified to :<br> * <pre> * Node node; ! * for (Enumeration e = parser.elements();e.hasMoreElements();) { ! * node = (Node)e.nextElement(); ! * System.out.println(node.toPlainTextString()); // Or do whatever processing you wish with the plain text string * } * </pre> */ ! public abstract String toPlainTextString(); /** ! * This method will make it easier when using html parser to reproduce html pages (with or without modifications) ! * Applications reproducing html can use this method on nodes which are to be used or transferred as they were ! * recieved, with the original html */ ! public abstract String toHtml(); /** ! * Return the string representation of the node. * Subclasses must define this method, and this is typically to be used in the manner<br> * <pre>System.out.println(node)</pre> ! * @return java.lang.String */ ! public abstract String toString(); /** * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node * satisfies the filtering criteria.<P> ! * * This mechanism allows powerful filtering code to be written very easily, * without bothering about collection of embedded tags separately. --- 98,141 ---- /** ! * Returns a string representation of the node. ! * It allows a simple string transformation ! * of a web page, regardless of node type.<br> ! * Typical application code (for extracting only the text from a web page) ! * would then be simplified to:<br> * <pre> * Node node; ! * for (Enumeration e = parser.elements (); e.hasMoreElements (); ) ! * { ! * node = (Node)e.nextElement(); ! * System.out.println (node.toPlainTextString ()); ! * // or do whatever processing you wish with the plain text string * } * </pre> + * @return The 'browser' content of this node. */ ! public abstract String toPlainTextString (); /** ! * Return the HTML that generated this node. ! * This method will make it easier when using html parser to reproduce html ! * pages (with or without modifications). ! * Applications reproducing html can use this method on nodes which are to ! * be used or transferred as they were recieved, with the original html. ! * @return The HTML code for this node. */ ! public abstract String toHtml (); /** ! * Return a string representation of the node. * Subclasses must define this method, and this is typically to be used in the manner<br> * <pre>System.out.println(node)</pre> ! * @return A textual representation of the node suitable for debugging */ ! public abstract String toString (); /** * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node * satisfies the filtering criteria.<P> ! * * This mechanism allows powerful filtering code to be written very easily, * without bothering about collection of embedded tags separately. *************** *** 134,138 **** * current node is a {@link org.htmlparser.tags.CompositeTag}, and going through its children. * So this method provides a convenient way to do this.<P> ! * * Using collectInto(), programs get a lot shorter. Now, the code to * extract all links from a page would look like: --- 145,149 ---- * current node is a {@link org.htmlparser.tags.CompositeTag}, and going through its children. * So this method provides a convenient way to do this.<P> ! * * Using collectInto(), programs get a lot shorter. Now, the code to * extract all links from a page would look like: *************** *** 145,149 **** * Thus, collectionList will hold all the link nodes, irrespective of how * deep the links are embedded.<P> ! * * Another way to accomplish the same objective is: * <pre> --- 156,160 ---- * Thus, collectionList will hold all the link nodes, irrespective of how * deep the links are embedded.<P> ! * * Another way to accomplish the same objective is: * <pre> *************** *** 155,158 **** --- 166,171 ---- * This is slightly less specific because the LinkTag class may be * registered for more than one node name, e.g. <LINK> tags too. + * @param list The node list to collect acceptable nodes into. + * @param filter The filter to determine which nodes are retained. */ public void collectInto (NodeList list, NodeFilter filter) *************** *** 163,184 **** /** - * Returns the beginning position of the tag. - * @deprecated Use {@link #getStartPosition}. - */ - public int elementBegin() - { - return (getStartPosition ()); - } - - /** - * Returns the ending position fo the tag - * @deprecated Use {@link #getEndPosition}. - */ - public int elementEnd() - { - return (getEndPosition ()); - } - - /** * Get the page this node came from. * @return The page that supplied this node. --- 176,179 ---- *************** *** 234,245 **** } - public abstract void accept (NodeVisitor visitor); - /** ! * @deprecated - use toHtml() instead */ ! public final String toHTML() { ! return toHtml(); ! } /** --- 229,237 ---- } /** ! * Visit this node. ! * @param visitor The visitor that is visiting this node. */ ! public abstract void accept (NodeVisitor visitor); /** *************** *** 283,289 **** /** ! * Returns the text of the string line */ ! public String getText() { return null; } --- 275,283 ---- /** ! * Returns the text of the node. ! * @return The text of this node. The default is <code>null</code>. */ ! public String getText () ! { return null; } *************** *** 293,298 **** * @param text The new text for the node. */ ! public void setText(String text) { ! } --- 287,292 ---- * @param text The new text for the node. */ ! public void setText(String text) ! { } *************** *** 300,305 **** * Perform the meaning of this tag. * The default action is to do nothing. */ ! public void doSemanticAction () throws ParserException { } --- 294,303 ---- * Perform the meaning of this tag. * The default action is to do nothing. + * @exception ParserException <em>Not used.</em> Provides for subclasses + * that may want to indicate an exceptional condition. */ ! public void doSemanticAction () ! throws ! ParserException { } Index: TagNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes/TagNode.java,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** TagNode.java 31 Jul 2004 16:42:35 -0000 1.5 --- TagNode.java 10 Apr 2005 23:20:44 -0000 1.6 *************** *** 319,328 **** } ! /* ! * Sets the attributes. ! * @param attribs The attribute collection to set. ! * Each element is an {@link Attribute Attribute}. ! * The first attribute in the list must be the tag name ( ! * <code>isStandalone()</code> returns <code>true</code>). */ public void setAttributeEx (Attribute attribute) --- 319,326 ---- } ! /** ! * Set an attribute. ! * @param attribute The attribute to set. ! * @see #setAttribute(Attribute) */ public void setAttributeEx (Attribute attribute) *************** *** 374,387 **** /** - * Eqivalent to <code>getAttribute (name)</code>. - * @param name Name of attribute. - * @deprecated use getAttribute instead - */ - public String getParameter (String name) - { - return (getAttribute (name)); - } - - /** * Gets the attributes in the tag. * @return Returns the list of {@link Attribute Attributes} in the tag. --- 372,375 ---- *************** *** 533,537 **** String ret; - //ret = mPage.getText (elementBegin () + 1, elementEnd () - 1); ret = toHtml (); ret = ret.substring (1, ret.length () - 1); --- 521,524 ---- *************** *** 766,769 **** --- 753,757 ---- * Based on <code>isEndTag()</code>, calls either <code>visitTag()</code> or * <code>visitEndTag()</code>. + * @param visitor The visitor that is visiting this node. */ public void accept (NodeVisitor visitor) Index: RemarkNode.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/nodes/RemarkNode.java,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** RemarkNode.java 17 Jul 2004 13:45:04 -0000 1.3 --- RemarkNode.java 10 Apr 2005 23:20:44 -0000 1.4 *************** *** 109,118 **** } ! public String toPlainTextString() { return (getText()); } ! ! public String toHtml() { StringBuffer buffer; --- 109,126 ---- } ! /** ! * Return the remark text. ! * @return The HTML comment. ! */ ! public String toPlainTextString () { return (getText()); } ! ! /** ! * Return The full HTML remark. ! * @return The comment, i.e. {@.html <!-- this is a comment -->}. ! */ ! public String toHtml () { StringBuffer buffer; |
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30655/htmlparser/src/org/htmlparser/tags Modified Files: BaseHrefTag.java BodyTag.java CompositeTag.java DoctypeTag.java FormTag.java FrameSetTag.java FrameTag.java HeadTag.java ImageTag.java JspTag.java LabelTag.java LinkTag.java MetaTag.java OptionTag.java ScriptTag.java SelectTag.java TableRow.java TableTag.java TextareaTag.java TitleTag.java package.html Log Message: Documentation revamp part one. Deprecated node decorators. Added doSemanticAction for Text and Comment nodes. Added missing sitecapturer scripts. Fixed DOS batch files to work when called from any location. Index: BaseHrefTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BaseHrefTag.java,v retrieving revision 1.39 retrieving revision 1.40 diff -C2 -d -r1.39 -r1.40 *** BaseHrefTag.java 2 Jul 2004 00:49:28 -0000 1.39 --- BaseHrefTag.java 10 Apr 2005 23:20:45 -0000 1.40 *************** *** 61,66 **** /** ! * Get the value of the HREF attribute, if any. ! * @return The HREF value, with the last slash removed, if any. */ public String getBaseUrl() --- 61,66 ---- /** ! * Get the value of the <code>HREF</code> attribute, if any. ! * @return The <code>HREF</code> value, with the leading and trailing whitespace removed, if any. */ public String getBaseUrl() *************** *** 76,79 **** --- 76,83 ---- } + /** + * Set the value of the <code>HREF</code> attribute. + * @param base The new <code>HREF</code> value. + */ public void setBaseUrl (String base) { *************** *** 84,87 **** --- 88,92 ---- * Perform the meaning of this tag. * This sets the base URL to use for the rest of the page. + * @exception ParserException If setting the base URL fails. */ public void doSemanticAction () throws ParserException *************** *** 91,97 **** page = getPage (); if (null != page) - { page.setBaseUrl (getBaseUrl ()); - } } } --- 96,100 ---- Index: OptionTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/OptionTag.java,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** OptionTag.java 2 Jan 2004 16:24:55 -0000 1.36 --- OptionTag.java 10 Apr 2005 23:20:45 -0000 1.37 *************** *** 82,96 **** /** ! * Get the value of the value attribute. */ ! public String getValue() { ! return (getAttribute("VALUE")); } /** * Set the value of the value attribute. */ ! public void setValue(String value) { this.setAttribute("VALUE",value); --- 82,99 ---- /** ! * Get the <code>VALUE</code> attribute, if any. ! * @return The value of the <code>VALUE</code> attribute, ! * or <code>null</code> if the attribute doesn't exist. */ ! public String getValue () { ! return (getAttribute ("VALUE")); } /** * Set the value of the value attribute. + * @param value The new value of the <code>VALUE</code> attribute. */ ! public void setValue (String value) { this.setAttribute("VALUE",value); *************** *** 98,102 **** /** ! * Get the text of this optin. */ public String getOptionText() --- 101,106 ---- /** ! * Get the text of this option. ! * @return The textual contents of this <code>OPTION</code> tag. */ public String getOptionText() *************** *** 105,108 **** --- 109,116 ---- } + /** + * Return a string representation of this node suitable for debugging. + * @return The value and text of this tag in a string. + */ public String toString() { Index: ScriptTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ScriptTag.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** ScriptTag.java 29 Feb 2004 01:38:36 -0000 1.37 --- ScriptTag.java 10 Apr 2005 23:20:45 -0000 1.38 *************** *** 78,86 **** /** ! * Get the language attribute value. */ public String getLanguage() { ! return (getAttribute("LANGUAGE")); } --- 78,87 ---- /** ! * Get the <code>LANGUAGE</code> attribute, if any. ! * @return The scripting language. */ public String getLanguage() { ! return (getAttribute ("LANGUAGE")); } *************** *** 113,121 **** /** ! * Get the type attribute value. */ public String getType() { ! return (getAttribute("TYPE")); } --- 114,123 ---- /** ! * Get the <code>TYPE</code> attribute, if any. ! * @return The script mime type. */ public String getType() { ! return (getAttribute ("TYPE")); } *************** *** 130,135 **** /** ! * Set the type of the script tag. ! * @param type The new type value. */ public void setType (String type) --- 132,137 ---- /** ! * Set the mime type of the script tag. ! * @param type The new mime type. */ public void setType (String type) *************** *** 138,142 **** } ! protected void putChildrenInto(StringBuffer sb) { Node node; --- 140,148 ---- } ! /** ! * Places the script contents into the provided buffer. ! * @param sb The buffer to add the script to. ! */ ! protected void putChildrenInto (StringBuffer sb) { Node node; *************** *** 155,159 **** /** ! * Print the contents of the script tag. */ public String toString() --- 161,166 ---- /** ! * Print the contents of the script tag suitable for debugging display. ! * @return The script language or type and code as a string. */ public String toString() Index: SelectTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/SelectTag.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** SelectTag.java 24 Jan 2004 23:57:52 -0000 1.38 --- SelectTag.java 10 Apr 2005 23:20:45 -0000 1.39 *************** *** 83,86 **** --- 83,90 ---- } + /** + * Get the list of options in this <code>SELECT</code> tag. + * @return The {@.html <OPTION>} tags contained by this tag. + */ public OptionTag [] getOptionTags () { Index: TableTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableTag.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** TableTag.java 13 Feb 2005 20:36:00 -0000 1.40 --- TableTag.java 10 Apr 2005 23:20:45 -0000 1.41 *************** *** 117,120 **** --- 117,123 ---- /** * Get the number of rows in this table. + * @return The number of rows in this table. + * <em>Note: this is a a simple count of the number of {@.html <TR>} tags and + * may be incorrect if the {@.html <TR>} tags span multiple rows.</em> */ public int getRowCount () *************** *** 125,130 **** /** * Get the row at the given index. */ ! public TableRow getRow (int i) { TableRow[] rows; --- 128,135 ---- /** * Get the row at the given index. + * @param index The row number (zero based) to get. + * @return The row for the given index. */ ! public TableRow getRow (int index) { TableRow[] rows; *************** *** 132,137 **** rows = getRows (); ! if (i < rows.length) ! ret = rows[i]; else ret = null; --- 137,142 ---- rows = getRows (); ! if (index < rows.length) ! ret = rows[index]; else ret = null; *************** *** 140,143 **** --- 145,152 ---- } + /** + * Return a string suitable for debugging display. + * @return The table as HTML, sorry. + */ public String toString() { Index: FrameTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameTag.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** FrameTag.java 2 Jul 2004 00:49:28 -0000 1.37 --- FrameTag.java 10 Apr 2005 23:20:45 -0000 1.38 *************** *** 83,86 **** --- 83,91 ---- } + /** + * Get the <code>NAME</code> attribute, if any. + * @return The value of the <code>NAME</code> attribute, + * or <code>null</code> if the attribute doesn't exist. + */ public String getFrameName() { *************** *** 89,93 **** /** ! * Print the contents of the FrameTag. */ public String toString() --- 94,99 ---- /** ! * Return a string representation of the contents of this <code>FRAME</code> tag suitable for debugging. ! * @return A string with this tag's contents. */ public String toString() Index: LabelTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LabelTag.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** LabelTag.java 2 Jan 2004 16:24:55 -0000 1.35 --- LabelTag.java 10 Apr 2005 23:20:45 -0000 1.36 *************** *** 62,65 **** --- 62,69 ---- } + /** + * Returns the text contained inside this label tag. + * @return The textual contents between the {@.html <LABEL></LABEL>} pair. + */ public String getLabel() { *************** *** 67,73 **** } public String toString() { ! return "LABEL: "+getLabel(); } } --- 71,81 ---- } + /** + * Returns a string representation of this label tag suitable for debugging. + * @return A string representing this label. + */ public String toString() { ! return "LABEL: "+ getLabel(); } } Index: CompositeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/CompositeTag.java,v retrieving revision 1.79 retrieving revision 1.80 diff -C2 -d -r1.79 -r1.80 *** CompositeTag.java 31 Jul 2004 16:42:34 -0000 1.79 --- CompositeTag.java 10 Apr 2005 23:20:45 -0000 1.80 *************** *** 59,62 **** --- 59,65 ---- protected final static CompositeTagScanner mDefaultCompositeScanner = new CompositeTagScanner (); + /** + * Create a composite tag. + */ public CompositeTag () { *************** *** 125,128 **** --- 128,135 ---- } + /** + * Return the textual contents of this tag and it's children. + * @return The 'browser' text contents of this tag. + */ public String toPlainTextString() { StringBuffer stringRepresentation = new StringBuffer(); *************** *** 133,136 **** --- 140,147 ---- } + /** + * Add the textual contents of the children of this node to the buffer. + * @param sb The buffer to append to. + */ protected void putChildrenInto(StringBuffer sb) { *************** *** 145,148 **** --- 156,163 ---- } + /** + * Add the textual contents of the end tag of this node to the buffer. + * @param sb The buffer to append to. + */ protected void putEndTagInto(StringBuffer sb) { *************** *** 152,155 **** --- 167,175 ---- } + /** + * Return this tag as HTML code. + * @return This tag and it's contents (children) and the end tag + * as HTML code. + */ public String toHtml() { StringBuffer sb = new StringBuffer(); *************** *** 158,162 **** { putChildrenInto(sb); ! if (null != getEndTag ()) // this test if for link tags that refuse to scan because there's no HREF attribute putEndTagInto(sb); } --- 178,182 ---- { putChildrenInto(sb); ! if (null != getEndTag ()) putEndTagInto(sb); } *************** *** 290,293 **** --- 310,314 ---- * @return int The node index in the children list of the node containing * the text or -1 if not found. + * @see #findPositionOf (String, Locale) */ public int findPositionOf (String text) *************** *** 301,307 **** * Text is compared without case sensitivity and conversion to uppercase * uses the supplied locale. - * @param text The text to search for. * @return int The node index in the children list of the node containing * the text or -1 if not found. */ public int findPositionOf (String text, Locale locale) --- 322,329 ---- * Text is compared without case sensitivity and conversion to uppercase * uses the supplied locale. * @return int The node index in the children list of the node containing * the text or -1 if not found. + * @param locale The locale to use in converting to uppercase. + * @param text The text to search for. */ public int findPositionOf (String text, Locale locale) *************** *** 358,365 **** /** ! * Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node ! * satisfies the filtering criteria.<P> ! * ! * This mechanism allows powerful filtering code to be written very easily, * without bothering about collection of embedded tags separately. * e.g. when we try to get all the links on a page, it is not possible to --- 380,386 ---- /** ! * Collect this node and its child nodes (if-applicable) into the list parameter, ! * provided the node satisfies the filtering criteria. ! * <p>This mechanism allows powerful filtering code to be written very easily, * without bothering about collection of embedded tags separately. * e.g. when we try to get all the links on a page, it is not possible to *************** *** 367,392 **** * links embedded in them. We could get the links out by checking if the * current node is a {@link CompositeTag}, and going through its children. ! * So this method provides a convenient way to do this.<P> ! * ! * Using collectInto(), programs get a lot shorter. Now, the code to * extract all links from a page would look like: * <pre> ! * NodeList collectionList = new NodeList(); * NodeFilter filter = new TagNameFilter ("A"); * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> ! * Thus, collectionList will hold all the link nodes, irrespective of how ! * deep the links are embedded.<P> ! * ! * Another way to accomplish the same objective is: * <pre> ! * NodeList collectionList = new NodeList(); * NodeFilter filter = new TagClassFilter (LinkTag.class); * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(collectionList, filter); * </pre> * This is slightly less specific because the LinkTag class may be ! * registered for more than one node name, e.g. <LINK> tags too. */ public void collectInto (NodeList list, NodeFilter filter) --- 388,414 ---- * links embedded in them. We could get the links out by checking if the * current node is a {@link CompositeTag}, and going through its children. ! * So this method provides a convenient way to do this.</p> ! * <p>Using collectInto(), programs get a lot shorter. Now, the code to * extract all links from a page would look like: * <pre> ! * NodeList list = new NodeList(); * NodeFilter filter = new TagNameFilter ("A"); * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(list, filter); * </pre> ! * Thus, <code>list</code> will hold all the link nodes, irrespective of how ! * deep the links are embedded.</p> ! * <p>Another way to accomplish the same objective is: * <pre> ! * NodeList list = new NodeList(); * NodeFilter filter = new TagClassFilter (LinkTag.class); * for (NodeIterator e = parser.elements(); e.hasMoreNodes();) ! * e.nextNode().collectInto(list, filter); * </pre> * This is slightly less specific because the LinkTag class may be ! * registered for more than one node name, e.g. <LINK> tags too.</p> ! * @param list The list to add nodes to. ! * @param filter The filter to apply. ! * @see org.htmlparser.filters */ public void collectInto (NodeList list, NodeFilter filter) *************** *** 399,402 **** --- 421,428 ---- } + /** + * Return the HTML code for the children of this tag. + * @return A string with the HTML code for the contents of this tag. + */ public String getChildrenHTML() { StringBuffer buff = new StringBuffer(); *************** *** 441,444 **** --- 467,474 ---- } + /** + * Return the number of child nodes in this tag. + * @return The child node count. + */ public int getChildCount() { *************** *** 450,453 **** --- 480,491 ---- } + /** + * Get the end tag for this tag. + * For example, if the node is {@.html <LABEL>The label</LABLE>}, then + * this method would return the {@.html </LABLE>} end tag. + * @return The end tag for this node. + * <em>Note: If the start and end position of the end tag is the same, + * then the end tag was injected (it's a virtual end tag).</em> + */ public Tag getEndTag() { *************** *** 455,461 **** } ! public void setEndTag (Tag end) { ! mEndTag = end; } --- 493,506 ---- } ! /** ! * Set the end tag for this tag. ! * @param tag The new end tag for this tag. ! * Note: no checking is perfromed so you can generate bad HTML by setting ! * the end tag with a name not equal to the name of the start tag, ! * i.e. {@.html <LABEL>The label</TITLE>} ! */ ! public void setEndTag (Tag tag) { ! mEndTag = tag; } *************** *** 464,468 **** * it. The text node will retain links to its parents, so * further navigation is possible. ! * @param searchText * @return The list of text nodes (recursively) found. */ --- 509,513 ---- * it. The text node will retain links to its parents, so * further navigation is possible. ! * @param searchText The text to search for. * @return The list of text nodes (recursively) found. */ *************** *** 490,493 **** --- 535,542 ---- } + /** + * Return a string representation of the contents of this tag, it's children and it's end tag suitable for debugging. + * @return A textual representation of the tag. + */ public String toString () { *************** *** 528,531 **** --- 577,585 ---- } + /** + * Return a string representation of the contents of this tag, it's children and it's end tag suitable for debugging. + * @param level The indentation level to use. + * @param buffer The buffer to append to. + */ public void toString (int level, StringBuffer buffer) { Index: DoctypeTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/DoctypeTag.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** DoctypeTag.java 2 Jul 2004 00:49:28 -0000 1.38 --- DoctypeTag.java 10 Apr 2005 23:20:45 -0000 1.39 *************** *** 58,62 **** /** ! * Print the contents of the document declaration tag. */ public String toString() --- 58,63 ---- /** ! * Return a string representation of the contents of this <code>!DOCTYPE</code> tag suitable for debugging. ! * @return The contents of the document declaration tag as a string. */ public String toString() Index: HeadTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/HeadTag.java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** HeadTag.java 2 Jan 2004 16:24:54 -0000 1.21 --- HeadTag.java 10 Apr 2005 23:20:45 -0000 1.22 *************** *** 81,84 **** --- 81,88 ---- } + /** + * Returns a string representation of this <code>HEAD</code> tag suitable for debugging. + * @return A string representing this tag. + */ public String toString() { Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/package.html,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** package.html 2 Jan 2004 16:24:55 -0000 1.19 --- package.html 10 Apr 2005 23:20:45 -0000 1.20 *************** *** 30,43 **** </head> <body bgcolor="white"> ! The tags package contains tag types that are created mostly by the scanners. ! Developers should familiarize themselves with this package as well. Custom scanners would need to create custom tags (Factory + Template Method). ! ! <h2>Related Documentation</h2> ! ! For overviews, tutorials, examples, guides, and tool documentation, please see: ! <ul> ! <li><a href="http://htmlparser.sourceforge.net">HTML Parser Home Page</a> ! </ul> ! <!-- Put @see and @since tags down here. --> --- 30,59 ---- </head> <body bgcolor="white"> ! The tags package contains specific tags. ! <p>This package has implementations of tags that have functionality beyond the ! capability of a generic tag. For example, the {@.html <META>} tag has methods ! to get the {@link org.htmlparser.tags.MetaTag#getMetaContent CONTENT} and ! {@link org.htmlparser.tags.MetaTag#getMetaTagName NAME} ! attributes (although this could be done with generic attribute manipulation) ! and an implementation of ! {@link org.htmlparser.tags.MetaTag#doSemanticAction doSemanticAction} ! that alters the lexer's encoding.</p> ! <p>The classes in this package have been added in an ad-hoc fashion, with the ! most useful ones having existed a long time, while some obvious ones are rather ! new. Please feel free to add your own, and register them with the ! {@link org.htmlparser.PrototypicalNodeFactory PrototypicalNodeFactory}, ! and they will be treated like any other in-built tag. In fact tags do not need ! to reside in this package.</p> ! <p>If the tag can contain other nodes, i.e. {@.html <h1>My Heading</h1>}, then ! it should derive from (i.e. be a subclass of) {@link org.htmlparser.tags.CompositeTag}. ! In this way it will inherit the ! {@link org.htmlparser.scanners.CompositeTagScanner CompositeTagScanner} ! and nodes between the start and end tag will be gathered into the list of ! children. Most of the tags in this package derive from CompositeTag, and that ! why the nodes returned from the Parser are nested.</p> ! <p>If it is a simple tag, i.e. {@.html <br>}, then it should derive from ! {@link org.htmlparser.nodes.TagNode TagNode}. See for example ! {@link org.htmlparser.tags.MetaTag} ! or {@link org.htmlparser.tags.ImageTag}.</p> <!-- Put @see and @since tags down here. --> Index: LinkTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/LinkTag.java,v retrieving revision 1.53 retrieving revision 1.54 diff -C2 -d -r1.53 -r1.54 *** LinkTag.java 13 Feb 2005 22:45:48 -0000 1.53 --- LinkTag.java 10 Apr 2005 23:20:45 -0000 1.54 *************** *** 118,122 **** /** ! * Returns the accesskey attribute value, if any. */ public String getAccessKey() --- 118,124 ---- /** ! * Get the <code>ACCESSKEY</code> attribute, if any. ! * @return The value of the <code>ACCESSKEY</code> attribute, ! * or <code>null</code> if the attribute doesn't exist. */ public String getAccessKey() *************** *** 130,133 **** --- 132,136 ---- * off the front (if those predicates return <code>true</code>) but not * for other protocols. Don't ask me why, it's a legacy thing. + * @return The URL for this <code>A</code> tag. */ public String getLink() *************** *** 158,162 **** /** ! * Returns the text contained inside this link tag */ public String getLinkText() --- 161,166 ---- /** ! * Returns the text contained inside this link tag. ! * @return The textual contents between the {@.html <A></A>} pair. */ public String getLinkText() *************** *** 259,263 **** /** ! * Print the contents of this Link Node */ public String toString() --- 263,268 ---- /** ! * Return the contents of this link node as a string suitable for debugging. ! * @return A string representation of this node. */ public String toString() *************** *** 287,290 **** --- 292,299 ---- } + /** + * Set the <code>HREF</code> attribute. + * @param link The new value of the <code>HREF</code> attribute. + */ public void setLink(String link) { Index: TableRow.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TableRow.java,v retrieving revision 1.41 retrieving revision 1.42 diff -C2 -d -r1.41 -r1.42 *** TableRow.java 13 Feb 2005 20:36:00 -0000 1.41 --- TableRow.java 10 Apr 2005 23:20:45 -0000 1.42 *************** *** 86,90 **** /** ! * Get the column tags within this row. */ public TableColumn[] getColumns () --- 86,91 ---- /** ! * Get the column tags within this <code>TR</code> (table row) tag. ! * @return The {@.html <TD>} tags contained by this tag. */ public TableColumn[] getColumns () *************** *** 125,128 **** --- 126,132 ---- /** * Get the number of columns in this row. + * @return The number of columns in this row. + * <em>Note: this is a a simple count of the number of {@.html <TD>} tags and + * may be incorrect if the {@.html <TD>} tags span multiple columns.</em> */ public int getColumnCount () Index: TextareaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TextareaTag.java,v retrieving revision 1.34 retrieving revision 1.35 diff -C2 -d -r1.34 -r1.35 *** TextareaTag.java 14 Jan 2004 02:53:46 -0000 1.34 --- TextareaTag.java 10 Apr 2005 23:20:45 -0000 1.35 *************** *** 81,84 **** --- 81,88 ---- } + /** + * Return the plain text contents from this text area. + * @return The text of the children of this <code>TEXTAREA</code> tag. + */ public String getValue() { return toPlainTextString(); Index: TitleTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/TitleTag.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** TitleTag.java 17 Jul 2004 13:45:04 -0000 1.35 --- TitleTag.java 10 Apr 2005 23:20:45 -0000 1.36 *************** *** 83,86 **** --- 83,87 ---- /** * Get the title text. + * @return The title. */ public String getTitle() *************** *** 89,92 **** --- 90,97 ---- } + /** + * Return a string representation of this tag for debugging. + * @return A string with the text of the title. + */ public String toString() { Index: JspTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/JspTag.java,v retrieving revision 1.40 retrieving revision 1.41 diff -C2 -d -r1.40 -r1.41 *** JspTag.java 2 Jul 2004 00:49:29 -0000 1.40 --- JspTag.java 10 Apr 2005 23:20:45 -0000 1.41 *************** *** 58,62 **** /** ! * Print the contents of the jsp tag. */ public String toString() --- 58,63 ---- /** ! * Returns a string representation of this jsp tag suitable for debugging. ! * @return A string representing this tag. */ public String toString() Index: FrameSetTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FrameSetTag.java,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** FrameSetTag.java 25 Jan 2004 21:33:12 -0000 1.36 --- FrameSetTag.java 10 Apr 2005 23:20:45 -0000 1.37 *************** *** 74,78 **** /** ! * Print the contents of the FrameSetTag */ public String toString() --- 74,79 ---- /** ! * Return a string representation of the contents of this <code>FRAMESET</code> tag suitable for debugging. ! * @return A string with this tag's contents. */ public String toString() Index: ImageTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/ImageTag.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** ImageTag.java 17 Jul 2004 13:45:04 -0000 1.48 --- ImageTag.java 10 Apr 2005 23:20:45 -0000 1.49 *************** *** 81,84 **** --- 81,85 ---- * <IMG SRC = http://www.redgreen.com> - space both sides of equals sign * </pre> + * @return The relative URL for the image. */ public String extractImageLocn () *************** *** 181,185 **** /** ! * Returns the location of the image */ public String getImageURL() --- 182,187 ---- /** ! * Returns the location of the image. ! * @return The absolute URL for this image. */ public String getImageURL() *************** *** 192,195 **** --- 194,201 ---- } + /** + * Set the <code>SRC</code> attribute. + * @param url The new value of the <code>SRC</code> attribute. + */ public void setImageURL (String url) { Index: FormTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/FormTag.java,v retrieving revision 1.49 retrieving revision 1.50 diff -C2 -d -r1.49 -r1.50 *** FormTag.java 18 Jul 2004 21:31:20 -0000 1.49 --- FormTag.java 10 Apr 2005 23:20:45 -0000 1.50 *************** *** 36,41 **** public class FormTag extends CompositeTag { ! public static final String POST="POST"; ! public static final String GET="GET"; /** --- 36,50 ---- public class FormTag extends CompositeTag { ! /** ! * The {@value} method. ! * @see #getFormMethod ! */ ! public static final String POST = "POST"; ! ! /** ! * The {@value} method. ! * @see #getFormMethod ! */ ! public static final String GET = "GET"; /** *************** *** 183,187 **** /** * Find the textarea tag matching the given name ! * @param name Name of the textarea tag to be found within the form */ public TextareaTag getTextAreaTag(String name) --- 192,197 ---- /** * Find the textarea tag matching the given name ! * @param name Name of the textarea tag to be found within the form. ! * @return The <code>TEXTAREA</code> tag with the matching name. */ public TextareaTag getTextAreaTag(String name) *************** *** 203,206 **** --- 213,217 ---- /** + * Return a string representation of the contents of this <code>FORM</code> tag suitable for debugging. * @return A textual representation of the form tag. */ *************** *** 211,216 **** /** ! * Extract the location of the image, given the tag, and the url ! * of the html page in which this tag exists. */ public String extractFormLocn () --- 222,227 ---- /** ! * Extract the <code>ACTION</code> attribute as an absolute URL. ! * @return The URL the form is to be submitted to. */ public String extractFormLocn () Index: MetaTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/MetaTag.java,v retrieving revision 1.38 retrieving revision 1.39 diff -C2 -d -r1.38 -r1.39 *** MetaTag.java 6 Sep 2004 17:12:59 -0000 1.38 --- MetaTag.java 10 Apr 2005 23:20:45 -0000 1.39 *************** *** 60,63 **** --- 60,68 ---- } + /** + * Get the <code>HTTP-EQUIV</code> attribute, if any. + * @return The value of the <code>HTTP-EQUIV</code> attribute, + * or <code>null</code> if the attribute doesn't exist. + */ public String getHttpEquiv () { *************** *** 65,68 **** --- 70,78 ---- } + /** + * Get the <code>CONTENT</code> attribute, if any. + * @return The value of the <code>CONTENT</code> attribute, + * or <code>null</code> if the attribute doesn't exist. + */ public String getMetaContent () { *************** *** 70,73 **** --- 80,88 ---- } + /** + * Get the <code>NAME</code> attribute, if any. + * @return The value of the <code>NAME</code> attribute, + * or <code>null</code> if the attribute doesn't exist. + */ public String getMetaTagName () { *************** *** 75,79 **** } ! public void setHttpEquiv(String httpEquiv) { Attribute equiv; --- 90,98 ---- } ! /** ! * Set the <code>HTTP-EQUIV</code> attribute. ! * @param httpEquiv The new value of the <code>HTTP-EQUIV</code> attribute. ! */ ! public void setHttpEquiv (String httpEquiv) { Attribute equiv; *************** *** 85,89 **** } ! public void setMetaTagContents(String metaTagContents) { Attribute content; --- 104,112 ---- } ! /** ! * Set the <code>CONTENT</code> attribute. ! * @param metaTagContents The new value of the <code>CONTENT</code> attribute. ! */ ! public void setMetaTagContents (String metaTagContents) { Attribute content; *************** *** 95,99 **** } ! public void setMetaTagName(String metaTagName) { Attribute name; --- 118,126 ---- } ! /** ! * Set the <code>NAME</code> attribute. ! * @param metaTagName The new value of the <code>NAME</code> attribute. ! */ ! public void setMetaTagName (String metaTagName) { Attribute name; *************** *** 106,112 **** /** * Check for a charset directive, and if found, set the charset for the page. */ ! public void doSemanticAction () throws ParserException { String httpEquiv; --- 133,143 ---- /** + * Perform the META tag semantic action. * Check for a charset directive, and if found, set the charset for the page. + * @exception ParserException If setting the encoding fails. */ ! public void doSemanticAction () ! throws ! ParserException { String httpEquiv; Index: BodyTag.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tags/BodyTag.java,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** BodyTag.java 2 Jan 2004 16:24:54 -0000 1.21 --- BodyTag.java 10 Apr 2005 23:20:45 -0000 1.22 *************** *** 77,80 **** --- 77,85 ---- } + /** + * Returns the textual contents of this <code>BODY</code> tag. + * Equivalent to <code>toPlainTextString()</code>. + * @return The 'browser' text in this tag. + */ public String getBody() { *************** *** 82,85 **** --- 87,94 ---- } + /** + * Return a string representation of this <code>BODY</code> tag suitable for debugging. + * @return A string representing this <code>BODY</code> tag. + */ public String toString() { |
From: Derrick O. <der...@us...> - 2005-04-06 10:28:13
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24576/htmlparser/docs Modified Files: release.txt samples.html Log Message: End user experience issues: remove multiple wiki files in zip fix sample application links change readme.txt to use Windows line endings change copyright date Index: samples.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/samples.html,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** samples.html 10 Jan 2005 00:43:24 -0000 1.2 --- samples.html 6 Apr 2005 10:28:02 -0000 1.3 *************** *** 26,30 **** <td> <i>Parse a web page and print the tags in a simple loop.</i><br> ! <a href="../javadoc/org/htmlparser/Parser.html#main(java.lang.String[])" target="_parent">org.htmlparser.Parser.main(String[] args)</a> <pre> <code>bin/parser http://website_url [tag_name]</code> --- 26,30 ---- <td> <i>Parse a web page and print the tags in a simple loop.</i><br> ! <a href="javadoc/org/htmlparser/Parser.html#main(java.lang.String[])" target="_parent">org.htmlparser.Parser.main(String[] args)</a> <pre> <code>bin/parser http://website_url [tag_name]</code> *************** *** 36,40 **** <code>java -jar lib/htmlparser.jar http://website_url [tag_name]</code> </pre> ! </td> </tr> <tr> --- 36,64 ---- <code>java -jar lib/htmlparser.jar http://website_url [tag_name]</code> </pre> ! </td> ! </tr> ! <tr> ! <td valign="top"> ! <strong>Lexer</strong><br> ! </td> ! <td> ! <i>Print the low level nodes of a web page.</i><br> ! <a href="javadoc/org/htmlparser/lexer/Lexer.html" target="_parent">org.htmlparser.lexer.Lexer</a> ! <pre> ! <code>bin/lexer http://website_url</code> ! </pre> ! </td> ! </tr> ! <tr> ! <td valign="top"> ! <strong>Filter Builder</strong><br> ! </td> ! <td> ! <i>Interactively generate source code to extract web site contents.</i><br> ! <a href="javadoc/org/htmlparser/parserapplications/filterbuilder/FilterBuilder.html" target="_parent">org.htmlparser.parserapplications.filterbuilder.FilterBuilder</a> ! <pre> ! <code>bin/filterbuilder</code> ! </pre> ! </td> </tr> <tr> *************** *** 44,53 **** <td> <i>Extract links/mail addresses from a web page.</i><br> ! <a href="../javadoc/org/htmlparser/parserapplications/LinkExtractor.html" target="_parent">org.htmlparser.parserapplications.LinkExtractor</a> <pre> <code>bin/linkextractor http://website_url [-maillinks]</code> the optional -maillinks argument causes mailto: links to be printed </pre> ! </td> </tr> <tr> --- 68,77 ---- <td> <i>Extract links/mail addresses from a web page.</i><br> ! <a href="javadoc/org/htmlparser/parserapplications/LinkExtractor.html" target="_parent">org.htmlparser.parserapplications.LinkExtractor</a> <pre> <code>bin/linkextractor http://website_url [-maillinks]</code> the optional -maillinks argument causes mailto: links to be printed </pre> ! </td> </tr> <tr> *************** *** 57,66 **** <td> <i>Extract text from a web page.</i><br> ! <a href="../javadoc/org/htmlparser/parserapplications/StringExtractor.html" target="_parent">org.htmlparser.parserapplications.StringExtractor</a> <pre> <code>bin/stringextractor http://website_url [-links]</code> the optional -links argument causes hyperlinks to be shown within the text </pre> ! </td> </tr> <tr> --- 81,90 ---- <td> <i>Extract text from a web page.</i><br> ! <a href="javadoc/org/htmlparser/parserapplications/StringExtractor.html" target="_parent">org.htmlparser.parserapplications.StringExtractor</a> <pre> <code>bin/stringextractor http://website_url [-links]</code> the optional -links argument causes hyperlinks to be shown within the text </pre> ! </td> </tr> <tr> *************** *** 70,74 **** <td> <i>Save a web site locally.</i><br> ! <a href="../javadoc/org/htmlparser/parserapplications/SiteCapturer.html" target="_parent">org.htmlparser.parserapplications.SiteCapturer</a> <pre> <code>bin/sitecapturer http://source_website /target_directory/ [true|false]</code> --- 94,98 ---- <td> <i>Save a web site locally.</i><br> ! <a href="javadoc/org/htmlparser/parserapplications/SiteCapturer.html" target="_parent">org.htmlparser.parserapplications.SiteCapturer</a> <pre> <code>bin/sitecapturer http://source_website /target_directory/ [true|false]</code> *************** *** 76,80 **** audio and video are to be captured </pre> ! </td> </tr> <tr> --- 100,104 ---- audio and video are to be captured </pre> ! </td> </tr> <tr> *************** *** 84,92 **** <td> <i>View images behind thumbnails.</i><br> ! <a href="../javadoc/org/htmlparser/lexerapplications/thumbelina/package-summary.html" target="_parent">org.htmlparser.lexerapplications.thumbelina.Thumbelina</a> <pre> <code>bin/thumbelina [http://starting_website]</code> </pre> ! </td> </tr> <tr> --- 108,116 ---- <td> <i>View images behind thumbnails.</i><br> ! <a href="javadoc/org/htmlparser/lexerapplications/thumbelina/package-summary.html" target="_parent">org.htmlparser.lexerapplications.thumbelina.Thumbelina</a> <pre> <code>bin/thumbelina [http://starting_website]</code> </pre> ! </td> </tr> <tr> *************** *** 96,104 **** <td> <i>Parser Java Bean demo.</i><br> ! <a href="../javadoc/org/htmlparser/beans/BeanyBaby.html" target="_parent">org.htmlparser.beans.BeanyBaby</a> <pre> <code>bin/beanybaby [http://starting_website]</code> </pre> ! </td> </tr> </table> --- 120,140 ---- <td> <i>Parser Java Bean demo.</i><br> ! <a href="javadoc/org/htmlparser/beans/BeanyBaby.html" target="_parent">org.htmlparser.beans.BeanyBaby</a> <pre> <code>bin/beanybaby [http://starting_website]</code> </pre> ! </td> ! </tr> ! <tr> ! <td valign="top"> ! <strong>Translate</strong><br> ! </td> ! <td> ! <i>Numeric character reference and character entity reference to unicode codec.</i><br> ! <a href="javadoc/org/htmlparser/util/Translate.html" target="_parent">org.htmlparser.util.Translate</a> ! <pre> ! <code>bin/translate [-encode] <input_file >output_file</code> ! </pre> ! </td> </tr> </table> Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.68 retrieving revision 1.69 diff -C2 -d -r1.68 -r1.69 *** release.txt 13 Mar 2005 15:36:10 -0000 1.68 --- release.txt 6 Apr 2005 10:28:01 -0000 1.69 *************** *** 23,27 **** (v) license.txt (GNU Lesser General Public License) ! (vi) this file, release.txt Changes since Version 1.4 --- 23,27 ---- (v) license.txt (GNU Lesser General Public License) ! (vi) this file, readme.txt Changes since Version 1.4 |
From: Derrick O. <der...@us...> - 2005-04-06 10:28:12
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv24576/htmlparser Modified Files: build.xml Log Message: End user experience issues: remove multiple wiki files in zip fix sample application links change readme.txt to use Windows line endings change copyright date Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.76 retrieving revision 1.77 diff -C2 -d -r1.76 -r1.77 *** build.xml 12 Mar 2005 20:27:45 -0000 1.76 --- build.xml 6 Apr 2005 10:27:59 -0000 1.77 *************** *** 430,438 **** <property name="javadoc.doctitle" value="HTML Parser ${versionNumber}"/> <property name="javadoc.header" value="<A HREF="http://htmlparser.sourceforge.net" target="_top">HTML Parser Home Page</A>"/> ! <property name="javadoc.footer" value="&copy; 2004 Derrick Oswald<div align="right">${TODAY_STRING}</div>"/> ! <property name="javadoc.bottom" value="HTML Parser is an open source library released under ! <A HREF="http://www.opensource.org/licenses/lgpl-license.html" target="_top">LGPL</A>.<BR> ! <div align="right"><A HREF="http://sourceforge.net/projects/htmlparser" target="_top"> ! <img src="http://sourceforge.net/sflogo.php?group_id=24399&type=1" width="88" height="31" border="0" alt="SourceForge.net"></A></div>"/> <javadoc packagenames="org.htmlparser.*" sourcepath="${src}" --- 430,438 ---- <property name="javadoc.doctitle" value="HTML Parser ${versionNumber}"/> <property name="javadoc.header" value="<A HREF="http://htmlparser.sourceforge.net" target="_top">HTML Parser Home Page</A>"/> ! <property name="javadoc.footer" value="&copy; 2005 Derrick Oswald<div align="right">${TODAY_STRING}</div>"/> ! <property name="javadoc.bottom" value="<table width='100%'><tr><td>HTML Parser is an open source library released under ! <a HREF="http://www.opensource.org/licenses/lgpl-license.html" target="_top">LGPL</a>.</td><td align='right'> ! <a HREF="http://sourceforge.net/projects/htmlparser" target="_top"> ! <img src="http://sourceforge.net/sflogo.php?group_id=24399&type=1" width="88" height="31" border="0" alt="SourceForge.net"></a></td></tr></table>"/> <javadoc packagenames="org.htmlparser.*" sourcepath="${src}" *************** *** 453,458 **** <taglet name="HtmlTaglet" path="${resources}:${classes}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications,org.htmlparser.lexerapplications.tabby,org.htmlparser.lexerapplications.thumbelina,org.htmlparser.parserapplications.filterbuilder"/> ! <group title="Tags" packages="org.htmlparser.tags,org.htmlparser.tags.data"/> <group title="Lexer" packages="org.htmlparser.lexer"/> <group title="Scanners" packages="org.htmlparser.scanners"/> --- 453,458 ---- <taglet name="HtmlTaglet" path="${resources}:${classes}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications,org.htmlparser.lexerapplications.tabby,org.htmlparser.lexerapplications.thumbelina,org.htmlparser.parserapplications.filterbuilder*"/> ! <group title="Nodes" packages="org.htmlparser.nodes,org.htmlparser.tags"/> <group title="Lexer" packages="org.htmlparser.lexer"/> <group title="Scanners" packages="org.htmlparser.scanners"/> *************** *** 461,465 **** <group title="Http" packages="org.htmlparser.http"/> <group title="Sax" packages="org.htmlparser.sax"/> ! <group title="Utility" packages="org.htmlparser.util,org.htmlparser.util.sort"/> <link href="http://java.sun.com/j2se/1.4.2/docs/api/"/> <link href="http://www.saxproject.org/apidoc/"/> --- 461,465 ---- <group title="Http" packages="org.htmlparser.http"/> <group title="Sax" packages="org.htmlparser.sax"/> ! <group title="Utility" packages="org.htmlparser.util*"/> <link href="http://java.sun.com/j2se/1.4.2/docs/api/"/> <link href="http://www.saxproject.org/apidoc/"/> *************** *** 472,476 **** </target> ! <!-- Prepare the sources zip, allowing folks to build the code --> <target name="sources" description="create the source zip"> <zip destfile="src.zip" defaultexcludes="no"> --- 472,476 ---- </target> ! <!-- Create the source zip. --> <target name="sources" description="create the source zip"> <zip destfile="src.zip" defaultexcludes="no"> *************** *** 483,501 **** <!-- Perform the htmlparser integration --> ! <target name="htmlparser" depends="release,sources" ! description="glom the release and source files into the distribution zip file"> <mkdir dir="${distribution}"/> <zip zipfile="${distribution}/htmlparser${versionTag}.zip"> <zipfileset dir="${bin}" prefix="htmlparser${versionQualifier}/${bin}" includes="*.bat"/> <zipfileset dir="${bin}" prefix="htmlparser${versionQualifier}/${bin}" includes="*" excludes="*.bat" filemode="755"/> ! <zipfileset dir="${docs}" prefix="htmlparser${versionQualifier}/${docs}" excludes="docs/**,samples/**"/> ! <zipfileset dir="${wiki}" prefix="htmlparser${versionQualifier}/${docs}/wiki"/> <zipfileset dir="${lib}" prefix="htmlparser${versionQualifier}/${lib}"/> <zipfileset dir="." prefix="htmlparser${versionQualifier}/" includes="src.zip"/> <!-- Copy the release notes as readme.txt in the base release directory --> ! <zipfileset dir="${docs}" includes="release.txt" fullpath="htmlparser${versionQualifier}/readme.txt"/> <!-- Copy the LGPL license.txt to the base release directory --> <zipfileset dir="${resources}" includes="license.txt" fullpath="htmlparser${versionQualifier}/license.txt"/> </zip> </target> --- 483,502 ---- <!-- Perform the htmlparser integration --> ! <target name="htmlparser" depends="init,release,sources" ! description="create distribution zip file"> <mkdir dir="${distribution}"/> + <fixcrlf srcDir="${docs}" destDir="${distribution}" includes="release.txt" eol="crlf"/> <zip zipfile="${distribution}/htmlparser${versionTag}.zip"> <zipfileset dir="${bin}" prefix="htmlparser${versionQualifier}/${bin}" includes="*.bat"/> <zipfileset dir="${bin}" prefix="htmlparser${versionQualifier}/${bin}" includes="*" excludes="*.bat" filemode="755"/> ! <zipfileset dir="${docs}" prefix="htmlparser${versionQualifier}/${docs}" excludes="samples/**"/> <zipfileset dir="${lib}" prefix="htmlparser${versionQualifier}/${lib}"/> <zipfileset dir="." prefix="htmlparser${versionQualifier}/" includes="src.zip"/> <!-- Copy the release notes as readme.txt in the base release directory --> ! <zipfileset dir="${distribution}" fullpath="htmlparser${versionQualifier}/readme.txt" includes="release.txt"/> <!-- Copy the LGPL license.txt to the base release directory --> <zipfileset dir="${resources}" includes="license.txt" fullpath="htmlparser${versionQualifier}/license.txt"/> </zip> + <delete file="${distribution}/release.txt"/> </target> |
From: Derrick O. <der...@us...> - 2005-04-06 10:20:35
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23052/docs Modified Files: contributors.html Log Message: Add link pattern filters submitted by John Derrick. Index: contributors.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/contributors.html,v retrieving revision 1.14 retrieving revision 1.15 diff -C2 -d -r1.14 -r1.15 *** contributors.html 14 Feb 2005 23:54:16 -0000 1.14 --- contributors.html 6 Apr 2005 10:20:21 -0000 1.15 *************** *** 396,401 **** </tr> </table> ! <p>Thanks to David Andersen, Manuel Polo, Enrico Triolo, Gernot Fricke, Nick Burch, ! Stephen Harrington, Domenico Lordi, Kamen, John Zook, Cheng Jun, Mazlan Mat, Rob Shields, Wolfgang Germund, Raj Sharma, Robert Kausch, Gordon Deudney, Serge Kruppa, Roger Kjensrud, and Manpreet Singh --- 396,401 ---- </tr> </table> ! <p>Thanks to John Derrick, David Andersen, Manuel Polo, Enrico Triolo, ! Gernot Fricke, Nick Burch, Stephen Harrington, Domenico Lordi, Kamen, John Zook, Cheng Jun, Mazlan Mat, Rob Shields, Wolfgang Germund, Raj Sharma, Robert Kausch, Gordon Deudney, Serge Kruppa, Roger Kjensrud, and Manpreet Singh |
From: Derrick O. <der...@us...> - 2005-04-06 10:20:35
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23052/src/org/htmlparser/filters Added Files: LinkRegexFilter.java LinkStringFilter.java Log Message: Add link pattern filters submitted by John Derrick. --- NEW FILE: LinkStringFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2005 John Derrick // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/LinkStringFilter.java,v $ // $Author: derrickoswald $ // $Date: 2005/04/06 10:20:23 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.tags.LinkTag; /** * This class accepts tags of class LinkTag that contain a link matching a given * pattern string. Use this filter to extract LinkTag nodes with URLs containing * the desired string. */ public class LinkStringFilter implements NodeFilter { protected String mPattern; protected boolean mCaseSensitive; /** * Creates a new instance of LinkStringFilter that accepts LinkTag nodes containing * a URL that matches the supplied pattern. The match is case insensitive. * @param pattern The pattern to match. */ public LinkStringFilter (String pattern) { this (pattern, false); } /** * Creates a new instance of LinkStringFilter that accepts LinkTag nodes containing * a URL that matches the supplied pattern. * @param pattern The pattern to match. * @param caseSensitive Specifies case sensitivity for the matching process. */ public LinkStringFilter (String pattern, boolean caseSensitive) { mPattern = pattern; mCaseSensitive = caseSensitive; } /** * Accept nodes that are assignable from the LinkTag class and have a URL that * matches the pattern supplied in the constructor. * @param node The node to check. * @return <code>true</code> if the node is a link with the pattern. */ public boolean accept (Node node) { boolean ret; ret = false; if (LinkTag.class.isAssignableFrom (node.getClass ())) { String link = ((LinkTag)node).getLink (); if (mCaseSensitive) { if (link.indexOf (mPattern) > -1) ret = true; } else { if (link.toUpperCase ().indexOf (mPattern.toUpperCase ()) > -1) ret = true; } } return (ret); } } --- NEW FILE: LinkRegexFilter.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2005 John Derrick // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/filters/LinkRegexFilter.java,v $ // $Author: derrickoswald $ // $Date: 2005/04/06 10:20:23 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.filters; import java.util.regex.*; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.tags.LinkTag; /** * This class accepts tags of class LinkTag that contain a link matching a given * regex pattern. Use this filter to extract LinkTag nodes with URLs that match * the desired regex pattern. */ public class LinkRegexFilter implements NodeFilter { protected Pattern mRegex; /** * Creates a new instance of LinkRegexFilter that accepts LinkTag nodes containing * a URL that matches the supplied regex pattern. The match is case insensitive. * @param regexPattern The pattern to match. */ public LinkRegexFilter (String regexPattern) throws Exception { this (regexPattern, true); } /** * Creates a new instance of LinkRegexFilter that accepts LinkTag nodes containing * a URL that matches the supplied regex pattern. * @param regexPattern The regex pattern to match. * @param caseSensitive Specifies case sensitivity for the matching process. */ public LinkRegexFilter (String regexPattern, boolean caseSensitive) throws Exception { if (caseSensitive) mRegex = Pattern.compile (regexPattern); else mRegex = Pattern.compile (regexPattern, Pattern.CASE_INSENSITIVE); } /** * Accept nodes that are assignable from the LinkTag class and have a URL that * matches the regex pattern supplied in the constructor. * @param node The node to check. * @return <code>true</code> if the node is a link with the pattern. */ public boolean accept (Node node) { boolean ret; ret = false; if (LinkTag.class.isAssignableFrom (node.getClass ())) { String link = ((LinkTag)node).getLink (); Matcher matcher = mRegex.matcher (link); ret = matcher.find (); } return (ret); } } |
From: Derrick O. <der...@us...> - 2005-04-05 01:03:01
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28518/htmlparser/src/org/htmlparser Modified Files: NodeFilter.java Parser.java package.html Log Message: Update javadocs. Enable SiteCapturer to handle resource names containing spaces. Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/package.html,v retrieving revision 1.21 retrieving revision 1.22 diff -C2 -d -r1.21 -r1.22 *** package.html 14 Jun 2004 00:06:51 -0000 1.21 --- package.html 5 Apr 2005 00:48:12 -0000 1.22 *************** *** 33,40 **** the HTML Parser. <p>The {@link org.htmlparser.Parser} class is the main high level class that ! provides simplified access to the contents of an HTML page. The page can be ! specified as either a URLConnection or a String. In the case of a String, an ! attempt is made to open it as a URL, and if that fails it assumes it is a local ! disk file. A wide range of methods is available to customize the operation of the Parser, as well as access specific pieces of the page as --- 33,37 ---- the HTML Parser. <p>The {@link org.htmlparser.Parser} class is the main high level class that ! provides simplified access to the contents of an HTML page. A wide range of methods is available to customize the operation of the Parser, as well as access specific pieces of the page as *************** *** 48,56 **** is the {@link org.htmlparser.PrototypicalNodeFactory} which operates by holding example nodes and cloning them as needed to satisfy the ! requests for nodes by the Parser. The Lexer is it's own NodeFactory, returning ! new {@link org.htmlparser.nodes.TextNode}, {@link org.htmlparser.nodes.RemarkNode} and undifferentiated {@link org.htmlparser.nodes.TagNode Tagnodes} (see the ! {@link org.htmlparser.nodes nodes} package).</p> <p>The {@link org.htmlparser.NodeFilter} interface is used by the filtering code to determine if a node meets a certain criteria. Some generic examples of --- 45,55 ---- is the {@link org.htmlparser.PrototypicalNodeFactory} which operates by holding example nodes and cloning them as needed to satisfy the ! requests for nodes by the Parser. By default, a Lexer is it's own NodeFactory, ! returning new {@link org.htmlparser.nodes.TextNode}, {@link org.htmlparser.nodes.RemarkNode} and undifferentiated {@link org.htmlparser.nodes.TagNode Tagnodes} (see the ! {@link org.htmlparser.nodes nodes} package), but when the parser uses a lexer ! it replaces this behaviour with a PrototypicalNodeFactory to return a rich ! set of specific tags (see the {@link org.htmlparser.tags tags} package).</p> <p>The {@link org.htmlparser.NodeFilter} interface is used by the filtering code to determine if a node meets a certain criteria. Some generic examples of Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.103 retrieving revision 1.104 diff -C2 -d -r1.103 -r1.104 *** Parser.java 13 Mar 2005 15:36:11 -0000 1.103 --- Parser.java 5 Apr 2005 00:48:10 -0000 1.104 *************** *** 46,59 **** /** ! * This is the class that the user will use, either to get an iterator into ! * the html page or to directly parse the page and print the results ! * <BR> ! * Typical usage of the parser is as follows : <BR> ! * [1] Create a parser object - passing the URL and a feedback object to the parser<BR> ! * [2] Enumerate through the elements from the parser object <BR> ! * It is important to note that the parsing occurs when you enumerate, ON DEMAND. ! * This is a thread-safe way, and you only get the control back after a ! * particular element is parsed and returned, which could be the entire body. ! * @see Parser#elements() */ public class Parser --- 46,108 ---- /** ! * The main parser class. ! * This is the primary class of the HTML Parser library. It provides ! * constructors that take a {@link #Parser(String) String}, ! * a {@link #Parser(URLConnection) URLConnection}, or a ! * {@link #Parser(Lexer) Lexer}. In the case of a String, an ! * attempt is made to open it as a URL, and if that fails it assumes it is a ! * local disk file. If you want to actually parse a String, use ! * {@link #setInputHTML setInputHTML()} after using the ! * {@link #Parser() no-args} constructor, or use {@link #createParser}. ! * <p>The Parser provides access to the contents of the ! * page, via a {@link #elements() NodeIterator}, a ! * {@link #parse(NodeFilter) NodeList} or a ! * {@link #visitAllNodesWith NodeVisitor}. ! * <p>Typical usage of the parser is: ! * <code> ! * <pre> ! * Parser parser = new Parser ("http://whatever"); ! * NodeList list = parser.parse (); ! * // do something with your list of nodes. ! * </pre> ! * </code></p> ! * <p>What types of nodes and what can be done with them is dependant on the ! * setup, but in general a node can be converted back to HTML and it's ! * children (enclosed nodes) and parent can be obtained, because nodes are ! * nested. See the {@link Node} interface.</p> ! * <p>For example, if the URL contains:<br> ! * <code> ! * {@.html ! * <html> ! * <head> ! * <title>Mondays -- What a bad idea.</title> ! * </head> ! * <body BGCOLOR="#FFFFFF"> ! * Most people have a pathological hatred of Mondays... ! * </body> ! * </html>} ! * </code><br> ! * and the example code above is used, the list contain only one element, the ! * {@.html <html>} node. This node is a {@link org.htmlparser.tags tag}, ! * which is an object of class ! * {@link org.htmlparser.tags.Html Html} if the default {@link NodeFactory} ! * (a {@link PrototypicalNodeFactory}) is used.</p> ! * <p>To get at further content, the children of the top ! * level nodes must be examined. When digging through a node list one must be ! * conscious of the possibility of whitespace between nodes, e.g. in the example ! * above: ! * <code> ! * <pre> ! * Node node = list.elementAt (0); ! * NodeList sublist = node.getChildren (); ! * System.out.println (sublist.size ()); ! * </pre> ! * </code> ! * would print out 5, not 2, because there are newlines after {@.html <html>}, ! * {@.html </head>} and {@.html </body>} that are children of the HTML node ! * besides the {@.html <head>} and {@.html <body>} nodes.</p> ! * <p>Because processing nodes is so common, two interfaces are provided to ! * ease this task, {@link org.htmlparser.filters filters} ! * and {@link org.htmlparser.visitors visitors}. */ public class Parser *************** *** 66,70 **** /** ! * The floating point version number. */ public final static double --- 115,119 ---- /** ! * The floating point version number ({@value}). */ public final static double *************** *** 73,77 **** /** ! * The type of version. */ public final static String --- 122,126 ---- /** ! * The type of version ({@value}). */ public final static String *************** *** 80,84 **** /** ! * The date of the version. */ public final static String --- 129,133 ---- /** ! * The date of the version ({@value}). */ public final static String *************** *** 87,91 **** /** ! * The display version. */ public final static String --- 136,140 ---- /** ! * The display version ({@value}). */ public final static String *************** *** 186,191 **** /** * Zero argument constructor. ! * The parser is in a safe but useless state. ! * Set the lexer or connection using setLexer() or setConnection(). * @see #setLexer(Lexer) * @see #setConnection(URLConnection) --- 235,241 ---- /** * Zero argument constructor. ! * The parser is in a safe but useless state parsing an empty string. ! * Set the lexer or connection using {@link #setLexer} ! * or {@link #setConnection}. * @see #setLexer(Lexer) * @see #setConnection(URLConnection) *************** *** 197,213 **** /** ! * This constructor enables the construction of test cases, with readers ! * associated with test string buffers. It can also be used with readers of the user's choice ! * streaming data into the parser.<p/> ! * <B>Important:</B> If you are using this constructor, and you would like to use the parser ! * to parse multiple times (multiple calls to parser.elements()), you must ensure the following:<br> ! * <ul> ! * <li>Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).</li> ! * <li>After the first parse, calls to elements() must be preceded by calls to : ! * <pre> ! * parser.getReader().reset(); ! * </pre> ! * </li> ! * </ul> * @param lexer The lexer to draw characters from. * @param fb The object to use when information, --- 247,253 ---- /** ! * Construct a parser using the provided lexer and feedback object. ! * This would be used to create a parser for special cases where the ! * normal creation of a lexer on a URLConnection needs to be customized. * @param lexer The lexer to draw characters from. * @param fb The object to use when information, *************** *** 226,232 **** --- 266,276 ---- /** * Constructor for custom HTTP access. + * This would be used to create a parser for a URLConnection that needs + * a special setup or negotiation conditioning beyond what is available + * from the {@link #getConnectionManager ConnectionManager}. * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. * @param fb The object to use for message communication. + * @throws ParserException If the creation of the underlying Lexer cannot be performed. */ public Parser (URLConnection connection, ParserFeedback fb) *************** *** 240,243 **** --- 284,288 ---- * Creates a Parser object with the location of the resource (URL or file) * You would typically create a DefaultHTMLParserFeedback object and pass it in. + * @see #Parser(URLConnection,ParserFeedback) * @param resourceLocn Either the URL or the filename (autodetects). * A standard HTTP GET is performed to read the content of the URL. *************** *** 245,249 **** * warning and error messages are produced. If <em>null</em> no feedback * is provided. ! * @see #Parser(URLConnection,ParserFeedback) */ public Parser (String resourceLocn, ParserFeedback feedback) throws ParserException --- 290,294 ---- * warning and error messages are produced. If <em>null</em> no feedback * is provided. ! * @throws ParserException If the URL is invalid. */ public Parser (String resourceLocn, ParserFeedback feedback) throws ParserException *************** *** 256,259 **** --- 301,305 ---- * A DefaultHTMLParserFeedback object is used for feedback. * @param resourceLocn Either the URL or the filename (autodetects). + * @throws ParserException If the resourceLocn argument does not resolve to a valid page or file. */ public Parser (String resourceLocn) throws ParserException *************** *** 263,279 **** /** ! * This constructor is present to enable users to plugin their own lexers. ! * A DefaultHTMLParserFeedback object is used for feedback. It can also be used with readers of the user's choice ! * streaming data into the parser.<p/> ! * <B>Important:</B> If you are using this constructor, and you would like to use the parser ! * to parse multiple times (multiple calls to parser.elements()), you must ensure the following:<br> ! * <ul> ! * <li>Before the first parse, you must mark the reader for a length that you anticipate (the size of the stream).</li> ! * <li>After the first parse, calls to elements() must be preceded by calls to : ! * <pre> ! * parser.getReader().reset(); ! * </pre> ! * </li> ! * @param lexer The source for HTML to be parsed. */ public Parser (Lexer lexer) --- 309,317 ---- /** ! * Construct a parser using the provided lexer. ! * A feedback object printing to {@link #stdout System.out} is used. ! * This would be used to create a parser for special cases where the ! * normal creation of a lexer on a URLConnection needs to be customized. ! * @param lexer The lexer to draw characters from. */ public Parser (Lexer lexer) *************** *** 283,291 **** /** ! * Constructor for non-standard access. ! * A DefaultHTMLParserFeedback object is used for feedback. * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. ! * @see #Parser(URLConnection,ParserFeedback) */ public Parser (URLConnection connection) throws ParserException --- 321,333 ---- /** ! * Construct a parser using the provided URLConnection. ! * This would be used to create a parser for a URLConnection that needs ! * a special setup or negotiation conditioning beyond what is available ! * from the {@link #getConnectionManager ConnectionManager}. ! * A feedback object printing to {@link #stdout System.out} is used. ! * @see #Parser(URLConnection,ParserFeedback) * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. ! * @throws ParserException If the creation of the underlying Lexer cannot be performed. */ public Parser (URLConnection connection) throws ParserException *************** *** 301,305 **** * Set the connection for this parser. * This method creates a new <code>Lexer</code> reading from the connection. - * Trying to set the connection to null is a noop. * @param connection A fully conditioned connection. The connect() * method will be called so it need not be connected yet. --- 343,346 ---- *************** *** 313,318 **** ParserException { ! if (null != connection) ! setLexer (new Lexer (connection)); } --- 354,360 ---- ParserException { ! if (null == connection) ! throw new IllegalArgumentException ("connection cannot be null"); ! setLexer (new Lexer (connection)); } *************** *** 320,324 **** * Return the current connection. * @return The connection either created by the parser or passed into this ! * parser via <code>setConnection</code>. * @see #setConnection(URLConnection) */ --- 362,366 ---- * Return the current connection. * @return The connection either created by the parser or passed into this ! * parser via {@link #setConnection}. * @see #setConnection(URLConnection) */ *************** *** 331,336 **** * Set the URL for this parser. * This method creates a new Lexer reading from the given URL. ! * Trying to set the url to null or an empty string is a noop. ! * @see #setConnection(URLConnection) */ public void setURL (String url) --- 373,380 ---- * Set the URL for this parser. * This method creates a new Lexer reading from the given URL. ! * Trying to set the url to null or an empty string is a no-op. ! * @param url The new URL for the parser. ! * @throws ParserException If the url is invalid or creation of the ! * underlying Lexer cannot be performed. */ public void setURL (String url) *************** *** 339,349 **** { if ((null != url) && !"".equals (url)) ! setConnection (Page.getConnectionManager ().openConnection (url)); } /** * Return the current URL being parsed. ! * @return The url passed into the constructor or the file name ! * passed to the constructor modified to be a URL. */ public String getURL () --- 383,395 ---- { if ((null != url) && !"".equals (url)) ! setConnection (getConnectionManager ().openConnection (url)); } /** * Return the current URL being parsed. ! * @return The current url. This is the URL for the current page. ! * A string passed into the constructor or set via setURL may be altered, ! * for example, a file name may be modified to be a URL. ! * @see Page#getUrl */ public String getURL () *************** *** 355,358 **** --- 401,408 ---- * Set the encoding for the page this parser is reading from. * @param encoding The new character set to use. + * @throws ParserException If the encoding change causes characters that + * have already been consumed to differ from the characters that would + * have been seen had the new encoding been in force. + * @see org.htmlparser.util.EncodingChangeException */ public void setEncoding (String encoding) *************** *** 367,370 **** --- 417,421 ---- * This item is set from the HTTP header but may be overridden by meta * tags in the head, so this may change after the head has been parsed. + * @return The encoding currently in force. */ public String getEncoding () *************** *** 375,383 **** /** * Set the lexer for this parser. ! * The current NodeFactory is set on the given lexer, since the lexer ! * contains the node factory object. * It does not adjust the <code>feedback</code> object. ! * Trying to set the lexer to <code>null</code> is a noop. * @param lexer The lexer object to use. */ public void setLexer (Lexer lexer) --- 426,435 ---- /** * Set the lexer for this parser. ! * The current NodeFactory is transferred to (set on) the given lexer, ! * since the lexer owns the node factory object. * It does not adjust the <code>feedback</code> object. ! * Trying to set the lexer to <code>null</code> is a no-op. * @param lexer The lexer object to use. + * @see #setNodeFactory */ public void setLexer (Lexer lexer) *************** *** 405,409 **** /** ! * Returns the reader associated with the parser * @return The current lexer. */ --- 457,461 ---- /** ! * Returns the lexer associated with the parser * @return The current lexer. */ *************** *** 415,419 **** /** * Get the current node factory. ! * @return The parser's node factory. */ public NodeFactory getNodeFactory () --- 467,471 ---- /** * Get the current node factory. ! * @return The current lexer's node factory. */ public NodeFactory getNodeFactory () *************** *** 424,428 **** /** * Set the current node factory. ! * @param factory The new node factory for the parser. */ public void setNodeFactory (NodeFactory factory) --- 476,480 ---- /** * Set the current node factory. ! * @param factory The new node factory for the current lexer. */ public void setNodeFactory (NodeFactory factory) *************** *** 435,439 **** /** * Sets the feedback object used in scanning. ! * @param fb The new feedback object to use. */ public void setFeedback (ParserFeedback fb) --- 487,492 ---- /** * Sets the feedback object used in scanning. ! * @param fb The new feedback object to use. If this is null a ! * {@link #noFeedback silent feedback object} is used. */ public void setFeedback (ParserFeedback fb) *************** *** 443,448 **** /** ! * Returns the feedback. ! * @return HTMLParserFeedback */ public ParserFeedback getFeedback() --- 496,501 ---- /** ! * Returns the current feedback object. ! * @return The feedback object currently being used. */ public ParserFeedback getFeedback() *************** *** 457,460 **** --- 510,515 ---- /** * Reset the parser to start from the beginning again. + * This assumes support for a reset from the underlying + * {@link org.htmlparser.lexer.Source} object. */ public void reset () *************** *** 464,488 **** /** ! * Returns an iterator (enumeration) to the html nodes. Each node can be a tag/endtag/ ! * string/link/image<br> ! * This is perhaps the most important method of this class. In typical situations, you will need to use ! * the parser like this : * <pre> ! * Parser parser = new Parser("http://www.yahoo.com"); ! * for (NodeIterator i = parser.elements();i.hasMoreElements();) { ! * Node node = i.nextHTMLNode(); ! * if (node instanceof StringNode) { ! * // Downcasting to StringNode ! * StringNode stringNode = (StringNode)node; ! * // Do whatever processing you want with the string node ! * System.out.println(stringNode.getText()); ! * } ! * // Check for the node or tag that you want ! * if (node instanceof ...) { ! * // Downcast, and process ! * // recursively (nodes within nodes) ! * } * } * </pre> */ public NodeIterator elements () throws ParserException --- 519,569 ---- /** ! * Returns an iterator (enumeration) over the html nodes. ! * {@link org.htmlparser.nodes Nodes} can be of three main types: ! * <ul> ! * <li>{@link org.htmlparser.nodes.TagNode TagNode}</li> ! * <li>{@link org.htmlparser.nodes.TextNode TextNode}</li> ! * <li>{@link org.htmlparser.nodes.RemarkNode RemarkNode}</li> ! * </ul> ! * In general, when parsing with an iterator or processing a NodeList, ! * you will need to use recursion. For example: ! * <code> * <pre> ! * void processMyNodes (Node node) ! * { ! * if (node instanceof TextNode) ! * { ! * // downcast to TextNode ! * TextNode text = (TextNode)node; ! * // do whatever processing you want with the text ! * System.out.println (text.getText ()); ! * } ! * if (node instanceof RemarkNode) ! * { ! * // downcast to RemarkNode ! * RemarkNode remark = (RemarkNode)node; ! * // do whatever processing you want with the comment ! * } ! * else if (node instanceof TagNode) ! * { ! * // downcast to TagNode ! * TagNode tag = (TagNode)node; ! * // do whatever processing you want with the tag itself ! * // ... ! * // process recursively (nodes within nodes) via getChildren() ! * NodeList list = tag.getChildren (); ! * if (null != list) ! * for (NodeIterator i = list.elements (); i.hasMoreElements (); ) ! * processMyNodes (i.nextNode ()); ! * } * } + * + * Parser parser = new Parser ("http://www.yahoo.com"); + * for (NodeIterator i = parser.elements (); i.hasMoreElements (); ) + * processMyNodes (i.nextNode ()); * </pre> + * </code> + * @throws ParserException If a parsing error occurs. + * @return An iterator over the top level nodes (usually {@.html <html>}). */ public NodeIterator elements () throws ParserException *************** *** 493,499 **** /** * Parse the given resource, using the filter provided. - * @param filter The filter to apply to the parsed nodes. * @return The list of matching nodes (for a <code>null</code> * filter this is all the top level nodes). */ public NodeList parse (NodeFilter filter) throws ParserException --- 574,582 ---- /** * Parse the given resource, using the filter provided. * @return The list of matching nodes (for a <code>null</code> * filter this is all the top level nodes). + * @param filter The filter to apply to the parsed nodes, + * or <code>null</code> to retrieve all the top level nodes. + * @throws ParserException If a parsing error occurs. */ public NodeList parse (NodeFilter filter) throws ParserException *************** *** 516,520 **** } ! public void visitAllNodesWith(NodeVisitor visitor) throws ParserException { Node node; visitor.beginParsing(); --- 599,612 ---- } ! /** ! * Apply the given visitor to the current page. ! * The visitor is passed to the <code>accept()</code> method of each node ! * in the page in a depth first traversal. The visitor ! * <code>beginParsing()</code> method is called prior to processing the ! * page and <code>finishedParsing()</code> is called after the processing. ! * @param visitor The visitor to visit all nodes with. ! * @throws ParserException If a parse error occurs while traversing the page with the visitor. ! */ ! public void visitAllNodesWith (NodeVisitor visitor) throws ParserException { Node node; visitor.beginParsing(); *************** *** 529,532 **** --- 621,625 ---- * Initializes the parser with the given input HTML String. * @param inputHTML the input HTML that is to be parsed. + * @throws ParserException If a error occurs in setting up the underlying Lexer. */ public void setInputHTML (String inputHTML) *************** *** 543,546 **** --- 636,644 ---- * Extract all nodes matching the given filter. * @see Node#collectInto(NodeList, NodeFilter) + * @param filter The filter to be applied to the nodes. + * @throws ParserException If a parse error occurs. + * @return A list of nodes matching the filter criteria, + * i.e. for which the filter's accept method + * returned <code>true</code>. */ public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException *************** *** 558,564 **** /** * Convenience method to extract all nodes of a given class type. ! * @see Node#collectInto(NodeList, NodeFilter) */ ! public Node [] extractAllNodesThatAre (Class nodeType) throws ParserException { NodeList ret; --- 656,669 ---- /** * Convenience method to extract all nodes of a given class type. ! * Equivalent to <code>extractAllNodesThatMatch (new NodeClassFilter (nodeType))</code>. ! * @param nodeType The class of the nodes to collect. ! * @throws ParserException If a parse error occurs. ! * @return A list of nodes which have the class specified. ! * @deprecated Use extractAllNodesThatMatch (new NodeClassFilter (nodeType)). ! * @see #extractAllNodesThatAre */ ! public Node [] extractAllNodesThatAre (Class nodeType) ! throws ! ParserException { NodeList ret; *************** *** 575,602 **** /** * Called just prior to calling connect. ! * The connection has been conditioned with proxy, URL user/password, ! * and cookie information. It is still possible to adjust the ! * connection to alter the request method for example. * @param connection The connection which is about to be connected. ! * @exception This exception is thrown if the connection monitor ! * wants the ConnectionManager to bail out. */ public void preConnect (HttpURLConnection connection) ! throws ! ParserException ! { if (null != getFeedback ()) getFeedback ().info (ConnectionManager.getRequestHeader (connection)); ! } ! /** Called just after calling connect. ! * The response code and header fields can be examined. * @param connection The connection that was just connected. ! * @exception This exception is thrown if the connection monitor ! * wants the ConnectionManager to bail out. */ public void postConnect (HttpURLConnection connection) ! throws ! ParserException { if (null != getFeedback ()) --- 680,708 ---- /** * Called just prior to calling connect. ! * Part of the ConnectionMonitor interface, this implementation just ! * sends the request header to the feedback object if any. * @param connection The connection which is about to be connected. ! * @throws ParserException <em>Not used</em> ! * @see ConnectionMonitor#preConnect */ public void preConnect (HttpURLConnection connection) ! throws ! ParserException ! { if (null != getFeedback ()) getFeedback ().info (ConnectionManager.getRequestHeader (connection)); ! } ! /** ! * Called just after calling connect. ! * Part of the ConnectionMonitor interface, this implementation just ! * sends the response header to the feedback object if any. * @param connection The connection that was just connected. ! * @throws ParserException <em>Not used.</em> ! * @see ConnectionMonitor#postConnect */ public void postConnect (HttpURLConnection connection) ! throws ! ParserException { if (null != getFeedback ()) *************** *** 606,609 **** --- 712,717 ---- /** * The main program, which can be executed from the command line + * @param args A URL or file name to parse, and an optional tag name to be + * used as a filter. */ public static void main (String [] args) *************** *** 630,651 **** } else ! try ! { ! parser = new Parser (); ! if (1 < args.length) ! filter = new TagNameFilter (args[1]); ! else ! { // for a simple dump, use more verbose settings ! filter = null; ! parser.setFeedback (Parser.stdout); ! getConnectionManager ().setMonitor (parser); ! } ! parser.setURL (args[0]); ! System.out.println (parser.parse (filter)); ! } ! catch (ParserException e) ! { ! e.printStackTrace (); ! } } } --- 738,759 ---- } else ! try ! { ! parser = new Parser (); ! if (1 < args.length) ! filter = new TagNameFilter (args[1]); ! else ! { // for a simple dump, use more verbose settings ! filter = null; ! parser.setFeedback (Parser.stdout); ! getConnectionManager ().setMonitor (parser); ! } ! parser.setURL (args[0]); ! System.out.println (parser.parse (filter)); ! } ! catch (ParserException e) ! { ! e.printStackTrace (); ! } } } Index: NodeFilter.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/NodeFilter.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** NodeFilter.java 13 Feb 2005 20:36:01 -0000 1.2 --- NodeFilter.java 5 Apr 2005 00:48:10 -0000 1.3 *************** *** 44,47 **** --- 44,48 ---- * @return <code>true</code> if the node is to be kept, <code>false</code> * if it is to be discarded. + * @param node The node to test. */ boolean accept (Node node); |
From: Derrick O. <der...@us...> - 2005-04-05 01:02:29
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28518/htmlparser/src/org/htmlparser/parserapplications Modified Files: SiteCapturer.java Log Message: Update javadocs. Enable SiteCapturer to handle resource names containing spaces. Index: SiteCapturer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/parserapplications/SiteCapturer.java,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** SiteCapturer.java 14 Feb 2005 23:49:24 -0000 1.7 --- SiteCapturer.java 5 Apr 2005 00:48:12 -0000 1.8 *************** *** 345,348 **** --- 345,395 ---- /** + * Unescape a URL to form a file name. + * Very crude. + * @param raw The escaped URI. + * @return The native URI. + */ + protected String decode (String raw) + { + int length; + int start; + int index; + int value; + StringBuffer ret; + + ret = new StringBuffer (raw.length ()); + + length = raw.length (); + start = 0; + while (-1 != (index = raw.indexOf ('%', start))) + { // append the part up to the % sign + ret.append (raw.substring (start, index)); + // there must be two hex digits after the percent sign + if (index + 2 < length) + { + try + { + value = Integer.parseInt (raw.substring (index + 1, index + 3), 16); + ret.append ((char)value); + start = index + 3; + } + catch (NumberFormatException nfe) + { + ret.append ('%'); + start = index + 1; + } + } + else + { // this case is actually illegal in a URI, but... + ret.append ('%'); + start = index + 1; + } + } + ret.append (raw.substring (start)); + + return (ret.toString ()); + } + + /** * Copy a resource (image) locally. * Removes one element from the 'to be copied' list and saves the *************** *** 352,355 **** --- 399,404 ---- { String link; + String raw; + String name; File file; File dir; *************** *** 365,369 **** if (getCaptureResources ()) { ! file = new File (getTarget (), makeLocalLink (link, "")); System.out.println ("copying " + link + " to " + file.getAbsolutePath ()); // ensure directory exists --- 414,420 ---- if (getCaptureResources ()) { ! raw = makeLocalLink (link, ""); ! name = decode (raw); ! file = new File (getTarget (), name); System.out.println ("copying " + link + " to " + file.getAbsolutePath ()); // ensure directory exists |
From: Derrick O. <der...@us...> - 2005-03-13 15:36:26
|
Update of /cvsroot/htmlparser/htmlparser/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28181/docs Modified Files: changes.txt release.txt Log Message: Update version to 1.5-20050313. Index: release.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/release.txt,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** release.txt 7 Mar 2005 02:57:34 -0000 1.67 --- release.txt 13 Mar 2005 15:36:10 -0000 1.68 *************** *** 1,3 **** ! HTMLParser Version 1.5 (Integration Build Mar 06, 2005) ********************************************* --- 1,3 ---- ! HTMLParser Version 1.5 (Integration Build Mar 13, 2005) ********************************************* *************** *** 5,11 **** ---------------------------- (i) jar files - lib directory ! HTML Parser jars: htmlparser.jar, lexer.jar and thumbelina.jar. ! Also thirdparty jar files checkstyle-all-3.1.jar, commons-logging.jar, ! fit.jar and junit.jar. (ii) source code - src.zip --- 5,11 ---- ---------------------------- (i) jar files - lib directory ! HTML Parser jars: htmlparser.jar, lexer.jar, thumbelina.jar and ! filterbuilder.jar. ! Also thirdparty jar files checkstyle-all-3.1.jar, fit.jar and junit.jar. (ii) source code - src.zip *************** *** 41,45 **** --- 41,54 ---- Updated the logo and included the LGPL license. Fixed the Windows batch files. + Added optional "classes" property to build.xml. This directory is where + class files are put. It defaults to src. + To use: + ant -Dclasses=classdir <target> + where classdir is/will-be a peer directory to src. Refactoring + Added static STRICT flag to ScriptScanner to revert to legacy handling of + broken ETAGO (</). If STRICT is true, scan according to HTML specification, + else if false, scan with quote smart state machine which heuristically + yields the correct parse. Obviated LinkProcessor and moved it's functionality to the Page class. Added Tag, Text and Remark interfaces and moved concrete node *************** *** 63,77 **** Enhancement Requests -------------------- ! 943593 LinkProcessor.extract(link,base) weird behaviour? ! 943197 Accept gzip / deflate content encodings ! 874000 Remove specialized tag signatures from NodeVisitor ! 1000063 FilterBean 1017249 HTML Client Doesn't Support Cookies but will follow redirect 1010586 Add support for password protected URL 1000739 Add support for proxy scenario Bug Fixes --------- 1153508 CVS sources do not compile 1104627 Parser Crash reading javascript 1061869 Crashing when trying to capture link to XLS document --- 72,90 ---- Enhancement Requests -------------------- ! 1160345 NodeList.visitAllNodesWith 1017249 HTML Client Doesn't Support Cookies but will follow redirect 1010586 Add support for password protected URL 1000739 Add support for proxy scenario + 1000063 FilterBean + 943593 LinkProcessor.extract(link,base) weird behaviour? + 943197 Accept gzip / deflate content encodings + 874000 Remove specialized tag signatures from NodeVisitor Bug Fixes --------- + 1161137 Non English Character web page + 1160010 NullPointerException in addCookies 1153508 CVS sources do not compile + 1121401 No Parsing with yahoo! 1104627 Parser Crash reading javascript 1061869 Crashing when trying to capture link to XLS document *************** *** 80,83 **** --- 93,97 ---- 1024045 StringBean crashes on an URL 1021925 StyleTag with missing linefeed prevents page from parsing + 1018884 'compile' ant task from build.xml messes up ./src directory 1005409 Input file not free by parser. 998195 SiteCatpurer just crashed Index: changes.txt =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/docs/changes.txt,v retrieving revision 1.204 retrieving revision 1.205 diff -C2 -d -r1.204 -r1.205 *** changes.txt 7 Mar 2005 02:57:34 -0000 1.204 --- changes.txt 13 Mar 2005 15:36:08 -0000 1.205 *************** *** 16,19 **** --- 16,88 ---- ******************************************************************************* + Integration Build 1.5 - 20050313 + -------------------------------- + + 2005-03-13 09:51 derrickoswald + + * src/org/htmlparser/: lexer/Lexer.java, lexer/Page.java, + lexer/Source.java, lexerapplications/tabby/Tabby.java, + scanners/ScriptDecoder.java, tests/lexerTests/TagTests.java, + util/IteratorImpl.java: + + Bug #1121401 No Parsing with yahoo! + By default nio.charset.CharsetDecoder replaces characters it cannot + represent in the current encoding with zero, which was the value + returned by the page when the Stream reached EOF. + This changes the Page return value to (char)Source.EOF (-1) when + the end of stream is encountered. + + 2005-03-12 16:39 derrickoswald + + * src/org/htmlparser/beans/: BeanyBaby.java, LinkBean.java: + + Fix bean example, stop sharing connections. + + 2005-03-12 15:27 derrickoswald + + * build.xml, lib/commons-logging.jar: + + Bug #1018884 'compile' ant task from build.xml messes up ./src directory + Added optional "classes" property to build.xml. + This directory is where class files are put. It defaults to src. + To use: + build -Dclasses=classdir <target> + where classdir is a peer directory to src. + Removed unused commons-logging.jar while I was in there. + + 2005-03-12 12:53 derrickoswald + + * src/org/htmlparser/: lexer/Lexer.java, + scanners/ScriptScanner.java, + tests/scannersTests/ScriptScannerTest.java: + + Add STRICT flag to ScriptScanner to revert to legacy handling of broken ETAGO (</). + If STRICT is true, scan according to HTML specification, else if false, scan with + quote smart state machine which heuristically yields the correct parse. + + 2005-03-12 08:39 derrickoswald + + * src/org/htmlparser/: + tests/visitorsTests/UrlModifyingVisitorTest.java, + util/NodeList.java: + + RFE #1160345 NodeList.visitAllNodesWith + Added visitAllNodesWith to the NodeList class. + + 2005-03-12 07:52 derrickoswald + + * src/org/htmlparser/: beans/StringBean.java, + tests/utilTests/AllTests.java, tests/utilTests/NonEnglishTest.java: + + Bug #1161137 Non English Character web page + Reinitialize the string buffer after encoding change exception processing. + + 2005-03-12 06:52 derrickoswald + + * src/org/htmlparser/http/ConnectionManager.java: + + Bug #1160010 NullPointerException in addCookies + Add test for null expiry date. + Integration Build 1.5 - 20050306 -------------------------------- |
From: Derrick O. <der...@us...> - 2005-03-13 15:36:25
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv28181/src/org/htmlparser Modified Files: Parser.java Log Message: Update version to 1.5-20050313. Index: Parser.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/Parser.java,v retrieving revision 1.102 retrieving revision 1.103 diff -C2 -d -r1.102 -r1.103 *** Parser.java 7 Mar 2005 02:57:35 -0000 1.102 --- Parser.java 13 Mar 2005 15:36:11 -0000 1.103 *************** *** 83,87 **** */ public final static String ! VERSION_DATE = "Mar 06, 2005" ; --- 83,87 ---- */ public final static String ! VERSION_DATE = "Mar 13, 2005" ; |
From: Derrick O. <der...@us...> - 2005-03-13 14:52:29
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv16205/lexer Modified Files: Lexer.java Page.java Source.java Log Message: Bug #1121401 No Parsing with yahoo! By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero, which was the value returned by the page when the Stream reached EOF. This changes the Page return value to (char)Source.EOF (-1) when the end of stream is encountered. Index: Source.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Source.java,v retrieving revision 1.17 retrieving revision 1.18 diff -C2 -d -r1.17 -r1.18 *** Source.java 13 Feb 2005 22:45:47 -0000 1.17 --- Source.java 13 Mar 2005 14:51:44 -0000 1.18 *************** *** 110,114 **** * @param off Offset at which to start storing characters * @param len Maximum number of characters to read ! * @return The number of characters read, or {@link #EOF} if the esource is * exhausted. * @exception IOException If an I/O error occurs. --- 110,114 ---- * @param off Offset at which to start storing characters * @param len Maximum number of characters to read ! * @return The number of characters read, or {@link #EOF} if the source is * exhausted. * @exception IOException If an I/O error occurs. *************** *** 121,125 **** * or the source is exhausted. * @param cbuf Destination buffer. ! * @return The number of characters read, or {@link #EOF} if the esource is * exhausted. * @exception IOException If an I/O error occurs. --- 121,125 ---- * or the source is exhausted. * @param cbuf Destination buffer. ! * @return The number of characters read, or {@link #EOF} if the source is * exhausted. * @exception IOException If an I/O error occurs. Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.47 retrieving revision 1.48 diff -C2 -d -r1.47 -r1.48 *** Page.java 7 Mar 2005 02:18:37 -0000 1.47 --- Page.java 13 Mar 2005 14:51:43 -0000 1.48 *************** *** 69,72 **** --- 69,78 ---- /** + * Character value when the page is exhausted. + * Has a value of {@value}. + */ + public static final char EOF = (char)Source.EOF; + + /** * The URL this page is coming from. * Cached value of <code>getConnection().toExternalForm()</code> or *************** *** 647,652 **** { i = mSource.read (); ! if (0 > i) ! ret = 0; else { --- 653,658 ---- { i = mSource.read (); ! if (Source.EOF == i) ! ret = EOF; else { *************** *** 687,691 **** { i = mSource.read (); ! if (-1 == i) { // do nothing --- 693,697 ---- { i = mSource.read (); ! if (Source.EOF == i) { // do nothing Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** Lexer.java 12 Mar 2005 17:53:08 -0000 1.36 --- Lexer.java 13 Mar 2005 14:51:43 -0000 1.37 *************** *** 261,270 **** switch (ch) { ! case 0: // end of input ret = null; break; case '<': ch = mPage.getCharacter (mCursor); ! if (0 == ch) ret = makeString (start, mCursor.getPosition ()); else if ('%' == ch) --- 261,270 ---- switch (ch) { ! case Page.EOF: ret = null; break; case '<': ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) ret = makeString (start, mCursor.getPosition ()); else if ('%' == ch) *************** *** 281,285 **** { ch = mPage.getCharacter (mCursor); ! if (0 == ch) ret = makeString (start, mCursor.getPosition ()); else --- 281,285 ---- { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) ret = makeString (start, mCursor.getPosition ()); else *************** *** 329,333 **** { ch = mPage.getCharacter (cursor); ! if (0 == ch) done = true; else --- 329,333 ---- { ch = mPage.getCharacter (cursor); ! if (Page.EOF == ch) done = true; else *************** *** 377,391 **** { ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else if (0x1b == ch) // escape { ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else if ('$' == ch) { ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else if ('B' == ch) --- 377,391 ---- { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else if (0x1b == ch) // escape { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else if ('$' == ch) { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else if ('B' == ch) *************** *** 406,410 **** { ch = mPage.getCharacter (mCursor); //try to consume escaped character ! if ( (ch != '\\') // escaped backslash && (ch != quote)) // escaped quote character // ( reflects ["] or ['] whichever opened the quotation) --- 406,411 ---- { ch = mPage.getCharacter (mCursor); //try to consume escaped character ! if ((Page.EOF != ch) ! && ('\\' != ch) // escaped backslash && (ch != quote)) // escaped quote character // ( reflects ["] or ['] whichever opened the quotation) *************** *** 418,422 **** // I can't handle single quotations. ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else if ('/' == ch) --- 419,423 ---- // I can't handle single quotations. ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else if ('/' == ch) *************** *** 424,428 **** do ch = mPage.getCharacter (mCursor); ! while ((ch != 0) && (ch != '\n')); } else if ('*' == ch) --- 425,429 ---- do ch = mPage.getCharacter (mCursor); ! while ((Page.EOF != ch) && ('\n' != ch)); } else if ('*' == ch) *************** *** 432,441 **** do ch = mPage.getCharacter (mCursor); ! while ((ch != 0) && (ch != '*')); ch = mPage.getCharacter (mCursor); if (ch == '*') mCursor.retreat (); } ! while ((ch != 0) && (ch != '/')); } else --- 433,442 ---- do ch = mPage.getCharacter (mCursor); ! while ((Page.EOF != ch) && ('*' != ch)); ch = mPage.getCharacter (mCursor); if (ch == '*') mCursor.retreat (); } ! while ((Page.EOF != ch) && ('/' != ch)); } else *************** *** 445,449 **** { ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; // the order of these tests might be optimized for speed: --- 446,450 ---- { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; // the order of these tests might be optimized for speed: *************** *** 600,604 **** { case 0: // outside of any attribute ! if ((0 == ch) || ('>' == ch) || ('<' == ch)) { if ('<' == ch) --- 601,605 ---- { case 0: // outside of any attribute ! if ((Page.EOF == ch) || ('>' == ch) || ('<' == ch)) { if ('<' == ch) *************** *** 618,622 **** break; case 1: // within attribute name ! if ((0 == ch) || ('>' == ch) || ('<' == ch)) { if ('<' == ch) --- 619,623 ---- break; case 1: // within attribute name ! if ((Page.EOF == ch) || ('>' == ch) || ('<' == ch)) { if ('<' == ch) *************** *** 640,644 **** break; case 2: // equals hit ! if ((0 == ch) || ('>' == ch)) { empty (attributes, bookmarks); --- 641,645 ---- break; case 2: // equals hit ! if ((Page.EOF == ch) || ('>' == ch)) { empty (attributes, bookmarks); *************** *** 665,669 **** break; case 3: // within naked attribute value ! if ((0 == ch) || ('>' == ch)) { naked (attributes, bookmarks); --- 666,670 ---- break; case 3: // within naked attribute value ! if ((Page.EOF == ch) || ('>' == ch)) { naked (attributes, bookmarks); *************** *** 678,682 **** break; case 4: // within single quoted attribute value ! if (0 == ch) { single_quote (attributes, bookmarks); --- 679,683 ---- break; case 4: // within single quoted attribute value ! if (Page.EOF == ch) { single_quote (attributes, bookmarks); *************** *** 691,695 **** break; case 5: // within double quoted attribute value ! if (0 == ch) { double_quote (attributes, bookmarks); --- 692,696 ---- break; case 5: // within double quoted attribute value ! if (Page.EOF == ch) { double_quote (attributes, bookmarks); *************** *** 708,712 **** case 6: // undecided for state 0 or 2 // we have read white spaces after an attributte name ! if (0 == ch) { // same as last else clause --- 709,713 ---- case 6: // undecided for state 0 or 2 // we have read white spaces after an attributte name ! if (Page.EOF == ch) { // same as last else clause *************** *** 824,828 **** { ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else --- 825,829 ---- { ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else *************** *** 842,846 **** // handle <!--> because netscape does ch = mPage.getCharacter (mCursor); ! if (0 == ch) done = true; else if ('>' == ch) --- 843,847 ---- // handle <!--> because netscape does ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) done = true; else if ('>' == ch) *************** *** 858,862 **** if ('-' == ch) state = 3; ! else if (0 == ch) return (parseString (start, quotesmart)); // no terminator break; --- 859,863 ---- if ('-' == ch) state = 3; ! else if (Page.EOF == ch) return (parseString (start, quotesmart)); // no terminator break; *************** *** 946,950 **** state = 1; break; ! // case 0: // <\0 // case '>': // <> default: --- 947,951 ---- state = 1; break; ! // case Page.EOF: // <\0 // case '>': // <> default: *************** *** 956,960 **** switch (ch) { ! case 0: // <%\0 case '>': // <%> done = true; --- 957,961 ---- switch (ch) { ! case Page.EOF: // <%\0 case '>': // <%> done = true; *************** *** 976,980 **** switch (ch) { ! case 0: // <%x\0 case '>': // <%x> done = true; --- 977,981 ---- switch (ch) { ! case Page.EOF: // <%x\0 case '>': // <%x> done = true; *************** *** 994,998 **** switch (ch) { ! case 0: // <%x??%\0 done = true; break; --- 995,999 ---- switch (ch) { ! case Page.EOF: // <%x??%\0 done = true; break; *************** *** 1009,1013 **** switch (ch) { ! case 0: // <%x??"\0 done = true; break; --- 1010,1014 ---- switch (ch) { ! case Page.EOF: // <%x??"\0 done = true; break; *************** *** 1022,1026 **** switch (ch) { ! case 0: // <%x??'\0 done = true; break; --- 1023,1027 ---- switch (ch) { ! case Page.EOF: // <%x??'\0 done = true; break; *************** *** 1110,1114 **** switch (ch) { ! case 0: // end of input done = true; break; --- 1111,1115 ---- switch (ch) { ! case Page.EOF: done = true; break; *************** *** 1132,1137 **** { ch = mPage.getCharacter (mCursor); // try to consume escaped character ! if (0 == ch) ! mCursor.retreat (); else if ( (ch != '\\') && (ch != quote)) mCursor.retreat (); // unconsume char if character was not an escapable char. --- 1133,1138 ---- { ch = mPage.getCharacter (mCursor); // try to consume escaped character ! if (Page.EOF == ch) ! done = true; else if ( (ch != '\\') && (ch != quote)) mCursor.retreat (); // unconsume char if character was not an escapable char. *************** *** 1144,1154 **** // handle multiline and double slash comments (with a quote) ch = mPage.getCharacter (mCursor); ! if (0 == ch) ! mCursor.retreat (); else if ('/' == ch) { do ch = mPage.getCharacter (mCursor); ! while ((ch != 0) && (ch != '\n')); } else if ('*' == ch) --- 1145,1155 ---- // handle multiline and double slash comments (with a quote) ch = mPage.getCharacter (mCursor); ! if (Page.EOF == ch) ! done = true; else if ('/' == ch) { do ch = mPage.getCharacter (mCursor); ! while ((Page.EOF != ch) && ('\n' != ch)); } else if ('*' == ch) *************** *** 1158,1167 **** do ch = mPage.getCharacter (mCursor); ! while ((ch != 0) && (ch != '*')); ch = mPage.getCharacter (mCursor); if (ch == '*') mCursor.retreat (); } ! while ((ch != 0) && (ch != '/')); } else --- 1159,1168 ---- do ch = mPage.getCharacter (mCursor); ! while ((Page.EOF != ch) && ('*' != ch)); ch = mPage.getCharacter (mCursor); if (ch == '*') mCursor.retreat (); } ! while ((Page.EOF != ch) && ('/' != ch)); } else *************** *** 1185,1189 **** switch (ch) { ! case 0: // end of input done = true; break; --- 1186,1190 ---- switch (ch) { ! case Page.EOF: done = true; break; *************** *** 1197,1201 **** break; case 2: // </ ! if (0 == ch) done = true; else if (Character.isLetter (ch)) --- 1198,1202 ---- break; case 2: // </ ! if (Page.EOF == ch) done = true; else if (Character.isLetter (ch)) |
From: Derrick O. <der...@us...> - 2005-03-13 14:51:57
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv16205/util Modified Files: IteratorImpl.java Log Message: Bug #1121401 No Parsing with yahoo! By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero, which was the value returned by the page when the Stream reached EOF. This changes the Page return value to (char)Source.EOF (-1) when the end of stream is encountered. Index: IteratorImpl.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/IteratorImpl.java,v retrieving revision 1.42 retrieving revision 1.43 diff -C2 -d -r1.42 -r1.43 *** IteratorImpl.java 31 Jul 2004 16:42:33 -0000 1.42 --- IteratorImpl.java 13 Mar 2005 14:51:46 -0000 1.43 *************** *** 31,34 **** --- 31,35 ---- import org.htmlparser.lexer.Cursor; import org.htmlparser.lexer.Lexer; + import org.htmlparser.lexer.Page; import org.htmlparser.scanners.Scanner; import org.htmlparser.util.NodeIterator; *************** *** 56,60 **** mCursor.setPosition (mLexer.getPosition ()); ! ret = 0 != mLexer.getPage ().getCharacter (mCursor); // more characters? return (ret); --- 57,61 ---- mCursor.setPosition (mLexer.getPosition ()); ! ret = Page.EOF != mLexer.getPage ().getCharacter (mCursor); // more characters? return (ret); |
From: Derrick O. <der...@us...> - 2005-03-13 14:51:56
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv16205/scanners Modified Files: ScriptDecoder.java Log Message: Bug #1121401 No Parsing with yahoo! By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero, which was the value returned by the page when the Stream reached EOF. This changes the Page return value to (char)Source.EOF (-1) when the end of stream is encountered. Index: ScriptDecoder.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptDecoder.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** ScriptDecoder.java 17 Jul 2004 13:45:03 -0000 1.2 --- ScriptDecoder.java 13 Mar 2005 14:51:45 -0000 1.3 *************** *** 309,313 **** input_character = page.getCharacter (cursor); character = (char)input_character; ! if (0 == input_character) { if ( (STATE_INITIAL != state) --- 309,313 ---- input_character = page.getCharacter (cursor); character = (char)input_character; ! if (Page.EOF == input_character) { if ( (STATE_INITIAL != state) |
From: Derrick O. <der...@us...> - 2005-03-13 14:51:56
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv16205/tests/lexerTests Modified Files: TagTests.java Log Message: Bug #1121401 No Parsing with yahoo! By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero, which was the value returned by the page when the Stream reached EOF. This changes the Page return value to (char)Source.EOF (-1) when the end of stream is encountered. Index: TagTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/lexerTests/TagTests.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** TagTests.java 31 Jul 2004 16:42:31 -0000 1.12 --- TagTests.java 13 Mar 2005 14:51:46 -0000 1.13 *************** *** 43,80 **** private static final String TEST_HTML = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">" + ! "<!-- Server: sf-web2 -->" + ! "<html lang=\"en\">" + ! " <head><link rel=\"stylesheet\" type=\"text/css\" href=\"http://sourceforge.net/cssdef.php\">" + ! " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">" + ! " <TITLE>SourceForge.net: Modify: 711073 - HTMLTagParser not threadsafe as a static variable in Tag</TITLE>" + ! " <SCRIPT language=\"JavaScript\" type=\"text/javascript\">" + ! " <!--" + ! " function help_window(helpurl) {" + ! " HelpWin = window.open( 'http://sourceforge.net' + helpurl,'HelpWindow','scrollbars=yes,resizable=yes,toolbar=no,height=400,width=400');" + ! " }" + ! " // -->" + ! " </SCRIPT>" + ! " <link rel=\"SHORTCUT ICON\" href=\"/images/favicon.ico\">" + ! "<!-- This is temp javascript for the jump button. If we could actually have a jump script on the server side that would be ideal -->" + ! "<script language=\"JavaScript\" type=\"text/javascript\">" + ! "<!--" + ! " function jump(targ,selObj,restore){ //v3.0" + ! " if (selObj.options[selObj.selectedIndex].value) " + ! " eval(targ+\".location='\"+selObj.options[selObj.selectedIndex].value+\"'\");" + ! " if (restore) selObj.selectedIndex=0;" + ! " }" + ! " //-->" + ! "</script>" + ! "<a href=\"http://normallink.com/sometext.html\">" + ! "<style type=\"text/css\">" + ! "<!--" + ! "A:link { text-decoration:none }" + ! "A:visited { text-decoration:none }" + ! "A:active { text-decoration:none }" + ! "A:hover { text-decoration:underline; color:#0066FF; }" + ! "-->" + ! "</style>" + ! "</head>" + ! "<body bgcolor=\"#FFFFFF\" text=\"#000000\" leftmargin=\"0\" topmargin=\"0\" marginwidth=\"0\" marginheight=\"0\" link=\"#003399\" vlink=\"#003399\" alink=\"#003399\">"; private int testProgress; --- 43,80 ---- private static final String TEST_HTML = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">" + ! "<!-- Server: sf-web2 -->\n" + ! "<html lang=\"en\">\n" + ! " <head><link rel=\"stylesheet\" type=\"text/css\" href=\"http://sourceforge.net/cssdef.php\">\n" + ! " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">\n" + ! " <TITLE>SourceForge.net: Modify: 711073 - HTMLTagParser not threadsafe as a static variable in Tag</TITLE>\n" + ! " <SCRIPT language=\"JavaScript\" type=\"text/javascript\">\n" + ! " <!--\n" + ! " function help_window(helpurl) {\n" + ! " HelpWin = window.open( 'http://sourceforge.net' + helpurl,'HelpWindow','scrollbars=yes,resizable=yes,toolbar=no,height=400,width=400');\n" + ! " }\n" + ! " // -->\n" + ! " </SCRIPT>\n" + ! " <link rel=\"SHORTCUT ICON\" href=\"/images/favicon.ico\">\n" + ! "<!-- This is temp javascript for the jump button. If we could actually have a jump script on the server side that would be ideal -->\n" + ! "<script language=\"JavaScript\" type=\"text/javascript\">\n" + ! "<!--\n" + ! " function jump(targ,selObj,restore){ //v3.0\n" + ! " if (selObj.options[selObj.selectedIndex].value)\n" + ! " eval(targ+\".location='\"+selObj.options[selObj.selectedIndex].value+\"'\");\n" + ! " if (restore) selObj.selectedIndex=0;\n" + ! " }\n" + ! " //-->\n" + ! "</script>\n" + ! "<a href=\"http://normallink.com/sometext.html\">\n" + ! "<style type=\"text/css\">\n" + ! "<!--\n" + ! "A:link { text-decoration:none }\n" + ! "A:visited { text-decoration:none }\n" + ! "A:active { text-decoration:none }\n" + ! "A:hover { text-decoration:underline; color:#0066FF; }\n" + ! "-->\n" + ! "</style>\n" + ! "</head>\n" + ! "<body bgcolor=\"#FFFFFF\" text=\"#000000\" leftmargin=\"0\" topmargin=\"0\" marginwidth=\"0\" marginheight=\"0\" link=\"#003399\" vlink=\"#003399\" alink=\"#003399\">\n"; private int testProgress; *************** *** 309,314 **** } while (testProgress!=completionValue); ! for (int i=0;i<parsingThread.length;i++) { ! if (!parsingThread[i].passed()) { assertNotNull("Thread "+i+" link 1",parsingThread[i].getLink1()); assertNotNull("Thread "+i+" link 2",parsingThread[i].getLink2()); --- 309,316 ---- } while (testProgress!=completionValue); ! for (int i=0;i<parsingThread.length;i++) ! { ! if (!parsingThread[i].passed()) ! { assertNotNull("Thread "+i+" link 1",parsingThread[i].getLink1()); assertNotNull("Thread "+i+" link 2",parsingThread[i].getLink2()); |
From: Derrick O. <der...@us...> - 2005-03-13 14:51:54
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexerapplications/tabby In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv16205/lexerapplications/tabby Modified Files: Tabby.java Log Message: Bug #1121401 No Parsing with yahoo! By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero, which was the value returned by the page when the Stream reached EOF. This changes the Page return value to (char)Source.EOF (-1) when the end of stream is encountered. Index: Tabby.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexerapplications/tabby/Tabby.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** Tabby.java 31 Jul 2004 16:42:34 -0000 1.2 --- Tabby.java 13 Mar 2005 14:51:44 -0000 1.3 *************** *** 143,147 **** expected = 0; last = -1; ! while (0 != (ch = page.getCharacter (cursor))) { if (++expected != cursor.getPosition ()) --- 143,147 ---- expected = 0; last = -1; ! while (Page.EOF != (ch = page.getCharacter (cursor))) { if (++expected != cursor.getPosition ()) *************** *** 296,299 **** --- 296,307 ---- * * $Log$ + * Revision 1.3 2005/03/13 14:51:44 derrickoswald + * Bug #1121401 No Parsing with yahoo! + * By default nio.charset.CharsetDecoder replaces characters it cannot + * represent in the current encoding with zero, which was the value + * returned by the page when the Stream reached EOF. + * This changes the Page return value to (char)Source.EOF (-1) when + * the end of stream is encountered. + * * Revision 1.2 2004/07/31 16:42:34 derrickoswald * Remove unused variables and other fixes exposed by turning on compiler warnings. |
From: Derrick O. <der...@us...> - 2005-03-12 21:39:56
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20668 Modified Files: BeanyBaby.java LinkBean.java Log Message: Fix bean example, stop sharing connections. Index: LinkBean.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/LinkBean.java,v retrieving revision 1.29 retrieving revision 1.30 diff -C2 -d -r1.29 -r1.30 *** LinkBean.java 16 May 2004 17:59:57 -0000 1.29 --- LinkBean.java 12 Mar 2005 21:39:45 -0000 1.30 *************** *** 36,44 **** import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.EncodingChangeException; import org.htmlparser.util.ParserException; - import org.htmlparser.visitors.ObjectFindingVisitor; /** --- 36,46 ---- import org.htmlparser.Node; + import org.htmlparser.NodeFilter; import org.htmlparser.Parser; + import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.EncodingChangeException; + import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; /** *************** *** 84,113 **** // ! protected URL[] extractLinks (String url) throws ParserException { ! Parser parser; ! ObjectFindingVisitor visitor; Vector vector; LinkTag link; URL[] ret; ! parser = new Parser (url); ! visitor = new ObjectFindingVisitor (LinkTag.class); try { ! parser.visitAllNodesWith (visitor); } catch (EncodingChangeException ece) { ! parser.reset (); ! visitor = new ObjectFindingVisitor (LinkTag.class); ! parser.visitAllNodesWith (visitor); } - Node [] nodes = visitor.getTags(); vector = new Vector(); ! for (int i = 0; i < nodes.length; i++) try { ! link = (LinkTag)nodes[i]; vector.add(new URL (link.getLink ())); } --- 86,113 ---- // ! protected URL[] extractLinks () throws ParserException { ! NodeFilter filter; ! NodeList list; Vector vector; LinkTag link; URL[] ret; ! mParser.reset (); ! filter = new NodeClassFilter (LinkTag.class); try { ! list = mParser.extractAllNodesThatMatch (filter); } catch (EncodingChangeException ece) { ! mParser.reset (); ! list = mParser.extractAllNodesThatMatch (filter); } vector = new Vector(); ! for (int i = 0; i < list.size (); i++) try { ! link = (LinkTag)list.elementAt (i); vector.add(new URL (link.getLink ())); } *************** *** 190,194 **** try { ! urls = extractLinks (getURL ()); if (!equivalent (mLinks, urls)) { --- 190,194 ---- try { ! urls = extractLinks (); if (!equivalent (mLinks, urls)) { *************** *** 213,217 **** try { ! mLinks = extractLinks (getURL ()); mPropertySupport.firePropertyChange (PROP_LINKS_PROPERTY, null, mLinks); } --- 213,217 ---- try { ! mLinks = extractLinks (); mPropertySupport.firePropertyChange (PROP_LINKS_PROPERTY, null, mLinks); } Index: BeanyBaby.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/beans/BeanyBaby.java,v retrieving revision 1.22 retrieving revision 1.23 diff -C2 -d -r1.22 -r1.23 *** BeanyBaby.java 13 Feb 2005 20:36:03 -0000 1.22 --- BeanyBaby.java 12 Mar 2005 21:39:45 -0000 1.23 *************** *** 85,89 **** * This method ties the two beans together on the same connection. * Whenever a property changes on one bean, make sure the URL properties ! * agree by setting the connection from one to the other. * @param event The event describing the event source * and the property that has changed. --- 85,89 ---- * This method ties the two beans together on the same connection. * Whenever a property changes on one bean, make sure the URL properties ! * agree by setting the URL from one to the other. * @param event The event describing the event source * and the property that has changed. *************** *** 98,107 **** { if (!mLinkBean.getURL ().equals (mStringBean.getURL ())) ! mStringBean.setConnection (mLinkBean.getConnection ()); } else if (source == mStringBean) { if (!mStringBean.getURL ().equals (mLinkBean.getURL ())) ! mLinkBean.setConnection (mStringBean.getConnection ()); // check for menu status changes name = event.getPropertyName (); --- 98,107 ---- { if (!mLinkBean.getURL ().equals (mStringBean.getURL ())) ! mStringBean.setURL (mLinkBean.getURL ()); } else if (source == mStringBean) { if (!mStringBean.getURL ().equals (mLinkBean.getURL ())) ! mLinkBean.setURL (mStringBean.getURL ()); // check for menu status changes name = event.getPropertyName (); *************** *** 369,373 **** BeanyBaby bb = new BeanyBaby (); bb.setVisible (true); ! bb.setURL ("http://www.slashdot.org"); } } --- 369,376 ---- BeanyBaby bb = new BeanyBaby (); bb.setVisible (true); ! if (0 >= args.length) ! bb.setURL ("http://www.slashdot.org"); ! else ! bb.setURL (args[0]); } } |
From: Derrick O. <der...@us...> - 2005-03-12 20:27:57
|
Update of /cvsroot/htmlparser/htmlparser/lib In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1521/lib Removed Files: commons-logging.jar Log Message: Bug #1018884 'compile' ant task from build.xml messes up ./src directory Added optional "classes" property to build.xml. This directory is where class files are put. It defaults to src. To use: build -Dclasses=classdir <target> where classdir is a peer directory to src. Removed unused commons-logging.jar while I was in there. --- commons-logging.jar DELETED --- |
From: Derrick O. <der...@us...> - 2005-03-12 20:27:56
|
Update of /cvsroot/htmlparser/htmlparser In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv1521 Modified Files: build.xml Log Message: Bug #1018884 'compile' ant task from build.xml messes up ./src directory Added optional "classes" property to build.xml. This directory is where class files are put. It defaults to src. To use: build -Dclasses=classdir <target> where classdir is a peer directory to src. Removed unused commons-logging.jar while I was in there. Index: build.xml =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/build.xml,v retrieving revision 1.75 retrieving revision 1.76 diff -C2 -d -r1.75 -r1.76 *** build.xml 13 Feb 2005 22:45:45 -0000 1.75 --- build.xml 12 Mar 2005 20:27:45 -0000 1.76 *************** *** 113,116 **** --- 113,117 ---- <property name="versionQualifier" value="${versionMajor}_${versionMinor}"/> <property name="src" value="src"/> + <property name="classes" value="${src}"/> <property name="docs" value="docs"/> <property name="wiki" value="${docs}/wiki"/> *************** *** 120,124 **** <property name="distribution" value="distribution"/> <property name="junit.jar" value="${lib}/junit.jar"/> - <property name="commons-logging.jar" value="${lib}/commons-logging.jar"/> <property name="sax2.jar" value="${lib}/sax2.jar"/> --- 121,124 ---- *************** *** 223,231 **** <target name="compile" description="compile all java files"> ! <javac srcdir="${src}" includes="org/htmlparser/**" excludes="org/htmlparser/tests/**" debug="on" classpath="src:${commons-logging.jar}" source="1.3"/> </target> <target name="compilelexer" description="compile lexer java files"> ! <javac srcdir="${src}" debug="on" classpath="src:${commons-logging.jar}" target="1.1" source="1.3"> <include name="org/htmlparser/lexer/*.java"/> <include name="org/htmlparser/nodes/*.java"/> --- 223,233 ---- <target name="compile" description="compile all java files"> ! <mkdir dir="${classes}"/> ! <javac srcdir="${src}" destdir="${classes}" includes="org/htmlparser/**" excludes="org/htmlparser/tests/**" debug="on" classpath="${classes}" source="1.3"/> </target> <target name="compilelexer" description="compile lexer java files"> ! <mkdir dir="${classes}"/> ! <javac srcdir="${src}" destdir="${classes}" debug="on" classpath="{classes}" target="1.1" source="1.3"> <include name="org/htmlparser/lexer/*.java"/> <include name="org/htmlparser/nodes/*.java"/> *************** *** 250,254 **** <target name="compileparser" depends="compilelexer" description="compile parser java files"> ! <javac srcdir="${src}" debug="on" classpath="src:${commons-logging.jar}:${sax2.jar}" source="1.3"> <include name="org/htmlparser/**/*.java"/> <exclude name="org/htmlparser/tests/**"/> --- 252,257 ---- <target name="compileparser" depends="compilelexer" description="compile parser java files"> ! <mkdir dir="${classes}"/> ! <javac srcdir="${src}" destdir="${classes}" debug="on" classpath="{classes}:${sax2.jar}" source="1.3"> <include name="org/htmlparser/**/*.java"/> <exclude name="org/htmlparser/tests/**"/> *************** *** 264,268 **** <mkdir dir="${lib}"/> <jar jarfile="${lib}/htmllexer.jar" ! basedir="${src}"> <include name="org/htmlparser/lexer/*.class"/> <include name="org/htmlparser/nodes/*.class"/> --- 267,271 ---- <mkdir dir="${lib}"/> <jar jarfile="${lib}/htmllexer.jar" ! basedir="${classes}"> <include name="org/htmlparser/lexer/*.class"/> <include name="org/htmlparser/nodes/*.class"/> *************** *** 297,304 **** <target name="jarparser" depends="compileparser" description="create htmlparser.jar"> <mkdir dir="${lib}"/> ! <jar jarfile="${lib}/htmlparser.jar" ! basedir="${src}" ! includes="**/*.class **/*.gif" ! excludes="org/htmlparser/tests/**/*.class"> <manifest> <attribute name="Main-Class" value="org.htmlparser.Parser"/> --- 300,311 ---- <target name="jarparser" depends="compileparser" description="create htmlparser.jar"> <mkdir dir="${lib}"/> ! <jar jarfile="${lib}/htmlparser.jar"> ! <fileset ! dir="${classes}" ! includes="**/*.class" ! excludes="org/htmlparser/tests/**/*.class"/> ! <fileset ! dir="${src}" ! includes="**/*.gif"/> <manifest> <attribute name="Main-Class" value="org.htmlparser.Parser"/> *************** *** 326,334 **** <!-- Create the lib directory --> <mkdir dir="${lib}"/> ! <javac compiler="javac1.4" srcdir="${src}" debug="on" classpath="src:${lib}/htmllexer.jar" source="1.3"> <include name="org/htmlparser/lexerapplications/thumbelina/**/*.java"/> </javac> <jar jarfile="${lib}/thumbelina.jar" ! basedir="${src}" defaultexcludes="no" update="true"> --- 333,342 ---- <!-- Create the lib directory --> <mkdir dir="${lib}"/> ! <mkdir dir="${classes}"/> ! <javac compiler="javac1.4" srcdir="${src}" destdir="${classes}" debug="on" classpath="${classes}:${lib}/htmllexer.jar" source="1.3"> <include name="org/htmlparser/lexerapplications/thumbelina/**/*.java"/> </javac> <jar jarfile="${lib}/thumbelina.jar" ! basedir="${classes}" defaultexcludes="no" update="true"> *************** *** 344,356 **** <!-- Create the lib directory --> <mkdir dir="${lib}"/> ! <javac compiler="javac1.4" srcdir="${src}" debug="on" classpath="src:${lib}/htmlparser.jar" source="1.3"> <include name="org/htmlparser/parserapplications/filterbuilder/**/*.java"/> </javac> <jar jarfile="${lib}/filterbuilder.jar" ! basedir="${src}" ! defaultexcludes="no" ! update="true"> ! <include name="org/htmlparser/parserapplications/filterbuilder/**/*.class"/> ! <include name="org/htmlparser/parserapplications/filterbuilder/**/*.gif"/> <manifest> <attribute name="Main-Class" value="org.htmlparser.parserapplications.filterbuilder.FilterBuilder"/> --- 352,371 ---- <!-- Create the lib directory --> <mkdir dir="${lib}"/> ! <mkdir dir="${classes}"/> ! <javac compiler="javac1.4" srcdir="${src}" destdir="${classes}" debug="on" classpath="${classes}:${lib}/htmlparser.jar" source="1.3"> <include name="org/htmlparser/parserapplications/filterbuilder/**/*.java"/> </javac> <jar jarfile="${lib}/filterbuilder.jar" ! update="true"> ! <fileset ! dir="${classes}" ! defaultexcludes="no"> ! <include name="org/htmlparser/parserapplications/filterbuilder/**/*.class"/> ! </fileset> ! <fileset ! dir="${src}" ! defaultexcludes="no"> ! <include name="org/htmlparser/parserapplications/filterbuilder/**/*.gif"/> ! </fileset> <manifest> <attribute name="Main-Class" value="org.htmlparser.parserapplications.filterbuilder.FilterBuilder"/> *************** *** 361,369 **** <!-- Run the unit tests --> <target name="test" depends="jar" description="run the JUnit tests"> ! <javac srcdir="${src}" includes="org/htmlparser/tests/**" debug="on" source="1.3"> <classpath> ! <pathelement location="src"/> <pathelement location="${junit.jar}"/> - <pathelement location="${commons-logging.jar}"/> <pathelement location="${sax2.jar}"/> <pathelement location="${java.home}/../lib/tools.jar"/> --- 376,384 ---- <!-- Run the unit tests --> <target name="test" depends="jar" description="run the JUnit tests"> ! <mkdir dir="${classes}"/> ! <javac srcdir="${src}" destdir="${classes}" includes="org/htmlparser/tests/**" debug="on" source="1.3"> <classpath> ! <pathelement location="${classes}"/> <pathelement location="${junit.jar}"/> <pathelement location="${sax2.jar}"/> <pathelement location="${java.home}/../lib/tools.jar"/> *************** *** 373,379 **** <classpath> <pathelement location="${lib}/htmlparser.jar"/> ! <pathelement location="${src}"/> <pathelement location="${junit.jar}"/> - <pathelement location="${commons-logging.jar}"/> <pathelement location="${sax2.jar}"/> <pathelement location="${java.home}/../lib/tools.jar"/> --- 388,393 ---- <classpath> <pathelement location="${lib}/htmlparser.jar"/> ! <pathelement location="${classes}"/> <pathelement location="${junit.jar}"/> <pathelement location="${sax2.jar}"/> <pathelement location="${java.home}/../lib/tools.jar"/> *************** *** 411,415 **** <!-- Create the javadoc for the project --> <target name="javadoc" depends="JDK1.4,JDK_Warning,init" description="create JavaDoc (API) documentation"> ! <javac srcdir="${resources}" includes="HtmlTaglet.java" classpath="${src}"/> <mkdir dir="${docs}/javadoc"/> <property name="javadoc.doctitle" value="HTML Parser ${versionNumber}"/> --- 425,430 ---- <!-- Create the javadoc for the project --> <target name="javadoc" depends="JDK1.4,JDK_Warning,init" description="create JavaDoc (API) documentation"> ! <mkdir dir="${classes}"/> ! <javac srcdir="${resources}" includes="HtmlTaglet.java" classpath="${classes}"/> <mkdir dir="${docs}/javadoc"/> <property name="javadoc.doctitle" value="HTML Parser ${versionNumber}"/> *************** *** 422,426 **** <javadoc packagenames="org.htmlparser.*" sourcepath="${src}" ! classpath="src:${commons-logging.jar}" defaultexcludes="yes" excludepackagenames="org.htmlparser.tests.*" --- 437,441 ---- <javadoc packagenames="org.htmlparser.*" sourcepath="${src}" ! classpath="${classes}" defaultexcludes="yes" excludepackagenames="org.htmlparser.tests.*" *************** *** 436,442 **** <bottom>${javadoc.bottom}</bottom> <footer>${javadoc.footer}</footer> ! <taglet name="HtmlTaglet" path="${resources}:${src}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications,org.htmlparser.lexerapplications.tabby,org.htmlparser.lexerapplications.thumbelina"/> <group title="Tags" packages="org.htmlparser.tags,org.htmlparser.tags.data"/> <group title="Lexer" packages="org.htmlparser.lexer"/> --- 451,457 ---- <bottom>${javadoc.bottom}</bottom> <footer>${javadoc.footer}</footer> ! <taglet name="HtmlTaglet" path="${resources}:${classes}"/> <group title="Main Package" packages="org.htmlparser"/> ! <group title="Example Applications" packages="org.htmlparser.parserapplications,org.htmlparser.lexerapplications.tabby,org.htmlparser.lexerapplications.thumbelina,org.htmlparser.parserapplications.filterbuilder"/> <group title="Tags" packages="org.htmlparser.tags,org.htmlparser.tags.data"/> <group title="Lexer" packages="org.htmlparser.lexer"/> *************** *** 489,493 **** <delete> <fileset dir="." includes="src.zip"/> ! <fileset dir="${src}" includes="**/*.class"/> </delete> <delete dir="${docs}/javadoc/"/> --- 504,508 ---- <delete> <fileset dir="." includes="src.zip"/> ! <fileset dir="${classes}" includes="**/*.class"/> </delete> <delete dir="${docs}/javadoc/"/> |
From: Derrick O. <der...@us...> - 2005-03-12 17:53:21
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25217/tests/scannersTests Modified Files: ScriptScannerTest.java Log Message: Add STRICT flag to ScriptScanner to revert to legacy handling of broken ETAGO (</). If STRICT is true, scan according to HTML specification, else if false, scan with quote smart state machine which heuristically yields the correct parse. Index: ScriptScannerTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/scannersTests/ScriptScannerTest.java,v retrieving revision 1.58 retrieving revision 1.59 diff -C2 -d -r1.58 -r1.59 *** ScriptScannerTest.java 7 Mar 2005 02:18:47 -0000 1.58 --- ScriptScannerTest.java 12 Mar 2005 17:53:11 -0000 1.59 *************** *** 183,195 **** * string parser was not moving to the ignore state on encountering double * quotes (only single quotes were previously accepted). - * - * <pre> - * Bug #1104627 Parser Crash reading javascript - * Bug #1024045 StringBean crashes on an URL - * Bug #1021925 StyleTag with missing linefeed prevents page from parsing - * </pre> - * Altered test to correctly escape the ETAGO. - * See http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data - * * @throws Exception */ --- 183,186 ---- *************** *** 212,216 **** "document.write(\"}\"); " + "// parser thinks this is the end tag.\n" + ! "document.write(\"<\\/script>\");" + "</script>" + "<body>" + --- 203,207 ---- "document.write(\"}\"); " + "// parser thinks this is the end tag.\n" + ! "document.write(\"</script>\");" + "</script>" + "<body>" + *************** *** 235,239 **** "document.write(\"}\"); " + "// parser thinks this is the end tag.\n" + ! "document.write(\"<\\/script>\");", scriptTag.getScriptCode() ); --- 226,230 ---- "document.write(\"}\"); " + "// parser thinks this is the end tag.\n" + ! "document.write(\"</script>\");", scriptTag.getScriptCode() ); *************** *** 241,260 **** } - /** - * - * <pre> - * Bug #1104627 Parser Crash reading javascript - * Bug #1024045 StringBean crashes on an URL - * Bug #1021925 StyleTag with missing linefeed prevents page from parsing - * </pre> - * Altered test to correctly escape the ETAGO. - * See http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data - * - */ public void testScriptCodeExtraction() throws ParserException { createParser( "<SCRIPT language=JavaScript>" + "document.write(\"<a href=\"1.htm\"><img src=\"1.jpg\" " + ! "width=\"80\" height=\"20\" border=\"0\"><\\/a>\");" + "</SCRIPT>" ); --- 232,240 ---- } public void testScriptCodeExtraction() throws ParserException { createParser( "<SCRIPT language=JavaScript>" + "document.write(\"<a href=\"1.htm\"><img src=\"1.jpg\" " + ! "width=\"80\" height=\"20\" border=\"0\"></a>\");" + "</SCRIPT>" ); *************** *** 265,289 **** "script code", "document.write(\"<a href=\"1.htm\"><img src=\"1.jpg\" " + ! "width=\"80\" height=\"20\" border=\"0\"><\\/a>\");", scriptTag.getScriptCode() ); } - /** - * - * <pre> - * Bug #1104627 Parser Crash reading javascript - * Bug #1024045 StringBean crashes on an URL - * Bug #1021925 StyleTag with missing linefeed prevents page from parsing - * </pre> - * Altered test to correctly escape the ETAGO. - * See http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data - * - */ public void testScriptCodeExtractionWithMultipleQuotes() throws ParserException { createParser( "<SCRIPT language=JavaScript>" + "document.write(\"<a href=\\\"1.htm\\\"><img src=\\\"1.jpg\\\" " + ! "width=\\\"80\\\" height=\\\"20\\\" border=\\\"0\\\"><\\/a>\");" + "</SCRIPT>" ); --- 245,258 ---- "script code", "document.write(\"<a href=\"1.htm\"><img src=\"1.jpg\" " + ! "width=\"80\" height=\"20\" border=\"0\"></a>\");", scriptTag.getScriptCode() ); } public void testScriptCodeExtractionWithMultipleQuotes() throws ParserException { createParser( "<SCRIPT language=JavaScript>" + "document.write(\"<a href=\\\"1.htm\\\"><img src=\\\"1.jpg\\\" " + ! "width=\\\"80\\\" height=\\\"20\\\" border=\\\"0\\\"></a>\");" + "</SCRIPT>" ); *************** *** 294,313 **** "script code", "document.write(\"<a href=\\\"1.htm\\\"><img src=\\\"1.jpg\\\" " + ! "width=\\\"80\\\" height=\\\"20\\\" border=\\\"0\\\"><\\/a>\");", scriptTag.getScriptCode() ); } - /** - * - * <pre> - * Bug #1104627 Parser Crash reading javascript - * Bug #1024045 StringBean crashes on an URL - * Bug #1021925 StyleTag with missing linefeed prevents page from parsing - * </pre> - * Altered test to correctly escape the ETAGO. - * See http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data - * - */ public void testScriptWithinComments() throws Exception { createParser( --- 263,271 ---- "script code", "document.write(\"<a href=\\\"1.htm\\\"><img src=\\\"1.jpg\\\" " + ! "width=\\\"80\\\" height=\\\"20\\\" border=\\\"0\\\"></a>\");", scriptTag.getScriptCode() ); } public void testScriptWithinComments() throws Exception { createParser( *************** *** 350,354 **** "else{" + "\n" + ! "menuobj.document.write('<layer name=gui bgColor=#E6E6E6 width=165 onmouseover=\"clearhidemenu()\" onmouseout=\"hidemenu()\">'+which+'<\\/layer>')" + "\n" + "menuobj.document.close()" + --- 308,312 ---- "else{" + "\n" + ! "menuobj.document.write('<layer name=gui bgColor=#E6E6E6 width=165 onmouseover=\"clearhidemenu()\" onmouseout=\"hidemenu()\">'+which+'</layer>')" + "\n" + "menuobj.document.close()" + *************** *** 558,574 **** /** * See bug #741769 ScriptScanner doesn't handle quoted </script> tags - * - * <pre> - * Bug #1104627 Parser Crash reading javascript - * Bug #1024045 StringBean crashes on an URL - * Bug #1021925 StyleTag with missing linefeed prevents page from parsing - * </pre> - * Altered test to correctly escape the ETAGO. - * See http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data - * */ public void testScanQuotedEndTag() throws ParserException { ! String html = "<SCRIPT language=\"JavaScript\">document.write('<\\/SCRIPT>');</SCRIPT>"; createParser(html); parseAndAssertNodeCount(1); --- 516,523 ---- /** * See bug #741769 ScriptScanner doesn't handle quoted </script> tags */ public void testScanQuotedEndTag() throws ParserException { ! String html = "<SCRIPT language=\"JavaScript\">document.write('</SCRIPT>');</SCRIPT>"; createParser(html); parseAndAssertNodeCount(1); |
From: Derrick O. <der...@us...> - 2005-03-12 17:53:20
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25217/scanners Modified Files: ScriptScanner.java Log Message: Add STRICT flag to ScriptScanner to revert to legacy handling of broken ETAGO (</). If STRICT is true, scan according to HTML specification, else if false, scan with quote smart state machine which heuristically yields the correct parse. Index: ScriptScanner.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/scanners/ScriptScanner.java,v retrieving revision 1.62 retrieving revision 1.63 diff -C2 -d -r1.62 -r1.63 *** ScriptScanner.java 7 Mar 2005 02:18:46 -0000 1.62 --- ScriptScanner.java 12 Mar 2005 17:53:10 -0000 1.63 *************** *** 52,55 **** --- 52,80 ---- { /** + * Strict parsing of CDATA flag. + * If this flag is set true, the parsing of script is performed without + * regard to quotes. This means that erroneous script such as: + * <pre> + * document.write("</script>"); + * </pre> + * will be parsed in strict accordance with appendix + * <a href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> + * B.3.2 Specifying non-HTML data</a> of the + * <a href="http://www.w3.org/TR/html4/">HTML 4.01 Specification</a> and + * hence will be split into two or more nodes. Correct javascript would + * escape the ETAGO: + * <pre> + * document.write("<\/script>"); + * </pre> + * If true, CDATA parsing will stop at the first ETAGO ("</") no matter + * whether it is quoted or not. If false, balanced quotes (either single or + * double) will shield an ETAGO. Beacuse of the possibility of quotes within + * single or multiline comments, these are also parsed. In most cases, + * users prefer non-strict handling since there is so much broken script + * out in the wild. + */ + public static boolean STRICT = false; + + /** * Create a script scanner. */ *************** *** 87,91 **** } } ! content = lexer.parseCDATA (); position = lexer.getPosition (); node = lexer.nextNode (false); --- 112,116 ---- } } ! content = lexer.parseCDATA (!STRICT); position = lexer.getPosition (); node = lexer.nextNode (false); |
From: Derrick O. <der...@us...> - 2005-03-12 17:53:19
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25217/lexer Modified Files: Lexer.java Log Message: Add STRICT flag to ScriptScanner to revert to legacy handling of broken ETAGO (</). If STRICT is true, scan according to HTML specification, else if false, scan with quote smart state machine which heuristically yields the correct parse. Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.35 retrieving revision 1.36 diff -C2 -d -r1.35 -r1.36 *** Lexer.java 7 Mar 2005 02:18:37 -0000 1.35 --- Lexer.java 12 Mar 2005 17:53:08 -0000 1.36 *************** *** 1058,1062 **** * According to appendix <a href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> * B.3.2 Specifying non-HTML data</a> of the ! * <a href="http://www.w3.org/TR/html4/">HTML 4.01 Specification</a>: * <quote> * <b>Element content</b><br> --- 1058,1062 ---- * According to appendix <a href="http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> * B.3.2 Specifying non-HTML data</a> of the ! * <a href="http://www.w3.org/TR/html4/">HTML 4.01 Specification</a>:<br> * <quote> * <b>Element content</b><br> *************** *** 1074,1080 **** --- 1074,1098 ---- ParserException { + return (parseCDATA (false)); + } + + /** + * Return CDATA as a text node. + * Slightly less rigid than {@link #parseCDATA()} this method provides for + * parsing CDATA that may contain quoted strings that have embedded + * ETAGO ("</") delimiters and skips single and multiline comments. + * @param quotesmart If <code>true</code> the strict definition of CDATA is + * extended to allow for single or double quoted ETAGO ("</") sequences. + * @return The <code>TextNode</code> of the CDATA or <code>null</code> if none. + * @see #parseCDATA() + */ + public Node parseCDATA (boolean quotesmart) + throws + ParserException + { int start; int state; boolean done; + char quote; char ch; int end; *************** *** 1083,1086 **** --- 1101,1105 ---- state = 0; done = false; + quote = 0; while (!done) { *************** *** 1094,1099 **** done = true; break; case '<': ! state = 1; break; default: --- 1113,1180 ---- done = true; break; + case '\'': + if (quotesmart) + if (0 == quote) + quote = '\''; // enter quoted state + else if ('\'' == quote) + quote = 0; // exit quoted state + break; + case '"': + if (quotesmart) + if (0 == quote) + quote = '"'; // enter quoted state + else if ('"' == quote) + quote = 0; // exit quoted state + break; + case '\\': + if (quotesmart) + if (0 != quote) + { + ch = mPage.getCharacter (mCursor); // try to consume escaped character + if (0 == ch) + mCursor.retreat (); + else if ( (ch != '\\') && (ch != quote)) + mCursor.retreat (); // unconsume char if character was not an escapable char. + } + break; + case '/': + if (quotesmart) + if (0 == quote) + { + // handle multiline and double slash comments (with a quote) + ch = mPage.getCharacter (mCursor); + if (0 == ch) + mCursor.retreat (); + else if ('/' == ch) + { + do + ch = mPage.getCharacter (mCursor); + while ((ch != 0) && (ch != '\n')); + } + else if ('*' == ch) + { + do + { + do + ch = mPage.getCharacter (mCursor); + while ((ch != 0) && (ch != '*')); + ch = mPage.getCharacter (mCursor); + if (ch == '*') + mCursor.retreat (); + } + while ((ch != 0) && (ch != '/')); + } + else + mCursor.retreat (); + } + break; case '<': ! if (quotesmart) ! { ! if (0 == quote) ! state = 1; ! } ! else ! state = 1; break; default: |
From: Derrick O. <der...@us...> - 2005-03-12 13:39:56
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21713/util Modified Files: NodeList.java Log Message: RFE #1160345 NodeList.visitAllNodesWith Added visitAllNodesWith to the NodeList class. Index: NodeList.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/util/NodeList.java,v retrieving revision 1.57 retrieving revision 1.58 diff -C2 -d -r1.57 -r1.58 *** NodeList.java 13 Feb 2005 20:36:03 -0000 1.57 --- NodeList.java 12 Mar 2005 13:39:47 -0000 1.58 *************** *** 33,36 **** --- 33,37 ---- import org.htmlparser.NodeFilter; import org.htmlparser.filters.NodeClassFilter; + import org.htmlparser.visitors.NodeVisitor; public class NodeList implements Serializable { *************** *** 301,303 **** --- 302,327 ---- return (extractAllNodesThatMatch (new NodeClassFilter (classType), recursive)); } + + /** + * Utility to apply a visitor to a node list. + * Provides for a visitor to modify the contents of a page and get the + * modified HTML as a string with code like this: + * <pre> + * Parser parser = new Parser ("http://whatever"); + * NodeList list = parser.parse (null); // no filter + * list.visitAllNodesWith (visitor); + * System.out.println (list.toHtml ()); + * </pre> + */ + public void visitAllNodesWith (NodeVisitor visitor) + throws + ParserException + { + Node node; + + visitor.beginParsing (); + for (int i = 0; i < size; i++) + nodeData[i].accept (visitor); + visitor.finishedParsing (); + } } |
From: Derrick O. <der...@us...> - 2005-03-12 13:39:55
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv21713/tests/visitorsTests Modified Files: UrlModifyingVisitorTest.java Log Message: RFE #1160345 NodeList.visitAllNodesWith Added visitAllNodesWith to the NodeList class. Index: UrlModifyingVisitorTest.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/visitorsTests/UrlModifyingVisitorTest.java,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** UrlModifyingVisitorTest.java 31 Jul 2004 16:42:33 -0000 1.18 --- UrlModifyingVisitorTest.java 12 Mar 2005 13:39:46 -0000 1.19 *************** *** 28,32 **** --- 28,37 ---- import org.htmlparser.Parser; + import org.htmlparser.Tag; + import org.htmlparser.tags.ImageTag; + import org.htmlparser.tags.LinkTag; import org.htmlparser.tests.ParserTestCase; + import org.htmlparser.util.NodeList; + import org.htmlparser.visitors.NodeVisitor; import org.htmlparser.visitors.UrlModifyingVisitor; *************** *** 66,68 **** --- 71,101 ---- result); } + + /** + * Test a better method of modifying an HTML page. + */ + public void testPageModification () + throws + Exception + { + Parser parser = Parser.createParser (HTML_WITH_LINK, null); + NodeList list = parser.parse (null); // no filter + // make an inner class that does the same thing as the UrlModifyingVisitor + NodeVisitor visitor = new NodeVisitor () + { + String linkPrefix = "localhost://"; + public void visitTag (Tag tag) + { + if (tag instanceof LinkTag) + ((LinkTag)tag).setLink(linkPrefix + ((LinkTag)tag).getLink()); + else if (tag instanceof ImageTag) + ((ImageTag)tag).setImageURL(linkPrefix + ((ImageTag)tag).getImageURL()); + } + }; + list.visitAllNodesWith (visitor); + String result = list.toHtml (); + assertStringEquals("Expected HTML", + MODIFIED_HTML, + result); + } } |
From: Derrick O. <der...@us...> - 2005-03-12 12:52:35
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9677/tests/utilTests Modified Files: AllTests.java Added Files: NonEnglishTest.java Log Message: Bug #1161137 Non English Character web page Reinitialize the string buffer after encoding change exception processing. --- NEW FILE: NonEnglishTest.java --- // HTMLParser Library $Name: $ - A java-based parser for HTML // http://sourceforge.org/projects/htmlparser // Copyright (C) 2004 Somik Raha // // Revision Control Information // // $Source: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/NonEnglishTest.java,v $ // $Author: derrickoswald $ // $Date: 2005/03/12 12:52:20 $ // $Revision: 1.1 $ // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU // Lesser General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this library; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // package org.htmlparser.tests.utilTests; import org.htmlparser.beans.StringBean; import org.htmlparser.tests.ParserTestCase; import org.htmlparser.util.ParserException; /** * Test case for bug #1161137 Non English Character web page. * Submitted by Michael (til...@us...) */ public class NonEnglishTest extends ParserTestCase { static { System.setProperty ("org.htmlparser.tests.utilTests.NonEnglishTest", "NonEnglishTest"); } public NonEnglishTest (String name) { super(name); } public void testNonEnglishCharacters() throws ParserException { StringBean sb; sb = new StringBean (); sb.setURL ("http://www.kobe-np.co.jp/"); sb.getStrings (); sb.setURL ("http://book.asahi.com/"); // this used to throw an exception sb.getStrings (); } } Index: AllTests.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/tests/utilTests/AllTests.java,v retrieving revision 1.56 retrieving revision 1.57 diff -C2 -d -r1.56 -r1.57 *** AllTests.java 18 Jul 2004 21:31:21 -0000 1.56 --- AllTests.java 12 Mar 2005 12:52:20 -0000 1.57 *************** *** 64,67 **** --- 64,68 ---- suite.addTestSuite(HTMLParserUtilsTest.class); suite.addTestSuite(NodeListTest.class); + suite.addTestSuite(NonEnglishTest.class); suite.addTestSuite(SortTest.class); |