[Htmlparser-cvs] htmlparser/src/org/htmlparser/lexer Cursor.java,1.18,1.19 InputStreamSource.java,1.
Brought to you by:
derrickoswald
From: Derrick O. <der...@us...> - 2005-04-12 11:28:08
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23432/htmlparser/src/org/htmlparser/lexer Modified Files: Cursor.java InputStreamSource.java Lexer.java Page.java PageAttribute.java Source.java Stream.java StringSource.java package.html Log Message: Documentation revamp part two. Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/package.html,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** package.html 2 Jan 2004 16:24:53 -0000 1.12 --- package.html 12 Apr 2005 11:27:41 -0000 1.13 *************** *** 84,97 **** <p> The following are some design goals and 'invariants' within the package, if you ! are attempting to understand or modify it. Things that differ substantially from ! previous implementations are highlighted in <B>bold</B>. <DL> <DT>Contiguous Nodes ! <DD><B>Adjacent nodes have no characters between them.</B> The list of nodes forms an uninterrupted chain that, by start and end definitions, completely covers the characters that were read from the HTML source. <DT>Text Fidelity ! <DD>Besides complete coverage, the <B>nodes do not initially contain copies of ! the text</B>, but instead simply contain offsets into a single large buffer that contains the text read from the HTML source. Even within tags, the attributes list can contain whitespace, thus there is no lost whitespace or --- 84,96 ---- <p> The following are some design goals and 'invariants' within the package, if you ! are attempting to understand or modify it. <DL> <DT>Contiguous Nodes ! <DD>Adjacent nodes have no characters between them. The list of nodes forms an uninterrupted chain that, by start and end definitions, completely covers the characters that were read from the HTML source. <DT>Text Fidelity ! <DD>Besides complete coverage, the nodes do not initially contain copies of ! the text, but instead simply contain offsets into a single large buffer that contains the text read from the HTML source. Even within tags, the attributes list can contain whitespace, thus there is no lost whitespace or *************** *** 99,129 **** preserved. <DT>Line Endings ! <DD><B>End of line characters are just whitespace.</B> There is no distinction made between end of line characters (or pairs of characters on Windows) and other whitespace. The text is not read in line by line so nodes (tags) can easily span multiple lines with no special processing. Line endings are not transformed between platforms, i.e. Unix line endings are not converted to Windows line ! endings by this level. Each node will has a starting and ending location, which the page can use to extract the text. To facilitate formatting error and log messages the page can turn these offsets into row and column numbers. In general ignore line breaks in the source if at all possible. ! <DT>One Parser, One Scan ! <DD>The Lexer has the following state machines corresponding ! (roughly) to the <B>four parsers it replaces</B> (StringParser, RemarkNodeParser, ! TagParser & AttributeParser): <LI>in text - parseString()</LI> <LI>in comment - parseRemark()</LI> <LI>in tag - parseTag()</LI> ! By integrating the four state machines into one, a single pass over the text is ! all that's needed for a low level parse of the HTML source. In previous ! implementations, the attributes were parsed on a second scan after the initial ! tag was extracted. (Actually, for error conditions, the lexer can back up a ! node to handle missing end tags etc.). <DT>Two Jars <DD>For elementary operations at the node level, a minimalist jar file containing ! <B>only the lexer and base tag classes</B> is split out from the larger <CODE>htmlparser.jar</CODE>. In this way, simple parsing and output is handled with a jar file that is under 45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection ! and other semantic reasoning will need the full set of scanners, nodes and ancillary classes, which now stands at 210 kilobytes. </DL> --- 98,126 ---- preserved. <DT>Line Endings ! <DD>End of line characters are just whitespace. There is no distinction made between end of line characters (or pairs of characters on Windows) and other whitespace. The text is not read in line by line so nodes (tags) can easily span multiple lines with no special processing. Line endings are not transformed between platforms, i.e. Unix line endings are not converted to Windows line ! endings by this level. Each node has a starting and ending location, which the page can use to extract the text. To facilitate formatting error and log messages the page can turn these offsets into row and column numbers. In general ignore line breaks in the source if at all possible. ! <DT>State Machines ! <DD>The Lexer has the following state machines: ! <UL> <LI>in text - parseString()</LI> <LI>in comment - parseRemark()</LI> <LI>in tag - parseTag()</LI> ! <LI>in JSP tag - parseJsp()</LI> ! </UL> ! There is another state machine -- parseCDATA -- used by higher level code ! (script and style scanners), but this isn't actually used by the lexer. <DT>Two Jars <DD>For elementary operations at the node level, a minimalist jar file containing ! only the lexer and base tag classes is split out from the larger <CODE>htmlparser.jar</CODE>. In this way, simple parsing and output is handled with a jar file that is under 45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection ! and other semantic reasoning, will need the full set of scanners, nodes and ancillary classes, which now stands at 210 kilobytes. </DL> Index: StringSource.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/StringSource.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** StringSource.java 13 Feb 2005 22:45:47 -0000 1.2 --- StringSource.java 12 Apr 2005 11:27:41 -0000 1.3 *************** *** 111,114 **** --- 111,115 ---- * Does nothing. * It's supposed to close the source, but use destroy() instead. + * @exception IOException <em>not used</em> * @see #destroy */ *************** *** 206,209 **** --- 207,212 ---- */ public void reset () + throws + IllegalStateException { if (null == mString) *************** *** 248,252 **** * @exception IOException If the source is closed. */ ! public long skip (long n) throws IOException { int length; --- 251,258 ---- * @exception IOException If the source is closed. */ ! public long skip (long n) ! throws ! IOException, ! IllegalArgumentException { int length; *************** *** 255,259 **** if (null == mString) throw new IOException ("source is closed"); ! if (n < 0) throw new IllegalArgumentException ("cannot skip backwards"); else --- 261,265 ---- if (null == mString) throw new IOException ("source is closed"); ! if (0 > n) throw new IllegalArgumentException ("cannot skip backwards"); else Index: Stream.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Stream.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** Stream.java 2 Jan 2004 16:24:53 -0000 1.12 --- Stream.java 12 Apr 2005 11:27:41 -0000 1.13 *************** *** 37,42 **** --- 37,56 ---- public class Stream extends InputStream implements Runnable { + /** + * The number of calls to fill. + * Note: to be removed. + */ public int fills = 0; + + /** + * The number of reallocations. + * Note: to be removed. + */ public int reallocations = 0; + + /** + * The number of synchronous (blocking) fills. + * Note: to be removed. + */ public int synchronous = 0; Index: Lexer.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Lexer.java,v retrieving revision 1.37 retrieving revision 1.38 diff -C2 -d -r1.37 -r1.38 *** Lexer.java 13 Mar 2005 14:51:43 -0000 1.37 --- Lexer.java 12 Apr 2005 11:27:41 -0000 1.38 *************** *** 47,53 **** --- 47,55 ---- * This class parses the HTML stream into nodes. * There are three major types of nodes (lexemes): + * <ul> * <li>Remark</li> * <li>Text</li> * <li>Tag</li> + * </ul> * Each time <code>nextNode()</code> is called, another node is returned until * the stream is exhausted, and <code>null</code> is returned. *************** *** 113,118 **** * Creates a new instance of a Lexer. * @param connection The url to parse. */ ! public Lexer (URLConnection connection) throws ParserException { this (new Page (connection)); --- 115,123 ---- * Creates a new instance of a Lexer. * @param connection The url to parse. + * @exception ParserException If an error occurs opening the connection. */ ! public Lexer (URLConnection connection) ! throws ! ParserException { this (new Page (connection)); *************** *** 192,195 **** --- 197,204 ---- } + /** + * Get the current cursor position. + * @return The current character offset into the source. + */ public int getPosition () { *************** *** 197,200 **** --- 206,213 ---- } + /** + * Set the current cursor position. + * @param position The new character offset into the source. + */ public void setPosition (int position) { *************** *** 315,318 **** --- 328,332 ---- * Advance the cursor through a JIS escape sequence. * @param cursor A cursor positioned within the escape sequence. + * @exception ParserException If a problem occurs reading from the source. */ protected void scanJIS (Cursor cursor) *************** *** 363,366 **** --- 377,382 ---- * @param start The position at which to start scanning. * @param quotesmart If <code>true</code>, strings ignore quoted contents. + * @return The parsed node. + * @exception ParserException If a problem occurs reading from the source. */ protected Node parseString (int start, boolean quotesmart) *************** *** 468,471 **** --- 484,491 ---- /** * Create a string node based on the current cursor and the one provided. + * @param start The starting point of the node. + * @param end The ending point of the node. + * @exception ParserException If the nodefactory creation of the string node fails. + * @return The new Text node. */ protected Node makeString (int start, int end) *************** *** 578,581 **** --- 598,602 ---- * @param start The position at which to start scanning. * @return The parsed tag. + * @exception ParserException If a problem occurs reading from the source. */ protected Node parseTag (int start) *************** *** 750,753 **** --- 771,779 ---- /** * Create a tag node based on the current cursor and the one provided. + * @param start The starting point of the node. + * @param end The ending point of the node. + * @param attributes The attributes parsed from the tag. + * @exception ParserException If the nodefactory creation of the tag node fails. + * @return The new Tag node. */ protected Node makeTag (int start, int end, Vector attributes) *************** *** 811,814 **** --- 837,842 ---- * @param start The position at which to start scanning. * @param quotesmart If <code>true</code>, strings ignore quoted contents. + * @return The parsed node. + * @exception ParserException If a problem occurs reading from the source. */ protected Node parseRemark (int start, boolean quotesmart) *************** *** 888,891 **** --- 916,923 ---- /** * Create a remark node based on the current cursor and the one provided. + * @param start The starting point of the node. + * @param end The ending point of the node. + * @exception ParserException If the nodefactory creation of the remark node fails. + * @return The new Remark node. */ protected Node makeRemark (int start, int end) *************** *** 915,918 **** --- 947,952 ---- * exhausted, in which case <code>null</code> is returned. * @param start The position at which to start scanning. + * @return The parsed node. + * @exception ParserException If a problem occurs reading from the source. */ protected Node parseJsp (int start) *************** *** 1070,1073 **** --- 1104,1108 ---- * </quote> * @return The <code>TextNode</code> of the CDATA or <code>null</code> if none. + * @exception ParserException If a problem occurs reading from the source. */ public Node parseCDATA () *************** *** 1087,1090 **** --- 1122,1126 ---- * @return The <code>TextNode</code> of the CDATA or <code>null</code> if none. * @see #parseCDATA() + * @exception ParserException If a problem occurs reading from the source. */ public Node parseCDATA (boolean quotesmart) *************** *** 1229,1232 **** --- 1265,1269 ---- * @param start The beginning position of the string. * @param end The ending positiong of the string. + * @return The created Text node. */ public Text createStringNode (Page page, int start, int end) *************** *** 1240,1243 **** --- 1277,1281 ---- * @param start The beginning position of the remark. * @param end The ending positiong of the remark. + * @return The created Remark node. */ public Remark createRemarkNode (Page page, int start, int end) *************** *** 1256,1259 **** --- 1294,1298 ---- * @param end The ending positiong of the tag. * @param attributes The attributes contained in this tag. + * @return The created Tag node. */ public Tag createTagNode (Page page, int start, int end, Vector attributes) *************** *** 1264,1272 **** /** * Mainline for command line operation */ public static void main (String[] args) throws MalformedURLException, - IOException, ParserException { --- 1303,1313 ---- /** * Mainline for command line operation + * @param args [0] The URL to parse. + * @exception MalformedURLException If the provided URL cannot be resolved. + * @exception ParserException If the parse fails. */ public static void main (String[] args) throws MalformedURLException, ParserException { Index: InputStreamSource.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/InputStreamSource.java,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** InputStreamSource.java 6 Mar 2005 21:46:31 -0000 1.5 --- InputStreamSource.java 12 Apr 2005 11:27:41 -0000 1.6 *************** *** 357,360 **** --- 357,361 ---- * Does nothing. * It's supposed to close the source, but use destroy() instead. + * @exception IOException <em>not used</em> * @see #destroy */ *************** *** 447,450 **** --- 448,453 ---- */ public void reset () + throws + IllegalStateException { if (null == mStream) *************** *** 504,508 **** * @exception IOException If an I/O error occurs. */ ! public long skip (long n) throws IOException { long ret; --- 507,514 ---- * @exception IOException If an I/O error occurs. */ ! public long skip (long n) ! throws ! IOException, ! IllegalArgumentException { long ret; *************** *** 510,521 **** if (null == mStream) throw new IOException ("source is closed"); ! if (mLevel - mOffset < n) ! fill ((int)(n - (mLevel - mOffset))); // minimum to satisfy this request ! if (mOffset >= mLevel) ! ret = EOF; else { ! ret = Math.min (mLevel - mOffset, n); ! mOffset += ret; } --- 516,532 ---- if (null == mStream) throw new IOException ("source is closed"); ! if (0 > n) ! throw new IllegalArgumentException ("cannot skip backwards"); else { ! if (mLevel - mOffset < n) ! fill ((int)(n - (mLevel - mOffset))); // minimum to satisfy this request ! if (mOffset >= mLevel) ! ret = EOF; ! else ! { ! ret = Math.min (mLevel - mOffset, n); ! mOffset += ret; ! } } Index: Source.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Source.java,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** Source.java 13 Mar 2005 14:51:44 -0000 1.18 --- Source.java 12 Apr 2005 11:27:41 -0000 1.19 *************** *** 89,92 **** --- 89,93 ---- * Does nothing. * It's supposed to close the source, but use {@link #destroy} instead. + * @exception IOException <em>not used</em> * @see #destroy */ *************** *** 139,143 **** * Reset the source. * Repositions the read point to begin at zero. - * @exception IllegalStateException If the source has been closed. */ public abstract void reset (); --- 140,143 ---- *************** *** 168,172 **** * @param n The number of characters to skip. * @return The number of characters actually skipped - * @exception IllegalArgumentException If <code>n</code> is negative. * @exception IOException If an I/O error occurs. */ --- 168,171 ---- Index: Page.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Page.java,v retrieving revision 1.48 retrieving revision 1.49 diff -C2 -d -r1.48 -r1.49 *** Page.java 13 Mar 2005 14:51:43 -0000 1.48 --- Page.java 12 Apr 2005 11:27:41 -0000 1.49 *************** *** 288,291 **** --- 288,292 ---- * @param name The name to look up. One of the aliases for a character set. * @param _default The name to return if the lookup fails. + * @return The character set name. */ public static String findCharset (String name, String _default) *************** *** 449,452 **** --- 450,454 ---- /** * Close the page by destroying the source of characters. + * @exception IOException If destroying the source encounters an error. */ public void close () throws IOException *************** *** 597,600 **** --- 599,603 ---- /** * Get the source this page is reading from. + * @return The current source. */ public Source getSource () *************** *** 776,783 **** /** * Build a URL from the link and base provided. * @param link The (relative) URI. * @param base The base URL of the page, either from the <BASE> tag * or, if none, the URL the page is being fetched from. ! * @return An absolute URL. */ public URL constructUrl (String link, String base) --- 779,787 ---- /** * Build a URL from the link and base provided. + * @return An absolute URL. * @param link The (relative) URI. * @param base The base URL of the page, either from the <BASE> tag * or, if none, the URL the page is being fetched from. ! * @exception MalformedURLException If creating the URL fails. */ public URL constructUrl (String link, String base) *************** *** 913,916 **** --- 917,922 ---- */ public String getText (int start, int end) + throws + IllegalArgumentException { String ret; *************** *** 945,948 **** --- 951,956 ---- */ public void getText (StringBuffer buffer, int start, int end) + throws + IllegalArgumentException { int length; *************** *** 1005,1008 **** --- 1013,1018 ---- */ public void getText (char[] array, int offset, int start, int end) + throws + IllegalArgumentException { int length; Index: PageAttribute.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/PageAttribute.java,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** PageAttribute.java 13 Feb 2005 22:45:47 -0000 1.2 --- PageAttribute.java 12 Apr 2005 11:27:41 -0000 1.3 *************** *** 146,152 **** * @param value The value of this attribute. * @exception IllegalArgumentException if the value contains other than ! * whitespace. To set a real value use {@link Attribute#Attribute(String)}. */ public PageAttribute (String value) { super (value); --- 146,154 ---- * @param value The value of this attribute. * @exception IllegalArgumentException if the value contains other than ! * whitespace. To set a real value use {@link #PageAttribute(String,String)}. */ public PageAttribute (String value) + throws + IllegalArgumentException { super (value); Index: Cursor.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/Cursor.java,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** Cursor.java 31 Jul 2004 16:42:31 -0000 1.18 --- Cursor.java 12 Apr 2005 11:27:41 -0000 1.19 *************** *** 123,126 **** --- 123,131 ---- } + /** + * Return a string representation of this cursor + * @return A string of the form "n[r,c]", where n is the character position, + * r is the row (zero based) and c is the column (zero based) on the page. + */ public String toString () { *************** *** 150,153 **** --- 155,162 ---- /** * Compare one reference to another. + * @param that The object to compare this to. + * @return A negative integer, zero, or a positive + * integer as this object is less than, equal to, + * or greater than that object. * @see org.htmlparser.util.sort.Ordered */ |