[Htmlparser-cvs] htmlparser/src/org/htmlparser/lexer PageIndex.java,1.10,1.11 package.html,1.7,1.8
Brought to you by:
derrickoswald
From: <der...@us...> - 2003-10-26 17:59:30
|
Update of /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer In directory sc8-pr-cvs1:/tmp/cvs-serv7966 Modified Files: PageIndex.java package.html Log Message: Doco update. Move the lexer from future tense to current. Index: PageIndex.java =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/PageIndex.java,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** PageIndex.java 29 Sep 2003 00:00:39 -0000 1.10 --- PageIndex.java 26 Oct 2003 17:58:25 -0000 1.11 *************** *** 39,45 **** /** ! * A sorted array of integers which are the positions of end of line characters. ! * Maintains a list of integers which are (the positions of the first ! * characters of each line. * To facilitate processing the first element should be maintained at position 0. * Facilities to add, remove, search and determine row and column are provided. --- 39,43 ---- /** ! * A sorted array of integers, the positions of the first characters of each line. * To facilitate processing the first element should be maintained at position 0. * Facilities to add, remove, search and determine row and column are provided. Index: package.html =================================================================== RCS file: /cvsroot/htmlparser/htmlparser/src/org/htmlparser/lexer/package.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** package.html 22 Sep 2003 02:39:59 -0000 1.7 --- package.html 26 Oct 2003 17:58:25 -0000 1.8 *************** *** 39,44 **** </HEAD> <BODY> ! The lexer package will eventually be the base level I/O subsystem. ! <EM>It is currently under development.</EM> <P>The lexer package is responsible for reading characters from the HTML source and identifying the node lexemes. For example, the HTML code below would return --- 39,43 ---- </HEAD> <BODY> ! The lexer package is the base level I/O subsystem. <P>The lexer package is responsible for reading characters from the HTML source and identifying the node lexemes. For example, the HTML code below would return *************** *** 98,110 **** <DD><B>Adjacent nodes have no characters between them.</B> The list of nodes forms an uninterrupted chain that, by start and end definitions, completely covers the ! characters that were read from the HTML source. Despite this, the nodes are not ! stored in a linked list, but rather an array to ease any editing tasks that may ! be performed. <DT>Text Fidelity ! <DD>Besides complete coverage, the <B>nodes do not contain copies of the text</B>, ! but instead simply contain offsets into a single large buffer that contains the ! text read from the HTML source. Even within tags, the attributes list can ! contain whitespace, thus there is no lost whitespace or text formatting ! either outside or within tags. Upper and lower case text is preserved. <DT>Line Endings <DD><B>End of line characters are just whitespace.</B> There is no distinction --- 97,108 ---- <DD><B>Adjacent nodes have no characters between them.</B> The list of nodes forms an uninterrupted chain that, by start and end definitions, completely covers the ! characters that were read from the HTML source. <DT>Text Fidelity ! <DD>Besides complete coverage, the <B>nodes do not initially contain copies of ! the text</B>, but instead simply contain offsets into a single large buffer ! that contains the text read from the HTML source. Even within tags, the ! attributes list can contain whitespace, thus there is no lost whitespace or ! text formatting either outside or within tags. Upper and lower case text is ! preserved. <DT>Line Endings <DD><B>End of line characters are just whitespace.</B> There is no distinction *************** *** 127,138 **** all that's needed for a low level parse of the HTML source. In previous implementations, the attributes were parsed on a second scan after the initial ! tag was extracted. <DT>Two Jars <DD>For elementary operations at the node level, a minimalist jar file containing <B>only the lexer and base tag classes</B> is split out from the larger <CODE>htmlparser.jar</CODE>. In this way, simple parsing and output is handled with a jar file that is under ! 40 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection and other semantic reasoning will need the full set of scanners, nodes and ancillary ! classes, which now stands at 160 kilobytes. </DL> </BODY> --- 125,137 ---- all that's needed for a low level parse of the HTML source. In previous implementations, the attributes were parsed on a second scan after the initial ! tag was extracted. (Actually, for error conditions, the lexer can back up a ! node to handle missing end tags etc.). <DT>Two Jars <DD>For elementary operations at the node level, a minimalist jar file containing <B>only the lexer and base tag classes</B> is split out from the larger <CODE>htmlparser.jar</CODE>. In this way, simple parsing and output is handled with a jar file that is under ! 45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection and other semantic reasoning will need the full set of scanners, nodes and ancillary ! classes, which now stands at 210 kilobytes. </DL> </BODY> |