[Htmlparser-cvs] htmlparser/src/doc-files building.html,NONE,1.1 overview.html,NONE,1.1 todo.html,NO
Brought to you by:
derrickoswald
From: <der...@us...> - 2003-12-16 02:29:59
|
Update of /cvsroot/htmlparser/htmlparser/src/doc-files In directory sc8-pr-cvs1:/tmp/cvs-serv22177/src/doc-files Added Files: building.html overview.html todo.html Log Message: Javadoc changes and additions. Stylesheet, overview, build instructions and todo list. Added HTMLTaglet, an inline Javadoc taglet for embedding HTML into javadocs. --- NEW FILE: building.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>How to Build the HTML Parser Libraries</TITLE> <link REL ="stylesheet" TYPE="text/css" HREF="../stylesheet.css" TITLE="Style"> </HEAD> <BODY bgcolor="white"> <H1>How to Build the HTML Parser libraries</H1> <H2>JDK</H2> Set up java. I won't include instructions here, just a link to the <a href="http://java.sun.com/j2se">Sun j2se site</a>. I use version 1.4.1, and you need a JDK (java development kit), not a JRE (java runtime environment).<p> Test your installation by typing command:<p> <code>javac</code><p> This should display help on the java compiler options. <H2>Ant</H2> Set up ant, the Java-based build tool from the <a href="http://jakarta.apache.org/ant/index.html">Apache Jakarta project</a>. It is kind of like Make, but without Make's wrinkles. The build.xml file the HTML Parser uses relies on command tags available in Ant version 1.4.1 or higher. The version currently used on the build machine is 1.5.3. The current version of Ant is available <a href="http://archive.apache.org/dist/ant/ant-current-bin.zip">here</a>.<p> Basically you unzip the file into a directory and add an ANT_HOME environment variable that points at it. Test your installation by typing the command:<p> <code>ant -help</code><p> This should display help on ant options. <H2>Third Party Libraries</H2> Any needed third-party libraries are included in the lib directory.<p> The unit test code relies on lib/junit.jar from the <a href="http://sourceforge.net/projects/junit">JUnit project</a>. The version used on the build machine is 3.8.1 which you can get <a href="http://prdownloads.sourceforge.net/junit/junit3.8.1.zip?download">here</a>. <H2>Sources</H2> The distribution zip file contains a src.jar file. If you've unpacked the distribution this file should be in the top level directory you chose.<p> Unjar this file with the command:<p> <code>jar -xf src.jar</code><p> There should now be a build.xml in the top level directory. <H2>Building</H2> The default ant target 'htmlparser' builds everything:<p> <code>ant</code><p> If you just want to build some of the parts see the help list:<p> <pre><code>ant -projecthelp Package glom the release and source files into the distribution zip file Release prepare the release files changelog create the change log from CVS logs checkstyle check source code adheres to coding standards clean cleanup compile compile all java files compilelexer compile lexer java files compileparser compile parser java files htmlparser same as Package plus cleanup init initialize version properties jar create htmlparser.jar and htmllexer.jar jarlexer create htmllexer.jar jarparser create htmlparser.jar javadoc create JavaDoc (API) documentation sources create the source zip test run the JUnit tests thumbelina create thumbelina.jar versionSource update the version in all java files </code></pre><p> <H2>Developing</H2> For development purposes you might want to get an Integrated Development Environment (IDE) such as <a href="http://www.netbeans.org/">NetBeans</a> or <a href="http://eclipse.org/">Eclipse</a>. Mount the org directory where the HTML Parser was installed along with the <code>junit.jar</code> file from the <code>lib</code> directory. "Build All" should work. <H2>CVS</H2> The most recent files are only available via CVS: <pre> server: cvs.htmlparser.sourceforge.net repository: /cvsroot/htmlparser </pre> For read-only access use 'pserver' and anonymous access with no password. For commit access you'll need to set up ssh (see <a href="http://sourceforge.net/docman/display_doc.php?docid=6841&group_id=1">an introduction to SSH on sourceforge</a> and <a href="http://sourceforge.net/docman/display_doc.php?docid=761&group_id=1">a guide on setting up ssh keys</a>). <p>Short instructions from Karle Kaila:<p> <pre> I have installed SSH software from <a href="http://www.f-secure.com">www.f-secure.com</a> I think it was something like F-Secure SSH 5.2 for Win95/98/ME/NT4.0/2000/XP Client It is a nice grapfical SSH client both for terminal use and filetransfer and it also contains commandline ssh2 software that CVS needs. To access CVS I first set it up with these commands set CVS_RSH=ssh2 set CVSROOT=use...@cv...:/cvsroot/htmlparser username = your sourceforge username In an empty directory I then can give CVS commands such as cvs chekcout htmlparser It asks for your password to sourceforge This retrieves the latest fileversions. Check the CVS commands in some handbook you can find on the internet. The manual I found is called Version Management with CVS by Per Cederqvist et al. perhaps from http://www.cvshome.org Derrick says: I need CVSROOT=:ext:use...@cv...:/cvsroot/htmlparser CVS_RSH=ssh </pre> <H2>Other</H2> Some of the build.xml targets (like changelog) rely on Perl to execute, and need a sourceforge login via ssh (secure shell). This is unlikely to be needed by the casual user. </BODY> </HTML> --- NEW FILE: overview.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>HTML Parser Libraries Overview</TITLE> </HEAD> <BODY> <h1>The HTML Parser Libraries.</h1> These java libraries provide access to the contents of local or remote HTML resources in a programatic way. <h2>Components</h2> The HTML Parser distribution is composed of: <li>a low level {@link org.htmlparser.lexer.Lexer lexer} that converts characters into tags</li> <li>a high level {@link org.htmlparser.Parser parser} that provides a heirarchical document view</li> <li>several example applications</li> <p> <h2>Building</h2> To build the system you'll need to get the sources from the <a href="http://sourceforge.net/project/showfiles.php?group_id=24399&release_id=161563">HTML Parser project on Sourceforge</a> if you haven't already, and then follow the <A href="{@docRoot}/doc-files/building.html">build instructions</A>. <h2>History</h2> <p> Originally started by Somik Raha, the HTML Parser has evolved with input from numerous people, and through several revisions... <h2>Outstanding Issues.</h2> The <A href="{@docRoot}/doc-files/todo.html">ToDo list</A> lists things that can or should be done. <h2>Mailing Lists.</h2> If you want to be notified when new releases of HTML Parser are available, join the <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-announce" target="_top">HTML Parser Announcement List</A>.<br> If you have questions about the usage of the parser, join the <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-user" target="_top">HTML Parser User List</A>.<br> If you want to join as a developer, please sign up on the <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-developer" target="_top">HTML Parser Developer List</A> </BODY> </HTML> --- NEW FILE: todo.html --- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>ToDo List for the HTML Parser Libraries</TITLE> <link REL ="stylesheet" TYPE="text/css" HREF="../stylesheet.css" TITLE="Style"> </HEAD> <BODY> <ul> <li> It looks like there are enough bugs and requests to warrant another 1.3 point release with some patched files. I hate to work on a branch, but it may be the only way to get everyone off my back. </li> <li> Implement the new filtering mechanism for NodeList.searchFor (). </li> <li> As of now, it's more likely that the javadocs are lying to you than providing any helpful advice. This needs to be reworked completely. </li> <li> There are some changes needed in the lexer state machines to handle JSP constructs and also whitespace either side of attribute equals signs. Currently the latter is handled by a kludgy fixAttributes() method applied after a tag is parsed, but it would be better handled in the state machine initially. The former isn't handled at all, and would involve all nodes possibly having children (a remark or string node can have embedded JSP, i.e. <!-- this remark, created on <%@ date() %>, needs to be handled -->. So some design work needs to be done to analyze the state transitions and gating characters. </li> <li> toHtml(boolean verbatim/fixed) - one of the design goals for the new Lexer subsystem was to be able to regurgitate the original HTML via the toHtml() method, so the original page is unmodified except for any explicit user edits, i.e. link URL edits. But the parser fixes broken HTML without asking, so you can't get back an unadulterated page from toHtml(). A lot of test cases assume fixed HTML. Either a parameter on toHtml() or another method would be needed to provide the choice of the original HTML or the fixed HTML. There's some initial work on eliminating the added virtual end tags commented out in TagNode, but it will also require a way to remember broken tags, like: <pre> <title>The Title</title</head><body>... </pre> </li> <li> Some GUI based parser application showing the HTML parse tree in one panel and the HTML text in another, with the tree node selected being highlighted in the text, or the text cursor setting the tree node selected, would be really good. A filter builder tool to graphically construct a program to extract a snippet from an HTML page would blow people away. </li> <li> Rework all the applications for a better 'out of the box' experience for new and novice users. Fix all the scripts in /bin (for unix and windows) and add any others that don't exist already. </li> <li> The tag-enders and end-tag-enders lists are only a partial solution to the HTML specification for block and inline tags. By marking each tag as a block or inline tag and ensuring block tags don't overlap, a better parsing job could be done, i.e. <pre> <FORM> .... <TABLE> ... </FORM></TABLE> </pre> would be rearranged as <pre> <FORM> .... <TABLE> ... </TABLE></FORM> </pre> This needs some design work. </li> <li> The recursion that currently happens on the JVM stack can probably be done via a stack of open tags passed to the scanner. This would probably avoid the 'Stack overflow' exceptions observed on Windows and also allow for smarter tag closing (in conjuction with the end tag enders list). </li> <li> Change all the headers to match the new format. The integration process needs to be revamped to use the $Name: CVS substitution (via 'get label'), so a checkin isn't required every integration. </li> <li> The default is now the equivalent of the old 'RegisterDomTags', so the operation of the following mainlines needs to be revisited: <ol> <li> Generate </li> <li> Parser </li> <li> LinkBean </li> <li> Robot </li> <li> InstanceofPerformanceTest </li> <li> StringBean </li> <li> MailRipper </li> <li> LinkExtractor </li> <li> BeanyBaby </li> </ol> </li> <li> decode() can be optimized by introducing parameters for start and end in the convertToChar( String bigString, int startToLook, int endToLook) to eliminate the substring operations. </li> <li> Use <A href="http://trove4j.sourceforge.net/javadocs/gnu/trove/TObjectIntHashMap.html"> TObjectIntHashMap</A> or use a sorted list similar to the newline index in PageIndex to avoid the HashMap and the 336 Character objects in Translate. </li> <li> Modify StingBean so it can be driven by a visitor externally. See <A href="http://sourceforge.net/mailarchive/forum.php?forum_id=2023&max_rows=25&style=flat&viewmonth=200311&viewday=12">StringBean.diff</A>. </li> </ul> </BODY> </HTML> |