[Htmlparser-cvs] htmlparser/src/doc-files building.html,NONE,1.1 overview.html,NONE,1.1 todo.html,NO

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/htmlparser/htmlparser/src/doc-files
In directory sc8-pr-cvs1:/tmp/cvs-serv22177/src/doc-files

Added Files:
	building.html overview.html todo.html 
Log Message:
Javadoc changes and additions. Stylesheet, overview, build instructions and todo list.
Added HTMLTaglet, an inline Javadoc taglet for embedding HTML into javadocs.

--- NEW FILE: building.html ---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
  <HEAD>
    <TITLE>How to Build the HTML Parser Libraries</TITLE>
    <link REL ="stylesheet" TYPE="text/css" HREF="../stylesheet.css" TITLE="Style">
  </HEAD>
  <BODY bgcolor="white">
  <H1>How to Build the HTML Parser libraries</H1>
  <H2>JDK</H2>
  Set up java. I won't include instructions here, just a link to the
  <a href="http://java.sun.com/j2se">Sun j2se site</a>. I use version 1.4.1, and you
  need a JDK (java development kit), not a JRE (java runtime environment).<p>
  Test your installation by typing command:<p>
  <code>javac</code><p>
  This should display help on the java compiler options.
  <H2>Ant</H2>
  Set up ant, the Java-based build tool from the 
  <a href="http://jakarta.apache.org/ant/index.html">Apache Jakarta project</a>.
  It is kind of like Make, but without Make's wrinkles. The build.xml file the
  HTML Parser uses relies on command tags available in
  Ant version 1.4.1 or higher. The version currently used on the build machine
  is 1.5.3. The current version of Ant is available 
  <a href="http://archive.apache.org/dist/ant/ant-current-bin.zip">here</a>.<p>
  Basically you unzip the file into a directory and add an ANT_HOME environment
  variable that points at it. Test your installation by typing the command:<p>
  <code>ant -help</code><p>
  This should display help on ant options.
  <H2>Third Party Libraries</H2>
  Any needed third-party libraries are included in the lib directory.<p>
  The unit test code relies on lib/junit.jar from the <a href="http://sourceforge.net/projects/junit">JUnit project</a>.
  The version used on the build machine is 3.8.1 which you can get 
  <a href="http://prdownloads.sourceforge.net/junit/junit3.8.1.zip?download">here</a>.
  <H2>Sources</H2>
  The distribution zip file contains a src.jar file. If you've unpacked the
  distribution this file should be in the top level directory you chose.<p>
  Unjar this file with the command:<p>
  <code>jar -xf src.jar</code><p>
  There should now be a build.xml in the top level directory.
  <H2>Building</H2>
  The default ant target 'htmlparser' builds everything:<p>
  <code>ant</code><p>
  If you just want to build some of the parts see the help list:<p>
  <pre><code>ant -projecthelp
 Package        glom the release and source files into the distribution zip file
 Release        prepare the release files
 changelog      create the change log from CVS logs
 checkstyle     check source code adheres to coding standards
 clean          cleanup
 compile        compile all java files
 compilelexer   compile lexer java files
 compileparser  compile parser java files
 htmlparser     same as Package plus cleanup
 init           initialize version properties
 jar            create htmlparser.jar and htmllexer.jar
 jarlexer       create htmllexer.jar
 jarparser      create htmlparser.jar
 javadoc        create JavaDoc (API) documentation
 sources        create the source zip
 test           run the JUnit tests
 thumbelina     create thumbelina.jar
 versionSource  update the version in all java files
  </code></pre><p>
  <H2>Developing</H2>
  For development purposes you might want to get an Integrated Development
  Environment (IDE) such as <a href="http://www.netbeans.org/">NetBeans</a> or <a
  href="http://eclipse.org/">Eclipse</a>.
  Mount the org directory where the HTML Parser was installed along with the
  <code>junit.jar</code> file from the <code>lib</code> directory. "Build All"
  should work.
  <H2>CVS</H2>
  The most recent files are only available via CVS:
  <pre>
  server: cvs.htmlparser.sourceforge.net
  repository: /cvsroot/htmlparser
  </pre>
  For read-only access use 'pserver' and anonymous access with no password.
  For commit access you'll need to set up ssh (see 
<a href="http://sourceforge.net/docman/display_doc.php?docid=6841&group_id=1">an
introduction to SSH on sourceforge</a> and 
<a href="http://sourceforge.net/docman/display_doc.php?docid=761&group_id=1">a
guide on setting up ssh keys</a>).
<p>Short instructions from Karle Kaila:<p>
<pre>
I have installed SSH software from <a
href="http://www.f-secure.com">www.f-secure.com</a>

I think it was something like F-Secure SSH 5.2 for Win95/98/ME/NT4.0/2000/XP Client

It is a nice grapfical SSH client both for terminal use and filetransfer
and it also contains commandline ssh2 software that CVS needs.

To access CVS I first set it up with these commands

set CVS_RSH=ssh2
set CVSROOT=use...@cv...:/cvsroot/htmlparser

username = your sourceforge username

In an empty directory I then can give CVS commands such as

cvs chekcout htmlparser

It asks for your password to sourceforge

This retrieves the latest  fileversions.
Check the CVS commands in some handbook you can find on the internet.
The manual I found is called Version Management with CVS by Per Cederqvist et al.
perhaps from http://www.cvshome.org

Derrick says:
I need
CVSROOT=:ext:use...@cv...:/cvsroot/htmlparser
CVS_RSH=ssh
</pre>
  <H2>Other</H2>
  Some of the build.xml targets (like changelog) rely on Perl to execute, and
  need a sourceforge login via ssh (secure shell). This is unlikely to be needed
  by the casual user.
  </BODY>
</HTML>

--- NEW FILE: overview.html ---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
  <HEAD>
    <TITLE>HTML Parser Libraries Overview</TITLE>
  </HEAD>
  <BODY>
  <h1>The HTML Parser Libraries.</h1>
  These java libraries provide access to the contents of local or remote HTML
  resources in a programatic way.
  <h2>Components</h2>
  The HTML Parser distribution is composed of:
  <li>a low level {@link org.htmlparser.lexer.Lexer lexer} that converts characters into tags</li>
  <li>a high level {@link org.htmlparser.Parser parser} that provides a heirarchical document view</li>
  <li>several example applications</li>
  <p>
  <h2>Building</h2>
  To build the system you'll need to get the sources from the
  <a href="http://sourceforge.net/project/showfiles.php?group_id=24399&release_id=161563">HTML
  Parser project on Sourceforge</a> if you haven't already, and then follow the
  <A href="{@docRoot}/doc-files/building.html">build instructions</A>.
  <h2>History</h2>
  <p>
  Originally started by Somik Raha, the HTML Parser has evolved with input from
  numerous people, and through several revisions...
  <h2>Outstanding Issues.</h2>
  The <A href="{@docRoot}/doc-files/todo.html">ToDo list</A> lists things that
  can or should be done.
  <h2>Mailing Lists.</h2>
  If you want to be notified when new releases of HTML Parser are available, join the
  <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-announce"
  target="_top">HTML Parser Announcement List</A>.<br>
  If you have questions about the usage of the parser, join the
  <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-user"
  target="_top">HTML Parser User List</A>.<br>
  If you want to join as a developer, please sign up on the
  <A HREF="http://lists.sourceforge.net/lists/listinfo/htmlparser-developer"
  target="_top">HTML Parser Developer List</A>
  </BODY>
</HTML>

--- NEW FILE: todo.html ---
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<HTML>
  <HEAD>
    <TITLE>ToDo List for the HTML Parser Libraries</TITLE>
    <link REL ="stylesheet" TYPE="text/css" HREF="../stylesheet.css" TITLE="Style">
  </HEAD>
  <BODY>
<ul>
<li>
It looks like there are enough bugs and requests to warrant another 1.3 point
release with some patched files.
I hate to work on a branch, but it may be the only way to get everyone off my
back.
</li>
<li>
Implement the new filtering mechanism for NodeList.searchFor ().
</li>
<li>
As of now, it's more likely that the javadocs are lying to you than providing
any helpful advice. This needs to be reworked completely.
</li>
<li>
There are some changes needed in the lexer state machines to handle JSP
constructs and also whitespace either side of attribute equals signs. Currently
the latter is handled by a kludgy fixAttributes() method applied after a tag is
parsed, but it would be better handled in the state machine initially. The
former isn't handled at all, and would involve all nodes possibly having
children (a remark or string node can have embedded JSP, i.e.
&lt;!-- this remark, created on &lt;%@ date() %&gt;, needs to be handled --&gt;.
So some design work needs to be done to analyze the state transitions and
gating characters.
</li>
<li>
toHtml(boolean verbatim/fixed) - one of the design goals for the new Lexer
subsystem was to be able to regurgitate the original HTML via the toHtml()
method, so the original page is unmodified except for any explicit user edits,
i.e. link URL edits. But the parser fixes broken HTML without asking, so you
can't get back an unadulterated page from toHtml(). A lot of test cases assume
fixed HTML. Either a parameter on toHtml() or another method would be needed to
provide the choice of the original HTML or the fixed HTML. There's some initial
work on eliminating the added virtual end tags commented out in TagNode, but it
will also require a way to remember broken tags,
like:
<pre>
&lt;title&gt;The Title&lt;/title&lt;/head&gt;&lt;body&gt;...
</pre>
</li>
<li>
Some GUI based parser application showing the HTML parse tree in one panel and
the HTML text in another, with the tree node selected being highlighted in the
text, or the text cursor setting the tree node selected, would be really good. A
filter builder tool to graphically construct a program to extract a snippet
from an HTML page would blow people away.
</li>
<li>
Rework all the applications for a better 'out of the box' experience for new
and novice users. Fix all the scripts in /bin (for unix and windows) and add
any others that don't exist already.
</li>
<li>
The tag-enders and end-tag-enders lists are only a partial solution to the HTML
specification for block and inline tags. By marking each tag as a block or
inline tag and ensuring block tags don't overlap, a better parsing job could be
done, i.e.
<pre>
   &lt;FORM&gt;    ....   &lt;TABLE&gt;   ...   &lt;/FORM&gt;&lt;/TABLE&gt;
</pre>
would be rearranged as
<pre>
   &lt;FORM&gt;    ....   &lt;TABLE&gt;   ...   &lt;/TABLE&gt;&lt;/FORM&gt;
</pre>
This needs some design work.
</li>
<li>
The recursion that currently happens on the JVM stack can probably be done via a
stack of open tags passed to the scanner. This would probably avoid the 'Stack
overflow' exceptions observed on Windows and also allow for smarter tag closing
(in conjuction with the end tag enders list).
</li>
<li>
Change all the headers to match the new format. The integration process needs to
be revamped to use the $Name: CVS substitution (via 'get label'), so a checkin
isn't required every integration.
</li>
<li>
The default is now the equivalent of the old 'RegisterDomTags', so the
operation of the following mainlines needs to be revisited:
<ol>
<li>
Generate
</li>
<li>
Parser
</li>
<li>
LinkBean
</li>
<li>
Robot
</li>
<li>
InstanceofPerformanceTest
</li>
<li>
StringBean
</li>
<li>
MailRipper
</li>
<li>
LinkExtractor
</li>
<li>
BeanyBaby
</li>
</ol>
</li>
<li>
decode() can be optimized by introducing parameters for
start and end in the convertToChar( String bigString, int startToLook,
int endToLook) to eliminate the substring operations.
</li>
<li>
Use <A href="http://trove4j.sourceforge.net/javadocs/gnu/trove/TObjectIntHashMap.html">
TObjectIntHashMap</A> or use a sorted list similar to the newline index in PageIndex
to avoid the HashMap and the 336 Character objects in Translate.
</li>
<li>
Modify StingBean so it can be driven by a visitor externally.
See
<A
href="http://sourceforge.net/mailarchive/forum.php?forum_id=2023&max_rows=25&style=flat&viewmonth=200311&viewday=12">StringBean.diff</A>.
</li>
</ul>
  </BODY>
</HTML>

[Htmlparser-cvs] htmlparser/src/doc-files building.html,NONE,1.1 overview.html,NONE,1.1 todo.html,NO

[Htmlparser-cvs] htmlparser/src/doc-files building.html,NONE,1.1 overview.html,NONE,1.1 todo.html,NONE,1.1