Jericho HTML Parser / News: Recent posts

Jericho HTML Parser 3.3 released

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML.

Version 3.3 includes important bug fixes and various enhancements.

Change log:
http://jericho.htmlparser.net/release.txt

Posted by Martin Jericho 2012-10-30

Jericho HTML Parser 3.2 released

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML.

Version 3.2 includes important bug fixes and various enhancements including HTML5 support.

Change log:
http://jericho.htmlparser.net/release.txt

Posted by Martin Jericho 2011-03-05

Jericho HTML Parser 3.1 released

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

Version 3.1 includes important bug fixes and the following enhancements:

- A new stream based parsing option using the StreamedSource class, which
allows memory efficient processing of large files using an event iterator.
This is essentially a StAX alternative with the ability to process HTML
and non-validating XML, as well as several other features not available
in other streaming parsers.... read more

Posted by Martin Jericho 2009-06-11

Jericho HTML Parser 3.0 released

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

Version 3.0 is a major new release with the following features:

- Requires runtime Java 5 or later

- Major API changes including:
- change of package name to net.htmlparser.jericho
- use of generics and enum
- changes to some method names to maintain consistent naming conventions... read more

Posted by Martin Jericho 2009-04-09

Jericho HTML Parser 3.0-beta1 (Java 5) released

A Java 5 based beta version of Jericho HTML Parser is available for download:
http://jericho.htmlparser.net/temp/jericho-html-3.0-beta1.zip

The new version makes use of Java 5 features such as generics and enums, and is not binary compatible with older versions.

CHANGES REQUIRED TO ALL EXISTING PROGRAMS: (will not work without modification)

- Package name has changed from au.id.jericho.lib.html to net.htmlparser.jericho.... read more

Posted by Martin Jericho 2008-06-30

Jericho HTML Parser 2.6 released

Jericho HTML Parser is a powerful java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

Version 2.6 includes important bug fixes and the following enhancements:

- Non-server tags are no longer recognised inside server tags

- Recognition of Microsoft downlevel-revealed conditional comments... read more

Posted by Martin Jericho 2008-06-25

Jericho HTML Parser 2.5 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

Version 2.5 includes important bug fixes, and introduces the following minor enhancements:

- Elements inside SCRIPT elements are ignored... read more

Posted by Martin Jericho 2007-09-02

Jericho HTML Parser 2.4 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

Version 2.4 includes important bug fixes, and introduces the following major features:

1. Licensed under Eclipse Public License (EPL) as well as LGPL.... read more

Posted by Martin Jericho 2007-05-20

Jericho HTML Parser 2.3 has new Maven Group ID

Jericho HTML Parser 2.3 has been uploaded to the ibiblio Maven repository under the new group id "net.htmlparser.jericho".

This will reflect the new java package name in the next major release.

The new POM file is located here:
http://www.ibiblio.org/maven2/net/htmlparser/jericho/jericho-html/2.3/jericho-html-2.3.pom

A relocation POM file has also been added under the old group id for the benefit of existing maven users.

Posted by Martin Jericho 2006-10-07

Jericho HTML Parser 2.3 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides useful HTML form utilities.

Version 2.3 includes important bug fixes as well as some minor improvements.

Changes since version 2.2:
- Bug Fixes:
- [1510438] NullPointerException in Source.indent.
- [1511480] Incorrect detection of non-html element with nested
empty-element tag of same name.
- [1547562] Fault in caching mechanism.
- Source.fullSequentialParse() sometimes resulted in unregistered
tags being returned in tag searches.
- Invalid Empty-element tags whose name is in either of the sets
HTMLElements.getEndTagOptionalElementNames() or
HTMLElements.getEndTagRequiredElementNames() were rejected by the
parser if the slash immediately follows the tag name.
- StartTag.tidy() only included a slash before the closing delimiter
of the tag if the tag name was in the set of
HTMLElements.getEndTagForbiddenElementNames(). It now includes the
slash for all tag names not in getEndTagOptionalElementNames().
- Source.fullSequentialParse() now clears the cache automatically
instead of throwing an IllegalStateException if the cache is not
empty.
- Changes to behaviour of Source.indent:
- preserves indenting in SCRIPT elements, server elements,
HTML comments and CDATA sections.
- keeps SCRIPT elements, HTML comments, XML declarations,
XML processing instructions and markup declarations inline.
- Minor documentation improvements.

Posted by Martin Jericho 2006-09-11

Jericho HTML Parser 2.2 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides useful HTML form utilities.

Version 2.2 includes important bug fixes, and introduces the following major features:

1. The Source.fullSequentialParse() method provides a much more efficient means of parsing the entire source document.... read more

Posted by Martin Jericho 2006-06-19

Jericho HTML Parser in Maven repository

Jericho HTML Parser (jericho-html) has been published to the Maven2 repository under the group id "net.htmlparser".

The main benefit of this is to simplify the inclusion of the library in projects built using Maven. (http://maven.apache.org/)

The POM file is located here:
http://www.ibiblio.org/maven2/net/htmlparser/jericho-html/2.1/jericho-html-2.1.pom

Posted by Martin Jericho 2006-01-11

Jericho HTML Parser 2.1 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

Version 2.1 adds two main features:

1. The Segment.extractText() method extracts all of the text from a segment of the source document, removing all markup and collapsing whitespace. This is a simple text extraction only and makes no attempt to render the markup.... read more

Posted by Martin Jericho 2005-12-24

Jericho HTML Parser 2.0 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

Version 2.0 is a complete rewrite of previous versions, including the core parser and caching mechanism.

The major new feature in 2.0, in addition to the HTML form manipulation features added in 1.5-dev, is the ability to easily define new tag types for recognition by the parser. Performance is also greatly improved and the documentation has been expanded significantly.... read more

Posted by Martin Jericho 2005-11-11

Jericho HTML Parser 1.4 released

Jericho HTML Parser is a simple but powerful java HTML parser library allowing analysis and manipulation of HTML documents.

Version 1.4 introduces classes for dealing with character entity references and numeric character references, relaxes rules for parsing attributes, and includes some minor documentation improvements.

Change Log:

- Added CharacterEntityReference and NumbericCharacterReference classes
- Added CharOutputSegment class
- Attributes allow whitespace around '=' sign
- Added convenience method Element.getAttributes()
- Some documentation improvements

Posted by Martin Jericho 2004-09-02

Jericho HTML Parser 1.3 released

Jericho HTML Parser is a simple but powerful java HTML parser library allowing analysis and manipulation of HTML documents. Version 1.3 introduces some minor features such as the ability to ignore specified sections of the document when parsing, and to parse attribute lists in any part of the document.
It also fixes a major bug related to the presence of comments in the document.

Change Log:

- Deprecated Source.getSourceTextLowerCase()
- Added ignoreWhenParsing methods to Source and Segment classes
(See sample called JSPTest)
- Added parseAttributes methods to Source, Segment and StartTag classes
- Added BlankOutputSegment class
- Bug fixes

Posted by Martin Jericho 2004-07-25

Jericho HTML Parser 1.2 released

Jericho HTML Parser is a simple but powerful java HTML parser library allowing analysis and manipulation of HTML documents. Version 1.2 introduces the recognition of common server-side tags such as ASP, JSP, PSP, PHP and Mason. Various other performance and usability improvements are also included.

Change Log:

- Deprecated public fields in Attribute class in favour of accessor methods
- Following methods return empty list instead of null if no result:
(WARNING - This could possibly break existing programs)
Segment.findAllStartTags(String name)
Segment.findAllComments()
Segment.findAllElements(String name)
Segment.findAllElements()
- Added hashCode() method to Segment class
- Server tags such as ASP, JSP, PSP, PHP and Mason are now recognised
- Basic parser logging introduced (see Source.setLogWriter() method)
- Start tags with too many badly formed attributes rejected
(reduces number of false positives when searching for start tags)
- Added public IOutputSegment.COMPARATOR field
- Improved caching... read more

Posted by Martin Jericho 2004-06-16

Jericho HTML Parser 1.1 released

Jericho HTML Parser is a simple but powerful java HTML parser library allowing analysis and manipulation of HTML documents. Version 1.1 introduces recognition of elements with optional end tags and includes major performance enhancements.

Other changes include the addition of numerous methods associated with the recognition of HTML 4.01 elements, as well as a bug fix.

For more details see:
- release notes: http://sourceforge.net/project/shownotes.php?release_id=222020
- javadocs: http://jerichohtml.sourceforge.net/api/index.html

Posted by Martin Jericho 2004-03-07

Jericho HTML Parser 1.0 Released

Jericho HTML Parser is a simple but powerful Java library for analysing and modifying HTML. It ignores any server-side code/markup or invalid HTML, while still being able to analyse and modify parts and reproduce the rest verbatim.

The library distinguishes itself from other HTML parsers by its three major features:

1. No parse tree of the entire document is ever generated. In this sense the toolkit is strictly speaking not a true parser. The document source text is searched only for the markup relevant to the current operation. This allows the toolkit to analyse and modify documents containing JSP, ASP, PHP, incorrect or badly formatted HTML, or any other server or client side code, script, macro or markup. Most other parsers can't handle content that they are not explicitly programmed to accept. ... read more

Posted by Martin Jericho 2004-02-07