Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.
Version 2.4 includes important bug fixes, and introduces the following major features:
1. Licensed under Eclipse Public License (EPL) as well as LGPL.
2. Simple rendering of HTML markup into text.
3. Integrated logging with various logging frameworks
4. Ability to parse HTML tags containing server tags without the need to explicitly exclude the server tags.
5. Automatic full sequential parse when needed.
Changes since version 2.3:
- Released under dual EPL/LGPL licence.
- Bug Fixes:
- [1583814] Indent method outputs multiple </script> tags
- [1576991] Bug in ConvertStyleSheets sample program
- [1597587] various NPEs in findFormFields()
- [1599700] Segment.findAllStartTags(attributeName...) infinite loop
- Overlapping elements resulted in some elements being listed as a
child of more than one parent element.
- OutputDocument.writeTo(Writer) closed the writer.
- Server tags no longer interfere with parsing of start tag attributes.
- Added Renderer class and Segment.getRenderer() method.
- Added TextExtractor class and Segment.getTextExtractor() method.
- Deprecated segment.extractText methods.
- Added SourceFormatter class and Source.getSourceFormatter() method.
- Deprecated Source.indent method.
- Added Logger interface along with the related LoggerProvider
interface and BasicLoggerProvider and WriterLogger classes.
- Added Source.setLogger(Logger) and Source.getLogger() methods.
- Deprecated Source.setLogWriter(Writer) and Source.getLogWriter()
methods.
- Added Source.findNextElement(int pos, String attributeName,
String value, boolean valueCaseSensitive) method.
- Added Segment.findAllElements(String attributeName, String value,
boolean valueCaseSensitive) method.
- Calling the ignoreWhenParsing methods on overlapping segments no
longer results in an OverlappingOutputSegmentsException.
- Added CharacterReference.getEncodingFilterWriter(Writer) method.
- Added CharacterReference.encode(char) method.
- Added Source.getNewLine() method.
- Added static Config.NewLine parameter.
- All text output now uses Config.NewLine instead of hard-coded '\n'.
- Source.fullSequentialParse() method no longer parses the source again
if it has already been called.
- Some methods that require the parsing of the entire source now call
Source.fullSequentialParse() automatically.
- Some changes to the output of various getDebugInfo() methods.
- Added categorised class list in javadoc.
- Removed all methods/constants deprecated in 2.0.