Share

ShaniXmlParser

File Release Notes and Changelog

Release Name: ShaniXmlParser v1.4.14

Notes:
ShaniXmlParser v1.4.14 (19/04/2007)
----------------------

ShaniXmlParser is a non validating xml/html DOM/SAX parser.

It can parse badly formed xml files, for example, it can parse files 
with inverted tag, bad escaped &,<,>. It expands all entities (if 
doctype present or auto doctype is set).

There is a css parser included.

A dtd parser is included. Entities are decoded from the dtd if any. 
If no dtd and the attribute AUTO_DOCTYPE is set on the factory, then entity 
replacement will fallback to the internal entity set 
(equals to the xhtml 1.0 entity set). 
The dtd parser parse entity, element, attlist, notation. It generates regexp to
check the validity of the document.

The dom parser will go directly in html mode if any of the following is met :
- The root node is <html>
- A w3c HTML/XHTML DTD is linked with the document

The parser is valid DOM 1 (100%),2 (100%),3 (90%) (result from validation suite).

To use it :
// The following line is not needed if you use the jar with the services include in the META-INF
System.setProperty(
	"javax.xml.parsers.DocumentBuilderFactory",
	"org.allcolor.xml.parser.CDocumentBuilderFactory"
);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// if you want auto doctype do the following :
// factory.setAttribute("org.allcolor.doctype.auto","");
DocumentBuilder build = factory.newDocumentBuilder();
build.parse(...);

To enable XInclude support, do factory.setXIncludeAware(true); before creating the 
DocumentBuilder.

A SAX parser, (XInclude is on by default)
to use it :
System.setProperty(
	"org.xml.sax.driver",
	"org.allcolor.xml.parser.CShaniSaxParser"
);
XMLReader xr = XMLReaderFactory.createXMLReader();
MySAXApp handler = new MySAXApp();
xr.setContentHandler(handler);
xr.setErrorHandler(handler);

or with jaxp :

// The following line is not needed if you use the jar with the services include in the META-INF
System.setProperty(
	"javax.xml.parsers.SAXParserFactory",
	"org.allcolor.xml.parser.CSaxParserFactory"
);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(...)

Enjoy,

do not hesitate to contact me and/or filling bug/feature reports.
Quentin Anciaux
<quentin.anciaux@advalvas.be>

This software is distributed under the term of the GNU LGPL. See license.txt file.


Changes: ShaniXmlParser v1.4.14 --------------------- 19/04/2007 v1.4.14 - The html parser now does not remove non-html tags in an html document. 16/04/2007 v1.4.13 - Faster parsing of documents without namespace. 15/04/2007 v1.4.12 - code cleanup. - faster parsing. 14/04/2007 v1.4.11 - code cleanup. - fixed an infinite loop in the css parser. 10/04/2007 v1.4.10 - internal version 07/04/2007 v1.4.9 - Correctly parse empty tag with not attributes and no space between tag name and the '/'. 03/04/2007 v1.4.8 - Corrected the duplication of a character following ']' in a CDATA section. 11/08/2006 v1.4.7 - Corrected the method isValidChild for html document in the Dom handler. - use lastIndexOf('/') instead of lastIndexOf('/',0) which is bogus in the getHTMLCss method on the Dom document. 07/08/2006 v1.4.6 - Speed improvement. 03/08/2006 v1.4.5 - again compatible with 1.4+ vm. - Speed improvement. 30/07/2006 v1.4.4 - rewritten parse method. - Speed improvement. - Use 1.5 classes, will run only on 1.5+ vm 25/07/2006 v1.4.3 - attributes parsing rewritten. - incorrect handling of overloaded namespace is corrected. - Speed improvement. 21/07/2006 v1.4.2 - Corrected incorrect empty tag parsing. - Speed improvement. 19/07/2006 v1.4.1 - Memory usage improvement. - Speed improvement. 08/07/2006 v1.4 - Memory usage improvement. - Removed document SoftReference cache. - 649/722 succeeded tests on DOM 3 Core Test Validation suite. - 282/282 succeeded tests on DOM 2 Core Test Validation suite. - 527/527 succeeded tests on DOM 1 Core Test Validation suite. 05/07/2006 v1.3.8 - Huge memory usage improvement. - Faster css styler. 06/05/2006 v1.3.7 - Corrected bug 1480913 : attributes parsing is incorrect if entry contains \n and \r end of line. 24/03/2006 v1.3.6 - Corrected an infinite loop in the dtd parser which appears on some special dtd. - Corrected the getReader method which did not fill entirely it's buffer and failed to determine correctly the charset for html files. 08/03/2006 v1.3.5 - Corrected a bug where hexadecimal entity references did not have a text node as child. - Corrected bad toString of internal entity &quot;. 01/03/2006 v1.3.4 - Corrected a bug where numerical entity references did not have a text node as child. 24/02/2006 v1.3.3 - Corrected bug [1435119] Illustration entities are parsed while they shouldn't be. 22/01/2006 v1.3.2 - implemented a softreference cache for DTD and Document. - Parser is 4 times faster on DOM validation suite than v1.3.1 - Corrected a ClassCastException occuring while trying to import/adopt a node from another implementation. - Full support for java cloneable on any Node. 19/01/2006 v1.3.1 - A bug affecting the '%' dtd entities replacement has been corrected. - Corrected bug [1409776] Parser Loops - A bug affecting the parsing of an attribute of one character length has been corrected. - A new attribute on the DocumentBuilderFactory (org.allcolor.doctype.auto) permits to default the doctype to xhtml transitional 1.0 (entities set) if no doctype is found in the document. 16/01/2006 v1.3 - 90% of DOM 3 Test Validation suite is passed. (644/722 succeeded tests) 13/01/2006 v1.2.14 - 71% of DOM 3 Test Validation suite is passed. (515/722 succeeded tests) 07/01/2006 v1.2.13 - corrected some minor bugs in the dtd parsing. - 50% of DOM 3 Test Validation suite is passed. - xml schema are parsed (but not interpreted for the moment) (removed in > 1.3) - corrected a problem which prevented the serialization of the document. 01/01/2006 v1.2.12 - Happy new year - All the 282 tests of the DOM 2 Test Validation Suite (dom2-core-tests-20040405.jar) are now passed. 28/12/2005 v1.2.11 - getElementsByTagName was sometimes returning self in the list. - Pass all the 527 tests of the DOM 1 Test Validation Suite (dom1-core-tests-20040405.jar) 29/11/2005 v1.2.10 - Correction of the parseNotation method which failed to extract the publicid from a Notation. 09/10/2005 v1.2.9 - Correction in the css selector parser. - Correction in the handling of dtd containing a mix of relative and absolute uri. 02/10/2005 v1.2.8 - Corrected the method getStyle of the CStyler class which was ignoring the pseudoElement given. - getNotationName on CEntity was returning the value of getNodeName instead of null or the value of NDATA in the entity. (Thanks to Jarle H. Næss) - added parsing of dtd NOTATION element. 19/09/2005 v1.2.7 - Corrected a NullpointerException when doing toString on an Attr node cloned. - Corrected the misbehavior of the method getElementsByTagNameNS. - Corrected a bug in CSSStyleRule parse method. - Implemented the method getStyleSheet on the Document node. - Corrected the method removeComment in the CSSParser because it could throw an OOM. - Corrected the method toString of CHtmlElement which did not escape attribute value. 10/09/2005 v1.2.6 - A NullPointerException was thrown when parsing an InputSource with only systemId set. 02/09/2005 v1.2.5 - fallback support added for XInclude in the DOM and SAX parser. - corrected a bug in the cloneNode method which didn't copy entity reference node, which throw a NullPointerException. - My daughter is one year old since the 31/08 ;) 28/08/2005 v1.2.4 - Primary support for XInclude in the SAX parser (no fallback, no xpointer). - Entities which refer an xml file are now parsed and included in the DOM tree and SAX events. - Corrected a bug in the DTDParser which appears when local DTD reference distant entities. 22/08/2005 v1.2.3 - Primary support for XInclude in the DOM parser (no fallback, no xpointer) - The DocType node was incorrectly returning null in the getInternalSubset. (Thanks to Jarle H. Næss) - The xml PI was incorrectly in the DOM tree. (Thanks to Jarle H. Næss) - When setExpandEntity is false, EntityReference node will have no children text nodes anymore. (Thanks to Jarle H. Næss) - '\n\r' in text nodes are converted as '\n' 20/08/2005 v1.2.2 - Corrected a bug where the entities in an internal DTD were not parsed. - Support of setExpandEntityReferences,setCoalescing,setIgnoringComments, setIgnoringElementContentWhitespace of the DocumentBuilderFactory. - EntityReference nodes are now correctly in the DOM tree. - Corrected a bug where the decoding of the entities in text node was made 2 times. 07/08/2005 v1.2.1 - corrected a bug in the SAX parser where character event was sometimes called multiple times. 03/08/2005 v1.2 - corrected a bug in CXmlParser in getAttributes method. (Thanks to Tom Fennelly) - Added two parser features on by default (Thanks to Tom Fennelly) : - http://www.allcolor.org/xml/decodeentities/ : Entity Reference decoding was on by default. This feature allows it to be turned off. - http://www.allcolor.org/xml/removedoublequotes/ : CXmlParser was performing a replace on all double quote characters in attribute values. This feature allows it to be turned off. - The EntityResolver and the DTDHandler of the SAX parser are called if set. - The methods startDTD,endDTD,notationDecl,startEntity,endEntity of SAX DefaultHandler2 are now called. - The CSS parser is linked with the HTML parser (you can cast in styled document and styled element). - The DTD if not xhtml/html was not always parsed, corrected this big problem. 29/03/2005 v1.1 - improved speed by using specialised tokenizer. 28/03/2005 v1.0 - improved speed of the SAX parser (run as fast as crimson) - improved speed of the DOM parser (run 2 time faster than crimson) - corrected bug in attributes namespace handling by the SAX parser. - The xhtml transitionnal and frameset DTD are now included in serialized form and not downloaded anymore from the web. This improve startup time a lot. 09/03/2005 v0.5.0 - This release adds major improvement in parsing speed. 05/03/2005 v0.4.6 - corrected a bug in the entity decoding module when no doctype and entry finished by '%' - faster parsing of the dom and the sax parser 22/02/2005 v0.4.5 - corrected a bug in getAttributeNodeNS - corrected wrong behavior of DocumentFragment Node. - compile the mono version with new IKVM.GNU.Classpath.dll (new character encoder/decoder) 06/01/2005 v0.4.4 - Bonne année ! - Corrected Namespace behavior which was incorrect in the DOM and SAX parser. - Entities are read from the DTD (if any) 06/12/2004 v0.4.3 - Merci grand St-Nicolas ;) - Corrected a serious bug in the html parser (elements disapearing). 04/12/2004 v0.4.2 - corrected a bug in the DTD parser replace entity method. (end '%' and '&' were incorrectly removed) 28/11/2004 v0.4.1 - the dtd parser is finally link with the xml parser. - entities are read from the dtd if one is found. - 2 examples apps are given (also compiled for mono). 26/11/2004 v0.4 - corrected a lot of bugs :) - remove gnu.regexp (was too limited, and jdk 1.4 has very good support for it) - refactorized the code. - a base class XmlParser do the parse and send events to a registered handler. As such there is now also a SAX interface. - provides the xml api (org.w3c.dom.* & org.xml.sax.*) - also the parser is compiled with ikvm for mono/.net - a css parser is included. - a dtd parser is included. - the jaxp api is also included. 02/10/2004 v0.3.1 - corrected a bug in the checkAhead method. - Parse directly the stream via a ReaderTokenizer. - The parsing is faster. - Use gnu.regexp package for the dtd parser.