Donate Share

Arabica XML and HTML Toolkit for C++

Project News for Arabica XML and HTML Toolkit for C++

  • Arabica October 2008 Release

    The "Probably long overdue release" bringing a big chunk of new functionality.

    Source tar.bz2
    http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.bz2

    Source tar.gz
    http://downloads.sourceforge.net/arabica/arabica-2008-october.tar.gz

    Source zip
    http://downloads.sourceforge.net/arabica/arabica-2008-october.zip
    Exciting New Stuff

    The exciting new stuff is Taggle, a port of John Cowan's rather super TagSoup package.

    TagSoup, if you're not familiar with it, is

    a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.

    Obviously, if you have a SAX parser you can apply all your standard XML techniques - not only SAX filters, but building a DOM, applying XPaths, or XSLT transformations as well.

    Cowan describes what TagSoup does as

    TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.

    The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like:

    This is <B>bold, <I>bold italic, </b>italic, </i>normal text
    gets correctly rewritten as:

    This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

    Looks straightforward, doesn't it? Well, that's a simple example and it's still a tricky and awkward result in practice. Cowan's patience in persuing this and what looks like a rather elegant solution is to be applauded. Porting his code to C++ was quick and painless, and Taggle is a useful addition to Arabica. Thanks, John.

    Arabica Taggle chews through HTML, providing the same SAX XMLReader interface as the XML parser, and can be used in exactly the same way. HTML source can be fed through SAX filter stacks, used to build DOM trees, queried with XPath, or transformed using XSLT.
    Changes and Bug Fixes

    There are, of course, many other fixes and changes. Most are relatively minor, and if you haven't been bitten by them you won't notice. The most significant changes are in Arabica's XSLT engine, Mangle. While still not feature complete and under development, it takes, in this release, a fairly big step forward.

    SAX

    * Fixed AttributesImpl.getIndex. Thanks to Isak Johnsson for that, and what on earth was I thinking to me
    * Return attribute type as "CDATA" not the empty string
    * After all this time, realised I had too many template parameters on XMLReaderInterface. It only needs the string_type and string_adaptor. Any addition parameters are only of interest the implementing parser class

    DOM

    * Output DocumentFragment properly
    * Output <elem/> for empty elements
    * Slipped a TextCoalescer filter into the DOM builder, so that consecutive bits of text get applied to a single Text or CDATA node, rather than as a series of nodes. (A series of nodes is perfectly legal, it's just slightly unexpected. Even to me, and I work with DOMs a lot :)

    XPath

    * Some time ago, it was gently suggested to me that XPathValuePtr and XPathExpressionPtr both exposed implementation details and provided an interface that was inconsistent with the DOM classes, because you accessed the member functions via -> rather than . At the time, I was just pleased to have got the XPath stuff done and wasn't really fussed, so I left it. Since then though, it's niggled and niggled away at the back of my mind and now I've done something about it. XPathValuePtr has become XPathValue and XPathExpressionPtr has become XPathExpression, with the member functions accessed through the . operator. The XPathValuePtr and XPathExpressionPtr name and -> member access are retained for the meantime, so that existing code won't be broken. Existing code using XPathValuePtr will still work, but new stuff should use XPathValue
    * Correctly implemented Namespace Nodes. The XPath data model requires that namespace nodes are associated with an element, and sort ahead of attribute nodes in document order. Until now, Arabica's namespace node had no parent, or owner document and so was failing these requirements
    * The default namespace is included when constructing namespace nodes
    * Amazingly, the XPath prefix:* didn't compile. I had no test for it, and had overlooked it. Now I do, and it isn't
    * Unbound namespace prefixes throw an exception
    * Corrected text() test to match CDATA nodes as well as text nodes
    * XPaths are now evaluated as if the DOM had been normalised, even if it hasn't. That is, consecutive text nodes are treated as a single node

    XSLT

    * Params are not passed on through an xsl:apply-imports call
    * Template names are now QNames
    * Template mode is now QName
    * In XPath node() matches any node of any type. In an XSLT match pattern, node() matches everything except attributes and the document root node. Fixed.
    * Fixed variable scoping in xsl:for-each, xsl:if, and xsl:choose
    * Escape naughty text when outputting processing instructions and comments (eg ---)
    * Use std::stable_sort instead of std::sort. When xsl:sort specifies a numerical sort, but you've got some string data in there we need to maintain the relative positions of that string data. This is the first time I can recall actually using std::stable_sort. I will mark it down in my big book of programming accomplishments.
    * Fixed local-name for namespace nodes
    * xsl:message can contain another xsl:message - now handled properly
    * Empty comments output correctly
    * Ensure xsl:choose has at lease one xsl:when
    * Make sure any xsl:template mode attribute is not empty
    * Verify xsl:sort attribute values
    * xsl:call-template now throws if it can't find a matching template
    * Duplicate variable and parameter names are rejected
    * Disallowed current() in match patterns
    * Verify xsl:for-each selects a node-set
    * Disallow pcdata ahead of an xsl:param
    * xsl:stylesheet now allows top-level elements when they are in a foreign namespace
    * Implemented position(), last() and positional predicates in match patterns
    * Throw error if transform is run with no input
    * Verify QNames at transform compile time
    * Detect circular variable references
    * Reject variables and parameters which have both a select attribute and text content
    * Top level variables and parameters handled according to import precedence
    * Fixed internal QName resolution - unprefixed names are not in the default namespace
    * Fixed xsl:element unprefixed names - when no namespace uri is supplied are in the default namespace
    * Don't suppress output of element namespace prefixes or attributes which are in the XSL namespace
    * ensure @xmlns|@xsmlns:* selects no nodes
    * direct information messages to std::cerr, not std::cout

    Build and installation

    * Fix for problem installing headers on FreeBSD, where install doesn't understand -D
    * Changes to help out-of-tree builds
    * Added build files for Visual Studio 2008
    * Added configure tests for std::mbstate_t and/or mbstate_t. Some platforms don't have it (VxWorks, for example)
    * Visual Studio 2005 and 2003 project files are now munged from the Visual Studio 2008 files. (Don't try this at home, folks)

    Other bits and bobs

    * Fixed for base URIs with leading ../
    * Convert \ to / for relative paths as well as absolute Windows paths.

    2008-10-19 19:40:35 UTC by jez_higgins

  • Arabica October 2007 Release

    This release fixes a build problem with older versions of GCC.

    2007-10-02 12:30:43 UTC by jez_higgins

  • Arabica September 2007 Release 2

    This is a re-release of the September 2007 release which fixes a couple of build issues, affecting some
    platform/parser combinations.

    The September 2007 release notes were:

    The "certainly-break-your-build-but-it'll-be-easily-sorted-out" release.

    This is the first Arabica release ever that knowingly breaks existing code,
    but the changes required are all straightforward and shouldn't take more
    than a few minutes to recover from.

    The changes are

    a) All Arabica header files now have a .hpp extension. Existing references
    to something.h will need to be updated (or mitigated by added a forwarding
    header).

    b) The SAX namespace has been moved within the Arabica namespace. References
    to SAX::something will need to be changed to Arabica::SAX::something, or
    mitigated by a using declaration.

    c) The DOM namespace and associated namespaces, like SimpleDOM, have been moved
    within the Arabica namespace. References to DOM::something will need to be
    changed to Arabica::DOM::something, or mitigated by a using declaration.

    d) SAX classes named basic_something have been renamed something. Related typedefs
    along the lines of typedef basic_something<string> something; have been removed.
    References to SAX::something will need to be changed to SAX::something<std::string>,
    or mitigated by adding your own typedef.

    e) All SAX and DOM classes now take both a string and string adaptor template
    parameters. This change should be transparent and require no changes.

    f) Some header files in the Utils/ subdirectory have been moved:
    Utils/uri.hpp -> io/uri.hpp
    Utils/socket_stream.hpp -> io/socket_stream.hpp
    Utils/convert_adaptor.hpp -> io/convert_adaptor.hpp
    Utils/convertstream.hpp -> io/convertstream.hpp
    Utils/*codecvt.hpp -> convert/*codecvt.hpp
    Utils/normalize_whitespace.hpp -> text/normalize_whitespace.hpp
    XML/UnicodeCharacters.hpp -> text/UnicodeCharacters.hpp
    Utils/StringAdaptor.hpp -> Arabica/StringAdaptor.hpp
    DOM/Utils/Stream.hpp -> DOM/io/Stream.hpp

    There are some namespace changes along with these physical changes. Any class in
    Arabica::Utils has been moved into Arabica::io or Arabica::convert.

    2007-09-26 17:16:39 UTC by jez_higgins

  • Arabica September 2007 Release

    The "certainly-break-your-build-but-it'll-be-easily-sorted-out" release.

    This is the first Arabica release ever that knowingly breaks existing code,
    but the changes required are all straightforward and shouldn't take more
    than a few minutes to recover from.

    The changes are

    a) All Arabica header files now have a .hpp extension. Existing references
    to something.h will need to be updated (or mitigated by added a forwarding
    header).

    b) The SAX namespace has been moved within the Arabica namespace. References
    to SAX::something will need to be changed to Arabica::SAX::something, or
    mitigated by a using declaration.

    c) The DOM namespace and associated namespaces, like SimpleDOM, have been moved
    within the Arabica namespace. References to DOM::something will need to be
    changed to Arabica::DOM::something, or mitigated by a using declaration.

    d) SAX classes named basic_something have been renamed something. Related typedefs
    along the lines of typedef basic_something<string> something; have been removed.
    References to SAX::something will need to be changed to SAX::something<std::string>,
    or mitigated by adding your own typedef.

    e) All SAX and DOM classes now take both a string and string adaptor template
    parameters. This change should be transparent and require no changes.

    f) Some header files in the Utils/ subdirectory have been moved:
    Utils/uri.hpp -> io/uri.hpp
    Utils/socket_stream.hpp -> io/socket_stream.hpp
    Utils/convert_adaptor.hpp -> io/convert_adaptor.hpp
    Utils/convertstream.hpp -> io/convertstream.hpp
    Utils/*codecvt.hpp -> convert/*codecvt.hpp
    Utils/normalize_whitespace.hpp -> text/normalize_whitespace.hpp
    XML/UnicodeCharacters.hpp -> text/UnicodeCharacters.hpp
    Utils/StringAdaptor.hpp -> Arabica/StringAdaptor.hpp
    DOM/Utils/Stream.hpp -> DOM/io/Stream.hpp

    There are some namespace changes along with these physical changes. Any class in
    Arabica::Utils has been moved into Arabica::io or Arabica::convert.

    2007-09-19 11:39:32 UTC by jez_higgins

  • Arabica August 2007 Release

    Here's the latest in what's becoming the tradional August Arabica release. It packages a number of incremental improvements, together with a major chunk of new code.

    * Code
    o This release includes the first release of Mangle, the Arabica XSLT engine. Still actively under development, mangle passes about 85% of the OASIS XSLT conformance test suite and covers most common cases. The Mangle code, in the Arabica::XSLT namespace, should be regarded as alpha quality.
    o There are a number of new SAX filters for whitespace stripping, tracking namespace declarations, tracking xml:base, and buffering multiple character(...) callbacks into a single callback.
    * Build
    o Further improvements to the Autotools build. The test cases can now be built and run using 'make check'. Wide string detection has been further tweaked, as has finding libxml2. Thanks to Bob Wilkinson for that.
    o Solution and project files for Visual Studio 2005 are now included.
    o Visual Studio builds now produce distinct debug and release versions of the library. Thanks to Timo Geusch and David Grigsby who separately suggested that.

    This release has been built on a variety of platforms. Additional build reports are very welcome, particularly fon non-i386 platforms and/or non-GCC compilers.

    2007-08-31 12:14:25 UTC by jez_higgins