I have some untidy XML documents with HTML-like tags that I need to tidy with JTidy. As its not exactly (X)Html, I need to use input-xml: yes
But JTidy's ParserImpl.parseXMLDocument() is hardwired to use mode Lexer.IGNORE_WHITESPACE. This results in whitespace stripping of relevant whitespace. E.g. <p>a <i>b</i> c</p> becomes <p>a <i>b</i>c</p>.
When tidy is parsing regular HTML input (input-xml:no), it leaves internal whitespace in tact using mode Lexer.MIXED_CONTENT for most tags.
So I've added a boolean Configuration option called 'xml-is-mixed', which, when set to yes, does XML Document parsing using mode Lexer.MIXED_CONTENT. This will leave leading whitespace in place when rendering the output.
I will provide a patch attached to this ticket.
Here are the test results of this patch
mvn -Dtest=JTidyParserBugsTest test
[INFO] Scanning for projects...
[INFO]
[INFO] Using the builder org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder with a thread count of 1
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building JTidy 8.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ jtidy ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 6 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.1:compile (default-compile) @ jtidy ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ jtidy ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 681 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.1:testCompile (default-testCompile) @ jtidy ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-surefire-plugin:2.5:test (default-test) @ jtidy ---
[INFO] Surefire report directory: /Users/pciuffetti/Documents/Dev/workspace/jtidy-code/jtidy/target/surefire-reports
T E S T S
Running org.w3c.tidy.JTidyParserBugsTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.676 sec
Results :
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.649 s
[INFO] Finished at: 2015-03-06T14:25:27-05:00
[INFO] Final Memory: 7M/245M
[INFO] ------------------------------------------------------------------------