Re: [Simple-support] Invalid unicode characters in XML stream
Brought to you by:
niallg
|
From: Dawid W. <daw...@gm...> - 2012-09-20 07:07:04
|
I'm not pressing on you to do it asap, I realize it's a borderline use case (most people will not be affected by this if they know their Strings contain valid unicode text, not some junk). I think the patch to Formatter could be based on the workaround I did with the transformer: http://goo.gl/8tYjR what happens is every "text" to be emitted is first checked (isMappableXmlText) and if it's conforming to the spec is passed through. If it's not valid XML text the offending characters are replaced with a "replacement character" from Unicode (it's the boxed question mark typically). Now, for the Formatter it'd probably need to take a configurable logic. Structural XML elements (element names, attribute names, namespaces etc.) should throw exceptions if they're declared with invalid characters inside. As for attribute and text blocks perhaps this should be configurable -- either an encoding exception (IOException) or a quiet replacement of offending characters much like in my code above. This would be somewhat consistent with Java's built-in charencoders. Again, this is by no means a critical feature -- like you see it can be hacked around in a pretty simple way -- just something to thing about and consider. Dawid On Thu, Sep 20, 2012 at 1:12 AM, Niall Gallagher - Yieldbroker <Nia...@yi...> wrote: > Hi Dawid, > > If you can suggest a simple fix perhaps in the Formatter that you think will be 100% XML compliant and make the Persister more resilient then let me know. Ill add it in to the next release. I think a full on strategy would be great, however it is something that I probably will not get time to do for a while. > > Thanks, > Niall > > -----Original Message----- > From: Dawid Weiss [mailto:daw...@gm...] > Sent: Wednesday, 19 September 2012 4:51 PM > To: sim...@li... > Subject: [Simple-support] Invalid unicode characters in XML stream > > Niall, you asked if this is a problem in practice -- see this build log, for example: > > http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Linux/1199/consoleText > > it failed for exactly the reasons I mentioned -- invalid unicode character in a throwable's message, serialized via simple-xml's model for ANT's junit files. I think it should be fixed at some point and I think it should be the default behavior to always produce valid XML, no matter what the performance penalty might be. > > Perhaps Formatter should be a replaceable strategy? Doesn't seem like there are that many options there, but it could be fun too -- think of a binary-XML formatter :) Just a thought. > > Dawid > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Simple-support mailing list > Sim...@li... > https://lists.sourceforge.net/lists/listinfo/simple-support |