[Simple-support] Valid XML characters and String recoding
Brought to you by:
niallg
|
From: Dawid W. <daw...@gm...> - 2012-09-13 08:46:50
|
Hi Niall, everyone. I'd like to bring up the issue that occurred to me in practice. We have input data which we serialize via simple-xml to external XML files. Everything works like a charm except when there are String objects with unmappable XML characters in the data model. We don't have any control over these strings and need to make sure the produced XML is always valid (parseable). The spec (and wikipedia) say the following: Unicode code points in the following ranges are valid in XML 1.0 documents:[10] - U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; - U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden); - U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters. Unfortunately at the moment characters outside of these ranges are passed to the XML stream writer and in effect produce invalid XML. There are no "simple" solutions but there are workarounds. I used a Transform<String> to recode all the strings that had invalid characters before serialization. This seems like a sensible solution to use by default (although it does come with a performance penalty). Looking forward to hearing your thoughts about this, Dawid |