From: Michael W. <wes...@ja...> - 2013-07-16 02:25:38
|
Immanuel Normann wrote: But the following seems to me more tricky. I have two xml-files with a > single element <test/> in it that look identical in a text editor. The only > difference is that bad.xml has a "Byte Order Mark" in front that good.xml > doesn't have. It becomes visible through a hexadump: > > $ hexdump -C good.xml > 00000000 3c 74 65 73 74 2f 3e 0d 0a |<test/>..| > 00000009 > > $ hexdump -C bad.xml > 00000000 ef bb bf 3c 74 65 73 74 2f 3e 0d 0a |...<test/>..| > 0000000c > According to the Unicode table, UTF-8 EF BB BF is U+FEFF which is the ZERO WIDTH NO-BREAK SPACE under the Aribic-B segment. A lot of parsers take a standard ASCII space (and other white space) into account and strip them from the front and back of a document before parsing, but not white space from different language encodings. I know that I've had to trim out Japanese ' ' (U+3000 IDEOGRAPHIC SPACE) before parsing some documents in the past just because who ever made the documents didn't realized that they had put such spaces in. If the source of some XML files consistently has malformed XML or special characters that cause the XML to fail to load, I load the file as text, use fn:replace to either make such characters standard spaces, other useful characters (like '0'), or eliminate them altogether. After the offending charcters have been replaced, I then use util:parse to get the XML document that may be stored and manipulated. Hope this helps. -- Michael Westbay Writer/System Administrator http://www.japanesebaseball.com/ |