Re: [Exist-open] ftpclient

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Immanuel Normann wrote:

But the following seems to me more tricky. I have two xml-files with a
> single element <test/> in it that look identical in a text editor. The only
> difference is that bad.xml has a "Byte Order Mark" in front that good.xml
> doesn't have. It becomes visible through a hexadump:
>
> $ hexdump -C good.xml
> 00000000  3c 74 65 73 74 2f 3e 0d  0a                       |<test/>..|
> 00000009
>
> $ hexdump -C bad.xml
> 00000000  ef bb bf 3c 74 65 73 74  2f 3e 0d 0a              |...<test/>..|
> 0000000c
>

According to the Unicode table, UTF-8 EF BB BF is U+FEFF which is the ZERO
WIDTH NO-BREAK SPACE under the Aribic-B segment.

A lot of parsers take a standard ASCII space (and other white space) into
account and strip them from the front and back of a document before
parsing, but not white space from different language encodings.  I know
that I've had to trim out Japanese ' ' (U+3000 IDEOGRAPHIC SPACE) before
parsing some documents in the past just because who ever made the documents
didn't realized that they had put such spaces in.

If the source of some XML files consistently has malformed XML or special
characters that cause the XML to fail to load, I load the file as text, use
fn:replace to either make such characters standard spaces, other useful
characters (like '0'), or eliminate them altogether.  After the offending
charcters have been replaced, I then use util:parse to get the XML document
that may be stored and manipulated.

Hope this helps.

-- 
Michael Westbay
Writer/System Administrator
http://www.japanesebaseball.com/

Re: [Exist-open] ftpclient

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] ftpclient