From: Tatu S. <cow...@ya...> - 2006-10-03 23:42:14
|
--- Mark Swanson <ma...@Sc...> wrote: > Tatu Saloranta wrote: ... > > Really? I wouldn't have though xpp would do that, > > since > > I thought it aims to be an actual xml conformant > > parser... > > > > What kind of scrubbing does it do? > > I haven't tested it exhaustively, but I do know that > it silently ignores > 0x0c (form feed) because this is where I noticed my > old code parsed some > XML properly (contained 0x0c) and my replacement > code based on vtd failed. Ok. It is likely it accepts all codes <= 0x0020 as white space -- that's usually pretty reasonable way to do it with string tokenization. ... > > Which industries rely on broken xml content being > > processed? (an honest question, no sarcasm > intended) > > It wasn't that long ago that some systems used these > control characters > and some devices/software are still using them. Some > financial systems I > work with today still use FS/GS/STX/ETX, and the > 0x0c data is coming > directly from Outlook MAPI data (event > descriptions). I'm not sure > exactly how someone is copy/pasting 0x0c characters ... > The problem is real; it's ugly, and I hope VTD will Ok, gotcha. Now that you describe it, I think I understand it bit better. And in fact this is part of a more general problem of how to transfer binary data with(in) xml. Control characters are the most obvious problem, but not the only ones. There are other illegal xml characters that would likewise cause problems... the most commonly used (but not optimal obviously) solution is to use something like base64. One more thing -- xml 1.1 actually does allow these control characters (although not null byte) to be included, via character entities. It's too bad, then, that xml 1.1 has other problems that make it DOA, rarely used anywhere (you can google for various people's reasoning why xml 1.1 sucks -- I have my own pet peeves -- byt I did work on making Woodstox parser support xml 1.1 nonetheless). Of course, since you don't control the generation of input files this is bit of a moot point. ;-) -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |