From: Mark S. <ma...@Sc...> - 2006-10-03 05:14:09
|
Tatu Saloranta wrote: > --- Mark Swanson <ma...@Sc...> wrote: > > ... >> Well, allow me try to make a stronger case: >> >> In the real world, data isn't perfect. One can >> either toss back illegal >> data or try your best to work with it. A common best >> practice is to be >> as friendly and as considerate as you can to the >> incoming data, and >> produce the most accurate and conforming (to >> whatever standard) outgoing >> data. > > This is common practice for some applications ("be > conservative at what you send, liberal at what you > accept"), but notably not with xml processing. I think the need for it is greatly diminished, but that's as far as I'll go. <snip> > Having said that, I would think that if specific > lenient modes could be enabled (and were disabled > by default), that might be reasonable. Cool. > ... >> 4. (at least some of) VTDs competitors already scrub >> the data by >> default. The XPP (Xml Pull Parser) already does >> this. In fact, I was in >> the middle of switching away from XPP when I ran >> into this VTD >> limitation. For my particular use case, using VTD is >> now slower than XPP >> because of this scrubbing issue. > > Really? I wouldn't have though xpp would do that, > since > I thought it aims to be an actual xml conformant > parser... > > What kind of scrubbing does it do? I haven't tested it exhaustively, but I do know that it silently ignores 0x0c (form feed) because this is where I noticed my old code parsed some XML properly (contained 0x0c) and my replacement code based on vtd failed. >> A single if{} could allow the pedantic behaviour (as >> it is currently) or >> a more friendly and considerate (I would argue more >> industry standard) >> behaviour. > > Which industries rely on broken xml content being > processed? (an honest question, no sarcasm intended) It wasn't that long ago that some systems used these control characters and some devices/software are still using them. Some financial systems I work with today still use FS/GS/STX/ETX, and the 0x0c data is coming directly from Outlook MAPI data (event descriptions). I'm not sure exactly how someone is copy/pasting 0x0c characters into the Outlook description field, but it happened yesterday/today. Also, cell phones that I work with (I support any SyncML-capable cell phone ever made) wrap data inside XML (SyncML is an XML protocol). All sorts of control characters wind up in the XML that have been taken care of through other scrubbers. I wish I didn't have to do that for the reasons mentioned. The problem is real; it's ugly, and I hope VTD will add this optional feature to help developers deal with it. Thanks for listening. Cheers. -- http://www.ScheduleWorld.com/tg/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |