From: Mark S. <ma...@Sc...> - 2006-10-02 19:49:49
|
I did a quick search and didn't find this so I'm reporting here: If VTD sees a control character like ^L it will throw an exception and fail to parse the document. Please fix this to be more tolerant. At the very least allow me to specify an option to tell vtd to ignore characters that would otherwise cause parsing to stop. Thank you. -- http://www.ScheduleWorld.com/tg/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2006-10-02 20:30:17
|
Is ^L a valid XML character? What is its value in the UCS? Does Xerces have problem with this char?? ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: <vtd...@li...> Sent: Monday, October 02, 2006 12:49 PM Subject: [Vtd-xml-users] 1.6 bug: >I did a quick search and didn't find this so I'm reporting here: > > If VTD sees a control character like ^L it will throw an exception and > fail to parse the document. > > Please fix this to be more tolerant. At the very least allow me to > specify an option to tell vtd to ignore characters that would otherwise > cause parsing to stop. > > Thank you. > > -- > http://www.ScheduleWorld.com/tg/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Mark S. <ma...@Sc...> - 2006-10-02 20:59:02
|
Jimmy Zhang wrote: > Is ^L a valid XML character? What is its value in the UCS? > Does Xerces have problem with this char?? Yeah, I can see I wasn't clear; I did look up ^L and didn't find anything so was hoping you would just know :-) Ok, I used hexdump and found the value of the offending character: 0x0C Form Feed. This is not a valid xml character. Valid characters are 0x0a, 0x0d, 0x09 below 0x20, and 0x20 and up: http://www.w3.org/TR/REC-xml/#charsets However, I now have to create a method called removeAsciiControl() that removes every byte < 0x20 except for 0x0d, 0x0a, 0x09. Only then can I pass this cleaned up data to vtd. I'd like to avoid this overhead, and it would be ideal if vtd just ignored non-valid xml characters. This saves me from creating a buffer and cleaning the data manually. Thank you. -- http://www.ScheduleWorld.com/tg/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2006-10-02 22:40:31
|
A concern, if VTD-XML is to allow those chars, is that XML developers are likely to shun away from VTD-XML for the fear that it is not enough conformant to the spec... It is something that is worth thinking about... ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: <vtd...@li...> Sent: Monday, October 02, 2006 1:59 PM Subject: Re: [Vtd-xml-users] 1.6 bug: > Jimmy Zhang wrote: >> Is ^L a valid XML character? What is its value in the UCS? >> Does Xerces have problem with this char?? > > Yeah, I can see I wasn't clear; I did look up ^L and didn't find > anything so was hoping you would just know :-) > > Ok, I used hexdump and found the value of the offending character: 0x0C > Form Feed. > This is not a valid xml character. Valid characters are 0x0a, 0x0d, 0x09 > below 0x20, and 0x20 and up: > http://www.w3.org/TR/REC-xml/#charsets > > However, I now have to create a method called removeAsciiControl() that > removes every byte < 0x20 except for 0x0d, 0x0a, 0x09. Only then can I > pass this cleaned up data to vtd. > > I'd like to avoid this overhead, and it would be ideal if vtd just > ignored non-valid xml characters. This saves me from creating a buffer > and cleaning the data manually. > > Thank you. > > -- > http://www.ScheduleWorld.com/tg/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Mark S. <ma...@Sc...> - 2006-10-03 01:48:19
|
Jimmy Zhang wrote: > A concern, if VTD-XML is to allow those chars, is that XML developers > are likely to > shun away from VTD-XML for the fear that it is not enough conformant to > the spec... > It is something that is worth thinking about... Well, allow me try to make a stronger case: In the real world, data isn't perfect. One can either toss back illegal data or try your best to work with it. A common best practice is to be as friendly and as considerate as you can to the incoming data, and produce the most accurate and conforming (to whatever standard) outgoing data. Currently, VTD fails here wrt the incoming data; there is no way I can tell VTD to be considerate and lenient towards the failings of the incoming data stream. As a developer, I believe it is my choice - not VTD's - whether or not I wish to be considerate and lenient to the incoming data. If I wish to consider a form feed character as harmless, then I don't want my tools to get in the way. VTD is getting in the way. From a performance standpoint, VTD is now causing many negative consequences: 1. I must build and maintain a new additional layer to handle incoming requests - because I don't have the source to modify any of the previous layers. This new layer is required to scrub clean the incoming data - a job that VTD could have done far faster and cheaper both in terms of time and resources. 2. I need to allocate more objects to buffer the incoming data and the new cleaned result, and spend CPU time walking and cleaning the incoming data. VTD is already doing this exact same job, it's just missing a small insignificant extra if() statement. 3. The extra cost of memory and CPU (plus garbage collection) forced upon users of VTD (users who want to be considerate and friendly to clients sending in data) may now remove the advantages of VTD compared to other solutions. 4. (at least some of) VTDs competitors already scrub the data by default. The XPP (Xml Pull Parser) already does this. In fact, I was in the middle of switching away from XPP when I ran into this VTD limitation. For my particular use case, using VTD is now slower than XPP because of this scrubbing issue. A single if{} could allow the pedantic behaviour (as it is currently) or a more friendly and considerate (I would argue more industry standard) behaviour. In the real world, dirty data happens. We have to deal with it. VTDs value would be even greater if developers could count on it to help us with this problem. Please consider it. Thank you. -- http://www.ScheduleWorld.com/tg/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Tatu S. <cow...@ya...> - 2006-10-03 04:06:30
|
--- Mark Swanson <ma...@Sc...> wrote: ... > Well, allow me try to make a stronger case: > > In the real world, data isn't perfect. One can > either toss back illegal > data or try your best to work with it. A common best > practice is to be > as friendly and as considerate as you can to the > incoming data, and > produce the most accurate and conforming (to > whatever standard) outgoing > data. This is common practice for some applications ("be conservative at what you send, liberal at what you accept"), but notably not with xml processing. > Currently, VTD fails here wrt the incoming data; > there is no way I can > tell VTD to be considerate and lenient towards the > failings of the > incoming data stream. As a developer, I believe it > is my choice - not > VTD's - whether or not I wish to be considerate and > lenient to the > incoming data. If I wish to consider a form feed You could argue this, but it is worth noting that none of the actual conformant xml parsers allow things like characters that are _illegal_ in xml content: try same content with, say, Xerces, and see what I mean. This is because xml specification is very clear not only on what is considered legal for well-formed documents, but also how conformining processing applications (parsers) are to deal with things that are not. Specifically they are not allowed to resolve fatal problems, and must report these fatal errors to the end application. Having said that, I would think that if specific lenient modes could be enabled (and were disabled by default), that might be reasonable. ... > 4. (at least some of) VTDs competitors already scrub > the data by > default. The XPP (Xml Pull Parser) already does > this. In fact, I was in > the middle of switching away from XPP when I ran > into this VTD > limitation. For my particular use case, using VTD is > now slower than XPP > because of this scrubbing issue. Really? I wouldn't have though xpp would do that, since I thought it aims to be an actual xml conformant parser... What kind of scrubbing does it do? > A single if{} could allow the pedantic behaviour (as > it is currently) or > a more friendly and considerate (I would argue more > industry standard) > behaviour. Which industries rely on broken xml content being processed? (an honest question, no sarcasm intended) -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Mark S. <ma...@Sc...> - 2006-10-03 05:14:09
|
Tatu Saloranta wrote: > --- Mark Swanson <ma...@Sc...> wrote: > > ... >> Well, allow me try to make a stronger case: >> >> In the real world, data isn't perfect. One can >> either toss back illegal >> data or try your best to work with it. A common best >> practice is to be >> as friendly and as considerate as you can to the >> incoming data, and >> produce the most accurate and conforming (to >> whatever standard) outgoing >> data. > > This is common practice for some applications ("be > conservative at what you send, liberal at what you > accept"), but notably not with xml processing. I think the need for it is greatly diminished, but that's as far as I'll go. <snip> > Having said that, I would think that if specific > lenient modes could be enabled (and were disabled > by default), that might be reasonable. Cool. > ... >> 4. (at least some of) VTDs competitors already scrub >> the data by >> default. The XPP (Xml Pull Parser) already does >> this. In fact, I was in >> the middle of switching away from XPP when I ran >> into this VTD >> limitation. For my particular use case, using VTD is >> now slower than XPP >> because of this scrubbing issue. > > Really? I wouldn't have though xpp would do that, > since > I thought it aims to be an actual xml conformant > parser... > > What kind of scrubbing does it do? I haven't tested it exhaustively, but I do know that it silently ignores 0x0c (form feed) because this is where I noticed my old code parsed some XML properly (contained 0x0c) and my replacement code based on vtd failed. >> A single if{} could allow the pedantic behaviour (as >> it is currently) or >> a more friendly and considerate (I would argue more >> industry standard) >> behaviour. > > Which industries rely on broken xml content being > processed? (an honest question, no sarcasm intended) It wasn't that long ago that some systems used these control characters and some devices/software are still using them. Some financial systems I work with today still use FS/GS/STX/ETX, and the 0x0c data is coming directly from Outlook MAPI data (event descriptions). I'm not sure exactly how someone is copy/pasting 0x0c characters into the Outlook description field, but it happened yesterday/today. Also, cell phones that I work with (I support any SyncML-capable cell phone ever made) wrap data inside XML (SyncML is an XML protocol). All sorts of control characters wind up in the XML that have been taken care of through other scrubbers. I wish I didn't have to do that for the reasons mentioned. The problem is real; it's ugly, and I hope VTD will add this optional feature to help developers deal with it. Thanks for listening. Cheers. -- http://www.ScheduleWorld.com/tg/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Tatu S. <cow...@ya...> - 2006-10-03 23:42:14
|
--- Mark Swanson <ma...@Sc...> wrote: > Tatu Saloranta wrote: ... > > Really? I wouldn't have though xpp would do that, > > since > > I thought it aims to be an actual xml conformant > > parser... > > > > What kind of scrubbing does it do? > > I haven't tested it exhaustively, but I do know that > it silently ignores > 0x0c (form feed) because this is where I noticed my > old code parsed some > XML properly (contained 0x0c) and my replacement > code based on vtd failed. Ok. It is likely it accepts all codes <= 0x0020 as white space -- that's usually pretty reasonable way to do it with string tokenization. ... > > Which industries rely on broken xml content being > > processed? (an honest question, no sarcasm > intended) > > It wasn't that long ago that some systems used these > control characters > and some devices/software are still using them. Some > financial systems I > work with today still use FS/GS/STX/ETX, and the > 0x0c data is coming > directly from Outlook MAPI data (event > descriptions). I'm not sure > exactly how someone is copy/pasting 0x0c characters ... > The problem is real; it's ugly, and I hope VTD will Ok, gotcha. Now that you describe it, I think I understand it bit better. And in fact this is part of a more general problem of how to transfer binary data with(in) xml. Control characters are the most obvious problem, but not the only ones. There are other illegal xml characters that would likewise cause problems... the most commonly used (but not optimal obviously) solution is to use something like base64. One more thing -- xml 1.1 actually does allow these control characters (although not null byte) to be included, via character entities. It's too bad, then, that xml 1.1 has other problems that make it DOA, rarely used anywhere (you can google for various people's reasoning why xml 1.1 sucks -- I have my own pet peeves -- byt I did work on making Woodstox parser support xml 1.1 nonetheless). Of course, since you don't control the generation of input files this is bit of a moot point. ;-) -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Jimmy Z. <cra...@co...> - 2006-10-03 17:32:53
|
Mark, I will keep your suggestion in mind and see what can be done to accommodate it in the future... Jimmy ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: "Tatu Saloranta" <cow...@ya...> Cc: <vtd...@li...> Sent: Monday, October 02, 2006 10:14 PM Subject: Re: [Vtd-xml-users] 1.6 bug: > Tatu Saloranta wrote: >> --- Mark Swanson <ma...@Sc...> wrote: >> >> ... >>> Well, allow me try to make a stronger case: >>> >>> In the real world, data isn't perfect. One can >>> either toss back illegal >>> data or try your best to work with it. A common best >>> practice is to be >>> as friendly and as considerate as you can to the >>> incoming data, and >>> produce the most accurate and conforming (to >>> whatever standard) outgoing >>> data. >> >> This is common practice for some applications ("be >> conservative at what you send, liberal at what you >> accept"), but notably not with xml processing. > > I think the need for it is greatly diminished, but that's as far as I'll > go. > > <snip> >> Having said that, I would think that if specific >> lenient modes could be enabled (and were disabled >> by default), that might be reasonable. > > Cool. > >> ... >>> 4. (at least some of) VTDs competitors already scrub >>> the data by >>> default. The XPP (Xml Pull Parser) already does >>> this. In fact, I was in >>> the middle of switching away from XPP when I ran >>> into this VTD >>> limitation. For my particular use case, using VTD is >>> now slower than XPP >>> because of this scrubbing issue. >> >> Really? I wouldn't have though xpp would do that, >> since >> I thought it aims to be an actual xml conformant >> parser... >> >> What kind of scrubbing does it do? > > I haven't tested it exhaustively, but I do know that it silently ignores > 0x0c (form feed) because this is where I noticed my old code parsed some > XML properly (contained 0x0c) and my replacement code based on vtd failed. > > >>> A single if{} could allow the pedantic behaviour (as >>> it is currently) or >>> a more friendly and considerate (I would argue more >>> industry standard) >>> behaviour. >> >> Which industries rely on broken xml content being >> processed? (an honest question, no sarcasm intended) > > It wasn't that long ago that some systems used these control characters > and some devices/software are still using them. Some financial systems I > work with today still use FS/GS/STX/ETX, and the 0x0c data is coming > directly from Outlook MAPI data (event descriptions). I'm not sure > exactly how someone is copy/pasting 0x0c characters into the Outlook > description field, but it happened yesterday/today. Also, cell phones > that I work with (I support any SyncML-capable cell phone ever made) > wrap data inside XML (SyncML is an XML protocol). All sorts of control > characters wind up in the XML that have been taken care of through other > scrubbers. I wish I didn't have to do that for the reasons mentioned. > > The problem is real; it's ugly, and I hope VTD will add this optional > feature to help developers deal with it. > > Thanks for listening. > > Cheers. > > -- > http://www.ScheduleWorld.com/tg/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users |