[Gpsbabel-code] Yet Another Google Fix
Brought to you by:
robertl
From: Ron P. <ro...@pa...> - 2007-05-11 16:16:22
|
I got mail off-list from someone whose Google Maps instructions included the nbsp entity, which our google reader was puking on. Yesterday, I checked in a temporary fix that just replaced " " with a space (it's really a non-breaking space, 0xA0, but for our application, it didn't matter much.) It was a temporary fix because, obviously, there could be a lot of other entities that cause the same problem. The current CVS version now contains a new module, xhtmlent.c, that declares a global string containing all of the entity declarations for XHTML 1.0. I also added a new function, xml_readprefixstring, to xmlgeneric. This function will read a constant string but not terminate the parser, so that you can concatenate multiple strings together by calling xml_readprefixstring for all but the last. This allows me to feed the parser a synthetic doctype declaration - including all of the entity declarations I just added - for the XHTML snippet that we extract from Google Maps data. That, in turn, lets expat do all of the entity-substitution heavy lifting, which I assume it's somewhat optimized for. The point of all this is to say that if you need to parse XHTML data, or if you need to parse any data that may contain undefined parsed entities (http://www.w3.org/TR/2006/REC-xml-20060816/#sec-physical-struct) for which you need to include a DTD, you can find code to do that in the new google.c. While I was at it, I rewrote the part of the code that cleans up the extracted XHTML to get it into a parseable format, making it closer to O(n) than to O(n*m). (In my checkin comments, I said O(n^2), but it was really O(n*m) where n is the length of the string and m is the number of substitutions that have to be made.) |