Remember, too, this is a moving target. What maps write three years ago is less interesting than what they write today.  These files have a much shorter "use by" window than most of our inputts.

On Tue, Aug 6, 2013 at 9:31 PM, Conrad Meyer <> wrote:
The other problem is that some google input files are UTF-8, and others are Latin1. So we can't just assume Latin1 all the time either =(.

On Tue, Aug 6, 2013 at 7:12 PM, Conrad Meyer <> wrote:

On Tue, Aug 6, 2013 at 6:56 PM, tsteven4 <> wrote:
I tried a test where I changed google_multisegment.js to make it legal utf-8.  The modified version of google_multisegment.js is attached.  Note that this isn't ok, google_multisegment.js is a web server response (see reference/google_multisegment.txt).  But it does show some interesting results:

a) everything unchanged: fail
b) only google_multisegment.js "a9" -> "c2a9": pass
c) google.c changed as shown below, original google_multisegment.js: fail
d) google.c changed as shown below, modified google_multisegment.js: pass
---   (revision 4510)
+++   (working copy)
@@ -262,7 +262,7 @@
   desc_handle = mkshort_new_handle();
   setshort_length(desc_handle, 12);

-  xml_init(fname, google_map, NULL);
+  xml_init(fname, google_map, "ISO-8859-1");

 static void

It seems to be using utf-8 encoding even though asked for ISO-8859-1.


The problem is that QXmlStreamReader doesn't really support forcing an encoding. For well-formed XML documents, it will read the encoding="" parameter of the <?xml tag. However, it doesn't appear to find the content-type declaration in this HTML file. We can work around it by slurping the file into a QString as Latin-1, and then feeding that QString to the xml parser. This does seem to defeat the purpose of a stream-based XML parser. It may be an acceptable compromise for



On 8/6/2013 5:21 PM, Conrad Meyer wrote:

On Tue, Aug 6, 2013 at 4:00 PM, tsteven4 <> wrote:
Maybe this is a hint:
~/work/gc/broken% xmllint reference/google_multisegment.js
reference/google_multisegment.js:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x32 0x30 0x31
r route.\x3c/div\x3e\x3cdiv id=\"cprt\" class=\"legal\"\x3e\x3cspan\x3eMap data
The xml encoding isn't declared in the file, but the html encoding is, ISO-8859-1.  This used to be what google.c used, but it isn't that simple, i.e. declaring ISO-8859-1 in the call to xml_init doesn't fix it.

It seems to me that the reader is loosing it in readElementText when it tries to read /html/head/script.  That seems to trash the reader.  Immediatedly after that call reader.qualifiedName().length() is 0 instead of 6.  It fails the same way on Fedora 18.

      if (cb) {
        QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);
reader is trashed here.
        cb(CSTRE(c), NULL);
        current_tag.chop(reader.qualifiedName().length() + 1);

Modifying that to read:

221       if (cb) {
222         QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);                 
223         if (current_tag == "/html/head/script") {
224           printf("XXX got to html head script\n'''\n%s\n''' (%d)\n", CSTR(c), c.length());          
225         }
226         cb(CSTRE(c), NULL);

I see:
XXX got to html head script
''' (2)



Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
Gpsbabel-code mailing list