Remember, too, this is a moving target. What maps write three years ago is less interesting than what they write today.  These files have a much shorter "use by" window than most of our inputts.


On Tue, Aug 6, 2013 at 9:31 PM, Conrad Meyer <cse.cem@gmail.com> wrote:
The other problem is that some google input files are UTF-8, and others are Latin1. So we can't just assume Latin1 all the time either =(.


On Tue, Aug 6, 2013 at 7:12 PM, Conrad Meyer <cse.cem@gmail.com> wrote:



On Tue, Aug 6, 2013 at 6:56 PM, tsteven4 <tsteven4@gmail.com> wrote:
I tried a test where I changed google_multisegment.js to make it legal utf-8.  The modified version of google_multisegment.js is attached.  Note that this isn't ok, google_multisegment.js is a web server response (see reference/google_multisegment.txt).  But it does show some interesting results:

r4516:
a) everything unchanged: fail
b) only google_multisegment.js "a9" -> "c2a9": pass
c) google.c changed as shown below, original google_multisegment.js: fail
d) google.c changed as shown below, modified google_multisegment.js: pass
Index: google.cc
===================================================================
--- google.cc   (revision 4510)
+++ google.cc   (working copy)
@@ -262,7 +262,7 @@
   desc_handle = mkshort_new_handle();
   setshort_length(desc_handle, 12);

-  xml_init(fname, google_map, NULL);
+  xml_init(fname, google_map, "ISO-8859-1");
 }

 static void

It seems to be using utf-8 encoding even though asked for ISO-8859-1.

Steve


The problem is that QXmlStreamReader doesn't really support forcing an encoding. For well-formed XML documents, it will read the encoding="" parameter of the <?xml tag. However, it doesn't appear to find the content-type declaration in this HTML file. We can work around it by slurping the file into a QString as Latin-1, and then feeding that QString to the xml parser. This does seem to defeat the purpose of a stream-based XML parser. It may be an acceptable compromise for google.cc.

Conrad

 


On 8/6/2013 5:21 PM, Conrad Meyer wrote:


On Tue, Aug 6, 2013 at 4:00 PM, tsteven4 <tsteven4@gmail.com> wrote:
Maybe this is a hint:
~/work/gc/broken% xmllint reference/google_multisegment.js
reference/google_multisegment.js:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x32 0x30 0x31
r route.\x3c/div\x3e\x3cdiv id=\"cprt\" class=\"legal\"\x3e\x3cspan\x3eMap data
The xml encoding isn't declared in the file, but the html encoding is, ISO-8859-1.  This used to be what google.c used, but it isn't that simple, i.e. declaring ISO-8859-1 in the call to xml_init doesn't fix it.

It seems to me that the reader is loosing it in readElementText when it tries to read /html/head/script.  That seems to trash the reader.  Immediatedly after that call reader.qualifiedName().length() is 0 instead of 6.  It fails the same way on Fedora 18.

      if (cb) {
        QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);
reader is trashed here.
        cb(CSTRE(c), NULL);
        current_tag.chop(reader.qualifiedName().length() + 1);
      }

Modifying that to read:

221       if (cb) {
222         QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);                 
223         if (current_tag == "/html/head/script") {
224           printf("XXX got to html head script\n'''\n%s\n''' (%d)\n", CSTR(c), c.length());          
225         }
226         cb(CSTRE(c), NULL);

I see:
XXX got to html head script
'''
//
''' (2)

Hrm.

Conrad



------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Gpsbabel-code mailing list  http://www.gpsbabel.org
Gpsbabel-code@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gpsbabel-code