On Wed, Aug 7, 2013 at 10:00 AM, Robert Lipe <robertlipe@gpsbabel.org> wrote:
If it writes Latin1 today (a quite surprising choice, actually) then that's the case to make work and the UTF-8 path may just be a historical distraction.

Unlike, say, a GPX file, I don't think people have ten year old .html/.js files from maps.google.com laying around that they're actively babel-izing.

Also, the "streamminess" (once you start making up words in an email, why limit yourself to just one?) matters much less in this format that in others, like GPX, where the file size can be unbounded.   I've worked with multi-GB GPX files and you really don't want to hold a copy in core.  The Google Maps output isn't unbounded - in fact, it's meant to be held in core inside a browser window anyway; that's why it has that weird compression, for example.   So if the solution is to slurp it in as a string and then make a copy of it with::fromLatin1(), I'm hip with that.

We just need to be sure the cases that we have with UTF-8 would be written by Google Maps today as Latin1...and we need to exercise it for at least European driving directions to be sure that Maps isn't being clever and using Latin1 when it can get away with it for us Americans that are less likely to have umlauts and accent graves in our directions.

I tried wget --header='Accept-Charset: utf-8', although I'm not sure that's correct. And geoip would tell Google I'm coming from North America.

I guess I can slurp in the first 256 bytes or something as latin1, look for a charset= header to confirm or switch to another encoding, then slurp the entire file using that encoding and pass the string to xmlgeneric.



On Wed, Aug 7, 2013 at 4:08 AM, Conrad Meyer <cse.cem@gmail.com> wrote:
Nope, still Latin1 (same as google_multisegment.js).

On Wed, Aug 7, 2013 at 2:04 AM, Conrad Meyer <cse.cem@gmail.com> wrote:
Sure. What does google maps emit today? I'm hoping UTF-8...

On Tue, Aug 6, 2013 at 9:07 PM, Robert Lipe <robertlipe@gpsbabel.org> wrote:
Remember, too, this is a moving target. What maps write three years ago is less interesting than what they write today.  These files have a much shorter "use by" window than most of our inputts.

On Tue, Aug 6, 2013 at 9:31 PM, Conrad Meyer <cse.cem@gmail.com> wrote:
The other problem is that some google input files are UTF-8, and others are Latin1. So we can't just assume Latin1 all the time either =(.

On Tue, Aug 6, 2013 at 7:12 PM, Conrad Meyer <cse.cem@gmail.com> wrote:

On Tue, Aug 6, 2013 at 6:56 PM, tsteven4 <tsteven4@gmail.com> wrote:
I tried a test where I changed google_multisegment.js to make it legal utf-8.  The modified version of google_multisegment.js is attached.  Note that this isn't ok, google_multisegment.js is a web server response (see reference/google_multisegment.txt).  But it does show some interesting results:

a) everything unchanged: fail
b) only google_multisegment.js "a9" -> "c2a9": pass
c) google.c changed as shown below, original google_multisegment.js: fail
d) google.c changed as shown below, modified google_multisegment.js: pass
Index: google.cc
--- google.cc   (revision 4510)
+++ google.cc   (working copy)
@@ -262,7 +262,7 @@
   desc_handle = mkshort_new_handle();
   setshort_length(desc_handle, 12);

-  xml_init(fname, google_map, NULL);
+  xml_init(fname, google_map, "ISO-8859-1");

 static void

It seems to be using utf-8 encoding even though asked for ISO-8859-1.


The problem is that QXmlStreamReader doesn't really support forcing an encoding. For well-formed XML documents, it will read the encoding="" parameter of the <?xml tag. However, it doesn't appear to find the content-type declaration in this HTML file. We can work around it by slurping the file into a QString as Latin-1, and then feeding that QString to the xml parser. This does seem to defeat the purpose of a stream-based XML parser. It may be an acceptable compromise for google.cc.



On 8/6/2013 5:21 PM, Conrad Meyer wrote:

On Tue, Aug 6, 2013 at 4:00 PM, tsteven4 <tsteven4@gmail.com> wrote:
Maybe this is a hint:
~/work/gc/broken% xmllint reference/google_multisegment.js
reference/google_multisegment.js:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x32 0x30 0x31
r route.\x3c/div\x3e\x3cdiv id=\"cprt\" class=\"legal\"\x3e\x3cspan\x3eMap data
The xml encoding isn't declared in the file, but the html encoding is, ISO-8859-1.  This used to be what google.c used, but it isn't that simple, i.e. declaring ISO-8859-1 in the call to xml_init doesn't fix it.

It seems to me that the reader is loosing it in readElementText when it tries to read /html/head/script.  That seems to trash the reader.  Immediatedly after that call reader.qualifiedName().length() is 0 instead of 6.  It fails the same way on Fedora 18.

      if (cb) {
        QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);
reader is trashed here.
        cb(CSTRE(c), NULL);
        current_tag.chop(reader.qualifiedName().length() + 1);

Modifying that to read:

221       if (cb) {
222         QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);                 
223         if (current_tag == "/html/head/script") {
224           printf("XXX got to html head script\n'''\n%s\n''' (%d)\n", CSTR(c), c.length());          
225         }
226         cb(CSTRE(c), NULL);

I see:
XXX got to html head script
''' (2)



Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
Gpsbabel-code mailing list  http://www.gpsbabel.org