Conrad,

Thanks for the fix.  I have found a way to control the encoding output by maps.google.com.  Basically we add "&output=js&oe=ENCODING" to the url instead of just "&output=js".  I have verified the default ISO-8859-1, and UTF-8 and UTF-16 all work now.  I can check in a modifed google.test to demonstrate it.  I think your fix is very desirable despite the possibility to control the encoding as your fix doesn't require the users to so.

Steve


On 8/7/2013 12:52 PM, Conrad Meyer wrote:
Hi,

Attached is a patch that fixes google.cc in testo.

- Nuke leftover HAVE_EXPAT garbage in google.cc
- Use file prefix to attempt to discover HTML encoding
- Slurp entire HTML input file as discovered encoding, then pass to
  QXmlStreamReader as a QString

xmlgeneric:
- Adds a new function, void xml_readunicode(const QString& str) (does
  what you'd expect -- feeds QString input into the xml parser and runs
  it)

Btw, r4517 breaks build on linux (defs.h:39:26: fatal error: Qtcore/QString: No such file or directory).

But apply this, and fix that issue, and Jenkins should be happy.

Thanks,
Conrad



On Wed, Aug 7, 2013 at 11:15 AM, Robert Lipe <robertlipe@gpsbabel.org> wrote:
It's more likely based on the route you're looking at than where you're viewing it if it's special casing it at all.

I think your proposal is fine.  Another copy of that string in memory is hardly a deal breaker.

Thanx for sticking with this.
RJL


On Wed, Aug 7, 2013 at 1:12 PM, Conrad Meyer <cse.cem@gmail.com> wrote:



On Wed, Aug 7, 2013 at 10:00 AM, Robert Lipe <robertlipe@gpsbabel.org> wrote:
If it writes Latin1 today (a quite surprising choice, actually) then that's the case to make work and the UTF-8 path may just be a historical distraction.

Unlike, say, a GPX file, I don't think people have ten year old .html/.js files from maps.google.com laying around that they're actively babel-izing.

Also, the "streamminess" (once you start making up words in an email, why limit yourself to just one?) matters much less in this format that in others, like GPX, where the file size can be unbounded.   I've worked with multi-GB GPX files and you really don't want to hold a copy in core.  The Google Maps output isn't unbounded - in fact, it's meant to be held in core inside a browser window anyway; that's why it has that weird compression, for example.   So if the solution is to slurp it in as a string and then make a copy of it with::fromLatin1(), I'm hip with that.

We just need to be sure the cases that we have with UTF-8 would be written by Google Maps today as Latin1...and we need to exercise it for at least European driving directions to be sure that Maps isn't being clever and using Latin1 when it can get away with it for us Americans that are less likely to have umlauts and accent graves in our directions.


I tried wget --header='Accept-Charset: utf-8', although I'm not sure that's correct. And geoip would tell Google I'm coming from North America.

I guess I can slurp in the first 256 bytes or something as latin1, look for a charset= header to confirm or switch to another encoding, then slurp the entire file using that encoding and pass the string to xmlgeneric.

Thanks,
Conrad

 



On Wed, Aug 7, 2013 at 4:08 AM, Conrad Meyer <cse.cem@gmail.com> wrote:
Nope, still Latin1 (same as google_multisegment.js).


On Wed, Aug 7, 2013 at 2:04 AM, Conrad Meyer <cse.cem@gmail.com> wrote:
Sure. What does google maps emit today? I'm hoping UTF-8...


On Tue, Aug 6, 2013 at 9:07 PM, Robert Lipe <robertlipe@gpsbabel.org> wrote:
Remember, too, this is a moving target. What maps write three years ago is less interesting than what they write today.  These files have a much shorter "use by" window than most of our inputts.


On Tue, Aug 6, 2013 at 9:31 PM, Conrad Meyer <cse.cem@gmail.com> wrote:
The other problem is that some google input files are UTF-8, and others are Latin1. So we can't just assume Latin1 all the time either =(.


On Tue, Aug 6, 2013 at 7:12 PM, Conrad Meyer <cse.cem@gmail.com> wrote:



On Tue, Aug 6, 2013 at 6:56 PM, tsteven4 <tsteven4@gmail.com> wrote:
I tried a test where I changed google_multisegment.js to make it legal utf-8.  The modified version of google_multisegment.js is attached.  Note that this isn't ok, google_multisegment.js is a web server response (see reference/google_multisegment.txt).  But it does show some interesting results:

r4516:
a) everything unchanged: fail
b) only google_multisegment.js "a9" -> "c2a9": pass
c) google.c changed as shown below, original google_multisegment.js: fail
d) google.c changed as shown below, modified google_multisegment.js: pass
Index: google.cc
===================================================================
--- google.cc   (revision 4510)
+++ google.cc   (working copy)
@@ -262,7 +262,7 @@
   desc_handle = mkshort_new_handle();
   setshort_length(desc_handle, 12);

-  xml_init(fname, google_map, NULL);
+  xml_init(fname, google_map, "ISO-8859-1");
 }

 static void

It seems to be using utf-8 encoding even though asked for ISO-8859-1.

Steve


The problem is that QXmlStreamReader doesn't really support forcing an encoding. For well-formed XML documents, it will read the encoding="" parameter of the <?xml tag. However, it doesn't appear to find the content-type declaration in this HTML file. We can work around it by slurping the file into a QString as Latin-1, and then feeding that QString to the xml parser. This does seem to defeat the purpose of a stream-based XML parser. It may be an acceptable compromise for google.cc.

Conrad

 


On 8/6/2013 5:21 PM, Conrad Meyer wrote:


On Tue, Aug 6, 2013 at 4:00 PM, tsteven4 <tsteven4@gmail.com> wrote:
Maybe this is a hint:
~/work/gc/broken% xmllint reference/google_multisegment.js
reference/google_multisegment.js:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA9 0x32 0x30 0x31
r route.\x3c/div\x3e\x3cdiv id=\"cprt\" class=\"legal\"\x3e\x3cspan\x3eMap data
The xml encoding isn't declared in the file, but the html encoding is, ISO-8859-1.  This used to be what google.c used, but it isn't that simple, i.e. declaring ISO-8859-1 in the call to xml_init doesn't fix it.

It seems to me that the reader is loosing it in readElementText when it tries to read /html/head/script.  That seems to trash the reader.  Immediatedly after that call reader.qualifiedName().length() is 0 instead of 6.  It fails the same way on Fedora 18.

      if (cb) {
        QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);
reader is trashed here.
        cb(CSTRE(c), NULL);
        current_tag.chop(reader.qualifiedName().length() + 1);
      }

Modifying that to read:

221       if (cb) {
222         QString c = reader.readElementText(QXmlStreamReader::IncludeChildElements);                 
223         if (current_tag == "/html/head/script") {
224           printf("XXX got to html head script\n'''\n%s\n''' (%d)\n", CSTR(c), c.length());          
225         }
226         cb(CSTRE(c), NULL);

I see:
XXX got to html head script
'''
//
''' (2)

Hrm.

Conrad



------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Gpsbabel-code mailing list  http://www.gpsbabel.org
Gpsbabel-code@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gpsbabel-code