Re: [Gpsbabel-code] [PATCH] Implement a custom QTextCodec for faster XML character cleaning
Brought to you by:
robertl
From: tsteven4 <tst...@gm...> - 2013-09-15 16:05:18
|
I tried something like that as well per your suggestion yesterday. The documentation says setCodec should be called before writeStartDocument. When I tried this anyway I had something else crop up like a second BOM after the xml declaration. I don't remember the exact issue. On 9/15/2013 9:55 AM, Conrad Meyer wrote: > Well, there's always this giant kludge: > > writeStartDocument() { > setCodec("UTF-8"); > QXmlStreamWriter::writeStartDocument(); > setCodec(XmlTextCodec::instance); > } > > I don't think an XML streaming writer is particularly thread-safe > anywhere else, so maybe that isn't too painful? > > Conrad > > > On Sun, Sep 15, 2013 at 11:50 AM, tsteven4 <tst...@gm... > <mailto:tst...@gm...>> wrote: > > This is hitting one of the wrinkles I mentioned. It is outputting > a BOM which we don't want with UTF-8, and which QXmlStreamWriter > would not normally output. > > Note that most of our tests ignore whitespace and BOMs, so even > though testo appears to pass there can be problems (such as > invalid xml which started all this). I have been using this as a test >> ./gpsbabel -t -r -w -i tpo3 -f reference/LineStyles.tpo -o gpx -F >> ls.gpx >> od -Ax -txC -v ls.gpx > ls.gpx.hd > (od -Ax -txC -v reference/LineStyles.gpx >LineStyles.gpx.hd) >> tkdiff ls.gpx.hd LineStyles.gpx.hd > > > > On 9/15/2013 9:03 AM, Conrad Meyer wrote: >> Okay, here's the cleaned up patch. I'm happy with it at this >> point. We just use writeProcessingInstruction() to implement >> writeStartDocument() without writing out the encoding's name, >> which no longer pretends to be "UTF-8". >> >> Thanks, >> Conrad >> >> >> On Sat, Sep 14, 2013 at 8:39 PM, Conrad Meyer <cs...@gm... >> <mailto:cs...@gm...>> wrote: >> >> We can (and should?) codify 106 as a constant somewhere (if >> we're continuing to use it) -- it's a well defined thing, >> apparently: >> http://tools.ietf.org/html/rfc3808 >> >> But we could also go the route of just overriding the >> writeStartDocument() method ... IMO that's better than >> injecting our not-quite-UTF-8 codec into the global codec >> namespace as "UTF-8". I can do this (later) or someone else >> can beat me to it. Whatever :). >> >> Thanks, >> Conrad >> >> >> On Sat, Sep 14, 2013 at 8:02 PM, Robert Lipe >> <rob...@gm... <mailto:rob...@gm...>> wrote: >> >> Cool. I'm happy to let the final call be Steve's as he's >> closer to this issue than I am, but I like it. 5% >> slower, but correct with no excuses sounds pretty >> awesome. I'm a little uncomfortable with the coupling of >> the constant "106", but if it has to be, it has to be. >> >> In unrelated news, I just fixed the format options dialog >> in the GUI to not do dumb things where this is no range >> given for numeric options (this is the bugreport from a >> day or two ago) and to link help directly to the "right" >> page. >> >> RJL >> >> On Sat Sep 14 2013 at 4:46:28 PM, Conrad Meyer >> <cs...@gm... <mailto:cs...@gm...>> wrote: >> >> This changes XmlTextCodec to use codecForMib() to >> avoid getting itself and going recursive; also >> changes the same thing anywhere else we know we'll be >> looking for UTF-8. Kind of a hack. >> >> writeStartDocument() uses private methods, so I think >> the alternative is this other hack: >> >> gpsbabel::XmlStreamWriter::writeStartDocument() { >> QTextCodec *c = codec(); >> setCodec(QTextCodec::codecForMib(106)); >> QXmlStreamWriter::writeStartDocument(this); >> setCodec(c); >> } >> >> >> >> On Sat, Sep 14, 2013 at 5:31 PM, tsteven4 >> <tst...@gm... <mailto:tst...@gm...>> wrote: >> >> probably have to mess with this as well to agree >> with gpsbabel::XmlTextCodec::name() >>> XmlTextCodec::XmlTextCodec() : QTextCodec() >>> { >>> utf8Codec = QTextCodec::codecForName("UTF-8"); >>> } >> >> I see some other small tweaks that might help >> performance slightly. >> >> >> On 9/14/2013 3:24 PM, Conrad Meyer wrote: >>> Ah, I think I know what the problem is. >>> QTextCodec::name() is probably used to look up >>> codecs in the global database, so our >>> "utf8Codec" is actually our own object, and we >>> recurse... I originally implemented >>> QTextCodec::name() as "UTF-8" because >>> QXmlStreamWriter::writeDTD() uses the text codec >>> name. However, I think the fix is this: >>> >>> 1) gpsbabel::XmlTextCodec::name() -> >>> "UTF-8-For-XML" (something other than "UTF-8", >>> anyway) >>> 2) gpsbabel::XmlStreamWriter::writeDTD() -> >>> overrides the default and just writes UTF-8. >>> >>> I can come up with a fresh patch in a minute. >>> >>> Thanks, >>> Conrad >>> >>> >>> On Sat, Sep 14, 2013 at 5:01 PM, tsteven4 >>> <tst...@gm... <mailto:tst...@gm...>> >>> wrote: >>> >>> here is what I get with Qt 4.6.2, Centos 6. >>> It also causes a segmentation fault with my >>> build of Qt 5.1.1 on Centos 6 >>> >>>> Core was generated by `./gpsbabel -i gpx -f >>>> reference/bounds-test.gpx -o gpx -F /dev/null'. >>>> Program terminated with signal 11, >>>> Segmentation fault. >>>> #0 0x000000000042f194 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=Cannot access memory at address >>>> 0x7ffffe67bff8 >>>> ) at src/core/xmlstreamwriter.cc:63 >>>> 63 { >>>> Missing separate debuginfos, use: >>>> debuginfo-install glib2-2.22.5-7.el6.x86_64 >>>> glibc-2.12-1.107.el6_4.4.x86_64 >>>> libgcc-4.4.7-3.el6.x86_64 >>>> libstdc++-4.4.7-3.el6.x86_64 >>>> libusb-0.1.12-23.el6.x86_64 >>>> qt-4.6.2-26.el6_4.x86_64 >>>> zlib-1.2.3-29.el6.x86_64 >>>> (gdb) bt(20) >>>> #0 0x000000000042f194 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=Cannot access memory at address >>>> 0x7ffffe67bff8 >>>> ) at src/core/xmlstreamwriter.cc:63 >>>> #1 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #2 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #3 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #4 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #5 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #6 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #7 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #8 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #9 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #10 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #11 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #12 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #13 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #14 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #15 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #16 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #17 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #18 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #19 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> (More stack frames follow...) >>>> (gdb) bt(-20) >>>> #196491 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #196492 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #196493 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #196494 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #196495 0x000000000042f472 in >>>> QTextCodec::toUnicode (this=0x11ecd00, >>>> in=0x11efe50 "macintosh", length=9, >>>> state=0x0) at >>>> /usr/include/QtCore/qtextcodec.h:119 >>>> #196496 0x000000000042f1c4 in >>>> gpsbabel::XmlTextCodec::convertToUnicode >>>> (this=0x11ecd00, c=0x11efe50 "macintosh", >>>> n=9, s=0x0) at src/core/xmlstreamwriter.cc:64 >>>> #196497 0x0000003a43abb9cc in >>>> QString::fromAscii_helper(char const*, int) >>>> () from /usr/lib64/libQtCore.so.4 >>>> #196498 0x000000000040f11d in >>>> QString::QString (this=0x7fffff27a480, >>>> ch=0x11efe50 "macintosh") at >>>> /usr/include/QtCore/qstring.h:413 >>>> #196499 0x0000000000425589 in >>>> cet_cs_alias_qsort_cb (a=0x11f0040, >>>> b=0x11f0050) at cet_util.cc:354 >>>> #196500 0x0000003a3ca34da4 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196501 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196502 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196503 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196504 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196505 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196506 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196507 0x0000003a3ca34b78 in >>>> msort_with_tmp () from /lib64/libc.so.6 >>>> #196508 0x0000003a3ca351dc in qsort_r () >>>> from /lib64/libc.so.6 >>>> #196509 0x0000000000425a65 in cet_register >>>> () at cet_util.cc:783 >>>> #196510 0x00000000004099d3 in main (argc=9, >>>> argv=0x7fffff27abc8) at main.cc:275 >>>> (gdb) >>> >>> On 9/14/2013 2:00 PM, Conrad Meyer wrote: >>>> Hi all, >>>> >>>> Using a custom QTextCodec, I was able to >>>> lower the performance overhead of >>>> bad-character-fixing in gpsbabel. Patch >>>> attached. >>>> >>>> Tl;dr: XML character cleaning with 5% >>>> overhead instead of 55% with QRegExp. Testo >>>> passes. >>>> >>>> Benchmark: >>>> ./gpsbabel -i gpx -f big.gpx [-o gpx -F >>>> /dev/null] # repeat [] section x5 >>>> >>>> Results on my system: >>>> Baseline (emit dirty characters, no >>>> regexp, no textcodec): >>>> 17.2 user, 4.8 system, 22.1 total >>>> >>>> Regexp (Qt4) (strips dirty characters in >>>> fields w/ Regexp): >>>> 26.7 user, 4.7 system, 31.5 total >>>> From baseline: 55% slower in cpu time, >>>> 43% overall >>>> >>>> Custom QTextCodec (strips dirty >>>> characters at Utf16->Utf8 time): >>>> 18.1 user, 4.7 system, 22.8 total >>>> From baseline: 5% slower in cpu time, 3% >>>> overall >>>> (Alternatively, 33% faster than QRegExp :-).) >>>> >>>> This method seems like a clear winner. Feel >>>> free to push gpsbabel::XmlTextCodec into >>>> its own header/source files; I didn't since >>>> it is probably only useful inside >>>> xmlstreamwriter. >>>> >>>> Thanks, >>>> Conrad >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! >>>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint >>>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes >>>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. >>>> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk >>>> >>>> >>>> _______________________________________________ >>>> Gpsbabel-code mailing listhttp://www.gpsbabel.org >>>> Gps...@li... <mailto:Gps...@li...> >>>> https://lists.sourceforge.net/lists/listinfo/gpsbabel-code >>> >>> >>> ------------------------------------------------------------------------------ >>> LIMITED TIME SALE - Full Year of Microsoft >>> Training For Just $49.99! >>> 1,500+ hours of tutorials including >>> VisualStudio 2012, Windows 8, SharePoint >>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New >>> Multi-Library Power Pack includes >>> Mobile, Cloud, Java, and UX Design. Lowest >>> price ever! Ends 9/22/13. >>> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Gpsbabel-code mailing list >>> http://www.gpsbabel.org >>> Gps...@li... >>> <mailto:Gps...@li...> >>> https://lists.sourceforge.net/lists/listinfo/gpsbabel-code >>> >>> >> >> >> >> > > |