From: Frederik H. <fre...@ar...> - 2004-05-04 14:13:58
|
I have an XML document in ISO-8859-1 character set. When using libxml++-2.6= ,=20 the sax pasers crashes when it encounters a character with an accent (=E9) = in=20 the on_characters method. I'm using the Glib::ustring class. With libxml++-1.0 on the same document, it did not crash, but the character= =20 was transformed in two (strange) characters.=20 With libxml++-2.6 I have this back trace: #0 0xffffe410 in ?? () #1 0xbfffeb6c in ?? () #2 0x00000006 in ?? () #3 0x00005550 in ?? () #4 0x40397640 in raise () from /lib/tls/libc.so.6 #5 0x40399149 in abort () from /lib/tls/libc.so.6 #6 0x403190f5 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.5 #7 0x40319132 in std::terminate() () from /usr/lib/libstdc++.so.5 #8 0x403192b2 in __cxa_throw () from /usr/lib/libstdc++.so.5 #9 0x402ca4aa in std::__throw_length_error(char const*) ()=20 from /usr/lib/libstdc++.so.5 #10 0x4030a7b0 in std::string::_Rep::_S_create(unsigned, std::allocator<cha= r>=20 const&) () from /usr/lib/libstdc++.so.5 #11 0x4030b8ff in std::string& std::string::_M_replace_safe<char=20 const*>(__gnu_cxx::__normal_iterator<char*, std::string>,=20 __gnu_cxx::__normal_iterator<char*, std::string>, char const*, char const*)= =20 () from /usr/lib/libstdc++.so.5 #12 0x40307a6c in std::string::string(char const*, unsigned,=20 std::allocator<char> const&) () from /usr/lib/libstdc++.so.5 #13 0x401a8022 in Glib::ustring::ustring(char const*, unsigned) ()=20 from /usr/lib/libglibmm-2.4.so.1 #14 0x400415e8 in xmlpp::SaxParserCallback::characters(void*, unsigned char= =20 const*, int) () from /usr/lib/libxml++-2.6.so.1 #15 0x4010b374 in xmlParseCharDataComplex () from /usr/lib/libxml2.so.2 #16 0xbfffeec0 in ?? () #17 0x00000002 in ?? () What could be the reason for this problem? =2D-=20 =46rederik Himpe |
From: Daniel V. <vei...@re...> - 2004-05-04 14:18:44
|
On Tue, May 04, 2004 at 04:13:49PM +0200, Frederik Himpe wrote: > I have an XML document in ISO-8859-1 character set. When using libxml++= -2.6,=20 > the sax pasers crashes when it encounters a character with an accent (=E9= ) in=20 > the on_characters method. I'm using the Glib::ustring class. >=20 > With libxml++-1.0 on the same document, it did not crash, but the chara= cter=20 > was transformed in two (strange) characters.=20 At the libxml2 SAX level, the document is first converted to UTF8, so all character() callbacks should only see UTF8 and yes the =E9 will be converted into 2 bytes in that encoding. I cannot explain why this would crash though, this sounds a serious breakage. Daniel --=20 Daniel Veillard | Red Hat Desktop team http://redhat.com/ vei...@re... | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ |
From: Christophe de V. <cde...@al...> - 2004-05-04 16:39:51
|
Hi, I could reproduce the problem with the saxparser example and the xml sample file from the bugzilla ticket. I obtain this bt: #0 0x403b91b1 in kill () from /lib/libc.so.6 #1 0x4014a9c1 in pthread_kill () from /lib/libpthread.so.0 #2 0x4014accb in raise () from /lib/libpthread.so.0 #3 0x403b8df4 in raise () from /lib/libc.so.6 #4 0x403ba5a8 in abort () from /lib/libc.so.6 #5 0x402bca74 in __cxa_call_unexpected () from /usr/lib/./libstdc++.so.5 #6 0x402bcab1 in std::terminate () from /usr/lib/./libstdc++.so.5 #7 0x402bcc21 in __cxa_throw () from /usr/lib/./libstdc++.so.5 #8 0x40276a5c in std::__throw_length_error () from /usr/lib/./libstdc++.so.5 #9 0x402af83f in std::string::_Rep::_S_create () from /usr/lib/./libstdc++.so.5 #10 0x402b03a4 in std::string::_M_replace_safe<char const*> () from /usr/lib/./libstdc++.so.5 #11 0x402ad23a in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/./libstdc++.so.5 #12 0x401d28b4 in Glib::ustring::ustring () from /usr/lib/./libglibmm-2.3.so.2 #13 0x40024924 in xmlpp::SaxParserCallback::characters (context=0x0, ch=0xbffff0a0 "égal", len=5) at saxparser.cc:390 #14 0x4005dd0d in xmlParseCharDataComplex () from /usr/lib/./libxml2.so.2 #15 0x4005d7d2 in xmlParseCharData () from /usr/lib/./libxml2.so.2 #16 0x40066bf1 in xmlParseContent () from /usr/lib/./libxml2.so.2 #17 0x40066fa9 in xmlParseElement () from /usr/lib/./libxml2.so.2 #18 0x400682e3 in xmlParseDocument () from /usr/lib/./libxml2.so.2 #19 0x4002383d in xmlpp::SaxParser::parse (this=0xbffff560) at saxparser.cc:152 #20 0x4002398a in xmlpp::SaxParser::parse_file (this=0xbffff560, filename=@0xbffff590) at saxparser.cc:173 #21 0x08049adc in main (argc=2, argv=0x0) at main.cc:45 We see at line #13 that the callback gives the string utf-8 encoded and it's length in characters. This is with no doubt correct, and those values are given to ustring constructor directly. Going throught ustring sources I see that the constructor we use do this: 269 ustring::ustring(const char* src, ustring::size_type n) 270 : 271 string_ (src, utf8_byte_offset(src, n)) 272 {} knowing that std::__throw_length_error () is supposed to be raised if size is greater than max_size, I presume utf8_byte_offset returned std::string::npos. My probably stupid question is the following: Murray, in ustring::ustring(const char* src, ustring::size_type n), is "n" supposed to be the length in utf-8 characters, or in bytes ? Regards, Christophe |
From: Murray C. <mu...@mu...> - 2004-05-04 16:52:07
|
On Tue, 2004-05-04 at 18:40 +0200, Christophe de VIENNE wrote: > Hi, > > > I could reproduce the problem with the saxparser example and the xml sample file from the bugzilla ticket. > I obtain this bt: > > #0 0x403b91b1 in kill () from /lib/libc.so.6 > #1 0x4014a9c1 in pthread_kill () from /lib/libpthread.so.0 > #2 0x4014accb in raise () from /lib/libpthread.so.0 > #3 0x403b8df4 in raise () from /lib/libc.so.6 > #4 0x403ba5a8 in abort () from /lib/libc.so.6 > #5 0x402bca74 in __cxa_call_unexpected () from /usr/lib/./libstdc++.so.5 > #6 0x402bcab1 in std::terminate () from /usr/lib/./libstdc++.so.5 > #7 0x402bcc21 in __cxa_throw () from /usr/lib/./libstdc++.so.5 > #8 0x40276a5c in std::__throw_length_error () from /usr/lib/./libstdc++.so.5 > #9 0x402af83f in std::string::_Rep::_S_create () from /usr/lib/./libstdc++.so.5 > #10 0x402b03a4 in std::string::_M_replace_safe<char const*> () from /usr/lib/./libstdc++.so.5 > #11 0x402ad23a in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/./libstdc++.so.5 > #12 0x401d28b4 in Glib::ustring::ustring () from /usr/lib/./libglibmm-2.3.so.2 > #13 0x40024924 in xmlpp::SaxParserCallback::characters (context=0x0, ch=0xbffff0a0 "égal", len=5) at saxparser.cc:390 > #14 0x4005dd0d in xmlParseCharDataComplex () from /usr/lib/./libxml2.so.2 > #15 0x4005d7d2 in xmlParseCharData () from /usr/lib/./libxml2.so.2 > #16 0x40066bf1 in xmlParseContent () from /usr/lib/./libxml2.so.2 > #17 0x40066fa9 in xmlParseElement () from /usr/lib/./libxml2.so.2 > #18 0x400682e3 in xmlParseDocument () from /usr/lib/./libxml2.so.2 > #19 0x4002383d in xmlpp::SaxParser::parse (this=0xbffff560) at saxparser.cc:152 > #20 0x4002398a in xmlpp::SaxParser::parse_file (this=0xbffff560, filename=@0xbffff590) at saxparser.cc:173 > #21 0x08049adc in main (argc=2, argv=0x0) at main.cc:45 > > We see at line #13 that the callback gives the string utf-8 encoded and it's length in characters. This is with no doubt correct, and those values are given to ustring constructor directly. > > Going throught ustring sources I see that the constructor we use do this: > 269 ustring::ustring(const char* src, ustring::size_type n) > 270 : > 271 string_ (src, utf8_byte_offset(src, n)) > 272 {} > > knowing that std::__throw_length_error () is supposed to be raised if size is greater than max_size, I presume utf8_byte_offset returned std::string::npos. > > My probably stupid question is the following: > Murray, in ustring::ustring(const char* src, ustring::size_type n), is "n" supposed to be the length in utf-8 characters, or in bytes ? I think it's meant to be the number of characters. That's not what I'd expect, but I guess it makes sense in terms of translating std::string to Glib::ustring. Actually, in my debugger, the ch paramater to SaxParserCallback::characters() seems to be 0, which is a more likely cause for the crash. I guess we should check for 0. |
From: Christophe de V. <cde...@al...> - 2004-05-04 16:57:57
|
Hi again, Here is a little program that makes thing more clear : #include <iostream> #include <glibmm/ustring.h> int main() { try { Glib::ustring test("égal", 4); std::cout << "Success" << std::endl; } catch (...) { std::cout << "Failure" << std::endl; } try { Glib::ustring test("égal", 5); std::cout << "Success" << std::endl; } catch (...) { std::cout << "Failure" << std::endl; } }; The output of it is: Success Failure So, we have two possible conclusions : 1) Glib::ustring::ustring(const char *, ustring::size_type) does not have the good behavior. 2) We do no use correctly Glib::ustring::ustring(const char *, ustring::size_type) I'd say that the constructor from a const char * should wait for the length in byte, and not in characters, since if I'm manipulating a char *, I have few chances to know the length in characters easily, so my personal conclusion is more 1) than 2). Anyway I guess that Glibmm being stable, the behavior will not change. So we should probably use the Glib::ustring::ustring(const char *) constructor instead. For that we have to be sure that libxml2 will always gives us strings with a trailing '\0'. Daniel could you confirm ? Thanks. Regards Christophe |