I am getting a segv when asking pyexpat to parse some bad xml. I'm not sure whether this is a problem with pyexpat or with expat, but I'm leaning towards expat after finding a (partial?) fix.
The stack of the segv is:
#0 0x010245f8 in normal_updatePosition (enc=0x1035fc0, ptr=0x73b000 <Address 0x73b000 out of bounds>, end=0x186cc50 "</<die", 'x' <repeats 76 times>, "</<die", 'x' <repeats 76 times>, "</<die", 'x' \ <repeats 30 times>..., pos=0x1317668) at /Volumes/extra/dvitek/trunk/third-party/python/Modules/expat/xmltok_impl.c:1748
#1 0x01011545 in XML_GetCurrentLineNumber (parser=0x13174d0) at /Volumes/extra/dvitek/trunk/third-party/python/Modules/expat/xmlparse.c:1793
#2 0x010093ca in set_error (self=0x725540, code=XML_ERROR_INVALID_TOKEN) at /Volumes/extra/dvitek/trunk/third-party/python/Modules/pyexpat.c:126
#3 0x0100c633 in get_parse_result (self=0x725540, rv=0) at /Volumes/extra/dvitek/trunk/third-party/python/Modules/pyexpat.c:918
#4 0x0100c6e0 in xmlparse_Parse (self=0x725540, args=0x7701e8) at /Volumes/extra/dvitek/trunk/third-party/python/Modules/pyexpat.c:940
#5 0x00068067 in PyCFunction_Call (func=0x770a30, arg=0x7701e8, kw=0x0) at Objects/methodobject.c:81
What appears to be going on is something like this:
There is an existing parser object whose m_positionPtr and m_eventPtr properly point somewhere inside m_buffer. Everything is fine at this point.
Someone calls XML_Parse on this parser. It proceeds to call XML_GetBuffer. XML_GetBuffer allocates a new buffer, but doesn't touch m_positionPtr or m_eventPtr. So these two pointers now point to freed memory (the old buffer).
After XML_GetBuffer returns, there is a call to XML_ParseBuffer. This function has a failure case just before it sets m_positionPtr and m_eventPtr to point somewhere better. It returns XML_STATUS_ERROR before touching m_positionPtr.
pyexpat sees this error code and proceeds to call XML_GetErrorLineNumber. XML_GetLineNumber dereferences m_positionPtr in the stack we see above, causing the crash.
I'm using the pyexpat (part of Python 2.6) interface to expat. Python 2.6 embeds a copy of expat's lib subdirectory at Modules/expat. To confirm that things were not better in 2.0.1 I just copied the 2.0.1 sources over the old ones. I was pleasantly surprised when it built 2.0.1 fine without any tinkering. Anyway, the problem still happened with 2.0.1.
My fix was to add these three lines near the end of XML_GetBuffer, under the assumption that nothing good can come of the dangling pointers:
+ positionPtr = NULL;
+ eventPtr = NULL;
+ eventEndPtr = NULL;
My test case was:
from xml.parsers.expat import ParserCreate
for abc in range(1,1000):
p = ParserCreate()
for i in xrange(100):
p.Parse('<die%s</' % ('x' * abc), False)
It only caused a crash on MacOSX; it worked OK on Linux. That's probably just because glibc's free isn't giving back pages to the OS as aggressively as the BSD one though.