Menu

#389 Byte count in large XML files fails

closed-fixed
None
5
2006-03-06
2005-09-08
Rolf Ade
No

XML_GetCurrentByteIndex(XML_Parser parser) returns a
long, which is at least on the most 32 bit Systems 32
bit long. That means, for XML input larger than 2 GByte
file size, XML_GetCurrentByteIndex() returns does not
return the right number.

Sure, such big XML files will be parsed in chunks, so
it is possbile, to keep track about the nr of overflows
by self, but come on.

It's surely a limbo dance by its own to introcude long
long in a source, so portable as expat, but that would
be it.

If you switch to long long if avaliable for this,
please consider also XML_GetCurrentLineNumber() and
XML_GetCurrentColumnNumber(). They return an int, which
is on most 32-byte systems 2 Gig. Though, I'm not
stumbled over this two limits in real life, as I in
fact did with XML_GetCurrentByteIndex().

Discussion

  • Karl Waclawek

    Karl Waclawek - 2005-11-27
    • assigned_to: nobody --> kwaclaw
     
  • Karl Waclawek

    Karl Waclawek - 2005-11-27

    Logged In: YES
    user_id=290026

    Rolf,

    should the type be 64 bit integer on all platforms,
    or 32bit on 32bit platforms and 64bit on 64bit platforms?
    I think we are talking about m_parseEndByteIndex,
    POSITION.lineNumber and POSITION.columnNumber.

    Options could be size_t, ptrdiff_t.
    MS VC++ 6.0 does not know about long long, but it knows
    about __int64. Is there an ANSI definition for 64 bit ints?

    What do you suggest that works on all platforms?

    Karl

     
  • Rolf Ade

    Rolf Ade - 2005-11-27

    Logged In: YES
    user_id=13222

    Karl,

    Most reasonable 32bit platforms have support for file sizes
    > 2 GB these days even on 32. It was in fact a 32bit
    platform, at which I stumbled over the problem. That for
    your easy question.

    Much harder is how to slove this in a portable way. I'm
    afraid that may need platform depending #defines (with
    fallback to long).

    I'll go out digging what other portable software does in
    this case and will come back with a more concrete proposal.

    rolf

     
  • Karl Waclawek

    Karl Waclawek - 2005-11-27

    Logged In: YES
    user_id=290026

    You are right, Rolf, it should be 64 bits even on a 32bit
    platform. I guess I should make a note in the docs that
    Expat supports > 2GB files, as long as each chunk passed
    to the XML_Parse routines is smaller than 2GB.

    There are also issues around compiling Expat on a 64bit
    platform, but at least for VC++, someone has provided
    a patch (bug # 1105135) which looks it should work
    on other platforms as well (just a bunch of type casts).

    One issue I have already seen is that VC++ 6.0 does not
    know about long long.

    Thanks for having a look at the cross-platform issue.

    I am trying to get Expat 2.0 released despite Fred
    not being active on Expat anymore.

    Karl

     
  • Karl Waclawek

    Karl Waclawek - 2005-11-27

    Logged In: YES
    user_id=290026

    Just some notes, so that I don't forget

    - should it be configurable?
    some may not want the overhead of a 64bit integer,
    especially for line number and column number.
    - is long long acceptable everywhere else
    (other than VC++)?
    - the byte index could be negative, but not
    line/column number and byte count, right?

     
  • Rolf Ade

    Rolf Ade - 2005-11-28

    Logged In: YES
    user_id=13222

    Configurable: No
    There is nearly no overhead: just a few variables (at max) 8
    bytes long instead of 4 bytes. Also speedwise: not mesuarable.

    long long acceptable everywhere: Probably no
    Some very old or limited embedded system may not have a long
    long (or equivalent). Therefor we probably need defines.

    Byte index could be negative: don't think so.
    How could that happen? Byte index starts at 0 and grows. Or
    do I miss something.?

     
  • Karl Waclawek

    Karl Waclawek - 2005-11-28

    Logged In: YES
    user_id=290026

    On a 32bit CPU, 64bit integer operations are considerably
    slower than 32 bit operations. On the other hand
    XMLUpdatePosition isn't called that often - mostly when
    you actually request the line/column number.
    So, I agree - no configuration necessary.

    For the other point:
    If you look at the XML_GetCurrentByteIndex() code, it can
    return -1, and it is calculated using a subtraction. So in
    practice and theory, it must be a signed integer.

    XML_GetCurrentByteCount is derived from a subtraction as
    well, but we know it will be positive because eventEndPtr
    should always be larger than eventPtr. So we could risk
    using an unsigned integer.

    Just playing around, I added this to expat_external.h:
    #ifdef XML_USE_MSC_EXTENSIONS
    typedef __int64 XML_Int64;
    typedef unsigned __int64 XML_UInt64;
    #else
    typedef long long XML_Int64;
    typedef unsigned long long XML_UInt64;
    #endif

    What do you think?

    Karl

     
  • Rolf Ade

    Rolf Ade - 2005-11-28

    Logged In: YES
    user_id=13222

    XML_GetCurrentByteIndex() could return -1:
    Of course! You're right. And it makes sense. A freshly
    created or reseted parser without the first XML_Parse()
    call returns -1 on XML_GetCurrentByteIndex(), to signal this
    fact: it is not right at the start of the document, but
    there isn't any parsing started yet. Nice detail. I should
    have looked at the implementation, before replying. Note:
    That detail isn't mentioned in the documentation.

    I'm fine with a signed long long. 2^63 should be big enough,
    for the next few weeks ;-).

    Re the defines: Basically yes. It's just, that I'm pretty
    sure, we need one round more: some configure check for long
    long and depending on that result defining XML_?Int64 as
    long long or just long.

    I'll look something up (but being on deadline catching, may
    need a bit time).

     
  • Karl Waclawek

    Karl Waclawek - 2005-11-30

    Logged In: YES
    user_id=290026

    Rolf,

    I opened a discussion about this on the discussion list.
    The main issue - IMO - is backwards compatibility.
    I don't know how many apps on Linux, for instance, rely
    on the Expat API staying binary compatible.

    Karl

     
  • Rolf Ade

    Rolf Ade - 2005-12-04

    Logged In: YES
    user_id=13222

    Karl,

    I followed it.

    I must confess, I've no clear opinion about the backward
    compatibilty problem. For me, that is not a problem. I
    distribute the expat sources with my software and link it
    statically (so no problem even for binary distribution).

    One note: I don't think, that XML_GetCurrentLineNumber() and
    XML_GetCurrentColumnNumber() are seldom used calls. Any
    resonable XML processing software should be able to deal
    with the case, that it get feeded with not well-formed XML
    input. I'd guess, most programmers use that two calls to
    provide a usefull error msg to their users.

    I'm happy with any solution. I feel, it may be to much bloat
    (as you said) to have all that functions twice with
    different return type. So, I'd favour a build flag. (With
    the default 32 bit changed to 64 bit right after the 2.0
    release). But I'm not really sure.

    Yes, I still have to provide some bits to check for a long
    long on unix plattforms, I now.

    rolf

     
  • Karl Waclawek

    Karl Waclawek - 2005-12-12

    Logged In: YES
    user_id=290026

    Rolf,

    you are right about XML_GetCurrentLineNumber() and
    XML_GetCurrentColumnNumber().

    The backwards compatibility is an issue for others, as far
    as I can tell, so I think I have it off by default.

    Attached is a patch I quickly put together today.
    Basic points: The switch is called "XML_LARGE_SIZE".
    I modified the type names to XML_Index (for signed large
    ints) and XML_Size (for unsigned large ints). If you have
    better names, let me know.

    Would you apply the patch and see how it works for you?
    (I assume long long works everywhere except Windows).

    Karl

     
  • Karl Waclawek

    Karl Waclawek - 2005-12-12

    Patch to add 64bit integer support.

     
  • Karl Waclawek

    Karl Waclawek - 2005-12-22

    Second patch for large integer support

     
  • Karl Waclawek

    Karl Waclawek - 2005-12-22

    Logged In: YES
    user_id=290026

    Rolf,

    I think I go for the simple solution: use long long, and if
    a compiler doesn't support it, too bad, the programmer will
    have to use 32 bit integers:

    #ifdef XML_LARGE_SIZE
    typedef long long XML_Index;
    typedef unsigned long long XML_Size;
    #else
    typedef long XML_Index;
    typedef unsigned long XML_Size;
    #endif

    Even if undefined, the above declarations will change the
    return values of XML_GetCurrentLine/ColumnNumber from int to
    unsigned long. Hope that is not a problem (in theory, long
    >= int).

    Anyway, I'll apply the attached patch (LargeInt2.diff)
    sometime soon and create a pre-release for download.

    Karl

     
  • Karl Waclawek

    Karl Waclawek - 2006-01-13

    Logged In: YES
    user_id=290026

    Rolf,

    I think I can close that now.
    Do you agree, or is there something you are still missing?

     
  • Karl Waclawek

    Karl Waclawek - 2006-01-13
    • status: open --> open-fixed
     
  • Karl Waclawek

    Karl Waclawek - 2006-03-06

    Logged In: YES
    user_id=290026

    Closing this issue - no complaints received.

     
  • Karl Waclawek

    Karl Waclawek - 2006-03-06
    • status: open-fixed --> closed-fixed
     
  • Nobody/Anonymous

    Logged In: NO

    Maybe you have a look what apache does for apr_size_t.

     

Log in to post a comment.

MongoDB Logo MongoDB