#479 expat memory consumption issue - advise needed

Not a Bug
closed-rejected
nobody
None
5
2012-02-26
2009-03-31
Alex Manov
No

We have a application which uses expat to convert a xml data file into a binary version of the file. The file at the moment is about 600M but will grow.
We encountered a blocking problem - while parsing the file the application starts using a huge amount of memory it needs 4G of RAM to finish successfuly a 600MB file.
Our engineers explained that this is due to block memory management in expat when it builds the xml tree. They explained that our xml has alot of tags which in turn requires separate 4K memory pages for even 3 bytes of actual data.

Is there any way to improve this? Could anyone suggest how we can optimize this process? Is there any settings which we can use to make it work?

Here is the file structure ( I am not uploading the file since it is 600M I can provide it though ).
<?xml version="1.0" encoding="utf-8" ?>
<Groups>
<Group>
<ID>9</ID>
<Status>Active</Status>
<EffectiveDate>196912311900</EffectiveDate>
<ExpireDate>203012301700</ExpireDate>
<Elements>
<Element>
<ID>2345737</ID>
<StartDate>20000101</StartDate>
<EndDate>20351231</EndDate>
<StartTime>00:00</StartTime>
<EndTime>00:00</EndTime>
<DayOfWeek>0,1,2,3,4,5,6</DayOfWeek>
<DayOfMonth></DayOfMonth>
<Month></Month>
<Data>1619</Data>
<Value_1>0.0000</Value_1>
<Value_2 type ="RELATIVE">0.0000</Value_2>
<Subelements>
<Subelement>
<ID>1</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
<Subelement>
<ID>Default</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
</Subelements>
</Element>
</Elements>
</Group>
</Groups>

There can be many Groups - in practice about 100
Each Group can have many elements - in practice about 100,000
Each Element can have many subelements - in practice about 4

Discussion

  • Karl Waclawek
    Karl Waclawek
    2009-03-31

    I think your engineers are mistaken.
    Expat does not build an in-memory tree of the XML file at all, and its memory consumption is negligible, even for multi-gigabyte files. The only exceptions are entity declarations in the DTD which could use a lot of memory (google for "million laughs attack"). If you don't have a DTD (it looks like that from your example), then I cannot see how Expat would consume much memory.

    Maybe you have a software library/layer on top of Expat which builds the tree?

    Karl

     
  • Alex Manov
    Alex Manov
    2009-04-02

    Hello Karl,

    I am attacching a small file (testxml.cxx) we use only for the parsing. It basically loads the xml and starts the parsing process with no other processing at all.
    The behavior is the same the parsing application eats up to 3G ram before we kill it.

    Could you tell us what we are doing wrong? Or is it a bug with expat?

    Attached in a rar file are the following files:
    - testxml.cxx - test program
    - rbx.xml.gz - compressed XML file

     
  • Alex Manov
    Alex Manov
    2009-04-02

    Sorry I could not attach it. Here is the C application code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <memory.h>
    #include <string.h>
    #include <time.h>
    #include "xml.h"

    //// Load() of CRatingEngine
    int LoadXML ( const char *szFileName)
    {
    FILE *pXMLFile = fopen ( szFileName, "rb");
    if ( !pXMLFile) {
    printf ("Load: File name %s is not exists\n", szFileName);
    return 0;
    }

    if( fseek ( pXMLFile, 0, SEEK_END) != 0) {
    printf ("Load: The file %s is corrupted\n", szFileName);
    fclose ( pXMLFile);
    return 0;
    }

    int nSize = ( int) ftell ( pXMLFile);
    rewind ( pXMLFile);

    ////Load all data once
    char *szBuffer = (char *) malloc ( nSize + 1);
    if( !szBuffer) {
    printf ("Parse: Insufficient memory to alloc %d bytes\n", nSize );
    return 0;
    }

    fseek ( pXMLFile, 0, SEEK_SET );
    fread ( szBuffer, nSize, 1, pXMLFile );
    szBuffer[nSize] = 0;

    printf ("Parse: Starting ...\n");

    XmlParser xmlParser;
    XmlNodeRef xtRaps = xmlParser.Parse ( szBuffer);

    if( !xtRaps) {
    printf ("Error: XML data incorrect (%s)\n", szFileName);
    fclose ( pXMLFile);
    return 0;
    }

    if ( szBuffer) free ( szBuffer);
    fclose ( pXMLFile);

    printf ("Parse: End ...\n");
    return 1;
    }

    int main(int argc, char* argv[])
    {
    char szXMLFile[20] = "rbx.xml";

    #ifndef WIN32
    if ( argc!=2) {
    printf ("Command format: rbxdbgen [xmlfile]\n");
    return 0;
    }
    strcpy ( szXMLFile, argv[1]);
    #endif

    time_t start = time ( NULL);

    printf("Load from XML file...\n");
    if( LoadXML ( szXMLFile) == 1 ){
    printf("Done\n");
    }
    else {
    printf("Failed\n");
    return 0;
    }

    return 0;
    }

     
  • Alex Manov
    Alex Manov
    2009-04-02

    Karl,

    The file is around 3MB compressed. I can send it to you via email if you need it.

     
  • Karl Waclawek
    Karl Waclawek
    2009-04-02

    I looked at your C code.

    The XmlNodeRef class gave it away, you are not actually using Expat directly, but apparently some DOM wrapper around it. This will build an in-memory representation of the XML file and consequently consume several times as much memory as the file size.

    If you check the Expat reference document you will see that Expat based code uses call-backs to inform the application of each tag/attribute it encounters, but does not build a tree.

    Karl

     
  • Karl Waclawek
    Karl Waclawek
    2012-02-26

    By default, Expat uses the standard C runtime memory allocation functions, so if anything the issue would be there.
    In addition, Expat can be configured to use custom memory allocators. This should be sufficient to adjust Expat to specific memory handling requriements.

    Also, this bug report does not pinpoint a specific reproducable Expat bug. Closing this entry.

     
  • Karl Waclawek
    Karl Waclawek
    2012-02-26

    • status: open --> closed-rejected