We have a application which uses expat to convert a xml data file into a binary version of the file. The file at the moment is about 600M but will grow.
We encountered a blocking problem - while parsing the file the application starts using a huge amount of memory it needs 4G of RAM to finish successfuly a 600MB file.
Our engineers explained that this is due to block memory management in expat when it builds the xml tree. They explained that our xml has alot of tags which in turn requires separate 4K memory pages for even 3 bytes of actual data.
Is there any way to improve this? Could anyone suggest how we can optimize this process? Is there any settings which we can use to make it work?
Here is the file structure ( I am not uploading the file since it is 600M I can provide it though ).
<?xml version="1.0" encoding="utf-8" ?>
<Groups>
<Group>
<ID>9</ID>
<Status>Active</Status>
<EffectiveDate>196912311900</EffectiveDate>
<ExpireDate>203012301700</ExpireDate>
<Elements>
<Element>
<ID>2345737</ID>
<StartDate>20000101</StartDate>
<EndDate>20351231</EndDate>
<StartTime>00:00</StartTime>
<EndTime>00:00</EndTime>
<DayOfWeek>0,1,2,3,4,5,6</DayOfWeek>
<DayOfMonth></DayOfMonth>
<Month></Month>
<Data>1619</Data>
<Value_1>0.0000</Value_1>
<Value_2 type ="RELATIVE">0.0000</Value_2>
<Subelements>
<Subelement>
<ID>1</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
<Subelement>
<ID>Default</ID>
<Value_3>0</Value_3>
<Value_4>1</Value_4>
<Value_5>0.0000</Value_5>
<Value_6 type="FIXED">0.0000</Value_6>
</Subelement>
</Subelements>
</Element>
</Elements>
</Group>
</Groups>
There can be many Groups - in practice about 100
Each Group can have many elements - in practice about 100,000
Each Element can have many subelements - in practice about 4
I think your engineers are mistaken.
Expat does not build an in-memory tree of the XML file at all, and its memory consumption is negligible, even for multi-gigabyte files. The only exceptions are entity declarations in the DTD which could use a lot of memory (google for "million laughs attack"). If you don't have a DTD (it looks like that from your example), then I cannot see how Expat would consume much memory.
Maybe you have a software library/layer on top of Expat which builds the tree?
Karl
Hello Karl,
I am attacching a small file (testxml.cxx) we use only for the parsing. It basically loads the xml and starts the parsing process with no other processing at all.
The behavior is the same the parsing application eats up to 3G ram before we kill it.
Could you tell us what we are doing wrong? Or is it a bug with expat?
Attached in a rar file are the following files:
- testxml.cxx - test program
- rbx.xml.gz - compressed XML file
Sorry I could not attach it. Here is the C application code:
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <string.h>
#include <time.h>
#include "xml.h"
//// Load() of CRatingEngine
int LoadXML ( const char *szFileName)
{
FILE *pXMLFile = fopen ( szFileName, "rb");
if ( !pXMLFile) {
printf ("Load: File name %s is not exists\n", szFileName);
return 0;
}
if( fseek ( pXMLFile, 0, SEEK_END) != 0) {
printf ("Load: The file %s is corrupted\n", szFileName);
fclose ( pXMLFile);
return 0;
}
int nSize = ( int) ftell ( pXMLFile);
rewind ( pXMLFile);
////Load all data once
char *szBuffer = (char *) malloc ( nSize + 1);
if( !szBuffer) {
printf ("Parse: Insufficient memory to alloc %d bytes\n", nSize );
return 0;
}
fseek ( pXMLFile, 0, SEEK_SET );
fread ( szBuffer, nSize, 1, pXMLFile );
szBuffer[nSize] = 0;
printf ("Parse: Starting ...\n");
XmlParser xmlParser;
XmlNodeRef xtRaps = xmlParser.Parse ( szBuffer);
if( !xtRaps) {
printf ("Error: XML data incorrect (%s)\n", szFileName);
fclose ( pXMLFile);
return 0;
}
if ( szBuffer) free ( szBuffer);
fclose ( pXMLFile);
printf ("Parse: End ...\n");
return 1;
}
int main(int argc, char* argv[])
{
char szXMLFile[20] = "rbx.xml";
#ifndef WIN32
if ( argc!=2) {
printf ("Command format: rbxdbgen [xmlfile]\n");
return 0;
}
strcpy ( szXMLFile, argv[1]);
#endif
time_t start = time ( NULL);
printf("Load from XML file...\n");
if( LoadXML ( szXMLFile) == 1 ){
printf("Done\n");
}
else {
printf("Failed\n");
return 0;
}
return 0;
}
Karl,
The file is around 3MB compressed. I can send it to you via email if you need it.
I looked at your C code.
The XmlNodeRef class gave it away, you are not actually using Expat directly, but apparently some DOM wrapper around it. This will build an in-memory representation of the XML file and consequently consume several times as much memory as the file size.
If you check the Expat reference document you will see that Expat based code uses call-backs to inform the application of each tag/attribute it encounters, but does not build a tree.
Karl
By default, Expat uses the standard C runtime memory allocation functions, so if anything the issue would be there.
In addition, Expat can be configured to use custom memory allocators. This should be sufficient to adjust Expat to specific memory handling requriements.
Also, this bug report does not pinpoint a specific reproducable Expat bug. Closing this entry.