[Vtd-xml-users] VTD-XML for C# having parsing/encoding issues?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi there,

I tried using charp-VTD-XML to parse he-Wikipedia XMLs. The unpacked XML is
about 1GB of size, so not having Extended VTD-XML available for C# shouldn't
be an issue.

I'm running the following simple code:

                    VTDGen vg = new VTDGen();
                    AutoPilot ap = new AutoPilot();

                    ap.selectXPath("//page");
                    if (vg.parseFile(ofd.FileName, false))
                    {
                        VTDNav vn = vg.getNav();
                        ap.bind(vn);
                        while (ap.evalXPath() != -1)
                        {
                        }
                    }

However, vg.parseFile never returns true for that file. I tried converting
the file to a UTF8 with BOM, and that didn't help either. Copied a few
sections (including Unicode characters) to a smaller file, and the smaller
file was parsed OK.

I really have no idea whats going on - especially since VX doesn't throw any
exception. Any idea what could be going wrong, or how I can intercept any
trace data?

The file is available from
http://download.wikimedia.org/hewiki/20100607/hewiki-20100607-pages-articles
.xml.bz2 or
http://dumps.wikimedia.org/hewiki/latest/hewiki-latest-pages-articles.xml.bz
2 (~200MB download).

Thanks.

Itamar.