From: Itamar Syn-H. <it...@di...> - 2010-06-08 22:45:57
|
Hi there, I tried using charp-VTD-XML to parse he-Wikipedia XMLs. The unpacked XML is about 1GB of size, so not having Extended VTD-XML available for C# shouldn't be an issue. I'm running the following simple code: VTDGen vg = new VTDGen(); AutoPilot ap = new AutoPilot(); ap.selectXPath("//page"); if (vg.parseFile(ofd.FileName, false)) { VTDNav vn = vg.getNav(); ap.bind(vn); while (ap.evalXPath() != -1) { } } However, vg.parseFile never returns true for that file. I tried converting the file to a UTF8 with BOM, and that didn't help either. Copied a few sections (including Unicode characters) to a smaller file, and the smaller file was parsed OK. I really have no idea whats going on - especially since VX doesn't throw any exception. Any idea what could be going wrong, or how I can intercept any trace data? The file is available from http://download.wikimedia.org/hewiki/20100607/hewiki-20100607-pages-articles .xml.bz2 or http://dumps.wikimedia.org/hewiki/latest/hewiki-latest-pages-articles.xml.bz 2 (~200MB download). Thanks. Itamar. |