Name | Modified | Size | Downloads / Week |
---|---|---|---|
1.0.0 | 2016-09-21 | ||
README | 2015-09-21 | 3.7 kB | |
Totals: 2 Items | 3.7 kB | 0 |
This is a Java program which converts one or more similar XML files into CSV matrices. I made it in order to extract data from big XML files and gather them in files more easily opened with a spreadsheet. The five optimization flavors the program offers are: 'raw', 'standard', plus 'extensive variant 1', 'extensive variant 2' and 'extensive variant 3'. Raw XML -> CSV transformation consists in 'flattening' an explicit data hierarchy into as many lines as there are XML leaf elements and to rely on relative row/column positioning to feature the hierarchy (--> 'raw' optimization). Data grouping might be performed alongside parsing in order to merge closely related single elements on the same line (--> 'standard' optimization), which, in turn, might be merged back with all related repeated elements (--> 'extensive variant 1' optimization). The 'extensive variant 2' optimization flavor, which uses a slightly different approach with its own advanced grouping/merging strategy, might produce better results with XML files bearing many mono-occurrence elements. The last 'extensive variant 3' optimization flavor maximizes 'extensive variant 2's grouping/merging capabilities (by introducing virtual mono-occurrence elements). The program performs 'standard' optimization with the default settings, which is quite the best compromise in my opinion because: - each XML element content appears once only in the output file(s); - data grouping reduces the number of lines in the output file(s). Moreover, the general CSV data layout is very close to its XML counterpart. Put together 'standard' optimization creates fairly small CSV files and, as such, was the one I chose for the default settings. This said, 'raw' and 'extensive' optimizations (variant 1, 2 or 3) are not gadgets because: - 'extensive' optimization is the closest thing to what spreadsheets do when they upload native XML. - 'raw' optimization is the only way to cope with XML bearing heavily repeated leaf elements (for instance: a <paininthe>NECK</paininthe> element repeated 123000 times in sequence). While I personally consider it bad XML practice such thing might happen and I felt obliged to propose something to deal with it. I wanted the program generic to remain simple at least for the end user (including myself) and to avoid having to upset people who just need plain vanilla XML to CSV conversion with boring details such as repeated elements, enclosing element declaration, and so on, just like I did with you a few lines above. Aaron Renn's GetOpt helped to create a convenient console command version in a convenient Unix fashion. It's not the 1st time. Well, thanks again, Aaron. Special care was paid to memory usage to ensure that the program would behave well and fast even against very big XML files, provided that the data dependency level remains under control (please refer to the documentation for further information). DOM-like access to XML files was banished from the outset in favor of more basic but far more memory-friendly SAX parsing. The tricky part was to find a way to balance both the need to keep enough data in memory to display them properly according to the selected optimization mode, and the urge to empty data buffer as soon as possible to avoid memory havoc. This ended up into devising some kind of versatile buffering routine which stuffs just enough data in memory to perform consistent data grouping, and then flushes it fast sprint to the output (something you might see for yourself on the console if you run the program in debug mode). Regards, Laurent lochrann@rocketmail.com