I attach a function which converts an XML file into CSV. This is useful for importing into Excel or database from XML or web pages which have non-uniform fields. The function first queries the XML file for column names, then generates a CSV, populating each column for each record (some column values might be blank).
Let's say you scrape a web page and pull out the data in a format like this
The below code will convert this XML to CSV file like this
Title Color Author
new shirt plaid
The Selfish Gene Dawkins, Richard
Here is the Web-Harvest code that calls the xml-to-csv function.
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<include path="xml-to-csv.xml"/>
<call name="xml-to-csv">
<!-- the XML file to read -->
<call-param name="xml_file">myXML.xml</call-param>
<!--theCSVfiletowrite--><call-paramname="data_file_name">myCSV.csv</call-param><!--ThiscontainstheXPaththatretirevesallrecords(rows).--><call-paramname="xpath_get_records">//record</call-param><!--ThiscontainstheXPaththatretirevesacolumnnames.--><call-paramname="xpath_get_column_names">distinct-values(//val/@id)</call-param><!--ThiscontainstheXPathtoretrieveeachcolumn's value for each record. For each record (row), the function xml-to-csv will loop through each column name and replace {column_name} with the current column'sname,togeneratetheXPathtoretrieveitsvalue.--><call-paramname="xpath_get_column_value">normalize-space(//val[@id='{column_name}']/child::text())</call-param><!--Delimiter--><call-paramname="delimiter">,</call-param><!--optional,defaulttonodelimiter--><!--Textqualifiertosurroundeachcolumn's value with. --> <call-param name="qualifier">"</call-param><!-- optional, default to no qualifier --> <!-- Escape each qualifier character appearing in each column value with this character (escape any special regexp characters) --> <call-param name="escape_qualifier">'</call-param><!--optional,defaulttonoescaping--><!--ShouldtheXMLfilebetidied?--><call-paramname="tidy_xml">false</call-param><!--optional,default=false--><!--Specifywhichrecordstoretrieve,e.g.1-10or-30or10---><call-paramname="filter">1-2</call-param><!--optional,defaulttoprocessallrecords--></call>
</config>
Below is the xml-to-csv.xml file itself.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<function name="xml-to-csv">
<!-- Get the XML content from the XML file. Tidy if instructed to do so. --><case><if condition="${tidy_xml}"><var-def name="xml_content"><html-to-xml><file action="read" path="${xml_file}" charset="UTF-8"/></html-to-xml></var-def></if><else><var-def name="xml_content"><file action="read" path="${xml_file}" charset="UTF-8"/></var-def></else></case><!-- Query distinct column names.
Change "//val@id" below to the XPath which defines each column name. --><var-def name="column_names"><xpath expression="${xpath_get_column_names}"><var name="xml_content"/></xpath></var-def><!-- Write the headers to a new file.
Change charset if needed. --><var-def name="first_record">true</var-def><file action="write" path="${data_file_name}" charset="UTF-8"><loop item="column_name" index="col"><list><var name="column_names"/></list><body><case><if condition="${first_record}"><empty><var-def name="first_record">false</var-def></empty><template>${qualifier}${column_name}${qualifier}</template></if><else><template>${delimiter}${qualifier}${column_name}${qualifier}</template></else></case></body></loop></file><empty><var-def name="records"><xpath expression="${xpath_get_records}"><var name="xml_content"/></xpath></var-def></empty><!-- Loop thru each record. --><loop item="record" index="i" filter="${filter}"><list><var name="records"/></list><body><!-- Append each data row onto the file. --><file action="append" path="${data_file_name}" charset="UTF-8"><!-- Start new data row. --><template>${const.lf}</template><!-- Add each column's value to the CSV row. --><empty><var-def name="first_column">true</var-def></empty><loop item="column_name" index="c"><list><var name="column_names"/></list><body><empty><var-def name="this_val"><empty><var-def name="this_xpath"><regexp replace="true"><regexp-pattern><template>\{column_name\}</template></regexp-pattern><regexp-source><template>${xpath_get_column_value}</template></regexp-source><regexp-result><template>${column_name}</template></regexp-result></regexp></var-def></empty><xpath expression="${this_xpath}"><var name="record"/></xpath></var-def><!-- Escape any qualifier characters which appear in the text. --><case><if condition="${escape_qualifier==null}"><var-def name="escaped_val"><var name="this_val"/></var-def></if><else><var-def name="escaped_val"><regexp replace="true"><regexp-pattern><template>${qualifier}</template></regexp-pattern><regexp-source><template>${this_val}</template></regexp-source><regexp-result><template>${escape_qualifier}</template></regexp-result></regexp></var-def></else></case></empty><case><if condition="${first_column}"><empty><var-def name="first_column">false</var-def></empty><template>${qualifier}${escaped_val}${qualifier}</template></if><else><template>${delimiter}${qualifier}${escaped_val}${qualifier}</template></else></case></body></loop></file></body></loop></function>
</config>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I attach a function which converts an XML file into CSV. This is useful for importing into Excel or database from XML or web pages which have non-uniform fields. The function first queries the XML file for column names, then generates a CSV, populating each column for each record (some column values might be blank).
Let's say you scrape a web page and pull out the data in a format like this
<records>
<record>
<val id="Title">new shirt</val>
<val id="Color">plaid</val>
</record>
<record>
<val id="Title">The Selfish Gene</val>
<val id="Author">Dawkins, Richard</val>
</record>
</records>
The below code will convert this XML to CSV file like this
Title Color Author
new shirt plaid
The Selfish Gene Dawkins, Richard
Here is the Web-Harvest code that calls the xml-to-csv function.
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<include path="xml-to-csv.xml"/>
<call name="xml-to-csv">
<!-- the XML file to read -->
<call-param name="xml_file">myXML.xml</call-param>
</config>
Below is the xml-to-csv.xml file itself.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<function name="xml-to-csv">
</config>
Hi, this is a great scrip - tks for sharing!
I'm having a small issue with the output. The title and results don't lineup
properly - can you help?
This is how the csv looks in excel:
Title
Color
Author
new shirt
plaid
The Selfish Gene
Dawkins, Richard
Fyi, I had to change the following property to be able to run it:
<template>${const.lf}</template>
<template>${sys.lf}</template>