Menu

Web-Harvest function to convert XML to CSV

David
2006-12-26
2012-09-04
  • David

    David - 2006-12-26

    I attach a function which converts an XML file into CSV. This is useful for importing into Excel or database from XML or web pages which have non-uniform fields. The function first queries the XML file for column names, then generates a CSV, populating each column for each record (some column values might be blank).

    Let's say you scrape a web page and pull out the data in a format like this

    <records>
    <record>
    <val id="Title">new shirt</val>
    <val id="Color">plaid</val>
    </record>
    <record>
    <val id="Title">The Selfish Gene</val>
    <val id="Author">Dawkins, Richard</val>
    </record>
    </records>

    The below code will convert this XML to CSV file like this

    Title Color Author
    new shirt plaid
    The Selfish Gene Dawkins, Richard

    Here is the Web-Harvest code that calls the xml-to-csv function.
    <?xml version="1.0" encoding="UTF-8"?>

    <config charset="ISO-8859-1">
    <include path="xml-to-csv.xml"/>
    <call name="xml-to-csv">
    <!-- the XML file to read -->
    <call-param name="xml_file">myXML.xml</call-param>

        &lt;!-- the CSV file to write --&gt;
        &lt;call-param name=&quot;data_file_name&quot;&gt;myCSV.csv&lt;/call-param&gt;
    
        &lt;!-- This contains the XPath that retireves all records (rows). --&gt;
        &lt;call-param name=&quot;xpath_get_records&quot;&gt;//record&lt;/call-param&gt;
    
        &lt;!-- This contains the XPath that retireves a column names. --&gt;
        &lt;call-param name=&quot;xpath_get_column_names&quot;&gt;distinct-values(//val/@id)&lt;/call-param&gt;
    
        &lt;!-- This contains the XPath to retrieve each column's value for each record.  
        For each record (row), the function xml-to-csv will loop through each column name and
        replace {column_name} with the current column's name, to generate the XPath to retrieve 
        its value. --&gt;
        &lt;call-param name=&quot;xpath_get_column_value&quot;&gt;normalize-space(//val[@id='{column_name}']/child::text())&lt;/call-param&gt;
    
        &lt;!-- Delimiter --&gt;
        &lt;call-param name=&quot;delimiter&quot;&gt;,&lt;/call-param&gt;&lt;!-- optional, default to no delimiter --&gt;
    
        &lt;!-- Text qualifier to surround each column's value with. --&gt;
        &lt;call-param name=&quot;qualifier&quot;&gt;&quot;&lt;/call-param&gt;&lt;!-- optional, default to no qualifier --&gt;
    
        &lt;!-- Escape each qualifier character appearing in each column value with this character (escape any special regexp characters) --&gt;
        &lt;call-param name=&quot;escape_qualifier&quot;&gt;'&lt;/call-param&gt;&lt;!-- optional, default to no escaping --&gt;
    
        &lt;!-- Should the XML file be tidied? --&gt;
        &lt;call-param name=&quot;tidy_xml&quot;&gt;false&lt;/call-param&gt;&lt;!-- optional, default=false --&gt;
    
        &lt;!-- Specify which records to retrieve, e.g. 1-10 or -30 or 10- --&gt;
        &lt;call-param name=&quot;filter&quot;&gt;1-2&lt;/call-param&gt;&lt;!-- optional, default to process all records --&gt;
    &lt;/call&gt;
    

    </config>

    Below is the xml-to-csv.xml file itself.

    <?xml version="1.0" encoding="UTF-8"?>

    <config>
    <function name="xml-to-csv">

                &lt;!-- Get the XML content from the XML file.  Tidy if instructed to do so. --&gt;
                &lt;case&gt;&lt;if condition=&quot;${tidy_xml}&quot;&gt;
                        &lt;var-def name=&quot;xml_content&quot;&gt;
                                &lt;html-to-xml&gt;
                                    &lt;file action=&quot;read&quot; path=&quot;${xml_file}&quot; charset=&quot;UTF-8&quot;/&gt;
                            &lt;/html-to-xml&gt;
                    &lt;/var-def&gt;
                &lt;/if&gt;&lt;else&gt;
                        &lt;var-def name=&quot;xml_content&quot;&gt;
                            &lt;file action=&quot;read&quot; path=&quot;${xml_file}&quot; charset=&quot;UTF-8&quot;/&gt;
                    &lt;/var-def&gt;
                &lt;/else&gt;&lt;/case&gt;
    
            &lt;!-- Query distinct column names.  
            Change &quot;//val@id&quot; below to the XPath which defines each column name. --&gt;
            &lt;var-def name=&quot;column_names&quot;&gt;
                    &lt;xpath expression=&quot;${xpath_get_column_names}&quot;&gt;
                            &lt;var name=&quot;xml_content&quot;/&gt;
                    &lt;/xpath&gt;
            &lt;/var-def&gt;
    
            &lt;!-- Write the headers to a new file.  
            Change charset if needed. --&gt;
            &lt;var-def name=&quot;first_record&quot;&gt;true&lt;/var-def&gt;
            &lt;file action=&quot;write&quot; path=&quot;${data_file_name}&quot; charset=&quot;UTF-8&quot;&gt;
                    &lt;loop item=&quot;column_name&quot; index=&quot;col&quot;&gt;
                            &lt;list&gt;&lt;var name=&quot;column_names&quot;/&gt;&lt;/list&gt;
                        &lt;body&gt;
                                &lt;case&gt;&lt;if condition=&quot;${first_record}&quot;&gt;
                                            &lt;empty&gt;&lt;var-def name=&quot;first_record&quot;&gt;false&lt;/var-def&gt;&lt;/empty&gt;
                                            &lt;template&gt;${qualifier}${column_name}${qualifier}&lt;/template&gt;
                                    &lt;/if&gt;&lt;else&gt;
                                            &lt;template&gt;${delimiter}${qualifier}${column_name}${qualifier}&lt;/template&gt;
                                    &lt;/else&gt;&lt;/case&gt;
                            &lt;/body&gt;
                    &lt;/loop&gt;
            &lt;/file&gt;
    
            &lt;empty&gt;
                    &lt;var-def name=&quot;records&quot;&gt;
                            &lt;xpath expression=&quot;${xpath_get_records}&quot;&gt;
                                    &lt;var name=&quot;xml_content&quot;/&gt;
                            &lt;/xpath&gt;
                    &lt;/var-def&gt;
            &lt;/empty&gt;
    
            &lt;!-- Loop thru each record. --&gt;
            &lt;loop item=&quot;record&quot; index=&quot;i&quot; filter=&quot;${filter}&quot;&gt;
                &lt;list&gt;&lt;var name=&quot;records&quot;/&gt;&lt;/list&gt;
                &lt;body&gt;
                        &lt;!-- Append each data row onto the file. --&gt;
                        &lt;file action=&quot;append&quot; path=&quot;${data_file_name}&quot; charset=&quot;UTF-8&quot;&gt;
    
                                &lt;!-- Start new data row. --&gt;
                              &lt;template&gt;${const.lf}&lt;/template&gt;
    
                              &lt;!-- Add each column's value to the CSV row. --&gt;
                              &lt;empty&gt;&lt;var-def name=&quot;first_column&quot;&gt;true&lt;/var-def&gt;&lt;/empty&gt;
                                &lt;loop item=&quot;column_name&quot; index=&quot;c&quot;&gt;
                                        &lt;list&gt;&lt;var name=&quot;column_names&quot;/&gt;&lt;/list&gt;
                                        &lt;body&gt;
                                                &lt;empty&gt;
                                                        &lt;var-def name=&quot;this_val&quot;&gt;
                                                                &lt;empty&gt;
                                                                        &lt;var-def name=&quot;this_xpath&quot;&gt;
                                                                                &lt;regexp replace=&quot;true&quot;&gt;
                                                                                        &lt;regexp-pattern&gt;&lt;template&gt;\{column_name\}&lt;/template&gt;&lt;/regexp-pattern&gt;
                                                                                            &lt;regexp-source&gt;&lt;template&gt;${xpath_get_column_value}&lt;/template&gt;&lt;/regexp-source&gt;
                                                                                            &lt;regexp-result&gt;&lt;template&gt;${column_name}&lt;/template&gt;&lt;/regexp-result&gt;
                                                                                &lt;/regexp&gt;
                                                                        &lt;/var-def&gt;
                                                                &lt;/empty&gt;
                                                                &lt;xpath expression=&quot;${this_xpath}&quot;&gt;
                                                                        &lt;var name=&quot;record&quot;/&gt;
                                                                &lt;/xpath&gt;
                                                        &lt;/var-def&gt;
    
                                                        &lt;!-- Escape any qualifier characters which appear in the text. --&gt;
                                                        &lt;case&gt;&lt;if condition=&quot;${escape_qualifier == null}&quot;&gt;
                                                                &lt;var-def name=&quot;escaped_val&quot;&gt;&lt;var name=&quot;this_val&quot;/&gt;&lt;/var-def&gt;
                                                        &lt;/if&gt;&lt;else&gt;
                                                                &lt;var-def name=&quot;escaped_val&quot;&gt;
                                                                        &lt;regexp replace=&quot;true&quot;&gt;
                                                                                    &lt;regexp-pattern&gt;&lt;template&gt;${qualifier}&lt;/template&gt;&lt;/regexp-pattern&gt;
                                                                                    &lt;regexp-source&gt;&lt;template&gt;${this_val}&lt;/template&gt;&lt;/regexp-source&gt;
                                                                                    &lt;regexp-result&gt;&lt;template&gt;${escape_qualifier}&lt;/template&gt;&lt;/regexp-result&gt;
                                                                            &lt;/regexp&gt;
                                                                &lt;/var-def&gt;
                                                        &lt;/else&gt;&lt;/case&gt;
                                                &lt;/empty&gt;
                                                &lt;case&gt;&lt;if condition=&quot;${first_column}&quot;&gt;
                                                        &lt;empty&gt;&lt;var-def name=&quot;first_column&quot;&gt;false&lt;/var-def&gt;&lt;/empty&gt;
                                                        &lt;template&gt;${qualifier}${escaped_val}${qualifier}&lt;/template&gt;
                                                &lt;/if&gt;&lt;else&gt;
                                                        &lt;template&gt;${delimiter}${qualifier}${escaped_val}${qualifier}&lt;/template&gt;
                                                &lt;/else&gt;&lt;/case&gt;
                                        &lt;/body&gt;
                                &lt;/loop&gt;
                        &lt;/file&gt;
                &lt;/body&gt;
            &lt;/loop&gt;
    &lt;/function&gt;
    

    </config>

     
  • triboos

    triboos - 2011-12-10

    Hi, this is a great scrip - tks for sharing!

    I'm having a small issue with the output. The title and results don't lineup
    properly - can you help?

    This is how the csv looks in excel:

    Title

    Color

    Author

    new shirt

    plaid

    The Selfish Gene

    Dawkins, Richard

    Fyi, I had to change the following property to be able to run it:

    <template>${const.lf}</template>

    <template>${sys.lf}</template>

     

Log in to post a comment.