Menu

Try to scrape a HTML page-need Xpath expressi

Help
uma
2006-10-25
2012-09-04
  • uma

    uma - 2006-10-25

    Thanks for the help.
    I want to extract some datas from a website. Sample HTML page is given.
    <tr>
    <td colspan="4" align="center" valign="top">

          &lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt; 
          &lt;td colspan=&quot;4&quot; align=&quot;center&quot; valign=&quot;top&quot;&gt;&lt;table width=&quot;100%&quot; border=&quot;0&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot;&gt;
              &lt;tr&gt; 
                &lt;td height=&quot;25&quot; background=&quot;images/categorybg.gif&quot;&gt;&lt;table width=&quot;100%&quot; border=&quot;0&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot;&gt;
                    &lt;tr&gt; 
                      &lt;td width=&quot;55%&quot; class=&quot;addresstitle&quot;&gt;&lt;strong&gt;&lt;font color=&quot;#6A0000&quot;&gt; 
                        &amp;nbsp; 
                        GOVT HOSPITAL CHROMPET
                        &lt;/font&gt;&lt;/strong&gt;&lt;/td&gt;
                      &lt;td width=&quot;19%&quot; align=&quot;left&quot; class=&quot;heading2&quot;&gt;&lt;font color=&quot;#666666&quot;&gt; 
                        &lt;span class=&quot;unnamed1&quot;&gt; 
                        Chrompet
                        &lt;/span&gt;&lt;/font&gt;&lt;/td&gt;
                      &lt;td width=&quot;14%&quot; align=&quot;left&quot;&gt;&lt;font color=&quot;#666666&quot;&gt;&lt;strong&gt; 
                        &lt;span class=&quot;unnamed1&quot;&gt; 
                        22382400&lt;/span&gt;&lt;/strong&gt;&lt;/font&gt; &lt;span class=&quot;unnamed1&quot;&gt;
    
                        &lt;/span&gt; &lt;/td&gt;
                      &lt;td width=&quot;12%&quot; align=&quot;right&quot;&gt;
    
                        &lt;/div&gt;&lt;/td&gt;
                    &lt;/tr&gt;
                  &lt;/table&gt;&lt;/td&gt;
              &lt;/tr&gt;
            &lt;/table&gt;&lt;/td&gt;
        &lt;/tr&gt;
    
          &lt;/td&gt;
        &lt;/tr&gt;
    

    I want to scrape only GOVT HOSPITAL CHROMPET,Chrompet,22382400 only. What XPath expression i should give and can you give me some sample code.

     
    • Vladimir Nikic

      Vladimir Nikic - 2006-10-25

      For the first cell use //tr[@class='addresstitle']. That is one part of information. For Chrompet you need different xpath etc. However you may use XQuery to combine several XPAth expressions in one Web-Harvest processor. There are lot of tutorials on web about XPath and XQuery. See for example

      http://www.w3schools.com/.

      When using XPAth expressions it is best not to make full paths to the desired data, like "/html/body/....."
      but to find soemthing specific like in example //tr[@class='addresstitle'] where addresstitle is crucial attribute.

       
    • uma

      uma - 2006-10-27

      my xml config file look like this.
      <?xml version="1.0" encoding="UTF-8"?>

      <config charset="UTF-8">

      &lt;file action=&quot;write&quot; path=&quot;myfolder/extract.xml&quot;&gt;
          &lt;template&gt;&lt;![CDATA[ &lt;extract time=&quot;${sys.datetime(&quot;dd.MM.yyyy, HH:mm:ss&quot;)}&quot;&gt; ]]&gt;&lt;/template&gt;
      
          &lt;loop item=&quot;row&quot;&gt;
              &lt;!-- list consists of all rows in the main HTML table on the page --&gt;
              &lt;list&gt;
                  &lt;xpath expression=&quot;//tr[@class=&quot;addresstitle&quot;]&quot;&gt;
                      &lt;html-to-xml&gt;
                          &lt;http url=&quot;urladdress&quot;/&gt;
                      &lt;/html-to-xml&gt;
                  &lt;/xpath&gt;
              &lt;/list&gt;
      
              &lt;!-- 
                  It is needed to resolve odds, date and header rows. Distinction is 
                  made based on &quot;class&quot; attribute.
              --&gt;
              &lt;body&gt;
                  &lt;empty&gt;
                      &lt;var-def name=&quot;clazz&quot;&gt;
                          &lt;xpath expression=&quot;//td/@class&quot;&gt;
                              &lt;var name=&quot;row&quot;/&gt;
                          &lt;/xpath&gt;
                      &lt;/var-def&gt;
                   &lt;/empty&gt;
                  &lt;case&gt;
                      &lt;if condition='${clazz.toString() == &quot;addresstitle&quot;}'&gt;
                          &lt;xquery&gt;
                              &lt;xq-param name=&quot;doc&quot;&gt;&lt;var name=&quot;row&quot;/&gt;&lt;/xq-param&gt;
                              &lt;xq-expression&gt;&lt;![CDATA[
                                  for $row in $doc//td return
                                      &lt;extract name=&quot;{normalize-space(data($row[1]))}&quot;&gt;
                                      &lt;/extract&gt;
                              ]]&gt;&lt;/xq-expression&gt;
                          &lt;/xquery&gt;
                      &lt;/if&gt;
                  &lt;/case&gt;
              &lt;/body&gt;
          &lt;/loop&gt;
      
          &lt;![CDATA[ &lt;/extract&gt; ]]&gt;
      &lt;/file&gt;
      

      </config>

      when I run this i am getting the following error.
      variable 'clazz' is not defined.
      Is it correct config file to extract the above said data.

       
    • uma

      uma - 2006-10-27

      Can you give me a sample code to write the extracted data to XML file using xquery

      Thanks

       

Log in to post a comment.