
Try to scrape a HTML page-need Xpath expressi

  • uma

    uma - 2006-10-25

    Thanks for the help.
    I want to extract some datas from a website. Sample HTML page is given.
    <td colspan="4" align="center" valign="top">

          &lt;td colspan=&quot;4&quot; align=&quot;center&quot; valign=&quot;top&quot;&gt;&lt;table width=&quot;100%&quot; border=&quot;0&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot;&gt;
                &lt;td height=&quot;25&quot; background=&quot;images/categorybg.gif&quot;&gt;&lt;table width=&quot;100%&quot; border=&quot;0&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot;&gt;
                      &lt;td width=&quot;55%&quot; class=&quot;addresstitle&quot;&gt;&lt;strong&gt;&lt;font color=&quot;#6A0000&quot;&gt; 
                        GOVT HOSPITAL CHROMPET
                      &lt;td width=&quot;19%&quot; align=&quot;left&quot; class=&quot;heading2&quot;&gt;&lt;font color=&quot;#666666&quot;&gt; 
                        &lt;span class=&quot;unnamed1&quot;&gt; 
                      &lt;td width=&quot;14%&quot; align=&quot;left&quot;&gt;&lt;font color=&quot;#666666&quot;&gt;&lt;strong&gt; 
                        &lt;span class=&quot;unnamed1&quot;&gt; 
                        22382400&lt;/span&gt;&lt;/strong&gt;&lt;/font&gt; &lt;span class=&quot;unnamed1&quot;&gt;
                        &lt;/span&gt; &lt;/td&gt;
                      &lt;td width=&quot;12%&quot; align=&quot;right&quot;&gt;

    I want to scrape only GOVT HOSPITAL CHROMPET,Chrompet,22382400 only. What XPath expression i should give and can you give me some sample code.

    • Vladimir Nikic

      Vladimir Nikic - 2006-10-25

      For the first cell use //tr[@class='addresstitle']. That is one part of information. For Chrompet you need different xpath etc. However you may use XQuery to combine several XPAth expressions in one Web-Harvest processor. There are lot of tutorials on web about XPath and XQuery. See for example

      When using XPAth expressions it is best not to make full paths to the desired data, like "/html/body/....."
      but to find soemthing specific like in example //tr[@class='addresstitle'] where addresstitle is crucial attribute.

    • uma

      uma - 2006-10-27

      my xml config file look like this.
      <?xml version="1.0" encoding="UTF-8"?>

      <config charset="UTF-8">

      &lt;file action=&quot;write&quot; path=&quot;myfolder/extract.xml&quot;&gt;
          &lt;template&gt;&lt;![CDATA[ &lt;extract time=&quot;${sys.datetime(&quot;dd.MM.yyyy, HH:mm:ss&quot;)}&quot;&gt; ]]&gt;&lt;/template&gt;
          &lt;loop item=&quot;row&quot;&gt;
              &lt;!-- list consists of all rows in the main HTML table on the page --&gt;
                  &lt;xpath expression=&quot;//tr[@class=&quot;addresstitle&quot;]&quot;&gt;
                          &lt;http url=&quot;urladdress&quot;/&gt;
                  It is needed to resolve odds, date and header rows. Distinction is 
                  made based on &quot;class&quot; attribute.
                      &lt;var-def name=&quot;clazz&quot;&gt;
                          &lt;xpath expression=&quot;//td/@class&quot;&gt;
                              &lt;var name=&quot;row&quot;/&gt;
                      &lt;if condition='${clazz.toString() == &quot;addresstitle&quot;}'&gt;
                              &lt;xq-param name=&quot;doc&quot;&gt;&lt;var name=&quot;row&quot;/&gt;&lt;/xq-param&gt;
                                  for $row in $doc//td return
                                      &lt;extract name=&quot;{normalize-space(data($row[1]))}&quot;&gt;
          &lt;![CDATA[ &lt;/extract&gt; ]]&gt;


      when I run this i am getting the following error.
      variable 'clazz' is not defined.
      Is it correct config file to extract the above said data.

    • uma

      uma - 2006-10-27

      Can you give me a sample code to write the extracted data to XML file using xquery



Log in to post a comment.