I am trying to scrape Hindu Newspaper like NYtimes. Sample config file is below. But the output is only date displayed. Config file is given below.
<var-def name="startUrl">http://www.hinduonnet.com/</var-def><file action="write" path="hindu/hindu.xml" charset="UTF-8"><template><![CDATA[ <Hindu date="${sys.datetime("dd.MM.yyyy")}"> ]]></template><loop item="articleUrl" index="i"><!-- collects URLs of all articles from the front page --><list><xpath expression="//td/a[@class='topstory']/@href"><html-to-xml><http url="${startUrl}"/></html-to-xml></xpath><xpath expression="//div[@class='bluebk']/a[1]/@href"><html-to-xml><http url="${startUrl}"/></html-to-xml></xpath></list><!-- downloads each article and extract data from it --><body><xquery><xq-param name="doc"><html-to-xml><http url="${sys.fullUrl(startUrl, articleUrl)}?&amp;pagewanted=print"/></html-to-xml></xq-param><xq-expression><![CDATA[let$author :=data($doc//div[@class="otherstory"])let $title :=data($doc/font[@class="storyhead"])let $text :=data($doc//p/a[1])return<article><title>{normalize-space($title)}</title><author>{normalize-space($author)}</author><text>{normalize-space($text)}</text></article>
]]></xq-expression></xquery></body></loop><![CDATA[ </Hindu> ]]>
please give the correct Xpath so that I can syndicate the Online Newspaper
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to scrape Hindu Newspaper like NYtimes. Sample config file is below. But the output is only date displayed. Config file is given below.
please give the correct Xpath so that I can syndicate the Online Newspaper