Hi. I have a similar problem. I need to write a script to scrape data from
"http://www.term4sale.com". Entering zipcode
"94530" on this page takes to "http://www.term4sale.com/cgi-
bin/cqsl.cgi". We need to extract
data like"company name", "product name", "health category". Script I wrote is
not able to pass zipcode information to the susequent page, so I get data
missing page. What is the right way to pass the zipcode so that second page
with data loads, not the error page.
The error still persists. The First URL takes zipcode as input and takes us to
the second URL from where I need to scrape data. My guess says we need to
mention both the URLs. I don't know how to get it done though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What do you mean by "takes us to the second URL"? It responds you with a
search result which you need to scrap OR an error page if something goes
wrongs. There is no "second url" to mention (as far as I can see).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
</xq-expression>
</xquery>
</var-def>
</body>
</loop>
</file>
</config>
I have passed all the possible POST parameter input still getting missing
field error.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
<company>{normalize-space($company)}</company>
<rating>{normalize-space($rating)}</rating>
<annual>{normalize-space($annual)}</annual>
<monthly>{normalize-space($annual)}</monthly>
<product>{normalize-space($product)}</product>
<category>{normalize-space($category)}</category>
</record>
]]>
</xq-expression>
</xquery>
</var-def>
</body>
</loop>
</file>
</config>
Any help is much appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
juliecloud, please do not paste those huge XMLs into your messages - that makes no sense because it's hardly likely people will read them over to find a needle in a haystack. Keep your messages shot and only paste a relevant piece of code and only if it makes your post clearer. Doing that you'll increase the chances for your question to be answered :)
Regarding to your question (if I got it right) - there is no embedded xml-to-
csv convertion in WH. But you can do it easily with the xslt processing.
Anyway the conversion algorithm depends on the custom XML structure, so there
is no truly generic converter.
Can I ask for help here? I whish I could extract selected sets of art related
info from wikipedia
(http://en.wikipedia.orgI):
I have a list of keywords as search keywords ( i.e. Mona Lisa, Picasso,
Surrealism, etc. all about arts and artists and events and cultural heritage)
and I would like to generate two folders, in one the xml data extracted from
the wikipedia pages and in the other folder the images connected sequencially
to the related xmls of the first folder (to be able to reconstruct the page
structure)
the first folder with xml files and each xml file identifies:
-language
-artist name
-work of art
-title
-date of work
-descriprion
-link to the main image
-link to high res image
-link to the wiki page
-list of links to references and copyright
the second folder with the page's extracted images named sequencially and
linked to the xml files from the source wikipedia page
please, could you help me?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
how to write .xml file for monster.com with the url http://jobsearch.monsterindia.com/searchresult.html?fts=java&ac=&submit.x=29&submit.y=13
for webharvest to extract the content of the search result and printing on file
this is monster ukk..
<config charset="ISO-8859-1">
<function name="download-multipage-list">
<return>
<while condition="${pageUrl.toString().length() != 0}" index="i">
<empty>
<var-def name="content">
<html-to-xml>
<http url="${pageUrl}"/>
</html-to-xml>
</var-def>
<var-def name="disabledvalue">
<xpath expression="${disabled}">
</xpath>
</var-def>
<var-def name="nextLinkUrl">
<xpath expression="${nextXPath}">
</xpath>
</var-def>
<var-def name="pageUrl">
<case>
<if condition="${disabledvalue.toString().length()==0}">
<template>${sys.fullUrl(pageUrl.toString(),
nextLinkUrl.toString())}</template>
</if>
<else>
<template></template>
</else>
</case>
</var-def>
</empty>
<xpath expression="${itemXPath}">
</xpath>
</while>
</return>
</function>
<var-def name="startURL"><template>http://jobsearch.monster.co.uk/Search.aspx
?q=BRACKNELL&cy=uk&lid=193&re=130</template></var-def>
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl"></call-param>
<call-param name="nextXPath">//noscript/a/@href</call-param>
<call-param name="itemXPath">//*</call-param>
<call-param name="disabled">//a/@disabled</call-param>
</call>
</var-def>
<loop item="item" index="i">
<list></list>
<body>
<var-def name="jobtitle.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA/a/text())
return
$jobTitle
]]></xq-expression>
</xquery>
</var-def>
<var-def name="jobdetail.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA)
return
$JobDetail
]]></xq-expression>
</xquery>
</var-def>
<var-def name="joblink.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA/a/@href)
return
$JobLink
]]></xq-expression>
</xquery>
</var-def>
<var-def name="employer.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA)
return
$employer
]]></xq-expression>
</xquery>
</var-def>
<var-def name="location.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA)
return
$location
]]></xq-expression>
</xquery>
</var-def>
<var-def name="salary.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA]></xq-expression>
</xquery>
</var-def>
<var-def name="category.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression><![CDATA]></xq-expression>
</xquery>
</var-def>
</body>
</loop>
</config>
Hi. I have a similar problem. I need to write a script to scrape data from
"http://www.term4sale.com". Entering zipcode
"94530" on this page takes to "http://www.term4sale.com/cgi-
bin/cqsl.cgi". We need to extract
data like"company name", "product name", "health category". Script I wrote is
not able to pass zipcode information to the susequent page, so I get data
missing page. What is the right way to pass the zipcode so that second page
with data loads, not the error page.
The script goes like this(its not complete)
<http method="post" url="<a class=" "="" href="http://www.term4sale.com/">http://www.term4sale.com/">
<http-param name="ZipCode">94530</http-param>
</http>
<var-def name="startUrl">http://www.term4sale.com/cgi-bin/cqsl.cgi</var-
def>
<file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.txt" charset="UTF-8">
<xpath expression="//td">
<html-to-xml>
<http url="${startUrl}"/>
</html-to-xml>
</xpath>
Thanks for you quick response.
The error still persists. The First URL takes zipcode as input and takes us to
the second URL from where I need to scrape data. My guess says we need to
mention both the URLs. I don't know how to get it done though.
What do you mean by "takes us to the second URL"? It responds you with a
search result which you need to scrap OR an error page if something goes
wrongs. There is no "second url" to mention (as far as I can see).
The URL "http://www.term4sale.com" is the one
where we input data which takes us to result URL "http://www.term4sale.com
/cgi-bin/cqsl.cgi".
The Script goes like this:
<config charset="ISO-8859-1">
<http method="post" url="<a class=" "="" href="http://www.term4sale.com/">http://www.term4sale.com/">
<http-param name="BirthMonth">6</http-param>
<http-param name="BirthYear">1970</http-param>
<http-param name="Birthday">15</http-param>
<http-param name="CompRating">4</http-param>
<http-param name="ErrOnMissingZipCode">ON</http-param>
<http-param name="FaceAmount">500000</http-param>
<http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="Health">PP</http-param>
<http-param name="ModeUsed">M</http-param>
<http-param name="NewCategory">5</http-param>
<http-param name="Sex">M</http-param>
<http-param name="Smoker">N</http-param>
<http-param name="SortOverride1">A</http-param>
<http-param name="State">0</http-param>
<http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="ZipCode">94530</http-param>
</http>
<var-def name="startUrl">http://www.term4sale.com/cgi-bin/cqsl.cgi</var-
def>
<var-def name="con">
<xpath expression="//p">
<html-to-xml>
<http url="${startUrl}"/>
</html-to-xml>
</xpath>
</var-def>
<file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.txt" charset="UTF-8">
<loop item="item" index="i">
<list>
</list>
<body>
<var-def name="rowdata.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression>
</xq-expression> </xquery> </var-def> </body> </loop> </file> </config> I have passed all the possible POST parameter input still getting missing field error.I am a newbie so took sometime to understand your advice. It works perfectly
fine.
Thanks Alex.
<html-to-xml>
<http method="post" url="<a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0002bin/cqsl.cgi">
<http-param name="BirthMonth">6</http-param>
<http-param name="BirthYear">1970</http-param>
<http-param name="Birthday">15</http-param>
<http-param name="CompRating">4</http-param>
<http-param name="ErrOnMissingZipCode">ON</http-param>
<http-param name="FaceAmount">500000</http-param>
<http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="Health">PP</http-param>
<http-param name="ModeUsed">M</http-param>
<http-param name="NewCategory">5</http-param>
<http-param name="Sex">M</http-param>
<http-param name="Smoker">N</http-param>
<http-param name="SortOverride1">A</http-param>
<http-param name="State">0</http-param>
<http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="ZipCode">94530</http-param>
</http>
</html-to-xml>
Hello Everyone,
I have a script which saves result in XML format. I would like to save the
result in CSV format.
Is there any way to modify the existing script to get the desired output or Do
I need to write a separate script to convert XML to CSV.
The code goes like this:
<config charset="ISO-8859-1">
<var-def name="con">
<html-to-xml>
<http method="post" url="<a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0005bin/cqsl.cgi">
<http-param name="BirthMonth">6</http-param>
<http-param name="BirthYear">1970</http-param>
<http-param name="Birthday">15</http-param>
<http-param name="CompRating">4</http-param>
<http-param name="ErrOnMissingZipCode">ON</http-param>
<http-param name="FaceAmount">500000</http-param>
<http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="Health">PP</http-param>
<http-param name="ModeUsed">M</http-param>
<http-param name="NewCategory">5</http-param>
<http-param name="Sex">M</http-param>
<http-param name="Smoker">N</http-param>
<http-param name="SortOverride1">A</http-param>
<http-param name="State">0</http-param>
<http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>
<http-param name="ZipCode">94530</http-param>
</http>
</html-to-xml>
</var-def>
<var-def name="allrows">
<xpath expression="//body/table">
</xpath>
</var-def>
<var-def name="col">
<xpath expression="//tr">
</xpath>
</var-def>
<file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.xml" charset="UTF-8">
<loop item="item" index="i">
<list></list>
<body>
<var-def name="row.${i}">
<xquery>
<xq-param name="item" type="node()">
</xq-param>
<xq-expression>
<company>{normalize-space($company)}</company> <rating>{normalize-space($rating)}</rating> <annual>{normalize-space($annual)}</annual> <monthly>{normalize-space($annual)}</monthly> <product>{normalize-space($product)}</product> <category>{normalize-space($category)}</category> </record> ]]> </xq-expression> </xquery> </var-def> </body> </loop> </file> </config> Any help is much appreciated.juliecloud, please do not paste those huge XMLs into your messages - that makes no sense because it's hardly likely people will read them over to find a needle in a haystack. Keep your messages shot and only paste a relevant piece of code and only if it makes your post clearer. Doing that you'll increase the chances for your question to be answered :)
Regarding to your question (if I got it right) - there is no embedded xml-to-
csv convertion in WH. But you can do it easily with the xslt processing.
Anyway the conversion algorithm depends on the custom XML structure, so there
is no truly generic converter.
Look here for example http://stackoverflow.com/questions/365312/xml-to-csv-
using-xslt
Thank you for the input wajda. Posting the code to make post clearer is a good
idea :)
I wrote an xsl file and included in the xml to convert it to csv but it throws
the following error:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then
click the Refresh button, or try again later.
End tag 'td' does not match the start tag 'xsl:value-of'. Error processing
resource 'file:///C:/Users/cloudfi/Desktop/WebHa...
--------------^
This is the piece of code:
<body>
<xsl:for-each select="//record"> </xsl:for-each></body>
Any thoughts?
Looks correct. Perhaps adding missed slash helps....
Here: <xsl:value-of select="monthly">
I too thought the same but adding missed slash did not help either.
Have no idea then. The error message you mentioned says exactly that - "End
tag 'td' does not match the start tag 'xsl:value-of'"
Thanks wajda, it worked. For some reason change was not saved, so the error
persisted.
Hi Everyone,
I have a script which works fine for single input for each parameter. How
should I modify the script to add loop to a parameter.
<html-to-xml>
<http method="post" url="<a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0002bin/cqsl.cgi">
<http-param name="BirthMonth">month</http-param>
<http-param name="BirthYear">1970</http-param>
<http-param name="Birthday">15</http-param>
</http>
For example, input birth year from 1970-1985, birth month from 1-12 and so on.
Any thoughts?
Thanks in advance.
thank you brothers to give fast reply, How to write or generate an XML file
using C# to get the result below?
<pages>http://www.jobz.pk/nawaiwaqt_jobs/
<page name="Page Name 1" url="/page-1/"/>
<page name="Page Name 2" url="/page-2/"/>
<page name="Page Name 3" url="/page-3/"/>
<page name="Page Name 4" url="/page-4/"/>
</pages>
Can I ask for help here? I whish I could extract selected sets of art related
info from wikipedia
(http://en.wikipedia.orgI):
I have a list of keywords as search keywords ( i.e. Mona Lisa, Picasso,
Surrealism, etc. all about arts and artists and events and cultural heritage)
and I would like to generate two folders, in one the xml data extracted from
the wikipedia pages and in the other folder the images connected sequencially
to the related xmls of the first folder (to be able to reconstruct the page
structure)
the first folder with xml files and each xml file identifies:
-language
-artist name
-work of art
-title
-date of work
-descriprion
-link to the main image
-link to high res image
-link to the wiki page
-list of links to references and copyright
the second folder with the page's extracted images named sequencially and
linked to the xml files from the source wikipedia page
please, could you help me?
Hi,
you can remove this request. I solved the basic structure myself. now I'll try
to clean it and improve it :-)
hi,
you can remove this request, I made the basic structure myself and works (now
I can collect images) . good tool .
:-)