Menu

how to write .xml file for monster.com

2008-06-02
2012-09-04
  • Chandra Shekher

    Chandra Shekher - 2010-04-11

    this is monster ukk..

    <config charset="ISO-8859-1">

    <function name="download-multipage-list">

    <return>

    <while condition="${pageUrl.toString().length() != 0}" index="i">

    <empty>

    <var-def name="content">

    <html-to-xml>

    <http url="${pageUrl}"/>

    </html-to-xml>

    </var-def>

    <var-def name="disabledvalue">

    <xpath expression="${disabled}">

    </xpath>

    </var-def>

    <var-def name="nextLinkUrl">

    <xpath expression="${nextXPath}">

    </xpath>

    </var-def>

    <var-def name="pageUrl">

    <case>

    <if condition="${disabledvalue.toString().length()==0}">

    <template>${sys.fullUrl(pageUrl.toString(),
    nextLinkUrl.toString())}</template>

    </if>

    <else>

    <template></template>

    </else>

    </case>

    </var-def>

    </empty>

    <xpath expression="${itemXPath}">

    </xpath>

    </while>

    </return>

    </function>

    <var-def name="startURL"><template>http://jobsearch.monster.co.uk/Search.aspx
    ?q=BRACKNELL&cy=uk&lid=193&re=130</template
    ></var-def>

    <var-def name="products">

    <call name="download-multipage-list">

    <call-param name="pageUrl"></call-param>

    <call-param name="nextXPath">//noscript/a/@href</call-param>

    <call-param name="itemXPath">//*</call-param>

    <call-param name="disabled">//a/@disabled</call-param>

    </call>

    </var-def>

    <loop item="item" index="i">

    <list></list>

    <body>

    <var-def name="jobtitle.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA/a/text())

    return

    $jobTitle

    ]]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="jobdetail.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA)

    return

    $JobDetail

    ]]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="joblink.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA/a/@href)

    return

    $JobLink

    ]]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="employer.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA)

    return

    $employer

    ]]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="location.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA)

    return

    $location

    ]]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="salary.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA]></xq-expression>

    </xquery>

    </var-def>

    <var-def name="category.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression><![CDATA]></xq-expression>

    </xquery>

    </var-def>

    </body>

    </loop>

    </config>

     
  • Julie

    Julie - 2010-10-20

    Hi. I have a similar problem. I need to write a script to scrape data from
    "http://www.term4sale.com". Entering zipcode
    "94530" on this page takes to "http://www.term4sale.com/cgi-
    bin/cqsl.cgi
    ". We need to extract
    data like"company name", "product name", "health category". Script I wrote is
    not able to pass zipcode information to the susequent page, so I get data
    missing page. What is the right way to pass the zipcode so that second page
    with data loads, not the error page.

    The script goes like this(its not complete)

    <http method="post" url="&lt;a class=" "="" href="http://www.term4sale.com/">http://www.term4sale.com/">

    <http-param name="ZipCode">94530</http-param>

    </http>

    <var-def name="startUrl">http://www.term4sale.com/cgi-bin/cqsl.cgi</var-
    def
    >

    <file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.txt" charset="UTF-8">

    <xpath expression="//td">

    <html-to-xml>

    <http url="${startUrl}"/>

    </html-to-xml>

    </xpath>

     
  • Alex Wajda

    Alex Wajda - 2010-10-20
    <html-to-xml> 
        <http method="post" url="[url]http://www.term4sale.com/cgi-bin/cqsl.cgi[/url]">
            <http-param name="ZipCode">94530</http-param>
            ....
        </http> 
    </html-to-xml>
    
     
  • Julie

    Julie - 2010-10-20

    Thanks for you quick response.

    The error still persists. The First URL takes zipcode as input and takes us to
    the second URL from where I need to scrape data. My guess says we need to
    mention both the URLs. I don't know how to get it done though.

     
  • Alex Wajda

    Alex Wajda - 2010-10-21

    What do you mean by "takes us to the second URL"? It responds you with a
    search result which you need to scrap OR an error page if something goes
    wrongs. There is no "second url" to mention (as far as I can see).

     
  • Julie

    Julie - 2010-10-21

    The URL "http://www.term4sale.com" is the one
    where we input data which takes us to result URL "http://www.term4sale.com
    /cgi-bin/cqsl.cgi
    ".

    The Script goes like this:

    <config charset="ISO-8859-1">

    <http method="post" url="&lt;a class=" "="" href="http://www.term4sale.com/">http://www.term4sale.com/">

    <http-param name="BirthMonth">6</http-param>

    <http-param name="BirthYear">1970</http-param>

    <http-param name="Birthday">15</http-param>

    <http-param name="CompRating">4</http-param>

    <http-param name="ErrOnMissingZipCode">ON</http-param>

    <http-param name="FaceAmount">500000</http-param>

    <http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="Health">PP</http-param>

    <http-param name="ModeUsed">M</http-param>

    <http-param name="NewCategory">5</http-param>

    <http-param name="Sex">M</http-param>

    <http-param name="Smoker">N</http-param>

    <http-param name="SortOverride1">A</http-param>

    <http-param name="State">0</http-param>

    <http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="ZipCode">94530</http-param>

    </http>

    <var-def name="startUrl">http://www.term4sale.com/cgi-bin/cqsl.cgi</var-
    def
    >

    <var-def name="con">

    <xpath expression="//p">

    <html-to-xml>

    <http url="${startUrl}"/>

    </html-to-xml>

    </xpath>

    </var-def>

    <file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.txt" charset="UTF-8">

    <loop item="item" index="i">

    <list>

    </list>

    <body>

    <var-def name="rowdata.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression>

    </xq-expression> </xquery> </var-def> </body> </loop> </file> </config> I have passed all the possible POST parameter input still getting missing field error.
     
  • Julie

    Julie - 2010-10-21

    I am a newbie so took sometime to understand your advice. It works perfectly
    fine.

    Thanks Alex.

    <html-to-xml>

    <http method="post" url="&lt;a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0002bin/cqsl.cgi">

    <http-param name="BirthMonth">6</http-param>

    <http-param name="BirthYear">1970</http-param>

    <http-param name="Birthday">15</http-param>

    <http-param name="CompRating">4</http-param>

    <http-param name="ErrOnMissingZipCode">ON</http-param>

    <http-param name="FaceAmount">500000</http-param>

    <http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="Health">PP</http-param>

    <http-param name="ModeUsed">M</http-param>

    <http-param name="NewCategory">5</http-param>

    <http-param name="Sex">M</http-param>

    <http-param name="Smoker">N</http-param>

    <http-param name="SortOverride1">A</http-param>

    <http-param name="State">0</http-param>

    <http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="ZipCode">94530</http-param>

    </http>

    </html-to-xml>

     
  • Julie

    Julie - 2010-10-26

    Hello Everyone,

    I have a script which saves result in XML format. I would like to save the
    result in CSV format.

    Is there any way to modify the existing script to get the desired output or Do
    I need to write a separate script to convert XML to CSV.

    The code goes like this:

    <config charset="ISO-8859-1">

    <var-def name="con">

    <html-to-xml>

    <http method="post" url="&lt;a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0005bin/cqsl.cgi">

    <http-param name="BirthMonth">6</http-param>

    <http-param name="BirthYear">1970</http-param>

    <http-param name="Birthday">15</http-param>

    <http-param name="CompRating">4</http-param>

    <http-param name="ErrOnMissingZipCode">ON</http-param>

    <http-param name="FaceAmount">500000</http-param>

    <http-param name="HTEMPLATEFILE">HTEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="Health">PP</http-param>

    <http-param name="ModeUsed">M</http-param>

    <http-param name="NewCategory">5</http-param>

    <http-param name="Sex">M</http-param>

    <http-param name="Smoker">N</http-param>

    <http-param name="SortOverride1">A</http-param>

    <http-param name="State">0</http-param>

    <http-param name="TEMPLATEFILE">TEMPLATE_TERM4SALE.HTM</http-param>

    <http-param name="ZipCode">94530</http-param>

    </http>

    </html-to-xml>

    </var-def>

    <var-def name="allrows">

    <xpath expression="//body/table">

    </xpath>

    </var-def>

    <var-def name="col">

    <xpath expression="//tr">

    </xpath>

    </var-def>

    <file action="write" path="C:\\Users\\cloudfi\\Desktop\\WebHarvest\\termsale.xml" charset="UTF-8">

    <loop item="item" index="i">

    <list></list>

    <body>

    <var-def name="row.${i}">

    <xquery>

    <xq-param name="item" type="node()">

    </xq-param>

    <xq-expression>

    <company>{normalize-space($company)}</company> <rating>{normalize-space($rating)}</rating> <annual>{normalize-space($annual)}</annual> <monthly>{normalize-space($annual)}</monthly> <product>{normalize-space($product)}</product> <category>{normalize-space($category)}</category> </record> ]]> </xq-expression> </xquery> </var-def> </body> </loop> </file> </config> Any help is much appreciated.
     
  • Alex Wajda

    Alex Wajda - 2010-10-26

    juliecloud, please do not paste those huge XMLs into your messages - that makes no sense because it's hardly likely people will read them over to find a needle in a haystack. Keep your messages shot and only paste a relevant piece of code and only if it makes your post clearer. Doing that you'll increase the chances for your question to be answered :)

    Regarding to your question (if I got it right) - there is no embedded xml-to-
    csv convertion in WH. But you can do it easily with the xslt processing.
    Anyway the conversion algorithm depends on the custom XML structure, so there
    is no truly generic converter.

    Look here for example http://stackoverflow.com/questions/365312/xml-to-csv-
    using-xslt

     
  • Julie

    Julie - 2010-10-27

    Thank you for the input wajda. Posting the code to make post clearer is a good
    idea :)

    I wrote an xsl file and included in the xml to convert it to csv but it throws
    the following error:

    The XML page cannot be displayed

    Cannot view XML input using XSL style sheet. Please correct the error and then
    click the Refresh button, or try again later.


    End tag 'td' does not match the start tag 'xsl:value-of'. Error processing
    resource 'file:///C:/Users/cloudfi/Desktop/WebHa...

    --------------^

    This is the piece of code:

    <body>

    <xsl:for-each select="//record"> </xsl:for-each>
    <xsl:value-of select="company"/>,<xsl:value-of select="rating"/>,<xsl:value-of select="annual"/>,<xsl:value-of select="monthly">,<xsl:value-of select="product"/>,<xsl:value-of select="category"/>

    </body>

    Any thoughts?

     
  • Alex Wajda

    Alex Wajda - 2010-10-27

    Looks correct. Perhaps adding missed slash helps....

    Here: <xsl:value-of select="monthly">

     
  • Julie

    Julie - 2010-10-27

    I too thought the same but adding missed slash did not help either.

     
  • Alex Wajda

    Alex Wajda - 2010-10-27

    Have no idea then. The error message you mentioned says exactly that - "End
    tag 'td' does not match the start tag 'xsl:value-of'"

     
  • Julie

    Julie - 2010-10-27

    Thanks wajda, it worked. For some reason change was not saved, so the error
    persisted.

     
  • Julie

    Julie - 2010-10-29

    Hi Everyone,

    I have a script which works fine for single input for each parameter. How
    should I modify the script to add loop to a parameter.

    <html-to-xml>

    <http method="post" url="&lt;a class=" "="" href="http://www.term4sale.com/cgi-bin/cqsl.cgi">http://www.term4sale.com/cgi-klzzwxh:0002bin/cqsl.cgi">

    <http-param name="BirthMonth">month</http-param>

    <http-param name="BirthYear">1970</http-param>

    <http-param name="Birthday">15</http-param>

    </http>

    For example, input birth year from 1970-1985, birth month from 1-12 and so on.
    Any thoughts?

    Thanks in advance.

     
  • katrina mughal

    katrina mughal - 2012-02-28

    thank you brothers to give fast reply, How to write or generate an XML file
    using C# to get the result below?

    <pages>http://www.jobz.pk/nawaiwaqt_jobs/

    <page name="Page Name 1" url="/page-1/"/>

    <page name="Page Name 2" url="/page-2/"/>

    <page name="Page Name 3" url="/page-3/"/>

    <page name="Page Name 4" url="/page-4/"/>

    </pages>

     
  • marco

    marco - 2012-03-10

    Can I ask for help here? I whish I could extract selected sets of art related
    info from wikipedia
    (http://en.wikipedia.orgI):

    I have a list of keywords as search keywords ( i.e. Mona Lisa, Picasso,
    Surrealism, etc. all about arts and artists and events and cultural heritage)
    and I would like to generate two folders, in one the xml data extracted from
    the wikipedia pages and in the other folder the images connected sequencially
    to the related xmls of the first folder (to be able to reconstruct the page
    structure)

    the first folder with xml files and each xml file identifies:

    -language
    -artist name
    -work of art
    -title
    -date of work
    -descriprion
    -link to the main image
    -link to high res image
    -link to the wiki page
    -list of links to references and copyright

    the second folder with the page's extracted images named sequencially and
    linked to the xml files from the source wikipedia page

    please, could you help me?

     
  • marco

    marco - 2012-03-15

    Hi,

    you can remove this request. I solved the basic structure myself. now I'll try
    to clean it and improve it :-)

     
  • marco

    marco - 2012-03-15

    hi,

    you can remove this request, I made the basic structure myself and works (now
    I can collect images) . good tool .

    :-)

     

Log in to post a comment.