Menu

XML web scraping

Help
Michael
2013-01-03
2013-01-04
  • Michael

    Michael - 2013-01-03

    Hello,

    I'm currently working to build a webscraper on certain apartment sites, and i've run into a slight problem. I was able to figure out XML web scraper on a more simple formatted HTML source, namely one that was nested properly. I've run into one where the format is more difficult. (multiple span references so i cannot iterate over a span or class) Can anyone help/suggest an alternative to my problem? below is the code followed by the desired output:

    <config>
        <file action="write" path="edr/edrxml.xml">
                  <template><![CDATA[ <extracted time="${sys.datetime("dd.MM.yyyy, HH:mm:ss")}" > ]]></template>
                <xquery>
                    <xq-param name="doc">
                        <html-to-xml>
                            <http url="http://www.100midtown.com/index.php/prop/floor_plans"/>
                        </html-to-xml>
                    </xq-param>
                    <xq-expression>
                        <![CDATA[
                        declare variable $doc as node() external;
                        for $allListings in $doc//div[@id="floorPlanOneMatrix"]
                            for $plantitle in $allListings//span[@class="firstFloorPlanTitle"]
                            let $pt := $plantitle//span[@style="color: rgb(0,56,147)"]
                        let $pr := $allListings//span[@class="firstFloorPlanRates"]
                            return
                                    <price> <header>100 Midtown</header><pt>{normalize-space(data($pt[1]))}</pt> <pr>{normalize-space(data($pr[1]))}</pr></price>
    
                    ]]></xq-expression>
                </xquery>
            <![CDATA[ </extracted> ]]>
    
        </file> 
    </config>
    

    I'm trying to display all of the Bedroom/Bathroom types, as well as the rents for each. Any and all help would be great. Also can answer questions if i'm being unclear, which is entirely possible as well.

    Thanks,
    M

     

    Last edit: Michael 2013-01-03
    • Michael

      Michael - 2013-01-04

      So disregard the previous.... i know what the problem is, but i'm still a bit short on the syntax.

      The Problem: I'm trying to read through the code twice, namely, create two separate xml readers over the same code.

      My code: <xq-expression>
      <![CDATA[
      declare variable $doc as node() external;
      declare variable $doc2 as node() external;
      for $allListings in $doc//div[@id="floorPlanOneMatrix"]

      for $planrates in $allListings//span[@class="firstFloorPlanRates"]
      let $pr := $planrates

                      for $allListings1 in $doc2//div[@id="floorPlanOneMatrix"]
                          for $plantitle in $allListings1//span[@class="firstFloorPlanTitle"]
                          let $pt := $plantitle//span[@style="color: rgb(0,56,147)"]
                          return
                                  <price> <header>100 Midtown</header> <pt>{normalize-space(data($pt[1]))}</pt> <pr>{normalize-space(data($pr[1]))}</pr></price>
                  ]]></xq-expression>
      

      Output:
      This creates a "squared effect", where each title is paired with each rate. I know the order will be fine if read separately into single arrays, but i know the problem involves breaking out of the for loop (which is pulling each rate) and then conducting a separate read through the titles.

      Can anyone suggest syntax enabling me to run two separate readers here? Thanks,

      Michael

       

      Last edit: Michael 2013-01-04

Log in to post a comment.