Menu

XPath Issue

Help
2008-02-24
2012-09-04
  • parker20121

    parker20121 - 2008-02-24

    I'm trying to parse information from yahoo's movie site. I have run into an xpath issue. I would like to think the syntax is correct, and xpath isn't returning what I need, but more than likely I'm doing something wrong.

    I'm using the following configuration file to retrieve a webpage for a movie, where it trys to parse out the movie's description and store that result in a XML file. Here is the config:

    <?xml version="1.0" encoding="UTF-8"?>
    <config>

    <var-def name="movieId">1809701422</var-def>

    <var-def name="resultsPage">
    <html-to-xml>
    <http url="http://movies.yahoo.com/movie/${movieId}/info">
    <http-header name="User-Agent">Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1</http-header>
    </http>
    </html-to-xml>
    </var-def>

    <var-def name="movieDescription">
    <xpath expression="/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font[23]">
    <var name="resultsPage"/>
    </xpath>
    </var-def>

    &lt;!-- 
       Extract the structure holding the movie information. 
       The information is enclosed in a series of font tags.
    

    <loop item="link" index="i">
    <list>
    <xpath expression="/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font/text()">
    <var name="resultsPage"/>
    </xpath>
    </list>
    <body>
    <case>
    <if condition="${i == 4}">
    <var-def name="movieTitle">
    <var name="link"/>
    </var-def>
    </if>
    </case>
    <file action="append" path="./font-text.txt">
    <template>
    ${i}. ${link} ${sys.cr}${sys.lf}
    </template>
    </file>
    </body>
    </loop>
    -->

    &lt;!-- Store the results on disk --&gt;
    

    <file action="write" path="./database/${movieId}.xml">

    &lt;template&gt;
       &lt;![CDATA[
       &lt;?xml version=&quot;1.0&quot;?&gt;
       &lt;movie id=&quot;${movieId}&quot; title=&quot;&quot; year=&quot;&quot;&gt;
         &lt;description&gt;
            ${movieDescription}
         &lt;/description&gt;  
         &lt;actors&gt;
         &lt;/actors&gt;
       &lt;/movie&gt;
    
       ]]&gt;
    &lt;/template&gt;
    

    </file>

    </config>

    The move description field is blank. But if I change the xpath query to read

    /html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font

    I get all the text values.

    I can iterate over the of nodes, and return the text one by one using a loop that is feed with a xpath query, and I get the following
    info that I have written to a file with an index to indicate the node position (see below). What I need is to get the text from node 23 as the description. So I know the original xpath query is getting the correct set of nodes. But if I use a position higher than 1, and I don't get any of the text in any of the child nodes at all. Any help would be apperciated.

    1
    .
    <font face="arial" size="-1">Movie Main Page</font>
    2
    .
    <font face="arial" size="-1">
    <b>Movie Overview</b>
    </font>
    3
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/details">Movie Details</a>
    </font>
    4
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/showtimes">Showtimes &amp; Tickets</a>
    </font>
    5
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/dvdinfo">DVD/Video Info</a>
    </font>
    6
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/trailer">Trailers &amp; Clips</a>
    </font>
    7
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/cast">Cast and Credits</a>
    </font>
    8
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/awards">Awards &amp; Nominations</a>
    </font>
    9
    .
    <font face="arial" size="-1">
    <b>Reviews and Previews</b>
    </font>
    10
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/critic">Critics Reviews</a>
    </font>
    11
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/user">User Reviews</a>
    </font>
    12
    .
    <font face="arial" size="-1" color="gray">Greg's Preview</font>13
    .
    <font face="arial" size="-1" color="gray">Movie Mom's Review</font>
    14
    .
    <font face="arial" size="-1">
    <b>Photos</b>
    </font>
    15
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/photo/premiere/stills">Premiere Photos</a>
    </font>
    16
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/photo/stills">Production Photos</a>
    </font>
    17
    .
    <font face="arial" size="-1">
    <b>Community</b>
    </font>
    18
    .
    <font face="arial" size="-1">
    <a href="http://messages.movies.yahoo.com/Movies/Films/forumview?bn=12172484-hv1809701422f0&amp;e=W5AndZVKwTRSyHgkLaLR9Gl1M42dLxmfgNVk9ig3DqCd.UZeLseE092KCacUqTZjaQ5.GlBjmMH61kQNGnJHhVWbYim1rA9uu17Oq8dcgpbqgtW6EF0U1Qp1PDzovKqNr4Xl">Message Board</a>
    </font>
    19
    .
    <font face="arial" size="-1">
    <b>Shopping</b>
    </font>
    20
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/buyvideo">Buy the DVD/Video</a>
    </font>
    21
    .
    <font face="arial" size="-1">
    <b>Other Resources</b>
    </font>
    22
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/web">Web Sites</a>
    </font>
    23
    .
    <font face="arial" size="-1">Set in West Texas, a man on the run with a suitcase full of money is pursued by a number of individuals.</font>
    24
    .
    <font face="arial">
    <b>Cast and Credits</b>
    </font>
    25
    .
    <font face="arial" size="-1">Starring:</font>
    26
    .
    <font face="arial" size="-1">
    <a href="/movie/contributor/1800016536">Tommy Lee Jones</a>,<a href="/movie/contributor/1800023079">Javier Bardem</a>,<a href="/movie/contributor/1800019611">Josh Brolin</a>,<a href="/movie/contributor/1800183279">Beth Grant</a>,<a href="/movie/contributor/1809126070">Garret Dillahunt</a>
    </font>
    27
    .
    <font face="arial" size="-1">Directed by:</font>
    28
    .
    <font face="arial" size="-1">
    <a href="/movie/contributor/1800025224">Joel Coen</a>,<a href="/movie/contributor/1800025225">Ethan Coen</a>
    </font>
    29
    .
    <font face="arial" size="-1">Produced by:</font>
    30
    .
    <font face="arial" size="-1">
    <a href="/movie/contributor/1809082023">Robert Graf (II)</a>,<a href="/movie/contributor/1808809218">Mark Roybal</a>,<a href="/movie/contributor/1800020262">Scott Rudin</a>
    </font>
    31
    .
    <font face="arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/cast">
    <b>See Full Cast and Credits</b>
    </a>
    </font>
    32
    .
    <font face="arial">
    <b>Production Photos</b>
    </font>
    33
    .
    <font face="arial">
    <b>Critical Consensus</b>
    </font>
    34
    .
    <font face="arial" size="-1">
    <b>
    <a href="http://movies.yahoo.com/movie/1809701422/critic">More Critics Reviews...</a>
    </b>
    </font>
    35
    .
    <font face="Arial" size="-1">
    <a href="http://movies.yahoo.com/movie/1809701422/user">
    <b>More User Reviews...</b>
    </a>
    <!-- ##comment TODO - need to be able to determine whether user has written

    comment a review. This is true if C_POSTED (in C++ code) is set to

    comment a nonzero value. This is in hf2k as $.already_posted -->  |  <a href="http://movies.yahoo.com/mvc/ecrv?mid=1809701422&amp;ys=HA9YfslBefY2J23hBP5y5A--">

      &lt;b&gt;Write Your Own Review!&lt;/b&gt;
    

    </a>
    </font>
    36
    .
    <font face="arial" size="-1">
    <b>Yahoo! Movies</b>:<a href="http://us.rd.yahoo.com/movies/bottomtrough/intheaters/?http://movies.yahoo.com/movies/feature/thisweekend.html">In Theaters</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/showtimes/?http://movies.yahoo.com/showtimes/showtimes.html">Times &amp; Tickets</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/trailers/?http://movies.yahoo.com/trailers">Trailers</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/moviesshopping/?http://movies.yahoo.com/dvd/">DVD/Video</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/movienews/?http://movies.yahoo.com/news/main/">News &amp; Gossip</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/boxoffice/?http://movies.yahoo.com/boxoffice/latest/rank.html">Box Office</a>-<a href="http://movies.yahoo.com/browse">Browse Movies</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/moviesmain/?http://movies.yahoo.com">more...</a>
    <br/>
    <b>Yahoo! Entertainment</b>:<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/more/?http://movies.yahoo.com">Movies</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/music/?http://music.yahoo.com">Music</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/tv/?http://tv.yahoo.com">TV</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/games/?http://games.yahoo.com">Games</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/astrology/?http://astrology.yahoo.com">Astrology</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/more/?http://entertainment.tv.yahoo.com">more...</a>
    </font>

     
    • parker20121

      parker20121 - 2008-02-24

      I found the issue. The xpath query should be:

      (/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font)[position()=23]/text()

      and not

      /html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font[23]/text()

      although I'm not exactly sure what the difference is at this point.

       
      • Vladimir Nikic

        Vladimir Nikic - 2008-02-26

        There is important difference between those two XPath expressions:

        /div/font[3] returns 3rd FONT element inside the DIV and
        (/div/font)[3] returns 3rd /div/font element

        For the following XML:

        <html>
        <div>
        <font>aaa</font>
        </div>
        <div>
        <font>bbb</font>
        </div>
        <div>
        <font>ccc</font>
        </div>
        </html>

        first one returns nothing, and seconf returns <font>ccc</font>

        Regards, Vladimir.

         
    • parker20121

      parker20121 - 2008-02-28

      Thanks for the clarification.

       

Log in to post a comment.