I'm trying to parse information from yahoo's movie site. I have run into an xpath issue. I would like to think the syntax is correct, and xpath isn't returning what I need, but more than likely I'm doing something wrong.
I'm using the following configuration file to retrieve a webpage for a movie, where it trys to parse out the movie's description and store that result in a XML file. Here is the config:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="movieId">1809701422</var-def>
<var-def name="resultsPage">
<html-to-xml>
<http url="http://movies.yahoo.com/movie/${movieId}/info">
<http-header name="User-Agent">Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1</http-header>
</http>
</html-to-xml>
</var-def>
I can iterate over the of nodes, and return the text one by one using a loop that is feed with a xpath query, and I get the following
info that I have written to a file with an index to indicate the node position (see below). What I need is to get the text from node 23 as the description. So I know the original xpath query is getting the correct set of nodes. But if I use a position higher than 1, and I don't get any of the text in any of the child nodes at all. Any help would be apperciated.
1
.
<font face="arial" size="-1">Movie Main Page</font>
2
.
<font face="arial" size="-1">
<b>Movie Overview</b>
</font>
3
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/details">Movie Details</a>
</font>
4
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/showtimes">Showtimes & Tickets</a>
</font>
5
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/dvdinfo">DVD/Video Info</a>
</font>
6
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/trailer">Trailers & Clips</a>
</font>
7
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/cast">Cast and Credits</a>
</font>
8
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/awards">Awards & Nominations</a>
</font>
9
.
<font face="arial" size="-1">
<b>Reviews and Previews</b>
</font>
10
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/critic">Critics Reviews</a>
</font>
11
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/user">User Reviews</a>
</font>
12
.
<font face="arial" size="-1" color="gray">Greg's Preview</font>13
.
<font face="arial" size="-1" color="gray">Movie Mom's Review</font>
14
.
<font face="arial" size="-1">
<b>Photos</b>
</font>
15
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/photo/premiere/stills">Premiere Photos</a>
</font>
16
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/photo/stills">Production Photos</a>
</font>
17
.
<font face="arial" size="-1">
<b>Community</b>
</font>
18
.
<font face="arial" size="-1">
<a href="http://messages.movies.yahoo.com/Movies/Films/forumview?bn=12172484-hv1809701422f0&e=W5AndZVKwTRSyHgkLaLR9Gl1M42dLxmfgNVk9ig3DqCd.UZeLseE092KCacUqTZjaQ5.GlBjmMH61kQNGnJHhVWbYim1rA9uu17Oq8dcgpbqgtW6EF0U1Qp1PDzovKqNr4Xl">Message Board</a>
</font>
19
.
<font face="arial" size="-1">
<b>Shopping</b>
</font>
20
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/buyvideo">Buy the DVD/Video</a>
</font>
21
.
<font face="arial" size="-1">
<b>Other Resources</b>
</font>
22
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/web">Web Sites</a>
</font>
23
.
<font face="arial" size="-1">Set in West Texas, a man on the run with a suitcase full of money is pursued by a number of individuals.</font>
24
.
<font face="arial">
<b>Cast and Credits</b>
</font>
25
.
<font face="arial" size="-1">Starring:</font>
26
.
<font face="arial" size="-1">
<a href="/movie/contributor/1800016536">Tommy Lee Jones</a>,<a href="/movie/contributor/1800023079">Javier Bardem</a>,<a href="/movie/contributor/1800019611">Josh Brolin</a>,<a href="/movie/contributor/1800183279">Beth Grant</a>,<a href="/movie/contributor/1809126070">Garret Dillahunt</a>
</font>
27
.
<font face="arial" size="-1">Directed by:</font>
28
.
<font face="arial" size="-1">
<a href="/movie/contributor/1800025224">Joel Coen</a>,<a href="/movie/contributor/1800025225">Ethan Coen</a>
</font>
29
.
<font face="arial" size="-1">Produced by:</font>
30
.
<font face="arial" size="-1">
<a href="/movie/contributor/1809082023">Robert Graf (II)</a>,<a href="/movie/contributor/1808809218">Mark Roybal</a>,<a href="/movie/contributor/1800020262">Scott Rudin</a>
</font>
31
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/cast">
<b>See Full Cast and Credits</b>
</a>
</font>
32
.
<font face="arial">
<b>Production Photos</b>
</font>
33
.
<font face="arial">
<b>Critical Consensus</b>
</font>
34
.
<font face="arial" size="-1">
<b>
<a href="http://movies.yahoo.com/movie/1809701422/critic">More Critics Reviews...</a>
</b>
</font>
35
.
<font face="Arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/user">
<b>More User Reviews...</b>
</a>
<!-- ##comment TODO - need to be able to determine whether user has written
comment a review. This is true if C_POSTED (in C++ code) is set to
comment a nonzero value. This is in hf2k as $.already_posted --> | <a href="http://movies.yahoo.com/mvc/ecrv?mid=1809701422&ys=HA9YfslBefY2J23hBP5y5A--">
I'm trying to parse information from yahoo's movie site. I have run into an xpath issue. I would like to think the syntax is correct, and xpath isn't returning what I need, but more than likely I'm doing something wrong.
I'm using the following configuration file to retrieve a webpage for a movie, where it trys to parse out the movie's description and store that result in a XML file. Here is the config:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="movieId">1809701422</var-def>
<var-def name="resultsPage">
<html-to-xml>
<http url="http://movies.yahoo.com/movie/${movieId}/info">
<http-header name="User-Agent">Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1</http-header>
</http>
</html-to-xml>
</var-def>
<var-def name="movieDescription">
<xpath expression="/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font[23]">
<var name="resultsPage"/>
</xpath>
</var-def>
<loop item="link" index="i">
<list>
<xpath expression="/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font/text()">
<var name="resultsPage"/>
</xpath>
</list>
<body>
<case>
<if condition="${i == 4}">
<var-def name="movieTitle">
<var name="link"/>
</var-def>
</if>
</case>
<file action="append" path="./font-text.txt">
<template>
${i}. ${link} ${sys.cr}${sys.lf}
</template>
</file>
</body>
</loop>
-->
<file action="write" path="./database/${movieId}.xml">
</file>
</config>
The move description field is blank. But if I change the xpath query to read
/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font
I get all the text values.
I can iterate over the of nodes, and return the text one by one using a loop that is feed with a xpath query, and I get the following
info that I have written to a file with an index to indicate the node position (see below). What I need is to get the text from node 23 as the description. So I know the original xpath query is getting the correct set of nodes. But if I use a position higher than 1, and I don't get any of the text in any of the child nodes at all. Any help would be apperciated.
1
.
<font face="arial" size="-1">Movie Main Page</font>
2
.
<font face="arial" size="-1">
<b>Movie Overview</b>
</font>
3
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/details">Movie Details</a>
</font>
4
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/showtimes">Showtimes & Tickets</a>
</font>
5
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/dvdinfo">DVD/Video Info</a>
</font>
6
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/trailer">Trailers & Clips</a>
</font>
7
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/cast">Cast and Credits</a>
</font>
8
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/awards">Awards & Nominations</a>
</font>
9
.
<font face="arial" size="-1">
<b>Reviews and Previews</b>
</font>
10
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/critic">Critics Reviews</a>
</font>
11
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/user">User Reviews</a>
</font>
12
.
<font face="arial" size="-1" color="gray">Greg's Preview</font>13
.
<font face="arial" size="-1" color="gray">Movie Mom's Review</font>
14
.
<font face="arial" size="-1">
<b>Photos</b>
</font>
15
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/photo/premiere/stills">Premiere Photos</a>
</font>
16
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/photo/stills">Production Photos</a>
</font>
17
.
<font face="arial" size="-1">
<b>Community</b>
</font>
18
.
<font face="arial" size="-1">
<a href="http://messages.movies.yahoo.com/Movies/Films/forumview?bn=12172484-hv1809701422f0&e=W5AndZVKwTRSyHgkLaLR9Gl1M42dLxmfgNVk9ig3DqCd.UZeLseE092KCacUqTZjaQ5.GlBjmMH61kQNGnJHhVWbYim1rA9uu17Oq8dcgpbqgtW6EF0U1Qp1PDzovKqNr4Xl">Message Board</a>
</font>
19
.
<font face="arial" size="-1">
<b>Shopping</b>
</font>
20
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/buyvideo">Buy the DVD/Video</a>
</font>
21
.
<font face="arial" size="-1">
<b>Other Resources</b>
</font>
22
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/web">Web Sites</a>
</font>
23
.
<font face="arial" size="-1">Set in West Texas, a man on the run with a suitcase full of money is pursued by a number of individuals.</font>
24
.
<font face="arial">
<b>Cast and Credits</b>
</font>
25
.
<font face="arial" size="-1">Starring:</font>
26
.
<font face="arial" size="-1">
<a href="/movie/contributor/1800016536">Tommy Lee Jones</a>,<a href="/movie/contributor/1800023079">Javier Bardem</a>,<a href="/movie/contributor/1800019611">Josh Brolin</a>,<a href="/movie/contributor/1800183279">Beth Grant</a>,<a href="/movie/contributor/1809126070">Garret Dillahunt</a>
</font>
27
.
<font face="arial" size="-1">Directed by:</font>
28
.
<font face="arial" size="-1">
<a href="/movie/contributor/1800025224">Joel Coen</a>,<a href="/movie/contributor/1800025225">Ethan Coen</a>
</font>
29
.
<font face="arial" size="-1">Produced by:</font>
30
.
<font face="arial" size="-1">
<a href="/movie/contributor/1809082023">Robert Graf (II)</a>,<a href="/movie/contributor/1808809218">Mark Roybal</a>,<a href="/movie/contributor/1800020262">Scott Rudin</a>
</font>
31
.
<font face="arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/cast">
<b>See Full Cast and Credits</b>
</a>
</font>
32
.
<font face="arial">
<b>Production Photos</b>
</font>
33
.
<font face="arial">
<b>Critical Consensus</b>
</font>
34
.
<font face="arial" size="-1">
<b>
<a href="http://movies.yahoo.com/movie/1809701422/critic">More Critics Reviews...</a>
</b>
</font>
35
.
<font face="Arial" size="-1">
<a href="http://movies.yahoo.com/movie/1809701422/user">
<b>More User Reviews...</b>
</a>
<!-- ##comment TODO - need to be able to determine whether user has written
comment a review. This is true if C_POSTED (in C++ code) is set to
comment a nonzero value. This is in hf2k as $.already_posted --> | <a href="http://movies.yahoo.com/mvc/ecrv?mid=1809701422&ys=HA9YfslBefY2J23hBP5y5A--">
</a>
</font>
36
.
<font face="arial" size="-1">
<b>Yahoo! Movies</b>:<a href="http://us.rd.yahoo.com/movies/bottomtrough/intheaters/?http://movies.yahoo.com/movies/feature/thisweekend.html">In Theaters</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/showtimes/?http://movies.yahoo.com/showtimes/showtimes.html">Times & Tickets</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/trailers/?http://movies.yahoo.com/trailers">Trailers</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/moviesshopping/?http://movies.yahoo.com/dvd/">DVD/Video</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/movienews/?http://movies.yahoo.com/news/main/">News & Gossip</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/boxoffice/?http://movies.yahoo.com/boxoffice/latest/rank.html">Box Office</a>-<a href="http://movies.yahoo.com/browse">Browse Movies</a>-<a href="http://us.rd.yahoo.com/movies/bottomtrough/moviesmain/?http://movies.yahoo.com">more...</a>
<br/>
<b>Yahoo! Entertainment</b>:<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/more/?http://movies.yahoo.com">Movies</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/music/?http://music.yahoo.com">Music</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/tv/?http://tv.yahoo.com">TV</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/games/?http://games.yahoo.com">Games</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/astrology/?http://astrology.yahoo.com">Astrology</a>-<a href="http://us.rd.yahoo.com/entertainment/bottomtrough/more/?http://entertainment.tv.yahoo.com">more...</a>
</font>
I found the issue. The xpath query should be:
(/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font)[position()=23]/text()
and not
/html/body/center/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/font[23]/text()
although I'm not exactly sure what the difference is at this point.
There is important difference between those two XPath expressions:
/div/font[3] returns 3rd FONT element inside the DIV and
(/div/font)[3] returns 3rd /div/font element
For the following XML:
<html>
<div>
<font>aaa</font>
</div>
<div>
<font>bbb</font>
</div>
<div>
<font>ccc</font>
</div>
</html>
first one returns nothing, and seconf returns <font>ccc</font>
Regards, Vladimir.
Thanks for the clarification.