Here is the hrml that contains the href.
<table border=0 cellpadding=10 cellspacing=0 width="94%">
<tr><td><table border=0 cellpadding=4 cellspacing=0 width="94%"><tr><td> </td><td align="right"> <a href="ex_sublist.cgi?sid=yffMCpZAtn&sub_id=184&page=1">Page 2>></a></td></tr></table></td></tr>
</table>
Here is my nextXPath parameter value: <call-param name="nextXPath">//a[contains(., 'Page')]/@href</call-param>
I get the data on the first page but that is it.
Could the ampersands be giving me trouble?
Any help would be appreciated!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2008-03-21
Found the solution. I was basically pulling back duplicates.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2008-03-21
It looks like the issue is due to the href being a relative path. Here is the log where it goes to download another page. I don't know why it is repeating the url. Does the XQuery need to only return 1 result? How do I only return the first one?
INFO - HtmlToXmlProcessor starts processing...
INFO - HttpProcessor starts processing...
INFO - Downloaded: http://www.blackborder.com/cgi-bin/prices/ex_sublist.cgi?sid=UlGTdZEwAQ&sub_id=184&page=1
ex_sublist.cgi?sid=UlGTdZEwAQ&sub_id=184&page=1, mime type = text/html, length = 533B.
INFO - HttpProcessor processor executed in 125ms.
INFO - HtmlToXmlProcessor processor executed in 125ms.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can't seem to get this to work where it will go to the next page.
Here is the site I am scraping to test.
http://www.blackborder.com/cgi-bin/prices/ex_sublist.cgi?sid=DwsHMHwf99&sub_id=184&page=0
Here is the hrml that contains the href.
<table border=0 cellpadding=10 cellspacing=0 width="94%">
<tr><td><table border=0 cellpadding=4 cellspacing=0 width="94%"><tr><td> </td><td align="right"> <a href="ex_sublist.cgi?sid=yffMCpZAtn&sub_id=184&page=1">Page 2>></a></td></tr></table></td></tr>
</table>
Here is my nextXPath parameter value: <call-param name="nextXPath">//a[contains(., 'Page')]/@href</call-param>
I get the data on the first page but that is it.
Could the ampersands be giving me trouble?
Any help would be appreciated!
Found the solution. I was basically pulling back duplicates.
distinct-values was the solution.
<call-param name="nextXPath">distinct-values(//a[contains(., 'Page ')]/@href)</call-param>
It looks like the issue is due to the href being a relative path. Here is the log where it goes to download another page. I don't know why it is repeating the url. Does the XQuery need to only return 1 result? How do I only return the first one?
INFO - HtmlToXmlProcessor starts processing...
INFO - HttpProcessor starts processing...
INFO - Downloaded: http://www.blackborder.com/cgi-bin/prices/ex_sublist.cgi?sid=UlGTdZEwAQ&sub_id=184&page=1
ex_sublist.cgi?sid=UlGTdZEwAQ&sub_id=184&page=1, mime type = text/html, length = 533B.
INFO - HttpProcessor processor executed in 125ms.
INFO - HtmlToXmlProcessor processor executed in 125ms.