WebHarvest - web data extraction tool / Discussion / Help: Problem using download-multipage-list

Rodrigo Rech - 2006-09-27

Hello, first let me say that i'm really enjoying Web Harvest, good job!

But i'm having one problem with the download-multipage-list function (using the google_images.xml example). If i define more than one word as the search variable, such as:
<var-def name="search">iron maiden</var-def>
the following URLs are "harvested":
1) Downloaded: http://images.google.com/images?q=iron maiden&hl=en&btnG=Search+Images&nojs=1
2) Downloaded: http://images.google.com/images?q=iron+maiden&nojs=1&svnum=10&hl=en&lr=&start=20&sa=N
3) Downloaded: http://images.google.com/images?q=iron%2Bmaiden&nojs=1&svnum=10&hl=en&lr=&start=40&sa=N
4) Downloaded: http://images.google.com/
And then always the same URL as number 4.

Notice that the space character first is codified to + and after to %2B, is this behavior correct? The function only returns the images of URLs 1 and 2.

Thanks for any help!
Rodrigo

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- zoombongo - 2006-10-12
  
  Problems with google_images example. I'm using webharvest0261.jar
  
  Config file is not harvesting multpie pages. First page is harvested fine
  
  Downloaded: http://images.google.com/images?q=platon&hl=en&btnG=Search+Images&nojs=1, mime type = text/html, length = 25915B.
  
  but the next 4 pages are harvested as:
  
  Downloaded: http://images.google.com/, mime type = text/html, length = 4334B
  
  Any help would be appreciated.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Vladimir Nikic - 2006-10-12
    
    I have tried and realized that in some cases it works and in some not. The problem is in new version of TagSoup dependant library which I added in version 0.26. I have found also some other bugs regarding TagSoup. I'll consider some other library for html clean up.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- zoombongo - 2006-10-12
  
  Thank you for your consideration.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vladimir Nikic - 2006-09-27
  
  Yes, you are right. There is some bug with encoding URLs. I'll fix it as soon as possible.
  Thanks for your report.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vladimir Nikic - 2006-09-28
  
  Bug is fixed now in version 0.26.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Rodrigo Rech - 2006-10-09
  
  Sorry for the delay, i was on vacation.
  
  Thanks a lot, now it's working fine :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Problem using download-multipage-list

Forums

Help

Problem using download-multipage-list document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Problem using download-multipage-list