Notice that the space character first is codified to + and after to %2B, is this behavior correct? The function only returns the images of URLs 1 and 2.
Thanks for any help!
Rodrigo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have tried and realized that in some cases it works and in some not. The problem is in new version of TagSoup dependant library which I added in version 0.26. I have found also some other bugs regarding TagSoup. I'll consider some other library for html clean up.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello, first let me say that i'm really enjoying Web Harvest, good job!
But i'm having one problem with the download-multipage-list function (using the google_images.xml example). If i define more than one word as the search variable, such as:
<var-def name="search">iron maiden</var-def>
the following URLs are "harvested":
1) Downloaded: http://images.google.com/images?q=iron maiden&hl=en&btnG=Search+Images&nojs=1
2) Downloaded: http://images.google.com/images?q=iron+maiden&nojs=1&svnum=10&hl=en&lr=&start=20&sa=N
3) Downloaded: http://images.google.com/images?q=iron%2Bmaiden&nojs=1&svnum=10&hl=en&lr=&start=40&sa=N
4) Downloaded: http://images.google.com/
And then always the same URL as number 4.
Notice that the space character first is codified to + and after to %2B, is this behavior correct? The function only returns the images of URLs 1 and 2.
Thanks for any help!
Rodrigo
Problems with google_images example. I'm using webharvest0261.jar
Config file is not harvesting multpie pages. First page is harvested fine
Downloaded: http://images.google.com/images?q=platon&hl=en&btnG=Search+Images&nojs=1, mime type = text/html, length = 25915B.
but the next 4 pages are harvested as:
Downloaded: http://images.google.com/, mime type = text/html, length = 4334B
Any help would be appreciated.
I have tried and realized that in some cases it works and in some not. The problem is in new version of TagSoup dependant library which I added in version 0.26. I have found also some other bugs regarding TagSoup. I'll consider some other library for html clean up.
Thank you for your consideration.
Yes, you are right. There is some bug with encoding URLs. I'll fix it as soon as possible.
Thanks for your report.
Bug is fixed now in version 0.26.
Sorry for the delay, i was on vacation.
Thanks a lot, now it's working fine :-)