I just found a problem with sitecapturer. So line 686 says :
if (isToBeCaptured (image))
when you go inside istobecaptured the images are not recognized as
being part of the website. The code here is the culprit :
return (
link.toLowerCase ().contains (getSource ().toLowerCase ())
&& (-1 == link.indexOf ("?"))
&& (-1 == link.indexOf ("#")));
Here the source is http://www1.cs.columbia.edu/~shiraz/psetv001.htm
and the link maybe :
http://www1.cs.columbia.edu/~shiraz/psetv001_files/image001.jpg
so the above keeps returning false!...
On Sat, Apr 18, 2009 at 9:22 PM, Athar Shiraz Siddiqui
<ath...@gm...> wrote:
> Good evening everyone,
>
> I am trying to simply screen scrape or download html and images of a
> webpage. I cannot do it however with sitecapturer. Could someone
> indicate why I cannot find the images when I run the following code :
>
> worker = new SiteCapturer ();
> worker.setSource ("http://www1.cs.columbia.edu/~shiraz/psetv001.htm");
> worker.setTarget ("C:\\Temp\\set\\download1");
> //C:\\Temp\\set\\download1
> worker.setCaptureResources (true);
> worker.capture ();
> System.out.println("Done!");
>
> I stepped through the code and there dont seem to be any images in the
> mImages array.
>
> I would merely like to get the text of a website and a handle on the
> accompanying images. Thanks!
>
>
> --
> Shiraz
>
> 500 Riverside Drive
> #425
> New York, NY 10027
> (703) 879-8342 (skype prefer)
> (571) 276 2404 (cell)
> (212) 316 8630 (landline) (extn 8630)
>
--
Shiraz
500 Riverside Drive
#425
New York, NY 10027
(703) 879-8342 (skype prefer)
(571) 276 2404 (cell)
(212) 316 8630 (landline) (extn 8630)
|