Thread: [Htmlparser-user] SiteCapturer Not Downloading /Saving Images

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] SiteCapturer Not Downloading /Saving Images

From: Athar S. S. <ath...@gm...> - 2009-04-19 00:59:27

I am using the following snippet in site capturer to save images but
it wont save any images. What is going on?

        worker = new SiteCapturer ();
        worker.setSource ("http://www1.cs.columbia.edu/~shiraz/psetv001.htm");
        // http://www1.cs.columbia.edu/~shiraz/psetv001.htm
        worker.setTarget ("C:\\Temp\\set\\download1");
//C:\\Temp\\set\\download1
        worker.setCaptureResources (true);
        worker.capture ();
        System.out.println("Done!");
        System.exit (0);


-- 
Shiraz

500 Riverside Drive
#425
New York, NY 10027
(703) 879-8342 (skype prefer)
(571) 276 2404 (cell)
(212) 316 8630 (landline) (extn 8630)

Re: [Htmlparser-user] SiteCapturer Not Downloading /Saving Images

From: Athar S. S. <ath...@gm...> - 2009-04-19 01:23:10

Good evening everyone,

I am trying to simply screen scrape or download html and images of a
webpage. I cannot do it however with sitecapturer. Could someone
indicate why I cannot find the images when I run the following code :

        worker = new SiteCapturer ();
        worker.setSource ("http://www1.cs.columbia.edu/~shiraz/psetv001.htm");
        worker.setTarget ("C:\\Temp\\set\\download1");
//C:\\Temp\\set\\download1
        worker.setCaptureResources (true);
        worker.capture ();
        System.out.println("Done!");

I stepped through the code and there dont seem to be any images in the
mImages array.

I would merely like to get the text of a website and a handle on the
accompanying images. Thanks!


-- 
Shiraz

500 Riverside Drive
#425
New York, NY 10027
(703) 879-8342 (skype prefer)
(571) 276 2404 (cell)
(212) 316 8630 (landline) (extn 8630)

Re: [Htmlparser-user] SiteCapturer Not Downloading /Saving Images

From: Athar S. S. <ath...@gm...> - 2009-04-19 02:22:33

I just found a problem with sitecapturer. So line 686 says :
            if (isToBeCaptured (image))
when you go inside istobecaptured the images are not recognized as
being part of the website. The code here is the culprit :

	        return (
	            link.toLowerCase ().contains (getSource ().toLowerCase ())
	            && (-1 == link.indexOf ("?"))
	            && (-1 == link.indexOf ("#")));


Here the source is http://www1.cs.columbia.edu/~shiraz/psetv001.htm
and the link maybe :
http://www1.cs.columbia.edu/~shiraz/psetv001_files/image001.jpg
so the above keeps returning false!...


On Sat, Apr 18, 2009 at 9:22 PM, Athar Shiraz Siddiqui
<ath...@gm...> wrote:
> Good evening everyone,
>
> I am trying to simply screen scrape or download html and images of a
> webpage. I cannot do it however with sitecapturer. Could someone
> indicate why I cannot find the images when I run the following code :
>
>        worker = new SiteCapturer ();
>        worker.setSource ("http://www1.cs.columbia.edu/~shiraz/psetv001.htm");
>        worker.setTarget ("C:\\Temp\\set\\download1");
> //C:\\Temp\\set\\download1
>        worker.setCaptureResources (true);
>        worker.capture ();
>        System.out.println("Done!");
>
> I stepped through the code and there dont seem to be any images in the
> mImages array.
>
> I would merely like to get the text of a website and a handle on the
> accompanying images. Thanks!
>
>
> --
> Shiraz
>
> 500 Riverside Drive
> #425
> New York, NY 10027
> (703) 879-8342 (skype prefer)
> (571) 276 2404 (cell)
> (212) 316 8630 (landline) (extn 8630)
>



-- 
Shiraz

500 Riverside Drive
#425
New York, NY 10027
(703) 879-8342 (skype prefer)
(571) 276 2404 (cell)
(212) 316 8630 (landline) (extn 8630)