The OpenIMAJ FlickrCrawler tools enable you to download large collections of images using the Flickr API for experimentation purposes. The FlickCrawler tools are implemented as simple Groovy scripts, and as such require that you have Groovy version 1.7 or later installed on your system.
The FlickrCrawler.groovy script is the main tool for downloading images using the flickr.photos.search API. It has a number of useful features:
The FlickrCrawler.groovy script is invoked from the command-line as follows:
groovy FlickrCrawler.groovy config_file
where config_file
is the path to the configuration file that describes the parameters of your crawl, as described below.
The FlickrCrawler.groovy configuration file is a simple text file that contains the information the crawler needs to find the relevant images to download. A complete configuration file will look like the following:
crawler { apikey="ENTER_YOUR_FLICKR_API_KEY_HERE" //your flickr api key secret="ENTER_YOUR_FLICKR_API_SECRET_HERE" //your flickr api secret apihitfreq=1000 //number of milliseconds between api calls hitfreq=1000 //number of milliseconds between retries of failed downloads outputdir="crawl-data" //name of directory to save images and data to maximages=-1 //limit the number of images to be downloaded; -1 is unlimited maxRetries=3000 //maximum number of retries after failed api calls force=false //force re-download of duplicate images perpage=500 //number of results to request from the api per call queryparams { //the parameters describing the query } concurrentDownloads=16 //max number of concurrent image downloads pagingLimit=20 //max number of pages to look through maxretrytime=300000 //maximum amout of time between retries data { info=true //download all the information about each image exif=true //download all the exif information about each image } images { targetSize=["large","original"] //preferred image sizes in order smallSquare=false //should small square images be downloaded thumbnail=false //should thumbnail images be downloaded small=false //should small images be downloaded medium=false //should medium images be downloaded large=false //should large images be downloaded original=false //should original size images be downloaded } }
In practice however, the crawler has sensible defaults for most of the configuration and many of the options can be omitted. For most crawls, the important parts of the configuration are:
crawler.apikey
. This is your Flickr API key; if you don't have one you can generate one here.crawler.secret
. This is your Flickr API secret which you got when you generated your key.crawler.outputdir
. This specifies where you want to save the images.crawler.maximages
. This specifies how many images you want.crawler.images.targetSize
. This specifies which size of image you would prefer.crawler.data.info
. This specifies whether the crawler should attempt to download all the available metadata for an image. Normally you don't want this as the crawl will be very slow as this creates many extra API calls. Even if this is set to false, a large amount of metadata will be downloaded to the images.csv file automatically (see below).crawler.data.exif
. This specifies whether the crawler should attempt to download all the available EXIF data for an image. Normally you don't want this as the crawl will be very slow as this creates many extra API calls.crawler.queryparams
. This is where the query to the flickr.photos.search is configured. See below for some example configurations. The flickr.photos.search page describes the various search options available. Note that in the parameters described on the Flickr API options page are written with underscores, however in the configuration file they must be written as camelCase (i.e. the content_type option would be written as contentType in the configuration file).The following examples demonstrate practical usage of FlickrCrawler.groovy.
The following configuration can be used used to download all of the geo-tagged images from Southampton, UK that are licensed with the Creative Commons Attribution-NonCommercial License:
crawler { apikey="..." secret="..." outputdir="southampton-cc" queryparams { woeId="35356" //from flickr.places.find license="2" //from flickr.photos.licenses.getInfo } data { info=false exif=false } images { targetSize=["large", "original", "medium"] } }
The important parts of the configuration are crawler.queryparams.woeId
which tells the crawler to find images with the specified flickr where-on-earth identifier
, and the crawler.queryparams.license
which specifies the license requirements for the downloaded images. Specific woeId
s can be looked up using the flickr.places.find explorer page. The mapping between actual licenses and license identifiers can be found on the flickr.photos.licenses.getInfo explorer page.
The following configuration illustrates how the FlickrCrawler.groovy script can be made to download 100 images tagged with "city" but not "night":
crawler { apikey="..." secret="..." outputdir="city-not-night" maximages=100 queryparams { tags=["city", "-night"] tagMode="bool" } data { info=false exif=false } images { targetSize=["large", "original", "medium"] } }
The crawler.queryparams
part is self explanatory. It should be noted however, that the Flickr API will not allow you to search only with negative terms, so it isn't possible to to search for just "not night".
As the crawler runs it will download images to a directory structure inside the outputdir
specified in the configuration. In addition to the images, the directory contains a number of other files which relate to the crawl:
crawler.config
contains a complete copy of the crawler configuration with all the default variables expanded. Do not edit this file.crawler.state
contains internal information about the state of the crawl, and can be used by the crawler to resume if it is interrupted.crawler-info.log
contains a log of the crawlers actions.images.csv
contains a large amount of metadata about each downloaded image in CSV format, which each line corresponding to a single image. Specifically the fields correspond to all the metadata that the flickr.photo.search API can return with each list of images:Sometimes the FlickrCrawler will fail to download some images (for example, because of network issues). The DownloadMissingImages.groovy
script will parse the images.csv
file from a crawl and automatically attempt to download any missing images. Usage is simple; just run the script with the path to the crawl output directory (the outputdir
specified in your original configuration):
groovy DownloadMissingImages.groovy crawldir
Anonymous
Hello,
i'm a candidate Phd student and i want to to use FlickrCrawler, but i can't. Although i installed successfully OpenIMAJ using maven, when i try to run FlickrCrawler i get the following error : "unresolved dependency: syslogr.SysLogR class". I conducted an extensive google search about this class but i can't find anything.
How can i solve this problem, is there any other package for doing the same job?
Thaks in advance
Kostas
Hi Kostas. The syslogr dependency was for something that we were experimenting with locally and isn't actually required in normal usage. I've just committed a new version of the FlickrCrawler with it removed completely, and also updated the @GrabResolver annotation to point at the correct server as the octopussy one mentioned was decommissioned a while ago. You can update your version with SVN, or get the new version here: https://sourceforge.net/p/openimaj/code/HEAD/tree/trunk/tools/FlickrCrawler/FlickrCrawler.groovy
Let us know if you have any problems.
Last edit: Jonathon Hare 2013-10-04
Hello,
I am trying to use your awesome tool, but unfortunately it is not working and I am getting the following problem:
Note: I already put the public / secret keys into the file.
$ groovy FlickrCrawler.groovy sample-city-not-night.config
Caught: java.lang.IllegalAccessError: tried to access class groovyx.gpars.Thread
LocalPools from class groovyx.gpars.Parallelizer
java.lang.IllegalAccessError: tried to access class groovyx.gpars.ThreadLocalPoo
ls from class groovyx.gpars.Parallelizer
at groovyx.gpars.Parallelizer.<clinit>(Parallelizer.groovy:51)
at FlickrCrawler.run(FlickrCrawler.groovy:253)
Ending crawl
I couldn't find anything related to what the problem is or what is causing it. Your input would be appreciated.
Thanks,
Ali
Hello again,
I replaced groovy-2.2 with 1.7 and it is now working. It seems that the script isn't compatible with the new version of groovy.
Sorry for the trouble, it is working now.
Thanks a lot.
This definitely will help speed up the research.