The API Crawler crawls five major social platforms: Twitter, Facebook, Youtube,
Flickr and Google Plus using user-specified keywords and selectable strategies.
It is implemented in Python under GPLv3. It depends on APIBlender,
python-oauth2, python-warc, web.py, the python interface to
H2RDF.
The archive contains all the dependencies.
To launch:
cd apicrawler source bin/activate src/apicrawler.py
More information can be found in the source code.
The API crawler can be driven through its web interface. Since it adheres to
REST principles, it can easily be used in an automated way.
Principles: a crawler runs crawls. Each crawl has a crawl ID assigned by the
client. The client ensures crawl IDs are unique. A crawl has four states:
running, stopped, being deleted, deleted. A crawl runs until it ends by itself
or until a stop order is received. Only a stopped crawl can be deleted.
Below, a dollar in the path or domain components indicates it is not a literal.
In all error cases, the payload should contain a JSON term of the form:
['error', term]
where the second element of the list can be anything.
Starts the job. The crawl ID is used by the crawler to retrieve all
the crawl configuration. Note: instead of passing the ID in the URL, we use a
JSON dictionary to make it easy to add information.
In:
{ 'crawl_id': 123 }
Out:
Lists all running, stopped and "being deleted" crawls with some information
about each.
In: nothing
Out:
200 with a list of this form:
[{ "campaign_id": "presidential_elections_usa_2012" "dates": { "end_date": "None", "start_date": "2012-09-20_14:26:16" }, "id": 27118696, "parameters": [ "helium", "style" ], "platform": "youtube", "spiders": [ { "dates": { "actual_end_date": "None", "actual_start_date": "2012-09-20_14:26:16", "end_date": "None", "start_date": "2012-09-20_14:26:16" }, "id": 27118984, "output_warc": null, "running time in seconds": null, "statistics": { "total_outlinks": 3894, "total_responses": 3, "total_triples": 2747 }, "status": "running" } ], "strategy": "search" }, ... ]
Stops the crawl.
In: nothing
Out:
Deletes the crawl. Can be called only on a stopped crawl. Asynchronous: the
crawl will show up as "being deleted" for some time, before disappearing
altogether.
In: nothing
Out: