Menu

APICrawler

John Arcoman

API crawler

The API Crawler crawls five major social platforms: Twitter, Facebook, Youtube,
Flickr and Google Plus using user-specified keywords and selectable strategies.

It is implemented in Python under GPLv3. It depends on APIBlender,
python-oauth2, python-warc, web.py, the python interface to
H2RDF
.

Set up

The archive contains all the dependencies.

To launch:

cd apicrawler
source bin/activate
src/apicrawler.py

Architecture

  • The API crawler starts threads for each platform crawled and the different
    output module (triples, warcs, outlinks),
  • it receives HTTP request, handled by the apicrawler module (that uses
    web.py), then instructions are passed to the interface module,
  • the interface module takes care of creating the crawl,
  • the crawl creates the spiders; a crawl can have several spiders, for
    instance, a crawl repeating every six hours during 24 hours will have 4
    spiders,
  • the spiders are added to the different platforms\' queues
  • the platforms starts the spiders at the convenient time (after
    start_date),
  • the data produced is sent to the different output modules,
  • the WARC module writes down the whole responses into WARC files,
  • the outlinks module extracts outlinks from the responses and sends it
    the heritrix crawler or writes it down into a backup file,
  • the triples module makes triples and sends it to the triple store or
    writes it down into a backup file and,
  • it logs everything into the different log files.

More information can be found in the source code.

API Crawler API

The API crawler can be driven through its web interface. Since it adheres to
REST principles, it can easily be used in an automated way.

Principles: a crawler runs crawls. Each crawl has a crawl ID assigned by the
client. The client ensures crawl IDs are unique. A crawl has four states:
running, stopped, being deleted, deleted. A crawl runs until it ends by itself
or until a stop order is received. Only a stopped crawl can be deleted.

Below, a dollar in the path or domain components indicates it is not a literal.

In all error cases, the payload should contain a JSON term of the form:

['error', term]

where the second element of the list can be anything.

POST http://$apicrawler_rest_server/add_from_triple_store

Starts the job. The crawl ID is used by the crawler to retrieve all
the crawl configuration. Note: instead of passing the ID in the URL, we use a
JSON dictionary to make it easy to add information.

In:

{ 'crawl_id': 123
}

Out:

  • 200: the crawl was started (the ICS was successfully read);
  • 404: Crawl not found in the triple store
  • 500: the crawl was not started (the ICS was not successfully read, request
  • decoding problem, any other problem preventing from starting)

GET http://$apicrawler_rest_server/crawls

Lists all running, stopped and "being deleted" crawls with some information
about each.

In: nothing

Out:
200 with a list of this form:

[{
"campaign_id": "presidential_elections_usa_2012"
"dates": {
    "end_date": "None",
    "start_date": "2012-09-20_14:26:16"
},
"id": 27118696,
"parameters": [
    "helium",
    "style"
],
"platform": "youtube",
"spiders": [
    {
        "dates": {
            "actual_end_date": "None",
            "actual_start_date": "2012-09-20_14:26:16",
            "end_date": "None",
            "start_date": "2012-09-20_14:26:16"
        },
        "id": 27118984,
        "output_warc": null,
        "running time in seconds": null,
        "statistics": {
            "total_outlinks": 3894,
            "total_responses": 3,
            "total_triples": 2747
        },
        "status": "running"
    }
],
"strategy": "search"
}, ... ]

POST http://$apicrawler_rest_server/stop/$crawl_id

Stops the crawl.

In: nothing

Out:

  • 200 \'OK\' or 200 \'Crawl was already stopped or finished\'; or
  • 404: the crawler knows of no such crawl; or
  • 500 the crawl could not stop, retry needed.

DEL http://$apicrawler_rest_server/crawl/$crawl_id

Deletes the crawl. Can be called only on a stopped crawl. Asynchronous: the
crawl will show up as "being deleted" for some time, before disappearing
altogether.

In: nothing

Out:

  • 200 if the crawl ID matches a stopped crawl and the deletion was initiated
    successfully; or
  • 404: the crawler knows of no such crawl; or
  • 400: the crawl exists but is not in stopped state; or
  • 500: any other issue.

Related

Wiki: TryIt

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.