dataflowkit Code

Golang framework for scraping data from web pages

Status: Beta

Brought to you by: slotix

Tree [b59d8f] master /

History

HTTPS access

File	Date	Author	Commit
.vscode	2018-01-29	DMITRY NARIZHNYKH	[2cc08d] refactor fetch service. Start cli for fetch
errs	2018-01-15	DMITRY NARIZHNYKH	[1de2b7] add more tests to fetcher package
examples	2018-03-06	DMITRY NARIZHNYKH	[e8fe2f] add goodreads.com sample to examples
extract	2018-02-28	DMITRY NARIZHNYKH	[554227] change doc for extract package
fetch	2018-03-08	DMITRY NARIZHNYKH	[e7e412] add docker files to fetch and parse services. c...
healthcheck	2018-01-18	DMITRY NARIZHNYKH	[d472ab] healthcheck doc update for correcting godoc issue
images	2018-02-28	DMITRY NARIZHNYKH	[222895] screenshots
log	2018-03-08	DMITRY NARIZHNYKH	[e7e412] add docker files to fetch and parse services. c...
paginate	2018-01-26	DMITRY NARIZHNYKH	[89dea1] change of scrape package. Combine all helpers t...
parse	2018-03-08	DMITRY NARIZHNYKH	[e7e412] add docker files to fetch and parse services. c...
scrape	2018-03-08	DMITRY NARIZHNYKH	[a86035] remove unnecessary logging
splash	2018-03-08	DMITRY NARIZHNYKH	[1429aa] add filter requests to splash
storage	2018-03-08	DMITRY NARIZHNYKH	[45bc9e] remove unnecessary logging
testdata	2018-02-28	DMITRY NARIZHNYKH	[34c031] change readme.md move payload files to examples...
utils	2018-01-26	DMITRY NARIZHNYKH	[89dea1] change of scrape package. Combine all helpers t...
.gitignore	2018-03-08	DMITRY NARIZHNYKH	[c21a45] remove img_src from repo
.travis.yml	2018-01-04	DMITRY NARIZHNYKH	[7c2911] added docker image to .travis.yml
CODE_OF_CONDUCT.md	2017-06-21	Dmitry Narizhnykh	[b7c1dd] Create CODE_OF_CONDUCT.md
CONTRIBUTING.md	2018-01-14	Dmitry Narizhnykh	[25c506] Update CONTRIBUTING.md
Gopkg.lock	2018-03-08	DMITRY NARIZHNYKH	[9076c0] update vendor packages
Gopkg.toml	2018-03-08	DMITRY NARIZHNYKH	[689b21] update vendor packages
LICENSE	2018-01-01	DMITRY NARIZHNYKH	[8cfe29] added some docs
README.md	2018-03-08	DMITRY NARIZHNYKH	[dce999] add docker-compose steps
_config.yml	2017-12-16	Dmitry Narizhnykh	[916fbb] Set theme jekyll-theme-slate
docker-compose.yml	2018-03-08	DMITRY NARIZHNYKH	[e7e412] add docker files to fetch and parse services. c...
test.sh	2017-12-31	DMITRY NARIZHNYKH	[4fc999] set up testing for travis CI

Read Me

Dataflow kit

Dataflow kit extracts structured data from web pages, following the specified extractors.

It can be used in many ways for data mining, data processing or archiving.

The actual use case can be grabbing list of products on several pages and follow each product’s details page to retrieve additional information. Parse endpoint returns information as a JSON, XML or CSV data.

DFK consists of two general services for fetching and parsing web pages content.

Fetch service

fetch.d is the daemon that downloads html pages. It sends requests to Splash server. Splash is a javascript rendering service. It is used to retrieve actual data before sending it to parse.d daemon.

Parse service

parse.d is the daemon that extracts data from downloaded web page following the rules described in configuration JSON file. Extracted data are returned in CSV, JSON or XML format.

Installation

Using dep

dep ensure -add github.com/slotix/dataflowkit@master

or go get

go get -u github.com/slotix/dataflowkit

Usage

Docker

Install Docker and Docker Compose
Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.

curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
    "name":"collection",
    "request":{
       "url":"https://example.com"
    },
    "fields":[
       {
          "name":"Title",
          "selector":".product-container a",
          "extractor":{
             "types":["text", "href"],
             "filters":[
                "trim",
                "lowerCase"
             ],
             "params":{
                "includeIfEmpty":false
             }
          }
       },
       {
          "name":"Image",
          "selector":"#product-container img",
          "extractor":{
             "types":["alt","src","width","height"],
             "filters":[
                "trim",
                "upperCase"
             ]
          }
       },
       {
          "name":"Buyinfo",
          "selector":".buy-info",
          "extractor":{
             "types":["text"],
             "params":{
                "includeIfEmpty":false
             }
          }
       }
    ],
    "paginator":{
       "selector":".next",
       "attr":"href",
       "maxPages":3
    },
    "format":"json",
    "paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

To stop services just press Ctrl+C and run

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

Manual way

Start Splash docker container

docker run -d -it --rm -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

Splash is used for fetching web pages to feed a Dataflow kit parser.

Build and run fetch.d service

cd $GOPATH/src/github.com/slotix/dataflowkit/fetch/fetch.d && go build && ./fetch.d

In new terminal window build and run parse.d service

cd $GOPATH/src/github.com/slotix/dataflowkit/parse/parse.d && go build && ./parse.d

Launch parsing. See step 3. from the previous section.

Front-End

Try http://scrape.dataflowkit.org Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

alt tag

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.
- Please submit your issues
- Fork the project