Menu

Tree [b59d8f] master /
 History

HTTPS access


File Date Author Commit
 .vscode 2018-01-29 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [2cc08d] refactor fetch service. Start cli for fetch
 errs 2018-01-15 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [1de2b7] add more tests to fetcher package
 examples 2018-03-06 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [e8fe2f] add goodreads.com sample to examples
 extract 2018-02-28 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [554227] change doc for extract package
 fetch 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [e7e412] add docker files to fetch and parse services. c...
 healthcheck 2018-01-18 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [d472ab] healthcheck doc update for correcting godoc issue
 images 2018-02-28 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [222895] screenshots
 log 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [e7e412] add docker files to fetch and parse services. c...
 paginate 2018-01-26 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [89dea1] change of scrape package. Combine all helpers t...
 parse 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [e7e412] add docker files to fetch and parse services. c...
 scrape 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [a86035] remove unnecessary logging
 splash 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [1429aa] add filter requests to splash
 storage 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [45bc9e] remove unnecessary logging
 testdata 2018-02-28 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [34c031] change readme.md move payload files to examples...
 utils 2018-01-26 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [89dea1] change of scrape package. Combine all helpers t...
 .gitignore 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [c21a45] remove img_src from repo
 .travis.yml 2018-01-04 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [7c2911] added docker image to .travis.yml
 CODE_OF_CONDUCT.md 2017-06-21 Dmitry Narizhnykh Dmitry Narizhnykh [b7c1dd] Create CODE_OF_CONDUCT.md
 CONTRIBUTING.md 2018-01-14 Dmitry Narizhnykh Dmitry Narizhnykh [25c506] Update CONTRIBUTING.md
 Gopkg.lock 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [9076c0] update vendor packages
 Gopkg.toml 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [689b21] update vendor packages
 LICENSE 2018-01-01 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [8cfe29] added some docs
 README.md 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [dce999] add docker-compose steps
 _config.yml 2017-12-16 Dmitry Narizhnykh Dmitry Narizhnykh [916fbb] Set theme jekyll-theme-slate
 docker-compose.yml 2018-03-08 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [e7e412] add docker files to fetch and parse services. c...
 test.sh 2017-12-31 DMITRY NARIZHNYKH DMITRY NARIZHNYKH [4fc999] set up testing for travis CI

Read Me

Dataflow kit

alt tag

Build Status
GoDoc
Go Report Card
codecov

Dataflow kit extracts structured data from web pages, following the specified extractors.

It can be used in many ways for data mining, data processing or archiving.

The actual use case can be grabbing list of products on several pages and follow each product’s details page to retrieve additional information. Parse endpoint returns information as a JSON, XML or CSV data.

DFK consists of two general services for fetching and parsing web pages content.

Fetch service

fetch.d is the daemon that downloads html pages. It sends requests to Splash server. Splash is a javascript rendering service. It is used to retrieve actual data before sending it to parse.d daemon.

Parse service

parse.d is the daemon that extracts data from downloaded web page following the rules described in configuration JSON file. Extracted data are returned in CSV, JSON or XML format.

Installation

Using dep

dep ensure -add github.com/slotix/dataflowkit@master

or go get

go get -u github.com/slotix/dataflowkit

Usage

Docker

  1. Install Docker and Docker Compose

  2. Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

  1. Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.
curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
    "name":"collection",
    "request":{
       "url":"https://example.com"
    },
    "fields":[
       {
          "name":"Title",
          "selector":".product-container a",
          "extractor":{
             "types":["text", "href"],
             "filters":[
                "trim",
                "lowerCase"
             ],
             "params":{
                "includeIfEmpty":false
             }
          }
       },
       {
          "name":"Image",
          "selector":"#product-container img",
          "extractor":{
             "types":["alt","src","width","height"],
             "filters":[
                "trim",
                "upperCase"
             ]
          }
       },
       {
          "name":"Buyinfo",
          "selector":".buy-info",
          "extractor":{
             "types":["text"],
             "params":{
                "includeIfEmpty":false
             }
          }
       }
    ],
    "paginator":{
       "selector":".next",
       "attr":"href",
       "maxPages":3
    },
    "format":"json",
    "paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

  1. To stop services just press Ctrl+C and run
cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

Manual way

  1. Start Splash docker container

docker run -d -it --rm -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

Splash is used for fetching web pages to feed a Dataflow kit parser.

  1. Build and run fetch.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/fetch/fetch.d && go build && ./fetch.d
  1. In new terminal window build and run parse.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/parse/parse.d && go build && ./parse.d
  1. Launch parsing. See step 3. from the previous section.

Front-End

Try http://scrape.dataflowkit.org Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

alt tag

alt tag

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.
- Please submit your issues
- Fork the project

alt tag

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.