go_spider

An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only. Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser. Html parsing is based on goquery package. Json parsing is based on simple JSON package. Jsonp will converse to json. Text form represents plain text content without a parser. The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and URLs to be crawled next step. These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.

Features

Requires Go 1.2 or higher
Concurrent
Fit for vertical communities
Flexible, Modular
Native Go implementation
Can be expanded to an individualized crawler easily

Project Samples

Project Activity

See All Activity >

License

Mozilla Public License 1.0 (MPL)

Follow go_spider

go_spider Web Site

User Reviews

Be the first to post a review of go_spider!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Related Categories

Go Frameworks

Registered

2023-01-27

Similar Business Software

Echo

High-performance, extensible, minimalist Go web framework. Highly optimized HTTP router with zero dynamic memory allocation which smartly prioritizes routes. Build robust and scalable RESTful API, easily organized into groups. Automatically install TLS certificates from Let's Encrypt. HTTP/2...

See Software
hapi

Build powerful, scalable applications, with minimal overhead and full out-of-the-box functionality, your code, your way. Developed initially to handle Walmart’s Black Friday sales, hapi continues to be the proven choice for enterprise-grade backend needs. When you install hapi, every single line...

See Software
Nancy

Welcome to Nancy, our main inspiration is the Sinatra framework for Ruby and, hence, Nancy was named after the daughter of Frank Sinatra. NancyFx is the name of the umbrella project that contains all the components. Nancy is a lightweight, low-ceremony, framework for building HTTP-based services...

See Software

Report inappropriate content