HAipproxy is a distributed proxy IP pool system designed to collect, manage, and provide large numbers of proxy addresses for web crawling tasks. It automatically crawls proxy resources from the internet and aggregates them into a centralized pool that can be accessed by distributed spiders and scraping systems. It is built using Python and relies on Scrapy for high-performance crawling while Redis is used for data storage, communication, and task coordination between components. It includes crawlers that discover proxy servers, validators that test proxy availability and performance, and schedulers that manage crawling and validation tasks. HAipproxy aims to maintain a high availability proxy pool with low latency so that scraping frameworks can rotate proxies efficiently and avoid blocking during large-scale data collection. Its architecture supports distributed deployment, allowing multiple crawler workers and validators to run across different machines.
Features
- Distributed crawler architecture powered by Scrapy and Redis
- Automatic discovery and collection of proxy IP resources
- Proxy validation system to ensure availability and reliability
- High availability design for crawler and scheduler components
- Flexible task routing and scheduling for proxy collection jobs
- Support for HTTP, HTTPS, and SOCKS5 proxy protocols