Summingbird is a streaming + batch hybrid computation framework developed by Twitter. Its aim is to let developers express data aggregation pipelines in a unified way, where the same logic can run either in real time (stream) or in batch mode, and the results can be merged or reconciled. In effect, Summingbird abstracts over multiple execution engines (such as Storm, Scalding, etc.) to provide one high-level program that composes transformations and aggregations, and then executes them in different runtime contexts. It is particularly useful in analytics or metrics systems where you want to update counters or aggregates continuously but also periodically recompute from historical data. Summingbird manages consistency and merging between the real-time and batch paths to avoid double-counting or data loss.
Features
- Unified API so the same pipeline code can be executed in streaming mode (Storm) or batch mode (Scalding/Hadoop) depending on deployment.
- Fault tolerance and scalability inherited from the underlying platforms (Storm, Scalding) for handling large data volumes.
- Support for real-time updates (streaming sources) and batch historical processing in one system.
- Ability to combine online (streaming) and offline computing for features like iterative computation or feedback loops.
- Integration with external stores (e.g. Memcached etc) for intermediate state / aggregation.
- Modular architecture with separate modules for streaming vs batch, examples / clients, ability to run locally or in cluster mode.