This is a guest blog post from HPCC Systems. HPCC Systems and Hadoop are open source projects, with both leveraging commodity hardware nodes and local storage interconnected through IP networks, allowing for parallel data processing and querying across this architecture. But this is where similarities end.
HPCC Systems was designed and developed about 14 years ago (1999-2000), under a different paradigm, to provide for a comprehensive, consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language. You can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying parallel architecture of the system.
Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), but they don’t compare with the formal aspects, sophistication and maturity of the ECL language, which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.
One of the limitations of the MapReduce model utilized by Hadoop, is that internode communication is left to the shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (they need to go through multiple phases of Map, Shuffle and Reduce, each representing a barrier operation that forces the serialization of the long tails of execution).
The HPCC Systems platform provides direct internode communication, leveraged by many of the high level ECL primitives. Another Hadoop disadvantage is the use of Java as the programming language for the entire platform, including the HDFS distributed file system, which adds for overhead from the JVM. In contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (performance of the HPCC Systems platform is anywhere between 3 and 10 times faster than Hadoop, on the same hardware).
The HPCC Systems platform is comprised of two components: a back-end batch oriented data workflow processing and analytics system called Thor (a data refinery engine equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (a data delivery engine which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie in the Hadoop ecosystem is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.
Moreover, the HPCC Systems platform presents the users with a homogeneous platform, which is production ready and has been largely proven for many years in our own data services, from a company which has been in the Big Data Analytics business before Big Data was called Big Data.