If you’re looking into Hadoop you might be interested in HPCC Systems

This is a guest blog post from HPCC Systems. HPCC Systems and Hadoop are open source projects, with both leveraging commodity hardware nodes and local storage interconnected through IP networks, allowing for parallel data processing and querying across this architecture. But this is where similarities end.

HPCC Systems was designed and developed about 14 years ago (1999-2000), under a different paradigm, to provide for a comprehensive, consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language. You can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying parallel architecture of the system.

Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), but they don’t compare with the formal aspects, sophistication and maturity of the ECL language, which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.

One of the limitations of the MapReduce model utilized by Hadoop, is that internode communication is left to the shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (they need to go through multiple phases of Map, Shuffle and Reduce, each representing a barrier operation that forces the serialization of the long tails of execution).

The HPCC Systems platform provides direct internode communication, leveraged by many of the high level ECL primitives. Another Hadoop disadvantage is the use of Java as the programming language for the entire platform, including the HDFS distributed file system, which adds for overhead from the JVM.  In contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (performance of the HPCC Systems platform is anywhere between 3 and 10 times faster than Hadoop, on the same hardware).

The HPCC Systems platform is comprised of two components: a back-end batch oriented data workflow processing and analytics system called Thor (a data refinery engine equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (a data delivery engine which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie in the Hadoop ecosystem is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.

Moreover, the HPCC Systems platform presents the users with a homogeneous platform, which is production ready and has been largely proven for many years in our own data services, from a company which has been in the Big Data Analytics business before Big Data was called Big Data.

1 comments
DSC
DSC

ECL is one of the most interesting components of HPCC.  It's a language designed around the manipulation of sets of records, much like SQL.  In fact, developers who use SQL should feel right at home with the concepts behind ECL.  ECL, however, provides a much richer set of functionality built right into the language to more easily support transformations, simple analytics, and a variety of other big data-oriented functions.  Automatic support for distributed data is included as well:  The same ECL code executes on a single node just as well as on a thousand nodes.


One of the more interesting aspects of ECL is the compiler.  ECL is a declarative language, so you really end up describing what you want to do the with your data rather than how to do it, and the compiler translates that to C++ for compilation to machine code.  Included in the translation phase is a very cool optimization phase:  The compiler can (among other things) dead strip unused fields, reorder operations, or even substitute functionality in order to achieve the desired result in the most efficient manner, all automatically.  The optimizations are based not only on distributed data processing theory but also on solid, we-have-actually-seen-this-in-the-real-world experience acquired over the last decade or more.  That is one of the reasons HPCC is faster than the competition in many cases, with less development effort expended.


ECL is a declarative language and evolved to easily describe data set manipulations.  There are, however, some problems that are difficult to solve in such a language.  Fortunately, ECL can be extended:  Code written in C++, Java, R, Python or Javascript can be embedded right into the ECL code.  More complex code (like that using 3rd party products or libraries) can be linked to the final executable and called from ECL.  This extensibility provides the ability to not only solve those complex data transformation problems more easily, but also to integrate with existing libraries and services.


Lastly:  Because ECL is designed around data set manipulation, solutions coded in ECL are typically much smaller than solutions coded in an imperative language like Java.  A consequence of that is the number of people required to implement an ECL-based solution is also much smaller.  Fewer people on the team, with the commensurate fewer communication channels, plus a language tailored around data manipulation, naturally produces a project with much higher velocity.  Simply put, you can get to market faster.


Disclaimer:  I used to write tightly-coded, high-performance, multi-threaded distributed data systems.  I don't feel the need to do that any more.  All of those projects are now centered around HPCC instead.