Cascading 3.0 Adds Multiple Framework Support. Concurrent Driven Manages Big Data Apps

Boris Lublinsky, InfoQ
May 13, 2014
http://www.infoq.com/news/2014/05/driven

When it comes to implementing Big Data applications, companies today can choose from multiple frameworks ranging from Apache MapReduce to Apache Tez to Apache Spark to Apache Storm. Each one of these frameworks has its own advantages and drawbacks, and can be most appropriate for certain applications. Although it is possible to run all frameworks on a single Apache YARN cluster, each one of them has a slightly different programming model and a different set of APIs. This means that porting a given application from one framework to another might prove to be non-trivial.

Cascading 3.0, a new release of Concurrent’s flagship product, solves many of these issues. Cascading is one of the most popular Java Domain Specific Languages (DSL) initially introduced in late 2007 as a DSL to define and implement functional programming for large scale data workflows on top of the low-level MapReduce APIs. Cascading is based on a “plumbing” metaphor allowing to assemble data pipelines: supported high-level constructs allow it to split, merge, and join streams of data, and to perform operations on the streams.

Such an approach allows users to represent their Cascading applications as a directed acyclic graph (DAG), which is mapped by Cascading’s planner to the underlying framework, originally MapReduce.

Cascading 3.0 goes beyond MapReduce by allowing enterprise developers to build their data applications once, then run those applications on the framework that best meets their business needs. Cascading 3.0 will initially ship with support for: local in-memory, Apache MapReduce (support for both Hadoop 1 and 2 are provided), and Apache Tez. Soon thereafter, with community support, Apache Spark™, Apache Storm and others will be supported through its new pluggable and customizable planner.

A new planner introduced in Cascading 3.0 allows users to create rules that assert correctness of the graph and annotate the graph nodes with meta-data used at runtime based on local topology. The planner also allows for transformation of the graph in order to balance, insert, remove, or reorder nodes. It also partitions the graph to find recursively smaller sub-graphs that map to compute units (like a Map or a Reduce node, or a Tez process).

Once the appropriate compute units are defined, Cascading builds the execution configuration plan which leverages a framework specific jar (and Maven POM) that insulates all the framework’s APIs. Both the jar and POM are provided by Cascading.

The open pluggable architecture implemented by Cascading 3.0 provides easy extensibility of the product to support additional frameworks. This can be done by implementing a new set of rules for a given framework and framework-specific jar and POMs.

In addition to open source Cascading 3.0, Concurrent also recently announced its commercial product Driven, which provides real-time monitoring, operational control and performance management for Cascading applications. Driven provides a set of screens to support the following features:

Understand – Seeing a data app executing in real time and visually drilling down into each unit of work.
Diagnose – Quickly identifying failed (including failure reasons) and poorly performing applications.
Optimize – Visually breaking down vital application metrics to spot performance issues and anomalies.
Track – Viewing and comparing the history of application’s run-time performance.

The new products released by Concurrent will ease application migration to new computation frameworks like Apache Tez and other best of breed technologies. This allows enterprises to standardize on a single API to meet business challenges and solve a variety of business problems and introduce new more suitable Big Data frameworks without massive application rewrites. Driven provides more operational visibility of new and existing Big Data applications, from development to production.