Big Data: Interacting with Hadoop 2 Data Using Standard ANSI SQL

Big Data: Interacting with Hadoop 2 Data Using Standard ANSI SQL
Dick Weisinger, formtek
November 19, 2013
http://formtek.com/blog/big-data-interacting-with-hadoop-2-data-using-standard-ansi-sql/

Cascading is an Apache-licensed application framework for building rich data processing and machine learning applications that run on Hadoop. Cascading applications are built using a simple API that can be called from any JVM-based language. Over the past five years, it has been developed and supported commercially by Concurrent, Inc. The flow of Cascading is to first capture data from ‘sources’, to then pass that data through ‘pipes’ where it is processed, and finally to push the results into output files or ‘sinks’. This flow of data is known as the ‘source-pipe-sink’ paradigm. Cascading runs as an abstraction at a higher level than MapReduce, so that while Cascading applications ultimately execute MapReduce jobs, when the Cascading application is written, no explicit interactions with MapReduce need to be programmed.

Chris Wensel, Founder and CTO of Concurrent, said that “building applications on Hadoop, despite its growing adoption in the enterprise, is notoriously difficult. We are driving the future of application development and management on Hadoop, by allowing enterprises to quickly extract meaningful information from large amounts of distributed data and better understand the business implications. We make it easy for developers to build powerful data processing applications for Hadoop, without requiring months spent learning about the intricacies of MapReduce.”

More than 110,000 user downloads of Cascading are made every month. Cascading is used by businesses like Twitter, eBay, The Climate Corporation, Square and Etsy for managing some or all of their Big Data requirements. In fact, all of Twitter’s revenue-generating applications have been built with Cascading.

Today, Cascading 2.5 is being introduced — a version-number jump from the previously available 2.2 point release. The jump in numbering was intended to emphasise the significance of some of the new features in the release. Most significantly, the 2.5 release will include support for Hadoop 2 and YARN. Other highlights of the new 2.5 Cascading release include:

Performance improvements for complex join operations and optimizations to dynamically partition and store processed data more efficiently on HDFS.
Broad compatibility with other Hadoop vendors and Hadoop as a service providers, including Cloudera, Hortonworks, MapR, Intel, Altiscale, Qubole and Amazon EMR

Coincident with the release of Cascading 2.5, Concurrent is also making another product, Cascading Lingual, generally available. The Lingual product is an add-on to Cascading that enables a complete ANSI-SQL interface for interacting with Hadoop data. Compatibility with standard SQL means that SQL developed in traditional relational databases can be brought over and used as-is within Lingual. Like Cascading, Lingual comes with the Apache 2.0 license.

Concurrent described the benefits of the Lingual product by saying that “Cascading Lingual provides out-of-the-box support for JDBC. Enterprises that have invested millions of dollars in business intelligence (BI) tools, such as Pentaho, Jaspersoft and Cognos, and training can now also access their data on Hadoop through standard SQL interface.”

André Kelpe, software engineer for Concurrent, summarized three of the design goals for the Cascading Lingual product:

Enable immediate ANSI SQL query access to data
Simplified System and Data Integration with read/writes from hdfs, jdbc, memcached, HBase, and redshift
Simplified migration of existing SQL within Cascading

A YouTube on-line demo of Cascading Lingual can be found here.