Cascading Now Supports Tez–Spark and Storm Up Next

Alex Woodie, Datanami
May 13, 2014
http://www.datanami.com/2014/05/13/cascading-now-supports-tez-spark-storm-next

Concurrent, the company behind the open source Cascading framework, today unveiled a major update that will allow its customers to migrate their Hadoop applications from using MapReduce to use the new Apache Tez engine, without rewriting any business logic. Spark and Storm are next up on Cascading’s radar, Concurrent CTO Chris Wensel tells Datanami.

Analysts have billed 2014 as the year that Hadoop grows up and takes on the enterprise. Anecdotal evidence suggests that big companies, indeed, are moving away from tire-kicking phase and investing in production systems.

However, while Hadoop may have traded in his open source apparel a suit and tie, that doesn’t mean that all technology questions have been settled. Everybody in the Hadoop world seems to agree that the batch-oriented nature of MapReduce is on its way out. But what’s going to replace it? Apache Tez? Apache Spark? Apache Storm? Apache next? Nobody knows.

“You gotta pick your poison,” Wensel says. “A lot of those technologies overlap with each other. But there are also tradeoffs. And this game is all about tradeoffs.”

With Cascading, Concurrent is ideally situated to help customers minimize the risk of making the wrong tradeoff. The Cascading product does this by presenting a layer of abstraction between the application developer and the complex Hadoop APIs. The product–which is free and being downloaded 150,000 times per month–allows developers to write their business logic once using the simpler Cascading APIs (available for Java, Python, Scala, and other languages), and deploy the application using whichever Hadoop data fabric meets their needs.

For the last six months, Wensel has been working on the heart of Cascading, the customizable query planner, to enable it to support Apache Tez. It was a very big job, and Wensel did much of this work in collaboration with Hortonworks, which is particularly bullish on the prospects of Tez as a replacement for MapReduce. The result of that collaboration is now available in Cascading 3.0.

According to Wensel, it’s all about giving customers the flexibility to pick the Hadoop fabric that best fits their needs. “We’re seeing Tez and other technologies that are slightly more complex [than MapReduce], but give you more degrees of freedom to do more interesting things at the computation level,” he says.

Tez represents a “massive improvement” in the Hadoop model, and Wensel is excited to see how users will respond to support for Tez, which should provide an immediate performance boost upwards of 50 percent compared to MapReduce.

What’s more, Cascading will also allow users to dial up the performance even higher if they want, but perhaps take on more risk of the code falling and breaking. “We can give you a conservative rule engine, for Tez, but as Tez matures, we can give you a more aggressive rule engine,” Wensel says. “If you want turn it up to 11, go for it, but it might blow out your speakers.”

Next up on Wensel’s plate are Apache Spark and Storm. The company today also announced a partnership with Databricks, the company behind Apache Spark. Wensel and company will set out this summer to enable Cascading apps to utilize Spark within Hadoop. Some of the work he did on supporting Tez will carry over to Spark, or at least make it somewhat easier to support, he says.

While Spark seems to have a lot of momentum at the moment, Wensel still sees a bit of risk with Spark still, and doesn’t seem entirely sold on it. “People want to try Spark. We get it,” he says. “People want us to port Cascading to Spark so they can see if it’s better. I don’t know anybody in production with Spark, but I don’t know anybody in production on Tez either.” The timeframe? “We’re definitely going to get to that as quickly as we can,” he says. “We hope to get to it this summer.”

The way Wensel sees it, nobody can predict what technology is going to win in the end. It could be Tez, or it could be Apache Spark. “What you don’t want is the risk of learning a new API or a language on a new API just to get the tradeoff to realize the tradeoff was a bad one,” he says. “What did you do? Spend six months figuring out that was a huge mistake.”

It’s all about weighing the tradeoffs, and allowing people to experiment with the various Hadoop fabrics to find out what works best for them and avoid those million-dollar mistakes. By allowing people to experiment with Tez, Spark, and MapReduce, Cascading will let developers make apples to apples comparisons among the various “Baby Bear, Mama Bear, and Papa Bear technologies,” as the colorful Wensel puts it.

“If Spark doesn’t scale, then they’ll go to Tez. But Tez might be slower,” Wensel says. “If they have a smaller application that doesn’t [need to] scale, maybe they could leave that on Spark. But they can make these decisions without having to rewrite their applications. If they’re okay with 2 percent of their jobs failing, then maybe they’ll pick a different technology that’s faster, but maybe it will fail more frequently. If they never can have it ever fail and they just need predictability, they may stick with [Papa Bear] MapReduce because it’s extremely stable and mature. People want to be able to make these choices. They don’t want just one technology.” Amen to that.