How to Escape the Dark Valley of Your Hadoop Journey

It happens to the best of us. You know your business is bursting with useful data, and you’ve only begun to scratch the surface. So you strike out to build an analytical platform, using all the great open-source tools you’ve been hearing so much about. First, you have to capture all the data that’s coming in, before you even know what you’re going to do with it. So you build a butterfly net to catch it all, using Hadoop. But as soon as the net is cast, everything goes dark. You know the data’s there, but you can’t get at it, or if you can, it comes out in unusable formats. Your current systems won’t talk to it, and you don’t have a staff of PhDs in programming, the budget to buy a United Nations’ worth of translators or hire an army of consultants. A chill runs down the back of your neck. What have you done? You’ve entered the Dark Valley of Hadoop. That’s the bad news. The good news is that you’re not alone, and there’s a way out.

Warning Signs That You’re in the Dark Valley

Many data-rich companies fall into the Dark Valley for a time. You have the data, but you’re not getting the value that you expect from it. You have problems testing and deploying the applications that are supposed to extract that value. You have difficulty turning business requirements into code that will turn the great Leviathan that is the Hadoop Distributed File System into something halfway manageable. The project for which all of this effort was intended is delayed for months, with cost overruns making stakeholders nervous. Once you finally get the chance to test, you don’t get the results you were expecting. More delays ensue.

One of the cruelest tricks of the Dark Valley is the illusion that you got it right on the first try. Agile design philosophy tells us to work on small projects and test them as quickly as possible, then iterate forward. But Hadoop tends to reveal its weaknesses around manageability the deeper you get into the adoption cycle. If you’re using tools made for programmers, such as Pig and Hive, you’re making a bet that the programmer who built that first iteration is going to be around for the second. In today’s competitive marketplace, there is no guarantee of that. Then there is the fact that MapReduce, Hadoop’s native language, is already on its second version, with a third entirely new computation engine, built from the ground up, rapidly on its way. In the Hadoop ecosystem, many low-level moving parts have the nasty habit of changing every 90 to 120 days. All these moving parts mean that you’re having to keep up with numerous release cycles, which takes your focus off the business at hand.

So if the project mantra is “stand up, rinse and repeat,” it’s the “repeat” part that proves challenging. You find you need to expand your team, the number of systems and the scope of your project. The farther along the path you travel from “build” to “deploy and operate,” the wider the gap between programming tools and enterprise application tools becomes. The open source tools for Hadoop were simply never intended for the mainstream data-driven enterprise. The result is a skills bottleneck, requirements confusion and maintenance complexity — a lot of bumping around in the dark.

You aren’t alone. Some of the largest and most successful data-driven companies are having or have had similar frustrations. LinkedIn, Twitter and BlueKai all started out with native interfaces and have ended up with mountains of unmaintainable code. They have found better, more useable, more sustainable technologies to run their businesses. By investing time and money in alternatives that increased the productivity of their brainy staff, they fought their way out with significant investments of time and money. The good news is that you can learn from their experience and avoid the Dark Valley entirely.

Escape From the Dark Valley

There is a light at the end of the tunnel, but you have to know how to find it. The key lies in knowing your options, which usually involves leveraging the skillsets you already have in-house so that your big data strategy can continue up into the light.

The development methodologies you have in place were established for a reason. Because Hadoop wasn’t designed for the enterprise developer, the first mistake many enterprises make is to invert their planning processes around Hadoop. The best way to protect your existing methodologies is to avoid reconfiguring them in order to use MapReduce. In other words, is it possible to find an easier way for your experienced enterprise developers to use Hadoop instead of hiring MapReduce programmers and attempting to teach them about your business?

Indeed, Hadoop can be tamed through application programming interfaces (APIs) or domain-specific languages (DSLs) that encapsulate and hide MapReduce so that your developers don’t have to master it. For example, a modeling tool such as MicroStrategy, SAS, R, or SPSS can use a DSL that will consume models and run them on Hadoop without needing to write Hive, Pig or MapReduce. Enterprises can leverage existing investments made in Java, SQL and modeling tools that will allow them to quickly start parsing Hadoop datasets without the need to learn another language.

Here are some domain-specific languages that leverage existing development skills:

Java: Cascading is a widely used Java development framework that enables developers to quickly build workflows on Hadoop.
SQL: Cascading Lingual is an ANSI SQL DSL. This means that developers can leverage existing ETL and SQL and migrate them to Hadoop. The deeper benefit is that applications designed for SQL can access data on Hadoop without modification using a standard database driver.
MicroStrategy, SAS, R and SPSS: Cascading Pattern is a scoring engine for modeling applications. In a matter of hours, developers can test and score predictive models on Hadoop.

Newer languages like Scala and Clojure are popular with many developers these days (Clojure is especially popular with data scientists). DSLs for these languages also simplify development on Hadoop.

These APIs abstract the complexity of Hadoop, enabling enterprise developers to spend their time on business logic and creating big data applications instead of getting stuck in the maze of the Hadoop open source projects. The best APIs let you corral the resources you already know how to use — relational databases, ERP systems, and visualization tools — and use them in conjunction with Hadoop.

Conclusion

The power of big data has been established, but our understanding of how to exploit it in the most productive way is still maturing. The initial toolset that came with Hadoop didn’t anticipate the kinds of enterprise applications and powerful analyses that businesses would want to build on it. Thus, many have fallen into the Dark Valley. But a new breed of middleware (APIs and DSLs) has arrived. They keep track of all the variables and peculiarities of Hadoop, abstract them away from development, and offer better reliability, sustainability and operational characteristics so that enterprises can find their way back out into the light.

Nearly 20 years ago, the Web set the stage for existing enterprises and new emerging companies to change and create new innovative businesses. Similarly, big data and the business opportunity that it offers is driving enterprises to extract valuable insights about their business and in many cases create additional and significant monetization opportunities for their existing products and services. Enterprises are transitioning from big data as a project to being in the “business of data.” If that’s not a bright light at the end of the tunnel, what is?

* * *

Check out our latest Whitepaper
“9 Best Practices of Achieving Operational Excellence on Hadoop”

About Gary Nakamura
Gary Nakamura is Chief Executive Officer of Concurrent, Inc. Gary has a highly successful track record including significant contributions to the explosive growth of Terracotta, where he was SVP and General Manager.

This article was originally published in AllThingsD