Mistaking a Data Library for a Data Lake: 7 best practices for developing your Hadoop data strategy

Originally published in InsideBigData on October 2, 2015, by Daniel Gutierrez

In this special guest feature,

Supreet Oberoi of Driven, Inc (formerly Concurrent). talks about how companies should change their perspective on their data strategies, and look at the process as building a data library as opposed to a data lake.

Supreet is the vice president of field engineering at Concurrent, Inc. Prior to that, he was director of Big Data application infrastructure for American Express, where he led the development of use cases for fraud, operational risk, marketing and privacy on Big Data platforms. He holds multiple patents in data engineering and has held leadership positions at Real-Time Innovations, Oracle, and Microsoft.

In my field engagements with various enterprises, I often reflect on the waste and inefficiencies that can manifest from their Hadoop engagements. Some are naïve with their data strategies and are dogmatic with their choice of tools, expending precious resources trying to turn the impossible into a reality. One of the most dangerous and common strategies companies implement is the decision to build a “data lake.” Most find that once they’ve pumped their data lake with vital business data, it is more comparable to an unknown, turbid sea.
It should be obvious that this is the wrong approach for a data strategy. A lake – whether data or an actual body of water – is not designed for efficient or sustainable exploitation of its natural resources. Moreover, it is tough to govern and control the consumption of these resources or to develop strategies to nurture and develop these resources for sustainability.

What I suggest in my engagements with enterprises is that they shift their perspective. Instead of a data lake, they should build a data library, a term originally coined by my colleague Andre Kelpe, senior software engineer at Concurrent, Inc. The analogy is certainly more relevant, and its implementation can be approached both practically and with demonstrable reliability. Here are seven data library best practices that I recommend to companies building their data strategies:

Have a librarian: A librarian controls procurement, subscription and consumption/checkout processes. In other words, not only is there a set of processes for efficient library governance protocols, but there is also a dedicated person responsible for defining and enforcing the governance protocols. In the digital world, the librarian need not be an entirely manual process, or even a human – use tools where possible to enforce the role of librarians.

Build a data catalog: A data catalog establishes taxonomy for an efficient and manageable organization, retention and consumption of your data assets. It is important to understand that a data catalog cannot be developed after procuring data assets. In fact, it is all too often with data lakes that companies ultimately waste time trying to organize and apply rules because they architected their infrastructure for failure. As a result, it is imperative to NOT manage a data catalog as a manual process, especially in the Hadoop environment where thousands of new variables (nee “books”) are being created every day with different job runs.

Develop protocols for subscribing to content: To establish a data library that promotes reuse of its data, you need to develop protocols to define and enforce policies for accessing the data. For example, data related to race and gender may be permissible to improve customer service for a credit card company, but it is not permissible for use in making a lending decision. To ensure compliance of such regulatory policies, it is not only important to define such policies, but it is equally vital to provide artifacts for auditing purposes that prove that no such data was accessed in a manner incompatible with policies.

Promote sharing and reuse of content: The library promotes the reuse of its assets through a subscription and a checkout process. Similarly, your data library and its operation should prevent each user from making a copy of the data feed just because it requires a different (sub)set of variables. Even better, your library can enforce different privacy policies around the reuse of the data. In addition, your data strategy should promote the reuse of data set instance despite the syntactic differences between the source and consumer. For example, if the consumer wants data in a particular format or within a particular system, your data architecture should prevent the need to make duplicates.

Establish lineage of your digital assets: Every book in the library has an author. In addition, it has references from where the primary work was derived. Finally, there is a record of all of its subscribers. Similarly, your data strategy should have the capability to address these questions for your data assets – which applications created these variables? Who is using these variables? What variables have been derived from the master variables? From which parent variables was the subject variable derived?

Develop a process to procure your data assets: Unlike a lake where the seasonality of the storms determine how the lake is filled, you need to have a more sophisticated plan on how to develop and procure your data assets. Is there enough shelf space for a new book? In other words, is there a plan for predicting future capacity requirements? Are your (new) data resources complying with the required syntactic, compression and semantic standards before they can be part of the data library? Are your data feeds integrating with the cataloging and privacy-enforcement protocols?

Monitor the quality of your assets: A library is judged by the covers of its books. If the quality of its assets depreciates, it no longer remains a library but an ill-maintained mausoleum. To avoid such a fate for your data strategy, it is important to continually monitor and manage the quality of the data applications that produce your digital assets. Ensure that your data applications are meeting their SLAs, no data exceptions are occurring during the creation of the data products and the variables are meeting and beating the quality parameters established by the librarian of your data assets.

The above are considered best practices – there is no one-size-fits-all strategy when it comes to Hadoop, let alone big data. However, what is undeniable is that the big data technology world is ever changing. Moreover, every single enterprise with a big data strategy needs a solution to assure data integrity, accessibility, collaboration and, ultimately, governance.