Recduce Performance Risks of Managing Hadoop as a Shared Service

Let’s face it; ensuring reliable application performance in a Hadoop shared services environment is hard. With more organizations moving this direction, we’ve developed a clear understanding of some of the common challenges almost every organization has faced along the way and we want to share those with you. Hopefully, armed with this information you can anticipate issues and use some of the best practices we’ve developed to mitigate these challenges.

We’ve spoken with hundreds of organizations and the main drivers for implementing a shared service are: easier long-term management, better resource utilization, efficiency and lower costs. Most are either running a large Hadoop cluster as a shared service already and are planning for scale or they are planning to consolidate smaller team clusters to a single large, shared, cluster within the next 6-12 months. From an organizational perspective, it’s varies widely where these teams are located but, in general, the applications are typically owned by development teams or data scientists while the shared cluster is managed by a central team of data engineers or IT operators. Some common themes of these shared service environments include:

Multiple business teams are developing applications using the technology that best suites their use case and/or skill set
As a result, most environments are heterogeneous, meaning multiple frameworks and technologies in play (MapReduce, Hive, Pig, Scalding, Spark, Oozie, Sqoop, Control M… the list goes on and on and on)
Most business critical applications have service levels expectations
Lots of Hive queries out there, both scheduled and ad-hoc
MapReduce is still the workhorse compute fabric but others like Apache Tez, Spark, Flink, Kafka, etc. are being tested, where it makes sense, with teams eventually rewriting or porting applications to other fabrics

Ok… so now imagine you are a member of the data engineering or IT operations team who is tasked with managing this hot mess of technology coupled with applications of varying degrees of business criticality and performance expectations. You basically have all the responsibility of optimizing cluster performance but almost no visibility to the applications running on it. At scale, this lack of visibility into the data pipeline or application performance makes providing a high quality of service pretty tough. Bottom-line is lack of visibility into an applications performance is the key issue that these teams find they cannot solve with cluster management solutions. It simply requires a different level of visibility within the environment.

The symptoms manifest themselves in 5 operational challenges (there are more but these are the most common):

We can’t maintain reliable, predictable performance for scheduled jobs/queries. If there is a slowdown, it takes too long to troubleshoot the issue and we cannot easily find the root cause
We don’t know what teams are consuming what resources and if they are being consumed efficiently, especially related to ad-hoc Hive queries
We are not sure what service levels we can commit to for business critical applications
We don’t know which jobs/queries/applications are business critical when there is a slowdown, who owns them or the downstream impacts
We have policies and best practices in place but its difficult to enforce them across the organization

There are a number of excellent cluster management and monitoring solutions, but they all focus on performance metrics for the cluster i.e. CPU utilization, I/O, disk performance, disk health, etc. However, when it comes to your applications, these solutions simply do not capture the metrics to provide application level performance monitoring. This why most DevOps teams are forced to wade through the resource manager, log files and thousands of lines of code to troubleshoot a problem.

To mitigate the reliability and performance risks in a shared service environment, you need to think about application performance monitoring and not just cluster performance monitoring. To make that shift, you will need to implement most, if not all, of these best practices:

Create a collaborative environment across development, QA, and operations teams covering the lifecycle of the application. This includes technology, people and process.

Collaboration should start at the development phase with your development, data analysts and operations teams working together to build better data applications that are optimized and resilient
When things go wrong, enable notifications to quickly assemble a cross-functional team to troubleshoot the application, identify the root cause and quickly fix the issue.
In both cases, leverage technology to allow your team members access the same application performance data at the same time. This should also include access to historical runs so present and past performance can be compared.

Segment your environment by assigning applications to teams and associating other business relevant metadata such as business priority, data security level, cluster, technology, platform, etc.

To understand what is happening, where and by whom, the environment needs to be organized and segmented in ways that are meaningful and aligned to how your business operates.
Associating applications to teams, departments or business units allow your operations teams to create the necessary cross-functional support organization to troubleshoot issues.
Segmentation also makes reporting more robust as you can create chargebacks, audit and/or usage reports by multiple dimensions. This makes managing the business of Big Data easier and provides evidence of ROI for senior leadership.

Provide operations teams with a single view of all the applications running in the environment

Bouncing between multiple monitoring tools to track down issues is not sustainable for a shared service team managing potentially dozens of technologies. A single view of the status of all the applications, regardless of cluster, technology or framework is essential to enabling these teams to support the enterprise effectively.

Give cross-functional teams the ability to visualize an entire data pipeline vs. just a single step/job/task.

Some cluster monitoring tools provide some basic information about how a job/task performed but only about that job/task and only for that moment of time. By enabling your development and operations teams to visualize the entire data pipeline/data flow, two things happen: 1) everyone can validate the application is behaving as intended; and 2) teams understand the inter-app dependencies and the technologies used in the application.
This is also a great way to enable the operations teams to advise data analysts or data scientists on best practices and make their applications or queries more efficient, saving everyone headaches in the long-term.

Surface real-time and historic application performance status with the ability to drill down into performance details such as slice rate, bytes read/written, wait time, processing time, etc.

Visualizing application performance metrics will help your teams quickly troubleshoot if the application is not efficient, if recent changes impacted performance or if there are resource constraints on the cluster. This alone can save you hours of combing through log files and provides the data you need to tune the application right the first time.
Performance monitoring details should surface the application metadata tags business context can be applied to any issue. For example, identify what team owns the application/query, the business priority class, etc. This information can be invaluable when a business critical scheduled job is being starved of resources because a large ad-hoc query is submitted during run-time.
In a shared service environment, more business teams ask for service levels so they can be sure their data is available when they need it. For team to confidently commit to service levels, you need to chart historic performance trends so you can determine what service levels are achievable based on actual data not on wishful thinking.

Track data lineage to support compliance and audit requirements.

This step is often overlooked as a backend reporting problem but the reality is you need to capture this data from the start. It can be an arduous and time consuming process to retrospectively string together logs to track sources, transformations and ultimate destination of data. Your team has better things to do with their time. To simplify the reporting process, enable the compliance teams to visualize the entire data pipeline/data flow so they can see, in one view, sources transformations, joins, and final output location of the entire application.

Enable alerting and notifications for the various teams

When anomalies are detected, you want to notify all the relevant team members so cross-functional troubleshooting and impact analysis can start immediately.
Utilize existing notification systems and processes so you are not creating a separate support process for these applications

We have found that organizations that implement these best practices for application performance monitoring are up to 40% more efficient in both managing application development and production environments but it takes technology, people and process to get there.

To learn more, download our free whitepaper: 9 Best Practices to Achieve Operational Readiness on Hadoop

******

About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is a application performance management solution to monitor, collaborate and control all your Big Data applications in one place.

If you are running a lot of Hive queries (scheduled and ad-hoc) and are struggling with performance, watch our 30-min webinar and learn 5 best practices to achieve operational excellence for Hive.

Take Driven for a test drive with our free 30-day trial.

Performance Management for Big Data Apps

Mitigating 5 Performance Risks of Managing Hadoop as a Shared Service