It doesn’t matter if you have 5 or 500 Big Data applications in production on your Hadoop cluster(s), the operational challenge is the same (though scale comes into the picture at some point) and understanding how your applications are actually behaving is key. There are a number of tools available to help you manage and monitor your Hadoop cluster. While these tools, like Jobtracker, Cloudera Navigator, Ambari, etc. will tell you something has happened, they will not tell you why or who is impacted and what is the business criticality of the problem. What I’m suggesting is you think about performance monitoring and management for your applications not just your cluster. Companies like Spotify, HomeAway and Deerwalk saw a huge difference in their quality of service to the business when they changed 3 things within their Big Data development and operations environment, started by implementing Driven, a performance management solution for Hadoop applications.
#1 – Start to monitor the performance of your applications, not just the cluster
With Driven, you know how the app is suppose to behave and by looking at the run-time application performance, you can quickly see where you need to start researching. For example, I can quickly see from the visualization of my application (below) there are 3 steps that are supposed to run in parallel. However, from the run-time stats, I can see these steps are actually executing sequentially because they are waiting. This could be because the cluster is too small or something else is consuming the resources.
I’ll just take a sec to explain the Driven DAG view (above/below) which shows how the applications is behaving. Through visualizing your entire data pipeline, you can quickly see if something unintended is happening. The DAG is interactive so you can track the data lineage of the entire pipeline from data source through all the transforms to the target. Many customers have incorporated the Driven DAG as part of the development process to ensure their data pipelines are properly constructed which, obviously, helps reduce the chance of downstream operational issues.
Back to tracking down performance issues… For this application, I can see one step is taking a long time to execute.
In both these case, I know exactly where to start researching deeper. Driven drills down into the details of the application and the cluster performance and surfaces historical and current run-time performance data. In most cases, you can pinpoint exactly what else was happening on the cluster to determine if it was a one-time event or something has changed. With your apps performance problem, you can drill into each step and go all the way to the stacktrace to pinpoint the line of code that needs to be optimized
We hear from many customers who have applications in production that often when things go wrong, ops blames dev and vice versa. The problem isn’t a lack of willingness to fix the problem but rather there is no definitive performance analytics data both teams can view, at the same time, that shows exactly what happened and why. To fix this, Driven lets you create cross-functional teams and segment applications into relevant groups. Each team has access to shared application views which promotes collaboration and accelerates issue resolution
For example, when the application above was moved to production, it was assigned to a team, a cluster, an application type, and given a priority grouping. As a result, I know who owns this application, who needs to be notified and how critical it is to the business. I also know what version of the application is running and when the last change was made.
Development and operations team members use a shared view to drill into each step, historic performance, cluster activity, etc. The key is everyone is looking at the same data at the same time so everyone can focus on finding the right path to resolution.
From our experience and that of our customers, this level of visibility is exactly what you need to track down to root cause of a problem, fast, and this is beyond cluster monitoring tools.
#3 – Start to think proactively vs. reactively
To move from a reactive state to a more proactive one, you need to understand what the business expectation is for data delivery. This means managing service levels(SLAs) and applying business context to the applications running in your cluster(s). Obviously, not everything is critical but for those applications that are, it is important you give them special attention. With everything going on, it’s hard to do that with out some help from a monitoring system. For most of you, you already know cluster monitoring tools simple don’t go here.
Below is an example of a simple Driven dashboard that shows all the applications with an SLA of 5 min. You can quickly see the state of each application and set warning thresholds so you are aware of potential issues before they occur. In the event of a problem, Driven sends automated notifications to team members so everyone can quickly address the issue.
Additionally, this customer set up another dashboard (below) to monitor applications using certain datasets that contain PII data. By drilling down into the DAG, compliance teams can see what is happening at each step in detail and create reports. This quickly enables operations compliance teams to do “anytime auditing” and make sure they are not doing any improper manipulation or delivering the data to unauthorized systems.
In conclusion, you need cluster monitoring solutions to monitor your Hadoop or Spark platforms but when it comes to managing your Big Data applications, you need more. Without this next level of visibility into application performance metrics, it is very time consuming to track down issues. Driven captures a huge amount of application metadata at run-time. It uses that data, along with metadata you add, to provide the deep application performance analytics you need to deliver the quality of service your business expects. It works with MapReduce, Hive, Pig, Cascading, Scalding, Cascalog and Spark applications.