Author Archives: Kim Loughead

It doesn’t matter if you have 5 or 500 Big Data applications in production on your Hadoop cluster(s), the operational challenge is the same (though scale comes into the picture at some point) and understanding how your applications are actually behaving is key.  There are a number of tools available to help you manage and monitor your Hadoop cluster. While these tools, like Jobtracker, Cloudera Navigator, Ambari, etc. will tell you something has happened, they will not tell you why or who is impacted and what is the business criticality of the problem.  What I’m suggesting is you think about performance monitoring and management for your applications not just your cluster.  Companies like Spotify, HomeAway and Deerwalk saw a huge difference in their quality of service to the business when they changed 3 things within their Big Data development and operations environment, started by implementing Driven, a performance management solution for Hadoop applications.

#1 – Start to monitor the performance of your applications, not just the cluster

With Driven, you know how the app is suppose to behave and by looking at the run-time application performance, you can quickly see where you need to start researching.  For example, I can quickly see from the visualization of my application (below) there are 3 steps that are supposed to run in parallel.  However, from the run-time stats, I can see these steps are actually executing sequentially because they are waiting.  This could be because the cluster is too small or something else is consuming the resources.

custerslow_dag

custerslow

I’ll just take a sec to explain the Driven DAG view (above/below) which shows how the applications is behaving.  Through visualizing your entire data pipeline, you can quickly see if something unintended is happening. The DAG is interactive so you can track the data lineage of the entire pipeline from data source through all the transforms to the target. Many customers have incorporated the Driven DAG as part of the development process to ensure their data pipelines are properly constructed which, obviously, helps reduce the chance of downstream operational issues.

Back to tracking down performance issues…  For this application, I can see one step is taking a long time to execute.

slowapp

In both these case, I know exactly where to start researching deeper.  Driven drills down into the details of the application and the cluster performance and surfaces historical and current run-time performance data.   In most cases, you can pinpoint exactly what else was happening on the cluster to determine if it was a one-time event or something has changed.  With your apps performance problem, you can drill into each step and go all the way to the stacktrace to pinpoint the line of code that needs to be optimized

#2 – Foster collaboration across dev/ops teams

We hear from many customers who have applications in production that often when things go wrong, ops blames dev and vice versa.  The problem isn’t a lack of willingness to fix the problem but rather there is no definitive performance analytics data both teams can view, at the same time, that shows exactly what happened and why. To fix this, Driven lets you create cross-functional teams and segment applications into relevant groups.  Each team has access to shared application views which promotes collaboration and accelerates issue resolution

For example, when the application above was moved to production, it was assigned to a team, a cluster, an application type, and given a priority grouping.  As a result, I know who owns this application, who needs to be notified and how critical it is to the business.  I also know what version of the application is running and when the last change was made.

stepview

Development and operations team members use a shared view to drill into each step, historic performance, cluster activity, etc.  The key is everyone is looking at the same data at the same time so everyone can focus on finding the right path to resolution.

From our experience and that of our customers, this level of visibility is exactly what you need to track down to root cause of a problem, fast, and this is beyond cluster monitoring tools.

#3 – Start to think proactively vs. reactively

To move from a reactive state to a more proactive one, you need to understand what the business expectation is for data delivery.  This means managing service levels(SLAs) and applying business context to the applications running in your cluster(s).  Obviously, not everything is critical but for those applications that are, it is important you give them special attention.  With everything going on, it’s hard to do that with out some help from a monitoring system.  For most of you, you already know cluster monitoring tools simple don’t go here.

Below is an example of a simple Driven dashboard that shows all the applications with an SLA of 5 min.  You can quickly see the state of each application and set warning thresholds so you are aware of potential issues before they occur.  In the event of a problem, Driven sends automated notifications to team members so everyone can quickly address the issue.

sla5min

Additionally, this customer set up another dashboard (below) to monitor applications using certain datasets that contain PII data. By drilling down into the DAG, compliance teams can see what is happening at each step in detail and create reports.  This quickly enables operations compliance teams to do “anytime auditing” and make sure they are not doing any improper manipulation or delivering the data to unauthorized systems.

piiviewdatalineage

In conclusion, you need cluster monitoring solutions to monitor your Hadoop or Spark platforms but when it comes to managing your Big Data applications, you need more.  Without this next level of visibility into application performance metrics, it is very time consuming to track down issues.   Driven captures a huge amount of application metadata at run-time. It uses that data, along with metadata you add, to provide the deep application performance analytics you need to deliver the quality of service your business expects.  It works with MapReduce, Hive, Pig, Cascading, Scalding, Cascalog and Spark applications.

If you want to check out Driven for yourself, go to our self-guided tour or you try Driven for free

Traditionally, a lot of time is spent collecting and preparing data, and then, eventually, you get around and build an app that makes use of the data. You create the right views and get the insights you need.

Life is awesome – all that data you collected has value, and the Hadoop project is a success.

But then something goes wrong: data stops flowing; processing times shoot up; cluster resources are maxed out, and in some extreme circumstance, the app breaks and comes to a halt.

Life is no longer awesome. It totally sucks.

So what do you do next?

  • As a developer, what tools do you have that can easily pinpoint code-level issues?
  • As a member of an operations team, how do you know which app is hogging all the resources and impacting the shared Hadoop infrastructure?
  • If you’re a compliance manger, how do you view the data lineage? What data was ingested? Where did it get outputted?

All these are burning questions that need an immediate answer. But how and with what do you answer them?

The answer is simple.

  • How: With a comprehensive Hadoop application management platform
  • With what: Driven

Driven is the only Hadoop application performance management platform that delivers unprecedented intelligence into what your Hadoop app is really doing, allowing organizations to build better apps that run reliably and use resources effectively.

With Driven:

  • Developers can pinpoint where an app broke, down to the specific line of code;
  • Operations teams can monitor processing times, implement chargeback models, or track SLAs;
  • Compliance teams can track data lineage to ensure that they are adhering to external regulatory requirements.

When something goes wrong with your Hadoop app, Driven is the solution you need to get back on track.

With that being said, I’m delighted to announce the latest version of Driven – Driven v1.2.

Driven 1.2 delivers the ability to:

Achieve better governance and compliance for Hadoop apps

  • Create and export saved views of the metadata repository to support governance/compliance requirements.
  • Visualize lineage – see exactly how your Hadoop app ingests, manipulates, and outputs data.
  • Easily detect apps that violate SLAs and policies.

Nurture a culture of operational excellence around Hadoop

  • Create Jira issues with views and data for quickly collaborating to resolve performance problems.
  • Integrate alerts with popular notification platforms like HipChat, PagerDuty, and Nagios.
  • Segment performance by team or department, or create custom tags for role-based views, chargeback models, and capacity planning.

If you’re building Hadoop apps without Driven, you’re flying blind.

Download the 30-day trial today, and see why so many organizations have increased the reliability and availability of their Hadoop apps in seconds. Click here to start a Free trial