Author Archives: Kim Loughead

In our last post, we talked about the possible options for organizing to achieve operational excellence on Hadoop and the pro’s and con’s of each model. A couple of key points that surfaced during our conversations with our customers who have successful production deployments were: 1) be conscious of the fact that data applications are rarely developed and managed in a vacuum so setting up a collaborate environment is important, and 2) there is an inverse relationship between scale and performance visibility. Meaning, the more you scale, the less visibility you have to what is happening in your environment.

In this post, we will dive deeper into why fostering a collaborative environment and application performance visibility are key factors to achieving Hadoop production readiness and operational excellence with your Big Data applications.

Before you ask… yes, there are a number of excellent cluster management and monitoring solutions and you absolutely need them. They focus on performance metrics for the cluster i.e. CPU utilization, I/O, disk performance, disk health, etc. However, when it comes to your applications, these solutions simply do not capture the metrics to provide application-level performance monitoring. This why most operations teams are forced to wade through Resource Manager, log files and thousands of lines of code to troubleshoot a performance problem. Needless to say, this can take hours and, at scale, is not sustainable.

To mitigate reliability and performance risks as you scale, you need to think about application performance monitoring as well as cluster performance monitoring. So what does that mean? These are the common challenges you will face as you scale that will make it difficult to maintain operational excellence:

  • We can’t maintain reliable, predictable performance for scheduled jobs/queries. If there is a slowdown, it takes too long to troubleshoot the issue and we cannot easily find the root cause
  • We don’t know what teams are consuming what resources and if they are being consumed efficiently, especially related to ad-hoc Hive queries
  • We are not sure what service levels we can commit to for business critical applications
  • We don’t know which jobs/queries/applications are business critical when there is a slowdown, who owns them or the downstream impacts
  • We have policies and best practices in place but its difficult to enforce them across the organization

To address these challenges, we’ve compiled this list of 7 things to do to ensure your Hadoop production implementation is sustainable and meets business expectations:

Create a collaborative environment across development, QA, and operations teams covering the lifecycle of the application. This includes technology, people, and process.

 

    collaboration

  • Collaboration should start at the development phase with your development, data analysts and operations teams working together to build better data applications that are optimized and resilient
  • When things go wrong, enable notifications to quickly assemble a cross-functional team to troubleshoot the application, identify the root cause and quickly fix the issue.
  • In both cases, leverage technology to allow your team members access to the same application performance data at the same time. This should also include access to historical runs so present and past performance can be compared.

Segment your environment by assigning applications to teams and associating other business relevant metadata such as business priority, data security level, cluster, technology, platform, etc.

 
chaos to order

  • To understand what is happening, where and by whom, the environment needs to be organized and segmented in ways that are meaningful and aligned to how your business operates.
  • Associating applications to teams, departments or business units allow your operations teams to create the necessary cross-functional support organization to troubleshoot issues.
  • Segmentation also makes reporting more robust as you can create chargebacks, audit and/or usage reports by multiple dimensions. This makes managing the business of Big Data easier and provides evidence of ROI for senior leadership.

Provide operations teams with a single view of all the applications running in the environment

  • Bouncing between multiple monitoring tools to track down issues is not sustainable for operations teams managing potentially dozens of technologies. A single view of the status of all the applications, regardless of cluster, technology or framework is essential to enabling these teams to support the enterprise effectively.

Give cross-functional teams the ability to visualize an entire data pipeline vs. just a single step/job/task.

 
Hive Applications - DRIVEN ScreenshotCascading application

  • Some cluster monitoring tools provide some basic information about how a job/task performed but only about that job/task and only for that moment of time. By enabling your development and operations teams to visualize the entire data pipeline/data flow, two things happen: 1) everyone can validate the application is behaving as intended; and 2) teams understand the inter-app dependencies and the technologies used in the application.
  • This is also a great way to enable the operations teams to advise data analysts or data scientists on best practices and make their applications or queries more efficient, saving everyone headaches in the long-term.

Surface real-time and historic application performance status with the ability to drill down into performance details such as slice rate, bytes read/written, wait time, processing time, etc.

  • Visualizing application performance metrics will help your teams quickly troubleshoot if the application is not efficient, if recent changes impacted performance or if there are resource constraints on the cluster. This alone can save you hours of combing through log files and provides the data you need to tune the application right the first time. This is a key tactic to achieving operational excellence.
  • Performance monitoring details should surface the application metadata tags business context can be applied to any issue. For example, identify what team owns the application/query, the business priority class, etc. This information can be invaluable when a business critical scheduled job is being starved of resources because a large ad-hoc query is submitted during run-time.
  • In a shared resource environment, more business teams ask for service levels so they can be sure their data is available when they need it. For team to confidently commit to service levels, you need to chart historic performance trends so you can determine what service levels are achievable based on actual data not on wishful thinking.

Track data lineage to support compliance and audit requirements.

  • This step is often overlooked as a backend reporting problem but the reality is you need to capture this data from the start. It can be an arduous and time-consuming process to retrospectively string together logs to track sources, transformations and ultimate destination of data. Your team has better things to do with their time. To simplify the reporting process, enable the compliance teams to visualize the entire data pipeline/data flow so they can see, in one view, sources transformations, joins, and final output location of the entire application.

Enable alerting and notifications for the various teams

  • When anomalies are detected, you want to notify all the relevant team members so cross-functional troubleshooting and impact analysis can start immediately.
  • Utilize existing notification systems and processes so you are not creating a separate support process for these applications

We have found that organizations that implement these best practices for application performance monitoring are up to 40% more efficient in both managing application development and production environments. Implementing these tactics will help you ensure you can sustain Hadoop operational excellence and meet business needs, but it takes technology, people and process to get there.

To learn more, download our free whitepaper: 9 Best Practices to Achieve Operational Readiness on Hadoop

******

About the Author: Kim is Sr. Director of Product Marketing at Driven, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is an application performance management solution to monitor, troubleshoot and manage all your Big Data applications in one place.

Take Driven for a test drive with our free 30-day trial.

First off, there is no right answer to how to organize to ensure Hadoop production readiness – Only the right answer for you (sorry). The key is to understand the implications, potential limitations and when to rethink how you are organized. Hadoop, like any other environment, is not static. It will change, mature and grow over time so what works for you today may not work for you in 6-months or a year so stay flexible in your thinking.

What we will discuss is the different ways we have seen enterprises organize their team for Hadoop Production Readiness – aka the process of moving their Hadoop and/or Spark applications into production and then scale.

This includes some pros and cons of each approach and what to look for as indicators it may be time to make a change. It’s also important to note, we have seen multiple models implemented in the same organization where there are a number of different divisions with active Big Data initiatives.

The Lone Wolf Approach

We all know at least one person in our life who seems like they can do just about anything. This person is typically located in a line of business IT team, the lone wolf is the ultimate wearer of many hats. With flexibility and agility on their side, the developer is tasked with doing everything from development to QA, to operations.
Developer Does Everything for Hadoop Production Readiness
Pros – Coupled with little to no process or red tape to contend with, the developer has the independence to move at the speed of the business team. Technology decisions are left to them and their degree of accountability is considerably high because of their independence, i.e. if they break it, they have to fix it.

Cons – There are some extreme limitations when you’re working alone. Once you’re beyond a few applications, the ability to scale only becomes more difficult. The Lone Wolf is wearing multiple hats so new development can be bogged down and slowed as monitoring production applications will take more and more of his/her time.

Typically, there are multiple Lone Wolves spread around the organization, each making their own technology choices to best suite the needs of their line of business and each running their own dedicated cluster (potentially using different distributions). As multiple clusters and technologies are supported, the lone wolf approach will quickly result in higher overall costs and lower resource efficiency.

Considerations to change to another model – This is probably somewhat obvious but scalability and costs are the primary considerations.

When a line of business scales beyond a handful of Hadoop applications or when multiple lines of business, with similar needs, are building data processing applications, you should consider moving away from this model and into a more team-based approach. Keep in mind, you need to balance the needs of the different lines of business to ensure there is an appropriate allocation of resources across teams or risk a business team reverting back to the lone wolf approach.

DevOps Team(s)

Similar to the lone wolf developer, DevOps are a small collection of developers, QA, and operations also typically located within a line of business IT team. The difference here being that each role would ideally be spread out across its members, although the same people can cover multiple roles. DevOps could also support multiple business analytics teams within a department or business division.
teamwork image for Hadoop Production Readiness
Pros – In many ways, the DevOps team(s) are like the lone wolf developer. Only here they’re more like a pack of wolves hunting in sync. They have the independence to move quickly without being bogged down by the dreaded red tape. Again, DevOps teams are making technology decisions based on the needs of their line of business and there’s no denying the benefits of easily collaborating across teams to identify and troubleshoot issues.

Cons – Despite having more numbers, DevOps will still have a hard time scaling beyond even just a few dozen applications and development can be slowed as resources may be wearing multiple hats. And just like the lone developer, there are higher overall costs as multiple lines of business stand up their own dedicated clusters and acquire specialized resources to support technologies choices.

Considerations to change to another model – Increased costs and poor resource efficiency are typically the primary consideration as organizations see multiple DevOps teams pop up to support different projects. We see a number of organizations start here but have an eye towards a time where they will consolidate resources, both technology, and people. Once they have proven out the value of their Hadoop production applications, at the line of business level, they make a move to another model to contain costs and increase resource utilization and development efficiencies.

Share Resource Environment with a Dedicated Cluster Management Team

Where a DevOps team has its set of benefits of a single team managing everything, organizing a dedicated cluster management team essentially separates Hadoop application development and Hadoop operations. In this model, one or more clusters are managed in a shared resource environment. Developers own the applications while ops own the cluster and are responsible for its performance. Together, they are responsible for troubleshooting issues. To mitigate the risk of issues, the operations teams will often set design and operations best practice policies that the development team is supposed to follow. However, the development teams, located with line of business IT, still have the authority to select the technology that best suites the needs of their project, assuming it adheres to the policy (and sometimes even if it doesn’t).

The success of this model is dependent on a high degree of collaboration between the operations and development teams so fostering that type of environment is critical. This requires a much higher degree of application performance visibility and transparency throughout the life cycle of a data application.

Pros – Developers have the ability to focus efforts mostly on development and subsequently can work faster. Unlike DevOps or the lone wolf, scalability to tens of thousands of applications is no longer an issue. In addition, costs are lower because the cluster is shared thus enabling organizations to leverage their buying power. Efficient use of the resources also increases as the cluster is less likely to be under utilized with multiple teams running applications. Typically, the cluster management team is leveraging existing operations technology and process the IT operations teams already use but the resources are specialized and, therefore, not normally part of the Corporate IT operations teams.

Cons – With operations teams managing multiple technologies, there also tends to be some finger pointing between development and ops teams when things go wrong. While policies exist, getting development teams to adhere to them is a challenge unless there is a review process as data applications are moved from development to production.

Poorly formed applications can easily kill the cluster and researching problems can take an exorbitant amount of time when combing through the endless sea of logs to see where the applications went wrong. Planning for resources and capacity can feel impossible with a mix of technologies and a mix of scheduled and ad-hoc jobs.

Without the right level of visibility into application performance, these environments can become extremely complex to manage which risks a higher number of performance issues, such as slowdowns, that both the development teams and operation team must manage.

Considerations to change to another model – Full integration into existing IT operations procedures and polices because of security or regulatory compliance, along with support cost containment, are the driving factors for moving away from this model.

Corporate IT Operations Team Manages the Cluster

Armed with some specialized knowledge, IT operations will manage the cluster and treat Hadoop like any other IT environment. command center website image for Hadoop production readinessFully integrated into IT operations standard operating procedures and systems, there are some alternative items to consider before you implement.

Pros – IT operations is often very similar to a dedicated team with regards to technology decisions, faster development, easier scalability, and lowered costs from sharing the cluster. The difference here is that because the cluster is fully integrated, support costs are lower.

Cons – Most problems with a dedicated cluster team are also true here. Only now there is less independence to move at the speed of the business team due to operational policy adherence. Additionally, innovation is potentially slowed as IT operations team must now deal with more processes and red tape.

Conclusion

Even if you decide to implement what may work now, you always have to be in tune with what your organization will need next. If you can anticipate the evolution, then you can plan better for the organizational shift and minimize distribution and anxiety to the line of business teams.

Regardless of how you organize, a common theme we hear everyone say is the need for application execution and performance visibility increases exponentially as Hadoop matures and the number of production apps increases. This is because there are a lot more variables to consider to ensure you meet service levels and don’t slowdown the cluster.

The reality is visibility and your ability to control your environment actually diminishes as you scale. Wading through log files and Resource Manager, when hundreds or even thousands of applications are running per day, simply takes too long and doesn’t provide the information needed to actual diagnose and solve problems. This is where solutions like Driven, Application Performance Management for Big Data, come in to provide the necessary application-level monitoring and performance visibility.

Success means your environment will become increasingly more chaotic to manage. How you organize to manage that chaos while creating a collaborative environment between development, operations, and the business becomes increasingly more important to maintaining the success of your Big Data implementation.

In the next post, we’ll discuss why this happens and start to detail what leading enterprises are doing to mitigate these risks.

******

About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is an application performance management solution to monitor, collaborate and control all your Big Data applications in a single solution.

Take Driven for a free test drive and sign-up for our 30-day trial.

Ensuring reliable application performance in a production Hadoop environment is really hard. We feel your pain and after speaking with dozens of organizations with mature implementations, some common themes emerge about the nature of these environments:

  • Multiple business teams are developing applications using the technology that best suites their use case and/or skill set
  • As a result, most environments are heterogeneous, meaning multiple frameworks and technologies in play (MapReduce, Hive, Pig, Scalding, Spark, Oozie, Sqoop, Control M… the list goes on and on and on)
  • Most business critical applications have service levels expectations
  • Lots of Hive queries out there, both scheduled and ad-hoc
  • MapReduce is still the workhorse compute fabric but others like Apache Tez, Spark, Flink, Kafka, etc. are being tested, where it makes sense, with teams eventually rewriting or porting applications to other fabrics

Ok… so now imagine you are a member of the data engineering or IT operations team who is tasked with managing this hot mess of technology coupled with applications of varying degrees of business criticality and performance expectations.   You basically have all the responsibility of optimizing cluster performance but almost no visibility to the applications running on it.   At scale, this lack of visibility into the data pipeline or application performance makes providing a high quality of service pretty tough.  Bottom-line is lack of visibility into an applications performance is the key issue that these teams find they cannot solve with cluster management solutions. It simply requires a different level of visibility within the environment.

The symptoms of a lack of application performance visibility manifest themselves in the following common operational challenges:

Production Readiness is delayed by a lack of visibility

    1. We can’t maintain reliable, predictable performance for scheduled jobs/queries. If there is a slowdown, it takes too long to troubleshoot the issue and we cannot easily find the root cause
    2. We don’t know what teams are consuming what resources and if they are being consumed efficiently, especially related to ad-hoc Hive queries
    3. We are not sure what service levels we can commit to for business critical applications
    4. We don’t know which jobs/queries/applications are business critical when there is a slowdown, who owns them or the downstream impacts
    5. We have policies and best practices in place but its difficult to enforce them across the organization

There are a number of excellent cluster management and monitoring solutions, but they all focus on performance metrics for the cluster i.e. CPU utilization, I/O, disk performance, disk health, etc.  However, when it comes to your applications, these solutions simply do not capture the metrics to provide application level performance monitoring.  This why most DevOps teams are forced to wade through the resource manager, log files and thousands of lines of code to troubleshoot a problem.

To mitigate the reliability and performance risks in a shared service environment, you need to think about application performance monitoring and not just cluster performance monitoring. This means understanding performance metrics at the application level, not the job or task level. The context is actually quite different because you first need to understand what constitutes an application, i.e. what jobs, queries, etc. make up an end-to-end data flow. Then you need to understand what teams own those applications and their respective business priority. Only then understand what applications and teams are consuming the most resources, what applications are not running optimally or not following good design practices and what applications should be prioritized over others. Ensuring the right visibility is both tooling and an organizational issue so both need to be considered as you scale up your production environment.

In the next post in this series, we will discuss the different organizational models we see and the pro’s and con’s of each when it comes to scaling up and managing large production environments. We will also start to explore step-by-step how other organizations like HomeAway, Expedia, D&B and The Commonwealth Bank of Australia have improved their application performance visibility using readily available solutions.

******

About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is a application performance management solution to monitor, collaborate and control all your Big Data applications in one place.

Take Driven for a free test drive and sign-up for our 30-day trial.

Hadoop is hard and if you’ve ever tried to code in MapReduce or manage a Hadoop cluster running production jobs, you know what I mean. Apache Spark has promise and simplifies some of Hadoop’s complexities but also brings its own challenges.

The good news is regardless of technology, there are more and more successful Big Data implementations where organizations of various sizes and industries are in production with Hadoop and Spark applications. We are in a somewhat unique position where, not only, do we have years of Hadoop technology experience as the team behind Cascading but we are exposed to how our 10,000+ customers are building, managing, monitoring and operationalizing their environments and the challenges they have faced.

Now, we get to share the journey many of them took, and are taking, to achieve operational readiness on Hadoop and meet business demands for service quality. In this blog series, we will share the 6 “a-ah moments” we see almost every organization go through on their journey to production readiness on Hadoop. With any luck, by sharing this, it will help you avoid some common operational pitfalls.

Discovery #1: We Have a Significant Visibility Problem

Yes, there are lots of top-notch cluster management and monitoring solutions out there. With them you can look into performance metrics for the cluster like CPU utilization, disk performance, I/O, etc. As far as these solutions are concerned, there is no question about value provided when examining these cluster performance metrics. Production Readiness is delayed by a lack of visibility

But what about application execution performance? Do these tools provide these kind of insights?

Framework and cluster monitoring solutions don’t address the overwhelming lack of visibility into the performance of Hadoop applications, once placed into production. Specifically, they fail to point out or even measure the necessary information to appropriately troubleshoot application performance. Without the right visibility, teams are left to accept the reality that their applications may not perform as expected in production and they probably won’t know until it’s too late.

Discovery #2: Organizing Correctly Ensures Production Readiness

The nature of Hadoop means it often started its life as a series of science experiments. Once it has proven to be successful in either lower costs or opening up new revenue generating opportunities, those experiments are moved to production. Simple. Right? Well, not always. If you have a handful of applications, maybe it is this simple but what we hear is production problems with quality of service begin to arise quickly if roles and responsibility are left to chance. Does the development team just manage everything? Should we setup up a special operations team to manage the cluster or should IT operations do it? Who owns the applications if the cluster is managed centrally? It didn’t matter if it was a handful of teams running hundreds or thousands of jobs per day or just a handful of business critical jobs.

Discovery #3: Our Success Means Our Environment Is Increasingly Chaotic and Unreliable

Our enviroment is increasingly chaotic and delays production readinessThe proliferation of fit-for-purpose technologies means most environments will be a heterogeneous hot bed of different technologies. Almost everyone we speak too has some combination of MapReduce, Hive, Cascading, Scalding, Pig, Spark. They use Oozie, Control-M, Kron, etc. for scheduling. They want to move some applications to Apache Tez, Apache Flink, and, of course Apache Spark. The list goes on. It seems every week there is a new project announced and there is no indication this is slowing down any time soon.

Different technologies inherently mean each one has its own set of monitoring tools. In a homogenous environment that’s great. In a heterogeneous technology environment, it sucks. It just adds to the confusion and chaos about what is actually happening to the performance of your applications.

Some applications could be behaving differently in production than expected, scheduled jobs are processing more data than usual, or maybe someone has just submitted a massive ad-hoc query. Because you can’t see what is happening and by whom, it can be extremely difficult to maintain reliable/predicable performance and slowdowns become more frequent.

Discovery #4: Wow! It Takes a Really Long Time to Troubleshoot a Problem

The truth is that no app will maintain perfect performance in its lifetime. Unfortunately, current methods for troubleshooting in the Hadoop environment can be a painstaking, time-intensive task.

Think about the amount of time it can take teams to essentially sift through job tracker and log files. Hours of mind-numbing investigation just to find what could be potentially a single or countless lines of code. Finding these needles in the haystack can be done, but there is no getting around how much time it can eat up.

Discovery #5: We are Not Properly Aligned to Business Priorities

Things will eventually go wrong. The important thing to keep in mind when things go wrong is comprehending the downstream impact across the organization and adapt. Right now, businesses are failing in this regard and have a poor understanding of these impacts.

Submitting an application to run means it’s then broken up into small jobs and tasks so those units of work can be distributed across the cluster for processing.

From a performance perspective, this is fantastic. From a monitoring perspective? Not so great.

Adding to the list of shortcomings associated with other tools, modern monitoring solutions just fail to provide critical insights into where your jobs or tasks went wrong. Production Readiness is delayed while pondering mysteriesSo while these monitoring tools will most certainly say which jobs failed, teams are left staring at a daunting pile of job failures. Not knowing where in the application the failure occurred, all that can be done is to once again sift through log files to try to find the culprit.

Depending on where the failure occurred, all subsequent jobs will also fail. This blindness means that teams won’t be able to identify who owns that application, what service level agreement will be missed or what data-centric business process won’t be carried out.

Teams are left wondering things like, “What happened to our data set?”

Pondering such a mystery will over time undermine the confidence of the team, increasing a hesitation for adoption as well as delaying production readiness.

Discovery #6: Reporting is Even Harder than Troubleshooting

As many teams might do, they’ll often relegate data lineage and audit reporting to a backend reporting task to be thought about later. This causes headaches down the road because the information you need for audit reporting, data lineage tracking and regulatory compliance is spread across logs and systems. Its up to you to piece everything back together in a way that will meet requirements.

What you’ll ideally do is set up the necessary mechanisms to capture and store the required application metadata and performance metrics on the front-end. This will enable data lineage and audit reporting to be a simply 2-click reporting task on the backend. The challenge, once again, is getting access to the right level of visibility to capture the right metadata as existing tools are of little value here.

Over the next several posts, we will go into details about each of these discoveries and what leading organizations are doing to address these operational challenges.

HomeAwayLogo is the world’s leading online marketplace for the vacation rental industry.

On November 10, 2015, we hosted a technical discussion with the HomeAway Big Data architects and engineering team. At the end of this session, you’ll know how the HomeAway team implemented shared Hadoop environment to meet revenue and growth objectives. You’ll gain insight into their lessons learned and resulting best practices for development (including code examples), management and monitoring of their Hadoop applications in a shared environment.

If you have read any of the industry reports lately, you’ll see mixed reviews about the success of Big Data initiatives. With any new technology, there is always a period where leading organizations try (and fail) to implement it. It’s not surprising that some organizations have struggled while others find success. That’s how we find out what are the best use cases for the new technology and what types of projects should be avoided. This is the path to broad adoption.

Organizations just starting their Big Data initiatives can already take advantage of many lessons learned by others. For example, it is important to set a goal and define an achievable project, with clear success criteria, right from the beginning.

As part of the definition phase, be sure to consider things beyond the technology. For example, identify the people and how their existing processes will be impacted when the project goes live. Without these steps in the beginning, you will only find frustration as you try to move from experimenting to production implementation. We’ve heard this time and again from organizations, large and small, who have successfully implemented their Big Data projects.
FREE OnDemand Webinar
Learn How HomeAway Is Increasing Bookings using Big Data

Starting with a realistic plan is the first step towards success but other factors are also important. In our conversations with enterprises moving their projects into production, they quickly identified three key factors that helped them ensure success:

    1. The initiative is deemed business critical by senior leadership
    • With cost savings initiatives, you need to ensure that the first set of processes moved are critical to the business for analytics, reporting, etc and resource intensive. With revenue generating initiatives, securing senior leadership sponsorship is a little easier but ROI needs to be proven faster. Most organizations try a mix of cost savings and revenue generating initiatives to ensure continued senior leadership sponsorship and set an expectation of 1-2 years, on average, to achieve to ROI.
    1. The development framework(s) used leverage existing resources skills
    • Most of the industry reports still list a lack of availability of skilled resources as a barrier for adoption. There are still a lot of, Pig, Hive and MapReduce jobs being developed directly in Hadoop and this does require new skills. Organizations that achieved ROI, quickly, have largely minimized the number of applications they are developing directly in Hadoop and have moved to some sort of abstraction layer, beyond Hive and Pig.
    • You can use GUI-based tools like Informatica or SnapLogic for ETL processes but they are limited beyond that use case. Cascading is a Java-based Open Source API framework. Scalding, also Open Source, support Scala-based development contributed by Twitter. More recently, of course, is Apache Spark. These are examples… there are more.
    1. There is a high level of operational transparency
    • Many Big Data projects start as science experiments in one or more business teams. As you take your Hadoop infrastructure from experiment to a production environment, be sure it has the required operational transparency and is integrated into existing operational support systems. This provides business teams with confidence in the production readiness of the environment and that their data will be delivered within service levels.
    • To do this, provide operations business teams with visibility into application performance, not just cluster performance. This will let you know who are using resources and what is consuming them. Without application level performance visibility, it is difficult to maintain reliable service levels at scale..

HomeAway is a great example of an organization that has found value from their Big Data investment because of these three factors. For example, one initiative gathers customer preference data from dozens of websites and uses it to refine their marketing and, in turn, increase bookings.

To hear more about HomeAway’s big data projects, join us on Nov. 10th, 2015 at 11:00 AM PT for a webinar with Rene, Austin, Michael and Francois from the HomeAway team. Learn how they successfully implemented a shared services Hadoop environment, are rapidly on-boarding new developers with a short learning curve, and have achieved operational excellence on Hadoop. Register here: http://info.cascading.io/webinar-homeaway-bigdata-increases-bookings-registration-0

Let’s face it; ensuring reliable application performance in a Hadoop shared services environment is hard.  With more organizations moving this direction, we’ve developed a clear understanding of some of the common challenges almost every organization has faced along the way and we want to share those with you.  Hopefully, armed with this information you can anticipate issues and use some of the best practices we’ve developed to mitigate these challenges.

We’ve spoken with hundreds of organizations and the main drivers for implementing a shared service are: easier long-term management, better resource utilization, efficiency and lower costs.  Most are either running a large Hadoop cluster as a shared service already and are planning for scale or they are planning to consolidate smaller team clusters to a single large, shared, cluster within the next 6-12 months.  From an organizational perspective, it’s varies widely where these teams are located but, in general, the applications are typically owned by development teams or data scientists while the shared cluster is managed by a central team of data engineers or IT operators.  Some common themes of these shared service environments include:

  • Multiple business teams are developing applications using the technology that best suites their use case and/or skill set
  • As a result, most environments are heterogeneous, meaning multiple frameworks and technologies in play (MapReduce, Hive, Pig, Scalding, Spark, Oozie, Sqoop, Control M… the list goes on and on and on)
  • Most business critical applications have service levels expectations
  • Lots of Hive queries out there, both scheduled and ad-hoc
  • MapReduce is still the workhorse compute fabric but others like Apache Tez, Spark, Flink, Kafka, etc. are being tested, where it makes sense, with teams eventually rewriting or porting applications to other fabrics

Ok… so now imagine you are a member of the data engineering or IT operations team who is tasked with managing this hot mess of technology coupled with applications of varying degrees of business criticality and performance expectations.   You basically have all the responsibility of optimizing cluster performance but almost no visibility to the applications running on it.   At scale, this lack of visibility into the data pipeline or application performance makes providing a high quality of service pretty tough.  Bottom-line is lack of visibility into an applications performance is the key issue that these teams find they cannot solve with cluster management solutions. It simply requires a different level of visibility within the environment.

The symptoms manifest themselves in 5 operational challenges (there are more but these are the most common):

    1. We can’t maintain reliable, predictable performance for scheduled jobs/queries. If there is a slowdown, it takes too long to troubleshoot the issue and we cannot easily find the root cause
    2. We don’t know what teams are consuming what resources and if they are being consumed efficiently, especially related to ad-hoc Hive queries
    3. We are not sure what service levels we can commit to for business critical applications
    4. We don’t know which jobs/queries/applications are business critical when there is a slowdown, who owns them or the downstream impacts
    5. We have policies and best practices in place but its difficult to enforce them across the organization

There are a number of excellent cluster management and monitoring solutions, but they all focus on performance metrics for the cluster i.e. CPU utilization, I/O, disk performance, disk health, etc.  However, when it comes to your applications, these solutions simply do not capture the metrics to provide application level performance monitoring.  This why most DevOps teams are forced to wade through the resource manager, log files and thousands of lines of code to troubleshoot a problem.

To mitigate the reliability and performance risks in a shared service environment, you need to think about application performance monitoring and not just cluster performance monitoring.  To make that shift, you will need to implement most, if not all, of these best practices:

    Create a collaborative environment across development, QA, and operations teams covering the lifecycle of the application. This includes technology, people and process.

    • Collaboration should start at the development phase with your development, data analysts and operations teams working together to build better data applications that are optimized and resilient
    • When things go wrong, enable notifications to quickly assemble a cross-functional team to troubleshoot the application, identify the root cause and quickly fix the issue.
    • In both cases, leverage technology to allow your team members access the same application performance data at the same time. This should also include access to historical runs so present and past performance can be compared.

    Segment your environment by assigning applications to teams and associating other business relevant metadata such as business priority, data security level, cluster, technology, platform, etc.

    • To understand what is happening, where and by whom, the environment needs to be organized and segmented in ways that are meaningful and aligned to how your business operates.
    • Associating applications to teams, departments or business units allow your operations teams to create the necessary cross-functional support organization to troubleshoot issues.
    • Segmentation also makes reporting more robust as you can create chargebacks, audit and/or usage reports by multiple dimensions. This makes managing the business of Big Data easier and provides evidence of ROI for senior leadership.

    Provide operations teams with a single view of all the applications running in the environment

    • Bouncing between multiple monitoring tools to track down issues is not sustainable for a shared service team managing potentially dozens of technologies. A single view of the status of all the applications, regardless of cluster, technology or framework is essential to enabling these teams to support the enterprise effectively.

    Give cross-functional teams the ability to visualize an entire data pipeline vs. just a single step/job/task.

    • Some cluster monitoring tools provide some basic information about how a job/task performed but only about that job/task and only for that moment of time. By enabling your development and operations teams to visualize the entire data pipeline/data flow, two things happen:  1) everyone can validate the application is behaving as intended; and 2) teams understand the inter-app dependencies and the technologies used in the application.
    • This is also a great way to enable the operations teams to advise data analysts or data scientists on best practices and make their applications or queries more efficient, saving everyone headaches in the long-term.

    Surface real-time and historic application performance status with the ability to drill down into performance details such as slice rate, bytes read/written, wait time, processing time, etc.

    • Visualizing application performance metrics will help your teams quickly troubleshoot if the application is not efficient, if recent changes impacted performance or if there are resource constraints on the cluster. This alone can save you hours of combing through log files and provides the data you need to tune the application right the first time.
    • Performance monitoring details should surface the application metadata tags business context can be applied to any issue. For example, identify what team owns the application/query, the business priority class, etc.  This information can be invaluable when a business critical scheduled job is being starved of resources because a large ad-hoc query is submitted during run-time.
    • In a shared service environment, more business teams ask for service levels so they can be sure their data is available when they need it.  For team to confidently commit to service levels, you need to chart historic performance trends so you can determine what service levels are achievable based on actual data not on wishful thinking.

    Track data lineage to support compliance and audit requirements.

    • This step is often overlooked as a backend reporting problem but the reality is you need to capture this data from the start. It can be an arduous and time consuming process to retrospectively string together logs to track sources, transformations and ultimate destination of data. Your team has better things to do with their time.  To simplify the reporting process, enable the compliance teams to visualize the entire data pipeline/data flow so they can see, in one view, sources transformations, joins, and final output location of the entire application.

    Enable alerting and notifications for the various teams

    • When anomalies are detected, you want to notify all the relevant team members so cross-functional troubleshooting and impact analysis can start immediately.
    • Utilize existing notification systems and processes so you are not creating a separate support process for these applications

We have found that organizations that implement these best practices for application performance monitoring are up to 40% more efficient in both managing application development and production environments but it takes technology, people and process to get there. 

To learn more, download our free whitepaper: 9 Best Practices to Achieve Operational Readiness on Hadoop

******

About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is a application performance management solution to monitor, collaborate and control all your Big Data applications in one place.

If you are running a lot of Hive queries (scheduled and ad-hoc) and are struggling with performance, watch our 30-min webinar and learn 5 best practices to achieve operational excellence for Hive.

Take Driven for a test drive with our free 30-day trial.

New Release Delivers Comprehensive Monitoring and Management for Hadoop and Spark Applications Across the Enterprise

SAN FRANCISCO – Oct. 20, 2015 – Concurrent, Inc., the leader in data application infrastructure, today announced the latest release of Driven, the industry’s leading application performance management solution for monitoring and managing enterprise-scale Big Data applications. Driven 2.0 represents an industry milestone by enabling application performance monitoring and management across heterogeneous Hadoop and Spark environments within a single, comprehensive solution.

As industries identify and continue to refine Big Data use cases, new development frameworks, tools, technologies and compute engines proliferate to allow project teams to implement the best solution for their specific use case. However, with more choice comes greater complexity. This is creating new requirements for DevOps, data operations, development teams and data professionals, who need visibility, measurement and control of data processing performance at scale.

Driven 2.0 ensures the highest fidelity and richest detail for application performance monitoring, troubleshooting, dependency tracking and service-level adherence of Apache Hive, MapReduce, Cascading and Scalding, and – with today’s new release – Spark applications. No other single solution on the market delivers this level of coverage to empower teams with continuous visibility and traceability from development through production.

Key features of Driven 2.0 include:

Support for Apache Spark: Enterprises can now seamlessly and transparently collect all the operational intelligence for Apache Spark applications in Driven. Currently in beta, new Spark support provides the comprehensive performance management required to deliver and maintain production Spark data processes.

Redesigned application analytics and custom views: Driven delivers new capabilities to segment operational metadata and create customized views and dashboards for more concise information delivery to the enterprise user. Additionally, Driven features new and more comprehensive visualization and navigation of applications for a highly intuitive view of all applications and transaction history. Users now have the ability to drill down in real time or to specific time periods in history and view the health of an application or clusters or the processes associated with a specific organization.

Deeper search capabilities: Because data processes can be unwieldy and complex, comprised of hundreds of steps, pinpointing where something was executed or went wrong in an application can be time consuming and expensive. Whether fulfilling an audit request, debugging an application, looking for a slow down or searching for dependencies, the new search capabilities in Driven enable enterprise users to quickly find specific units of work and progress to satisfy their needs.

A proven application performance management solution for enterprise scale Big Data, enterprises rely on Driven to deliver against today’s complex data strategies. Eight of the top 10 financial services organizations use Driven to manage their Big Data applications, and Driven monitors applications responsible for hundreds of millions in revenues. Driven delivers benefits to the enterprise including accelerated application development cycles, immediate application failure diagnosis, improved application performance, easier audit reporting and reduced cluster utilization costs.

To learn more about Driven and how customers, such as HomeAway, Inc., are using Driven to optimize performance and time-to-market, register for a webinar, titled “Learn How HomeAway Uses Big Data to Increase Bookings Revenue,” taking place at 11 a.m. PT/2 p.m. ET on Tuesday, Nov. 10.

Driven is available at http://www.driven.io/choose-trial, and Driven 2.0 will be generally available in November. For pricing and more information, email sales@concurrentinc.com.

Supporting Quotes

“Enterprise needs have not changed. They want a comprehensive solution to monitor and manage their data processes. They want technical, operational, organizational and business-level context on every process. They want to measure how these processes are performing, how they are consuming resources and whether they are delivering or not – and if they aren’t, where is the issue? Driven equips enterprises with this and more, and is playing a critical role in the success of big data initiatives in the enterprise.”
– Gary Nakamura, CEO, Concurrent, Inc.

Supporting Resources

About Concurrent, Inc.

Concurrent, Inc. is the leader in data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 500,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com

 

The good news is your Big Data investment is paying off and multiple teams are now using Hadoop to run hundreds of Apache Hive queries and MapReduce jobs. The bad news is everyone is using the cluster. Existing cluster monitoring solutions lack the application performance visibility to enable operations teams to manage the chaos and enforce policies in a multi-tenant Hadoop implementation. This makes it difficult to commit to and deliver reliable performance for these, now, business critical applications.

On October 29, 2015 at 11:00 AM PT, please join us for a 30 min live demo of Driven, a performance management solution for Hadoop applications. See how you can get the right level of visibility and control the chaos.

Register Now

 

Communities Join Forces to Give Users the Power and Simplicity of Cascading on Apache Flink with More Choices of Compute Platforms to Better Address Business Needs

SAN FRANCISCO – SEPT. 22, 2015Concurrent, Inc., the leader in data application infrastructure, and data Artisans, developer of the next-generation large-scale data analysis technology, today announced a strategic partnership to bring support for Cascading on Apache Flink.

The data application technology landscape continues to rapidly evolve with new, state-of-the-art compute fabrics. Enterprises who have invested in Hadoop are seeking pragmatic solutions that provide the freedom to solve simple to complex problems, all while ensuring their investments are protected against change and technical uncertainty. As enterprise architects make decisions on which compute engine to run their big data technology stack, Apache Flink, a general-purpose engine for fast, reliable and expressive analysis of Hadoop data, is a robust option for new and existing Cascading users.

This partnership is a community-driven effort and represents yet another milestone for Cascading and Apache Flink, allowing risk-averse users to make investments in next-generation compute fabrics without rewriting their data processing applications. Tens of thousands of production Cascading users can seamlessly move from legacy compute engines to Apache Flink, and easily test and benefit from its speed, scale and resiliency. Users have unique portability across programming languages (Java, SQL, Scala), Hadoop distributions (Cloudera, Hortonworks, MapR) and now compute fabrics.

Apache Flink pioneered several unique designs in open source compute fabrics, such as memory management, program optimization and operating on binary data. At its core, Apache Flink is a data streaming engine, surfacing batch and streaming APIs, while putting just as much attention on batch analytics by treating batch as a special case of streaming. As an Apache Software Foundation project with a rapidly growing community and ecosystem of libraries and add-ons, including graph processing, machine learning and notebook functionality, Apache Flink is a viable, fast and reliable alternative to other computation engines.

Concurrent provides the proven, data application platform for reliably building and deploying data applications on Hadoop. With more than 500,000 downloads a month, Cascading is the enterprise data application platform of choice and has become the de-facto standard for building data-centric applications. Driven is the industry’s leading application performance management product for the data-centric enterprise, built to provide business and operational visibility and control to organizations that need to deliver operational excellence. Together, Cascading and Driven deliver a one-two punch to knock out the complexity and provide a proven reliable solution for enterprises to execute their big data strategies.

Supporting Quotes

“At data Artisans, we are extremely excited to contribute to the community effort of bringing Apache Flink to Cascading users, who will be able to continue building their data applications in the platform of their choice, while enjoying the speed and reliability that Flink provides. We also look forward to the further evolution of the Flink platform seamlessly benefiting Cascading users. Users will be able to work within their data applications, knowing their infrastructure will be able to evolve with them – simply and rapidly. ”

– Kostas Tzoumas, co-founder & CEO, data Artisans

“At Concurrent, one of our main goals is to provide customers with a wide range of options for compute engines, without making them feel locked in to one technology. We’re pleased by this community-driven effort to bring Cascading to Apache Flink, as it broadens the ecosystem and allows users the choice to deploy on the compute platform that best fits their needs. With the growing popularity surrounding Flink, we look forward to continuing to work with data Artisans and expanding our community efforts.”

– Chris Wensel, founder & CTO, Concurrent, Inc.

Supporting Resources

About data Artisans

data Artisans was founded by the original creators of the Apache Flink project and develops the next-generation large-scale data analysis technology under the umbrella of Apache Flink. Apache Flink is a new open source general-purpose engine for real-time, reliable and expressive analysis of Hadoop data and beyond.

About Concurrent, Inc.

Concurrent, Inc. is the leader in data application infrastructure, delivering products that help enterprises create, deploy, run and manage data applications at scale. The company’s flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 500,000 user downloads a month. Used by thousands of businesses including eBay, Etsy, The Climate Corp and Twitter, Cascading is the de facto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and online at http://concurrentinc.com.

Media Contact
Danielle Salvato-Earl
Kulesa Faul for Concurrent, Inc.
(650) 922-7287
concurrent@kulesafaul.com