First off, there is no right answer to how to organize to ensure Hadoop production readiness – Only the right answer for you (sorry). The key is to understand the implications, potential limitations and when to rethink how you are organized. Hadoop, like any other environment, is not static. It will change, mature and grow over time so what works for you today may not work for you in 6-months or a year so stay flexible in your thinking.
What we will discuss is the different ways we have seen enterprises organize their team for Hadoop Production Readiness – aka the process of moving their Hadoop and/or Spark applications into production and then scale.
This includes some pros and cons of each approach and what to look for as indicators it may be time to make a change. It’s also important to note, we have seen multiple models implemented in the same organization where there are a number of different divisions with active Big Data initiatives.
The Lone Wolf Approach
We all know at least one person in our life who seems like they can do just about anything. This person is typically located in a line of business IT team, the lone wolf is the ultimate wearer of many hats. With flexibility and agility on their side, the developer is tasked with doing everything from development to QA, to operations.
Pros – Coupled with little to no process or red tape to contend with, the developer has the independence to move at the speed of the business team. Technology decisions are left to them and their degree of accountability is considerably high because of their independence, i.e. if they break it, they have to fix it.
Cons – There are some extreme limitations when you’re working alone. Once you’re beyond a few applications, the ability to scale only becomes more difficult. The Lone Wolf is wearing multiple hats so new development can be bogged down and slowed as monitoring production applications will take more and more of his/her time.
Typically, there are multiple Lone Wolves spread around the organization, each making their own technology choices to best suite the needs of their line of business and each running their own dedicated cluster (potentially using different distributions). As multiple clusters and technologies are supported, the lone wolf approach will quickly result in higher overall costs and lower resource efficiency.
Considerations to change to another model – This is probably somewhat obvious but scalability and costs are the primary considerations.
When a line of business scales beyond a handful of Hadoop applications or when multiple lines of business, with similar needs, are building data processing applications, you should consider moving away from this model and into a more team-based approach. Keep in mind, you need to balance the needs of the different lines of business to ensure there is an appropriate allocation of resources across teams or risk a business team reverting back to the lone wolf approach.
Similar to the lone wolf developer, DevOps are a small collection of developers, QA, and operations also typically located within a line of business IT team. The difference here being that each role would ideally be spread out across its members, although the same people can cover multiple roles. DevOps could also support multiple business analytics teams within a department or business division.
Pros – In many ways, the DevOps team(s) are like the lone wolf developer. Only here they’re more like a pack of wolves hunting in sync. They have the independence to move quickly without being bogged down by the dreaded red tape. Again, DevOps teams are making technology decisions based on the needs of their line of business and there’s no denying the benefits of easily collaborating across teams to identify and troubleshoot issues.
Cons – Despite having more numbers, DevOps will still have a hard time scaling beyond even just a few dozen applications and development can be slowed as resources may be wearing multiple hats. And just like the lone developer, there are higher overall costs as multiple lines of business stand up their own dedicated clusters and acquire specialized resources to support technologies choices.
Considerations to change to another model – Increased costs and poor resource efficiency are typically the primary consideration as organizations see multiple DevOps teams pop up to support different projects. We see a number of organizations start here but have an eye towards a time where they will consolidate resources, both technology, and people. Once they have proven out the value of their Hadoop production applications, at the line of business level, they make a move to another model to contain costs and increase resource utilization and development efficiencies.
Share Resource Environment with a Dedicated Cluster Management Team
Where a DevOps team has its set of benefits of a single team managing everything, organizing a dedicated cluster management team essentially separates Hadoop application development and Hadoop operations. In this model, one or more clusters are managed in a shared resource environment. Developers own the applications while ops own the cluster and are responsible for its performance. Together, they are responsible for troubleshooting issues. To mitigate the risk of issues, the operations teams will often set design and operations best practice policies that the development team is supposed to follow. However, the development teams, located with line of business IT, still have the authority to select the technology that best suites the needs of their project, assuming it adheres to the policy (and sometimes even if it doesn’t).
The success of this model is dependent on a high degree of collaboration between the operations and development teams so fostering that type of environment is critical. This requires a much higher degree of application performance visibility and transparency throughout the life cycle of a data application.
Pros – Developers have the ability to focus efforts mostly on development and subsequently can work faster. Unlike DevOps or the lone wolf, scalability to tens of thousands of applications is no longer an issue. In addition, costs are lower because the cluster is shared thus enabling organizations to leverage their buying power. Efficient use of the resources also increases as the cluster is less likely to be under utilized with multiple teams running applications. Typically, the cluster management team is leveraging existing operations technology and process the IT operations teams already use but the resources are specialized and, therefore, not normally part of the Corporate IT operations teams.
Cons – With operations teams managing multiple technologies, there also tends to be some finger pointing between development and ops teams when things go wrong. While policies exist, getting development teams to adhere to them is a challenge unless there is a review process as data applications are moved from development to production.
Poorly formed applications can easily kill the cluster and researching problems can take an exorbitant amount of time when combing through the endless sea of logs to see where the applications went wrong. Planning for resources and capacity can feel impossible with a mix of technologies and a mix of scheduled and ad-hoc jobs.
Without the right level of visibility into application performance, these environments can become extremely complex to manage which risks a higher number of performance issues, such as slowdowns, that both the development teams and operation team must manage.
Considerations to change to another model – Full integration into existing IT operations procedures and polices because of security or regulatory compliance, along with support cost containment, are the driving factors for moving away from this model.
Corporate IT Operations Team Manages the Cluster
Armed with some specialized knowledge, IT operations will manage the cluster and treat Hadoop like any other IT environment. Fully integrated into IT operations standard operating procedures and systems, there are some alternative items to consider before you implement.
Pros – IT operations is often very similar to a dedicated team with regards to technology decisions, faster development, easier scalability, and lowered costs from sharing the cluster. The difference here is that because the cluster is fully integrated, support costs are lower.
Cons – Most problems with a dedicated cluster team are also true here. Only now there is less independence to move at the speed of the business team due to operational policy adherence. Additionally, innovation is potentially slowed as IT operations team must now deal with more processes and red tape.
Even if you decide to implement what may work now, you always have to be in tune with what your organization will need next. If you can anticipate the evolution, then you can plan better for the organizational shift and minimize distribution and anxiety to the line of business teams.
Regardless of how you organize, a common theme we hear everyone say is the need for application execution and performance visibility increases exponentially as Hadoop matures and the number of production apps increases. This is because there are a lot more variables to consider to ensure you meet service levels and don’t slowdown the cluster.
The reality is visibility and your ability to control your environment actually diminishes as you scale. Wading through log files and Resource Manager, when hundreds or even thousands of applications are running per day, simply takes too long and doesn’t provide the information needed to actual diagnose and solve problems. This is where solutions like Driven, Application Performance Management for Big Data, come in to provide the necessary application-level monitoring and performance visibility.
Success means your environment will become increasingly more chaotic to manage. How you organize to manage that chaos while creating a collaborative environment between development, operations, and the business becomes increasingly more important to maintaining the success of your Big Data implementation.
In the next post, we’ll discuss why this happens and start to detail what leading enterprises are doing to mitigate these risks.
About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is an application performance management solution to monitor, collaborate and control all your Big Data applications in a single solution.
Take Driven for a free test drive and sign-up for our 30-day trial.SHARE: