Archives

Key Takeaways

  • By adopting Cascading, AdMobius accelerated their application development for their critical enterprise data applications
  • The development team utilized Cascading workflows to handle the data processing, incorporate a custom scoring algorithm, and write out results to HDFS and MySQL
  • With Cascading’s APIs, AdMobius created custom Taps for data interoperability between their Hadoop-based application and their existing infrastructure
  • Cascading enabled AdMobius to achieve quality testing for their new code through test-driven development best practices and APIs

Background

AdMobius is the industry’s first Mobile Audience Management Platform (MAMP) and enables publishers and advertisers to discover and target relevant audiences at scale. By organizing and interpreting unique demographic and interest-based information, AdMobius unlocks the rich value of mobile data.

The MAMP platform helps advertisers identify mobile audiences by demographics and interests through standard, custom, and private customer segments. A core component of their platform is their internal “Device Graph” application. This application aggregates data and pairs that information to individual mobile devices, processing terabytes of data each day. This information allows AdMobius to construct segments that are useful for advertisers to execute targeted marketing.

Challenge

To build their Device Graph application, AdMobius needed to aggregate data from various sources, normalize that data, process it by applying a proprietary scoring algorithm, and then write out the results concurrently to HDFS and MySQL data sources.

AdMobius began building their application with Apache Hive, a mechanism for querying and managing large datasets residing in distributed storage, and Apache Oozie, a workflow scheduler system to manage Apache Hadoop jobs. As they began to build an early version of their application, they encountered adversity in their development cycles. The team at AdMobius couldn’t easily incorporate business logic into their application. Specifically, there was no aggregation or filtering functionality that they could easily use. Hive was too primitive and it was difficult to build the customized function types they needed to implement their scoring algorithms. Also, their application was not easy to test, which added to the frustration in building and maintaining their application.

AdMobius quickly realized that the combination of Hive and Oozie were not enough to build out a robust data-driven application. The development team needed a framework that could handle robust workflows, incorporate a complex proprietary algorithm, all while simplifying application development.

Solution

To solve their problem, AdMobius chose Cascading as their application’s development framework. Cascading is an application development framework for Java developers to easily develop robust data applications on Apache Hadoop. Because Cascading is a Java API, the engineering team at AdMobius was able to quickly adopt the framework with a minimal learning curve.

With Cascading, the development team was able to easily write complex operations, custom joins and aggregations with a performance improvement over other MapReduce (MR) frameworks. The data-centric concepts in Cascading make it possible to write a series of MR jobs in just a few lines of code. Also, Cascading provides the functionality to build custom sink and source Taps for data interoperability between existing data sources.

Now, AdMobius still uses Hive to aggregate and normalize its data, but now have built Cascading workflows to perform their application’s data processing and incorporate their algorithm used for their Device Graph application. After data processing, Cascading workflows then write out the results to their Hadoop Distributed File System (HDFS) and a MySQL database. AdMobius also built custom sink Taps to achieve reliable data interoperability with its existing infrastructure by using Cascading’s APIs.

Benefits

In addition to being able to realize their Device Graph application, AdMobius experienced several tangible benefits.

Fast Development Cycles and Visibility into Workflows

The Cascading framework made it easy for new developers to write reliable MR jobs. Instead of thinking about their problem in terms of MR, AdMobius developers used the Cascading metaphor and did not need to know much about the underlying MR framework. The Java engineers were able to quickly understand the Cascading framework and become familiar with its Java API. Cascading jobs can easily be visualized using the Driven application, making it useful for understanding and debugging workflows. Cascading also provided customized logging levels, which had more logging detail that those available at the job level through the Hadoop dashboard, to help identify data anomalies, errors and exceptions in the data and logic.

Easy Integration with the AdMobius Stack (i.e. HBase, Giraph, Hive)

Data can seamlessly be moved in and out from various data sources to Cascading workflows with the framework’s source and sink Taps. The development and extensibility for Taps made it easy to achieve data interoperability. As mentioned above, AdMobius built its own sink Taps to achieve reliable data interoperability with its existing infrastructure. AdMobius also built a custom Tap, CombinedFileInputFormats, to help combine smaller input files together before submitting to the mappers to reduce the amount of resources consumed.

Test-Driven Development

With Cascading, it was easy to adopt AdMobius’s test-driven development processes, which gave the team enough confidence to move their applications into production. AdMobius used Cascading’s local or in-memory mode to efficiently test code and process local files before they deployed their application on a cluster. They also were able to incorporate inline data assertions and define results at any point in their timeline, analyzing failed assertions when they occurred.