Etsy

Key Takeaways

Etsy runs over 50 Cascading applications daily to study customer behavior and product sales.  Programming in JRuby, Etsy can quickly test and create new applications on its e-commerce site that helps it acquire new customers and sell more products.

Solution

Etsy chose Cascading to abstract standard data processing operations away from the underlying map/reduce tasks. Cascading combines the scalability of Hadoop with an easy way to perform deep dives on data. Etsy also extended the Cascading API to create a Domain Specific Language (DSL), cascading.jruby1. DSLs generally provide simpler code structure and a cleaner syntax for common problem-specific data analysis tasks, and using JRuby provided a base language with which their team felt more productive when compared to Java.

Benefits

With Cascading and Hadoop, and using short-lived Hadoop clusters on Amazon EMR, Etsy is able to quickly build and launch data-driven products such as their gift recommender  (and here) suggested shops recommender, and “taste test”.  Each night, over 50 Cascading jobs extract data from web logs and database snapshots, then aggregate metrics used to monitor and understand the behavior of the site’s visitors.  These jobs also aggregate the results of all the A/B tests running on the site, helping teams make product decisions.  Finally, engineers are able to answer one-off questions and explore data easily, in order to get insights that help improve their site and community.

In the future, Etsy plans to rapidly increase the number of data-driven apps to help improve conversion rates, enhance the community areas of the site, and accelerate the company’s growth.  Cascading and Hadoop on Amazon EMR will allow the engineering team to build and scale these applications much more easily than if they had used raw map/reduce on an in-house cluster.

1 More on the original author’s version of JRuby here: https://github.com/gmarabout/cascading.jruby