Stripe Open Sources Tools For Apache Hadoop

December 9, 2014
Alex Giamas
http://www.infoq.com/news/2014/12/Stripe-Open-Source-Tools-Hadoop

Stripe, the internet payments infrastructure company recently announced open sourcing a set of internally developed tools based on Apache Hadoop.

Timberlake is a dashboard for Hadoop jobs. Written in Go with a React.js frontend it improves on existing Hadoop job trackers. By providing waterfall and boxplot visualizations for jobs, one can figure out easier what makes a Map Reduce job slow. Timberlake plays well with Scalding and Cascading and can visualize their flows. Timberlake only works with the YARN Resource Manager API and has been tested on v.2.4.x and v2.5.x .

Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala. Based on Google’s PLANET, it’s built on top of Hadoop and Scalding. Brushfire can process classification tree learning algorithms in a scalable way, using commodity hardware. Brushfire can build and validate random forests from large sized training data.

Sequins is a dead-simple static database. It indexes and serves SequenceFiles over HTTP, so it’s perfect for serving data created with Hadoop. It’s a simple way to provide low-latency access to key/value entries generated by Hadoop jobs.

Finally, Herringbone is a suite of tools for working with parquet files on hdfs, and with Cloudera Impala and Apache Hive. Stripe uses extensively Apache Parquet for efficient columnar storage. Stripe uses Cloudera Impala with Parquet, and Herringbone is essentially a set of command line interface tools for more productive development.

With Apache Hadoop 2.6 just being released and several big technology companies either contributing to Hadoop development or open sourcing tools from their internal development stack, future looks bright for Apache Hadoop.