Cascading is the core component of Trulia’s data processing pipeline for data cleansing, entity resolution, and metadata extraction. This helps Trulia provide higher quality content with the newest listings and the best deals in their customers’ markets.
Trulia selected Apache Hadoop and Cascading. Cascading is used to manage execution of data processing that cleanses incoming data, identifies relationships between items and creates meta-data. Because the data processing logic is very complex, Cascading plays a critical role in Trulia’s main data processing pipeline, providing an easy and higher-level abstraction for running multiple MapReduce jobs.
With Cascading, Trulia is able to deliver higher-quality content to be shown on the site and make development cycles much more efficient. The ability to manage complex data processing logic allows them to connect users with people who live nearby, slice and dice local data into useful bits, and show the newest listings and best deals in the user’s market. Cascading also allows Trulia’s engineering team to move much faster developing data processing applications than they could with direct MapReduce development.
“With Cascading, we’re able to provide the high-quality local real estate information our users demand,” noted Daniele Farnedi, Vice President, Engineering, Trulia, Inc. “Cascading’s easy-to-use framework allows engineering to deliver new features much more quickly and manage the complex data processes required to support those features very efficiently.”