Testing Big Data Pipelines Made Easy With Apache Falcon

Pavan Kolamuri

Pavan Kolamuri on November 12, 2015

At InMobi, we see events arriving in excess of 10 billion per day. Analysis, reporting and inferencing from these requests (and responses served) is key to serving the right ad, to the right person, at the right time. We have nearly 200 complex big data pipelines that run against various data sources. Managing so many pipelines and the associated data was becoming a challenge and that is when we created an orchestration and data lifecycle management framework called Falcon. After benefiting immensely from this, we open sourced the product and it is now a Top Level Apache Project.

Pipelines are developed and deployed by multiple teams across InMobi. Testing these pipelines before deploying into production was cumbersome. To test these pipelines, components like Apache Hadoop, Apache Oozie(scheduler) and Apache Falcon have to be deployed in QA/Dev boxes. Some of the challenges we faced while testing these pipelines are as follows:

  • User errors such as resolution of Input/Output paths, mis-configuration of properties were only identified during deployment time. That led to delay in deployment as it had to pass through multiple QA cycles if there were issues.
  • Testing requires setting up and breaking down of multiple components and we used deploy scripts to setup our environments. Various issues emerged because of buggy scripts.
  • Debugging was cumbersome as there are multiple components and logs were scattered across different machines and locations.

Due to the above mentioned issues, testing Falcon pipelines was difficult and hence the need to develop Falcon Unit which simplifies the testing of big data pipelines significantly. Using Falcon Unit, pipelines can be tested locally that helps find issues during development and saves QA cycles. The following are the capabilities of Falcon Unit:

  • Identifies all the user errors, pipeline errors before deployment and helps users fix them in lesser time.
  • Provides setup and teardown and other utility methods, so that test code is minimal and readable.
  • Supports both Local mode and Cluster mode. It means you can run the entire pipeline using Local Oozie, LocalJobRunner and Local FS, or , you can point it to a cluster definition and it will use external Oozie and Hadoop.
  • Falcon Unit API’s are designed in such a way that same API’s can be used to run both locally and as well as in cluster mode.
  • When tests are run in local mode, the tests finish faster, hence reducing build and test time.

This is the code snippet for testing a process.

// submit with default properties
submitCluster();
// submitting feeds
APIResult result = submit(EntityType.Feed, <Input Feed Path>);
assertStatus(result);
result = submit(EntityType.Feed, <Output Feed Path);
assertStatus(result);
// submitting process
result = submit(EntityType.Process, <Path to Process Definition>);
assertStatus(result);
// scheduling process with start time and number  of instances
schedule(EntityType.Process, <processName>, <clusterName>, <startTime>,  <noOfInstances>);
// checking the status of instances of given process
status = getInstanceStatus(EntityType.Process, <processName>, <scheduleTime>);
assertStatus(status);

As seen in the example, with the simple API exposed and the ability to run the same tests locally or on a real cluster, Falcon Unit simplifies the testing of big data pipelines. As a next step, we wish to improve data injection and support sampling of production data for test.