Validating Hadoop Jobs at InMobi using Strider

Swamynathan S

Swamynathan S on November 04, 2015

Several technology companies run their analytics pipeline jobs on hadoop. Though easy to start, managing a scalable pipeline presents multiple challenges across different dimensions. In this article, we focus on one set of those important challenges -- the one arising from the functional testing perspective. The challenges are described in detail, with the initial attempts made to address them and their shortcomings. We then explain strider, a test framework that was designed explicitly to address these challenges, and demonstrate how it helped improving the overall efficiency of the validation effort.

Functional Testing Challenges

Some of the challenges that arise during the functional testing phase in the hadoop world are:

a) Lack of well defined deployment unit or artifact.

b) Diverse input or output types, where the types may vary from a simple csv file to complex thrift structures.

c) Huge cost of implementing every test case.

d) Time wastages in running regression tests - especially when new fields are added or deleted.

e) Data generation

f) Shortfall of reusable/open source frameworks or available tools.

Initial testing solutions and motivation for Strider

At InMobi, we use Apache Falcon to create and manage pipelines and hence a falcon process is the de-facto, basic unit of testing.

The first step towards having a disciplined automation framework is to know the type of the deployable artifact/unit and how they are structured inside. Fortunately, the entire hadoop pipeline in our project was logically segmented into various debians, which was the unit of deployment. The debians contained the falcon processes and feeds that are part of deployment. Hence, the deployable unit and structuring was handled well and so the key challenges towards efficient testing was contributed by items b) through d), of the functional testing challenges listed above.

Though there were tools to validate the basic sanity of deployable units, scaling with respect to types is a concern. During the initial stages, with the number of thrift types being less, the primary concern was to prepare functional data in a readable format, which can be converted to thrift before testing. The testable pig scripts internally used thrift serializer/de-serializer UDFs. We used these deserializer UDFs and converted them to JSON format, so that we can prepare readable functional data sets and serialize them back to their respective thrifts. These steps gave us the flexibility to provide minute details to the test data for the functional testing. However, in order to achieve this - we had to go through a lengthy course:

a) Creating pig scripts that use these de-serializing UDFs.

b) Running the script against a small chunk of serialized functional data.

c) Getting the output and converting it to JSON.

d) Modifying the JSON.

e) Using it back by converting it to thrift.

f) Serializing it back and using it as functional data.

Though this served the basic purpose for testing, it was not scalable. We soon started having several thrifts as our input or output types, along with csv and json and heterogeneous data types (thrift in a csv). This situation commanded a generic solution towards handling these scaling types. Also, the whole test cycle (listed below) was not seamless and hence could not be integrated with CI.

  1. Downloading the artifacts
  2. Deploying the debians
  3. Transferring the available test data to hadoop
  4. Submitting the right process to falcon with its dependent feeds
  5. Running the process
  6. Confirming its completion and
  7. Validating the output data with the expected output

These, along with the lack of an existing framework or tools to handle such an issue , motivated the development of Strider to address the above deficiencies.

Strider

Strider is a TestNG based framework, which can test falcon based processes functionally. The framework orchestrates the whole flow of setting up, running and cleaning the test without any manual intervention. The framework has test executors, which has the logical test flows of a process. We need to provide the inputs such as process name, it’s start time, end time, test host information, build artifact url, input data and expected output and strider takes care of the rest.

End to end flow of Strider

The following picture depicts the end-to-end flow of Strider. There are two actors in the system -- the user(a tester) who does a few configuration activities and then the strider framework which automates the rest of the testing, including running the processes and validating results, based on the submitted configuration.

Strider.png

Tester’s Role

Consider a scenario where Tester U wants to test Falcon Process P.

1) Tester starts creating functional cases which has a test name. The cases consists of two parts

a) Inputs to the feeds - in json

b) Golden set (Expected output) of a process - in json

2) User specifies build url, hdfs host/test host configs, process name, start time and end time of the process through testng.xml. This can be continued for any number of tests that the user wants to run.

3) Runs the test.

Strider’s Role

1) Downloads and deploys the jenkin url build in debian.

2) For the process to test, it finds the dependent feeds of the process and submits it.

3) Edits the process based on custom start time and end time and submits the same.

4)Converts the data to the expected type and puts in hdfs.

5) Makes few custom modifications in workflow or pig scripts to reduce the runtime.

6) Schedules the process and monitors it’s completion.

7) Converts the output to JSON and validates it with the expected output provided.

Further, Strider allows parallelism across tests for a process. This is further extendable to build a pipeline test, custom or specific test executors inside the framework. The data conversion aspect of Strider (called The-Ring), is evolving at a high pace to represent thrift and non-thrift data types in a mature fashion, which can allow building a generic data generator tool in the next level. Watch out for a deep dive on “The-Ring” component in the next blog in this series.

Addressing the challenges through Strider

Through Strider, we were able to achieve the seamlessness in every test step. The generic design with respect to data conversion and validation, makes strider extendable with the multiple input and output data types. Further, the presence of generic test executors and the ability to extend the same, made our test cases to start and end with providing only input and expected outputs, without much of test code implementation. This considerably reduced our regression test and test implementation costs.

Impact

Strider reduced our effort to a great extent. Some of the complex processes (such as processes with lots of input file with different types), which took several days for testing, was completed in less than a day. This is because, for a positive functional case, we need not have to implement any test case. Since, we had to provide only the build url, input and expected output in JSON, Strider significantly reduced our test implementation costs and regression costs.

The chart below provides the performance comparison details between the initial testing methodology and Strider.

Perf_Chart.png

Future Work

The future roadmap of strider consists of:

  1. “The-Ring” - Version 2 to represent data in a much more matured fashion, which can eventually help in data generation.
  2. Help strider to cater to wider audiences, by making it much more generic with respect to handling the deployment artifacts.
  3. With these changes, CI can be done now, by letting quality engineers prepare or modify the test data and expected output in parallel, while the developer is working on writing or modifying a falcon process/pig script. A code commit from the developer side, would download Strider codes and test data in the jenkins environment and fire the tests.