As the applications evolve they tend to become more complex. Understanding which part of the system is behaving in what manner in real time becomes increasingly important. It not only enhances the debuggability of the application but also make them much more predictable. At InMobi, we have adserver running on a big cluster of machines spread across multiple colos. The adserver updates its data about various entities such as ads, publisher sites etc. based on the events happening in the customer portal via data ingestion pipeline. It is important to understand how each and every instance of ad server is behaving with respect to multiple such events and whether or not its correctly updating entities present in the system.
Each instance of adserver continuously loads and unloads these entities into memory based on their expiry, state changes, server restarts etc. We had the need to audit these changes and also wanted to track when these updates are reaching adserver. This would help us in finding problems such as missed updates, overall latency of data ingestion pipeline etc.
We considered building our own solution for this and also looked at some open source projects which can be used. Specifically, the solution we were looking for should be:
- Customizable to support any vocabulary of events
- Have good handle(any programmatic way) to analyse data
- Easily extensible to other similar use cases
- Scale up and down based on the need
- Should have good community support and industry adoption
After considering many options such as message queues etc, which tends to add maintenance overhead in the application, we decided to build this solution by leveraging the power of logstash for central logging. Logstash supports variety of filters, codecs and outputs which can be used to process almost any kind of data. All our servers started writing logs whenever it processes any update on the entity with the context published in MDC. We created a separate logger for this use case, so that our routine application logs are not shipped along with this, with a gelf appender which can send logs to logstash server. Logstash server was configured to feed this data to elasticsearch cluster, where all we needed was just to upload our custom template. The data in elasticsearch could easily be analysed using either their APIs or visualized using tools like kibana.
After building this we are able to:
- Calculate end to end latency of the entity ingestion for our adserver
- Figure out where the entity update is stuck in the ingestion pipeline
- Audit the lifecycle of an entity in the adserver memory on historical basis.
The beauty of this solution is we were able to build such a strong tool solely using all free and open source softwares. The code changes required to get this whole setup up and running were minimal. And the best part is, this solution is generic enough to extend to multiple other scenarios, where we may need to gain some insights into the internals of application.