[This is the first post in a blog series on our production infrastructure for metrics collection and monitoring. In this post, we discuss why metrics are critical to our engineering function, and the need for a robust infrastructure supporting metrics collection and monitoring.]
Terminology & Taxonomy
Let’s start with a clarification: while several technical discussions use the term “metric” to describe properties of a system, they actually imply a measure of some aspect of the system. A measure might be the response latency of a web service, load on a CPU etc. A metric on the other hand expresses a system’s derived quality based on one or more measures against an expectation. For instance, cost per request served is a metric one might track with an expectation that it must be no more than ‘x’ cents. Despite the above distinction, the term “metric” is more commonly understood in technical colloquy to refer to measures. We’ll hence continue using the term metrics broadly for the rest of the article. This article will particularly focus on “tech” metrics: those pertaining to software services and the underlying technology infrastructure. It’s important to distinguish between pure tech metrics and business aware metrics such as “transaction dollar volumes by time of the day”. Business metrics tend to have varied cardinality characteristics, and requirements related to data consistency and query patterns. Combining them with tech metrics often results in unnecessary complexity. Of specific interest to us in this discussion are the tech metrics that we categorize as follows:
- System metrics: These include a computing system’s internal metrics and encompass the hardware and the operating system. Examples include CPU load, memory utilization, paging activity, disk IOPS, disk usage etc.
- Application internal metrics: Many components of an application do not manifest as features or functionality understood by consumers (e.g. connections to the database). They also don’t appear directly in service level measurement unlike, say, web serving latency. However, they are critical to the overall functionality of the application and typically reflect the “white box” state of the application. Metrics related to web containers and middleware (e.g. tomcat, web server internal metrics, JVM metrics such as garbage collection rates, occupancy rate of the erstwhile PermGen space, etc.) are also considered application level, but “internal” metrics.
- Externally visible metrics: Metrics that are of direct interest to an external consumer, including programmatic consumers like API clients, and often reflect end user experience, fall under this category. Examples include response latencies, response codes of a web server.
Having looked at this categorization, let’s now see how metrics play a key role in the way we engineer systems.
The Role of Metrics in Engineering
Like several other consumer Internet companies, we at InMobi believe in responding fast to changes in consumer behavior, partner requirements and market expansion needs. This requires our engineering teams to be able to launch products and features rapidly while safeguarding the experience of our end users and partners. Hence, while executing such changes at a fast pace, it is critical for our engineers to be able to verify the impact of each change on the underlying system. This verification is carried out in addition to A/B testing the product features. A key to enabling such visibility of a running system is to first measure every important aspect of the underlying system. Measurement of key tech metrics is therefore imperative for an effective product engineering organization. Metrics have hence been playing an integral role in every stage of our software development lifecycle as follows:
- Feature design and implementation: Each design specification and implementation includes the list of new metrics defined, corresponding measures to be gathered and the impact (either change or discontinuation) on existing metrics.
- Quality Engineering: While conventional test suites typically focus on verifying functionality, our tests also confirm that the observed metrics reflect the functionality accurately.
- Launch verification: This includes both limited-launch (canary) environments as well as across our production fleet post launch. Canary validation includes validation of the new metrics and the expected (or absence of) impact on existing metrics over the observation window. Any deviation is considered a pre-launch validation failure.
Production and Infrastructure Engineering
Metrics are also central to how we conduct regular reviews of our production systems, wherein engineering teams jointly review the overall health, SLO compliance, errors, trends and resource utilization across production services. These reviews have proved to be a great mechanism to unearth emergent behavior, validate quality of service experienced by consumers of a service, and identify indirect causal effects across services and interactions with the underlying infrastructure. Metrics form the basis for conducting these discussions in an objective and effective manner.
Periodically, the infrastructure engineering function needs to assess the utilization of resources (including hardware and shared software infrastructure) across our production fleet. We also analyse the resource efficiency of production services over time and identify emerging utilization trends. Similarly, we need to effectively estimate capacity growth across our production services, as precisely as possible. Performing these functions requires that every aspect of our infrastructure is captured and measured.
Our core monitoring focus has been to speed up the “OODA” loop in production systems: to be able to quickly detect, assess, diagnose and remedy errors in our production systems. Our engineering teams must be enabled to detect service degradation or impending outages, and respond to them quickly before our users are affected. Similarly, the infrastructure teams also need to quickly detect and isolate or remedy errors in our production infrastructure (hosts, network connectivity, system services like DNS) that can have widespread impact on multiple application services. An effective monitoring and alerting service is necessary to achieve all of these. As part of our move to metrics-based engineering, a paradigm shift for us has been to re-orient our monitoring to take a metrics-first view of system health: that a healthy service and the underlying infrastructure will always emit critical metrics (internal and externally observable) at expected intervals. We have since moved away from merely monitoring liveness and progress of systems through legacy methods like ping checks against web endpoints or process id checks – and have instead focused on monitoring each of the above categories of tech metrics. Metrics centric monitoring helps ensures that our visualization and alerting systems are aligned. A metrics based monitoring system also helps us detect anomalies, quickly highlight other impacted metrics, correlate errors and their causes, helping validate if a remedy actually addressed an issue completely. Stay tuned for more on these capabilities in a subsequent post.
Streamlining the Infrastructure
When we started our journey to transform the way we look at metrics and production systems, the corresponding metrics infrastructure resembled that of most startups at that time: each engineering group captured and collected metrics using different methods, at different frequencies, while applying different aggregations. Visualizing and monitoring the behavior of production services, and the tools used while responding to production errors were also unique to some of these teams. While this initial proliferation gave the teams the necessary autonomy and speed early on, it quickly became an operational overhead due to redundancy and inconsistency in deployments. It was also difficult to discover or understand metrics across teams or services for troubleshooting and analyses. It was evident that while instituting a metrics based production engineering methodology, it would be critical for us to build the corresponding infrastructure services for several reasons: a stable, centrally owned metrics and monitoring service would eliminate redundant deployments in each product engineering group and would help engineers focus on building customer facing products. Uniform infrastructure and tooling for metrics and monitoring would also be a great mechanism to achieve consistent instrumentation, and ensure proven methods are followed to instrument, monitor and sustain production services. For instance, unified metrics tooling would ensure a consistent metrics naming convention across product services. A single reliable source of truth for metrics would also simplify query and analysis across multiple services, and multiple types of metrics to understand correlation and other effects. As mentioned earlier, as part of our regular production reviews, we often unearth cross-service effects and those between the infrastructure and application services. Such wide-span views of tech metrics have proven to be particularly useful due to the dependencies between microservices and the feedback loops that are typical of consumer oriented systems. A unified system of record for metrics would be necessary to perform these functions. Considering these needs, we set out to design and build a stable, multi-tenant system that would meet our end goals.
In the next blog in the series, we describe the requirements of such a system, the corresponding architecture and the choice of component subsystems. We then delve deeper into the implementation of one of the subsystems - the metrics service, and discuss the learnings and challenges, our vision for the end state of the overall system and our continued efforts in this area. Stay tuned for more!
- Bharath Ravi Kumar and Sudhakar B.G, Production Infrastructure Engineering