A Converged System for Metrics and Monitoring

Sudhakar BG and Bharath Ravi Kumar on April 27, 2016

Overview

As discussed in part 1 of the blog series, metrics are central to InMobi’s engineering methodology. Based on our requirements, it was clear that we needed a single source of reliable data for production metrics, to achieve convergence of monitoring, visualization, and analysis of production service metrics. The logical view of a design that achieves this convergence is as follows:

Converged Metrics and Monitoring.jpg

In the above model, hosts and application services are instrumented to emit metrics to the metrics service, which is also queried by the visualization and monitoring services. Data from the metrics service is periodically propagated to a unified analytics system to enable ad-hoc analysis of compatible business metrics and service metrics over the same time range. The monitoring service dispatches alerts to an alert processing system that further dispatches them to the alert gateway. A key aspect of this architecture is the logical (and as we shall later see, physical) segregation of metrics collection and query, visualization and monitoring systems. While these capabilities were bundled together in a few commercial products evaluated by us in late 2014 which are now part of F/OSS projects, we believe it is important to segregate these functions into logical subsystems considering scale, fault tolerance, code maintainability and system operability. We also believe in the principle that a sophisticated, sustainable system can be achieved by combining simpler building blocks that can be reasoned about individually and as a composite.

Since the metrics service is central to this architecture, we will focus on it for the rest of this blog series, with a brief mention of implementation choices for the other subsystems.

Requirements of the Metrics Service

Considering that we needed a central service that enables the above capabilities, it was obvious that the solution had to be highly available, scalable and integrate well with the rest of our production toolchain. The detailed requirements of such a service is as outlined below:

Functional Requirements

At a high level, the service should support ingestion and query of time series data. On the ingestion front, a dominating factor is the choice of the wire format to propagate metrics to the point of ingestion. The wire format must be one that is compact when serialized, is sufficiently expressive to represent the measurement, and has widespread ecosystem support for out-of-the-box metrics reporting in that format. Sufficiency of expression implies that the metrics format can describe what aspect of a specific system/process is being measured, what the value is, and the time of measurement. While arbitrary annotation of the measurement (e.g. through attaching labels) is useful, it wasn’t a mandatory capability. Considering these requirements, we evaluated metrics formats used by Riemann, Nextflix’s Turbine and Graphite. We chose the Graphite format to represent metrics, due to tradeoffs between expressiveness, compactness and ecosystem integration, amongst the options we had. Having chosen the format, we next had to lock down on the metric naming convention, requiring metrics to be prefixed by the deployment environment (e.g. production/staging/dev), the region of deployment, the service name, the category of the metric (sys/app), the host/container, and the actual component being measured. Such a format would allow us to efficiently segregate (and shard when required) services and service groups.Continuing with ingestion requirements, the metrics service must also support simultaneous ingestion of metrics measured over arbitrary time intervals. This is important since metrics from various sources can arrive out of order due to factors like network outages, batching and pre-aggregation. Backfilling past data in case of temporary write unavailability is also an important capability that requires out of order data ingestion.

On the metrics query front, we needed the service to support a rich set of statistical functions and transforms applied on one or more time series on the server side, so consumers (i.e. visualization and monitoring clients) can be optimized to process relatively primitive time series data. The decision of the actual query format was left open since there were fewer integration points in our production toolchain, and the effort required for custom integration was expected to be minimal if there was insufficient ecosystem integration out of the box.

Non-Functional Requirements

Horizontal Scalability

It must be possible to scale the service with increased number (and complexity) of reads or to ingest more updates to existing metrics, either due to increase in the number of measures or monitored hosts/services. Scaling the service for such cases should be possible by provisioning more instances of the metrics service rather than allocating additional resources to existing instances.Note: Considering that the service would apply only to tech metrics, it is assumed that the metrics space (i.e. the set of all metrics) would remain stable. In other words, we don’t expect new metric names to be created or removed more than once in a couple of days. Hence, the requirement is to support high update rate rather than the high rate of create/delete operations.

Multi-Tenant Capable

The metrics service is expected to be shared across multiple engineering groups and run over a common underlying infrastructure to achieve operational and infrastructure efficiency. It is hence imperative that the service guarantees fault isolation between tenants and resource capping per tenant. The former ensures that a malfunction triggered by a single tenant does not impact others. Resource capping is required to provide SLO guarantees (related to ingestion, query, and data completeness) against an allocated quota of resources to each tenant. It also ensures that accidental spikes in utilization caused by one or more tenants do not impact others.

Admission Control and Validation

The metrics service should allow service owners to whitelist and/or blacklist ingestion of metrics to achieve admission control. The service should also allow validation of incoming metrics for well-formedness.

Expected Service Level Objectives

The following SLOs were expected to be supported for each tenant:

  • Latency and Throughput
    • Connection latency: A client connecting to the metrics service should experience a 75th percentile connect time of less than 10 milliseconds with the 99th percentile latency being 15 milliseconds.
    • Metrics creation : A peak burst rate of 400 unique metrics created per minute must be supported.
    • Metrics update : The service will support a peak rate of 300K unique metric updates per minute (through a maximum of 600 write-clients).
    • Metrics read : A maximum of 150 unique metric reads per second with a data look back of 15 minutes will be supported and with the maximum number of reading clients being 5. The median response latency across such queries will be 10ms, with the 99th percentile latency being 20ms. This implies that a query (e.g. involving wildcards) cannot result in a fanout of more than 150 metrics, and no more than one such query will be supported per second.
  • Availability

    Availability of 99.9% over a 24-hour window is required. Note that violation of any of the above latency and throughput SLOs for a tenant is considered unavailability of the service. Since the service is expected to be region-local, availability of the service is to be measured with respect to (i.e. within) the region.

  • Data Staleness, Consistency, and Precision
    • A maximum turnaround time for a metric, i.e. the time between the a metric being published successfully to when the metric is available for subsequent reads should be 500 ms, with the 75th percentile time being 100 ms.
    • Data can be expected to be complete (i.e. without gaps in the time series data) and globally consistent in sync with the write availability SLO. Specifically, this means that if the service is write-available in a particular time window, there will be no gaps in the data during that window.
    • Up to a lookback window of 3 weeks, data should have no loss of precision. Beyond 3 weeks, data may be down-sampled and must be considered representative (as per the down-sampling methods pre-configured per tenant), but not exact.
  • Resource Efficiency

As is the case for most shared infrastructure services at scale, it is important to have an acceptable price:performance ratio to achieve the above SLOs. This means that the implementation should not require specialized hardware to work on, and should be amenable to hardware heterogeneity within a deployment if the need arises. We will discuss actual numbers related to this in greater detail in a subsequent post.

  • Operational Simplicity and Predictability

From an operational standpoint, it’s important that the service has well-understood failure modes, is composed of simple building blocks, but has fewer moving parts if possible to reduce operational complexity. In addition, it must be possible for an operator to reason about system behavior in the face of various workloads, failures and resource constraints in production. The service itself should be well instrumented to enable monitoring and diagnostics.

Conclusion

In this blog, we have discussed the logical architecture of a system that achieves convergence of metrics and monitoring, and delved into detailed requirements of the metrics service that is central to this architecture. In the final blog in the series, we will discuss the architecture of the metrics service, the implementation choices and our experience running this service. Until then, stay tuned!