Building A Scalable Multitenant Metrics Service

Sudhakar BG and Bharath Ravi Kumar on May 16, 2016

As we discussed in part 2 of this blog series, a robust metrics service is critical to achieve convergence of metrics and monitoring. Considering the requirements discussed, we were clear that the service had to reside in our data centers for reasons of cost and control. Once we arrived at the architecture of the metrics service (which we describe in detail later), we set out to evaluate if we would need to build the service from scratch or if we could build on top of existing F/OSS time series databases to meet our requirements.

Choice of Time Series Database

The following alternatives were considered while choosing the time series database:

  • InfluxDB: At the time of evaluation, InFluxDB was (and still is) a popular time series database that offered reasonable implementation choices for storage and application runtime. It also integrated with the graphite wire protocol. However, the project was in its infancy and was undergoing rapid development. We hence deferred going with it and decided to consider it in the medium to long term.
  • OpenTSDB: OpenTSDB is a proven time series database that can process a very large number of metrics and can scale horizontally. It can use one of several popular distributed data stores for metrics storage. At the time of evaluation, it required a custom bridge to be implemented to receive metrics in the graphite wire protocol. In addition, the complexity of operating distributed data stores (seethis video describing a similar experience) could be justified only if the need for such data scale and throughput would arise. However, based on the architecture of the overall system (e.g. if one were to choose a federated and/or sharded model in a multi-tenant deployment), there may be no need for a single shard/namespace of the metrics service to process more than a few millions of metrics per minute. We hence deferred the move to a time series data base that required a distributed data store.
  • Prometheus: Prometheus was not live at the time of our evaluation, but might be a good option to consider now with its timeseries optimized storage scheme, resource efficiency and robustness.
  • Graphite: We then re-evaluated Graphite, which was already being used for metrics collection in several of our engineering groups. While Graphite has wide community adoption, good ecosystem integration and is mature, we were dissatisfied with the single-thread performance of graphite daemons and the (comparative) limitations of its storage library. While there were alternative storage libraries, they weren’t well maintained or tested in production environments. Our previous experience also indicated that it performed poorly on spinning disks and didn’t work well under high metric creation rates. However, for our current requirement, we didn’t have to support large number of creates - also solid state drives were ubiquitous in our deployments. From prior experience, we also liked the predictability and stability of Graphite. When we tested Graphite’s performance based on the above Service Level Objectives (SLOs) on our current hardware, we realized that we could comfortably meet our goals with a few minor modifications to Graphite. Trading off various functional and nonfunctional requirements, we decided to use Graphite.

Architecture of the Metrics Service

Using Graphite as the building block, we designed a multi-tenant capable, scalable metrics service using docker and LVM as follows:

  • Each docker container runs all the carbon daemons (relay, aggregator, cache and graphite web) along with an nginx instance fronting graphite web. Every such container instance is called a “pod”.
  • For redundancy, at least two such pods exist per datacenter, with the relay of each pod replicating metrics to its peer pods.
  • The redundant pods are placed behind a load balancer VIP created for metrics reads and writes.
  • The docker containers have explicit CPU and RAM quotas allocated for equitable resource sharing.
  • The containers mount a logical volume each from the host machine for metrics data storage. The logical volumes provides disk space isolation between pods. We also mount logical volumes from the host machine for logging purposes (one each per container)
  • Metrics whitelists are in place inside each pod as a primitive form of admission control
  • The containers are run in host mode networking to minimize the networking overhead of the default docker bridge mode.
  • Since individual pods tends to be less network intensive (compared to the capacity of the NIC on servers), the network interfaces have not been isolated. When required, we intend to move network intensive containers to their own isolated interfaces by configuring our SR-IOV capable NICs.

The following schematic illustrates the overall architecture:

Hawkeye 2.0.jpg

The above design achieves complete isolation across pods along the read and write paths and has helped us consolidate, streamline and stabilize our metrics infrastructure.

Modifications to Graphite

Graphite supports a few different metrics drain strategies (max, naive, sorted etc.) which determine the order in which metrics get flushed to persistent storage. All of these however share the property of doing greedy writes to disk - regardless of how efficient the writes may (or may not) be.

We wanted to experiment with a more deterministic approach to writing down metrics to disk and so developed a new strategy - “batched” - which allows for a configurable number of data points to be accumulated in carbon cache’s internal MetricCache data structure before being flushed to disk in a batched manner at configurable intervals. The batched drain strategy allows us to serve recent metrics directly from carbon cache RAM vs. hitting disk - which makes read workloads (graphing, alerting rules) faster. The deterministic approach also makes it easy for us to do RAM sizing correctly by allowing us to model RAM for a given incoming (unique) metrics rate and for a given retention in memory.

Default Graphite Cache Drain Behavior

Carbon cache by default drains its cache greedily (i.e., if possible at all, it will drain everything in its cache down to disk). This approach makes sure data is persisted as quickly as possible. The flip side of the greedy drain behavior is data integrity/redundancy/performance tradeoffs become harder to make. Assuming a Graphite setup where carbon relay replicates metrics to two metric sinks - we could (in theory) trade off potential cached metric data loss for better metrics read performance for instance.

Batched Cache Drain Strategy

By having the administrator explicitly tell Graphite how many data points per metric should be cached in memory before writing to disk, sizing and tuning becomes a less fuzzy math. Additionally, carbon cache internal metrics predictably reflect the white box state of the system. For example, pointsPerUpdate or committedPoints is no longer a floating metric as it is with the other drain strategies - it is exactly what you want it to be at all times with known, understood and bounded variance, based on the time it takes to build up the cache and to drain it in its entirety.

The operator of the system can decide the number of minutes/hours worth of data that needs to stay in memory. Since the disk is quiescent in terms of writes for the duration of cache buildup, reads no longer have to contend with potentially inefficient writes. There is either no contention at best (during cache build up) or at worst they contend with efficient writes (during cache drain).

For example - in our tests, six hours worth of cached metrics took about half an hour to drain for a million unique metrics (total of 360 million accumulated metric points drained) on an experimental setup using SSDs - see exact machine configuration below. The rate of cache drain can be fairly accurately controlled via the MAX_UPDATES_PER_SECOND Graphite config parameter if writes during drain down end up causing read contention. The rate of cache drain in this scheme reflects overall efficiency of write subsystems - the Graphite writer thread code, the OS filesystem caches (or an alternative data storage system) and the disk controller. The ratio of the time taken to write metrics to disk vs. the number of cached metric points serves as a good proxy to detect system write scaling limits being reached.

Basically, we found that being able to predictably control the write down behavior allows proper evaluation of the performance characteristics of alternative storage backends and in general makes sizing and tuning for performance easier.

Capacity and Resource Consumption

We benchmarked a pod with our modifications and found that with 2 cores and 12GB RAM, a single container can conservatively handle 350-400K incoming metric updates / min and serve read traffic of about 150 qps of unique metric reads with a 15m lookback window, while caching an hour’s worth of data in memory. A single server (configuration below) can comfortably handle about 4 million metrics / min incoming while serving about 1500 qps of unique metric reads with headroom to spare. The system can continue to horizontally scale beyond these limits by simply adding more servers into the mix and by sharding/shuffling/relocating services between them.

Isolation

Dockerizing our Graphite setup also provides isolation, so errors/misconfiguration of clients ( happens occasionally!) resulting in incoming metrics spikes or read traffic spikes stay isolated/contained and do not affect the health or performance of other co-tenants of the system.

Monitoring the Monitoring

White Box

Existing carbon internal metrics (e.g., cache.queue, cache.queries, cache.size, metricsReceived, cpuUsage, memUsage, relayMaxQueueLength) for all Graphite components are monitored and have alerting rules configured.

Black Box

To ensure high availability of the system, we developed additional synthetic probes which write test metrics into the system at periodic intervals and read them back out. Injecting test metrics at regular intervals into various points in the write path (e.g., through the VIP, through the carbon relay, through the carbon cache) allows us to quickly and correctly diagnose and isolate faults.

Hardware Configuration

The following is a snapshot of the hardware configuration of the servers running the metrics service:

CPU:

{

product : intel(r) xeon(r) cpu e5-2640 0 @ 2.50ghz,

speed_verbose : 2500.012 mhz,

vendor : genuineintel,

cores_per_socket : 6,

vcpus : 24,

cpu_modes : [

32-bit,

64-bit

],

architecture : x86_64,

speed : 2500012000,

threads_per_core : 2

}

RAM: 160GB


Disk: Intel SSD Drives - no RAID configured.

Choice of Monitoring and Alerting Subcomponents

Based on similar efforts to understand our (then) needs and engineering trade-offs, we have built a redundant nagios4 backed service for monitoring, a shared Grafana service for metrics visualization, CitoEngine for metrics aggregation/dispatch and PagerDuty for alert dispatch. In a subsequent blog series, we will discuss how and what changes we made to each of the above subsystems as we moved to a container based, hosting agnostic deployment model.

Conclusion

In this blog series, we discussed how we arrived at a robust, centralized service for metrics collection and monitoring. These services have helped improve the availability of our business services, streamlined production engineering practices, helped us consolidate and better manage our infrastructure and have allowed product engineering teams to run faster and more reliable services.