Micrometer with Prometheus for Spring Boot Applications

Aman Gupta
Aman Gupta
15 min read
Posted on November 07, 2022
Micrometer with Prometheus for Spring Boot Applications

Monitoring allows developers to see various metrics of a system both in real-time and in retrospect to help with decision-making during maintenance and outage handling. It also represents the overall health of the system. Therefore, it's very important for an org to have a stable monitoring system in place.

InMobi Affiliate Platform (IAP) has multiple components to support the business and product at a high scale including HTTP-based services, event-driven systems, etc. To have the system up and running and operate on a such large scale, it’s important to monitor the system health, analyze the traffic pattern for better system optimizations and identify and act on any untoward incident and mitigate it.

Therefore, IAP relies on Prometheus as a monitoring system to monitor and understand the performance of its applications and infrastructure.
 

A Brief on Prometheus

Prometheus is an open-source monitoring system that has its own built-in Time Series Database (TSDB) that scales well allowing it to collect metrics from thousands of endpoints. It supports powerful query language PromQL and various integrations with other systems.

To better understand how Prometheus works, the Prometheus documentation provides this graphic and details about the essential elements of Prometheus and how the pieces connect together. You can refer to it and have a glance at its sub components.

It works on a pull-based model to collects metrics from targets by "scraping" metrics HTTP endpoints.


Spring Boot Applications ↔︎ Prometheus

Till now we discussed why monitoring systems are important and how Prometheus works in a nutshell. Let’s have a detailed look at how to monitor any Spring Boot application on Prometheus.

An application needs to expose all its metrics at an HTTP endpoint from which Prometheus can scrape it and in a data format that Prometheus can understand.

Spring Boot Actuator: Provides production-ready endpoints that you can use for your application. These endpoints fall under a common prefix of /actuator and are, by default, protected.

It provides the Prometheus metrics at /actuator/prometheus endpoint.

Micrometer: Micrometer converts and exposes Actuator metrics to external monitoring systems such as Prometheus.

It is a dimensional metrics and monitoring façade to help developers integrate their application metrics to various monitoring systems while keeping the application independent from the actual monitoring implementation.

This means that you as a developer can use a single interface to ship your metrics to a variety of monitoring systems. Concerning the scope of this blog, it's Prometheus.

Let’s take a look at the implementation level details of the setup required in Spring Boot Application and also some conventions and best practices for creating and naming metrics.


Getting Started

  • Dependencies required in the pom.xml:
        org.springframework.boot
        spring-boot-starter-actuator

        io.micrometer
        micrometer-core

        io.micrometer
        micrometer-registry-prometheus

  • Expose the actuator endpoint for Prometheus in application.properties:
    management.endpoints.web.exposure.include=prometheus

  • Spring Boot autoconfigures a composite MeterRegistry and adds a registry to the composite for each of the supported implementations that it finds on the CLASSPATH. Having a dependency on micrometer-registry-{system} in your runtime CLASSPATH is enough for Spring Boot to configure the registry.


Autoconfigured Metrics Provided by Spring Boot Actuator

Spring Boot autoconfigures and registers some useful metrics for applications like JVM metrics, System Metrics, Logger Metrics, etc.
For more information check out Spring Boot Reference Documentation

To disable any of the autoconfigured metrics, add the following in application.properties file:
management.metrics.enable.jvm=false

This will disable the JVM related metrics for the application.


Setting Up Custom Metrics

Micrometer employs a naming convention that separates lowercase words with a . (dot) character. Different monitoring systems follow different recommendations regarding naming convention, and some of them may be incompatible with one another. Each Micrometer implementation for a monitoring system comes with a naming convention that transforms lowercase dot notation names to the monitoring system’s recommended naming convention.

You can override the default naming convention for a registry by implementing NamingConvention and setting it on the registry:

registry.config().namingConvention(myCustomNamingConvention);

For example:

registry.timer("http.server.requests");

will translate to:

http_server_requests_duration_seconds in Prometheus.

By adhering to Micrometer’s lowercase dot notation convention, you guarantee the maximum degree of portability for your metric names across monitoring systems.

 

Naming Metrics

With the dotted.string.style of metrics it's usual to prefix metrics with the host they're on, environment, datacenter, application, and so on. This is however not the way to handle things in Prometheus whose labels provide a more powerful data model. In addition, Prometheus has a top-down way of handling service discovery.

Prometheus Labels

Use labels to differentiate the characteristics that are being measured:

  • api_http_requests_total - differentiate request types: operation="create|update|delete"

  • api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"

Do not put the label names in the metric name, as this introduces redundancy and will cause confusion if the respective labels are aggregated away.

Labels in Prometheus are arbitrary in nature and they can be much more powerful than just a service or instance that exposes a metric. With the simple example of http_requests_total, services can be more descriptive on the requests that are being counted and can expose elements like the endpoint being used or a status code returned.

# Number of unauthorised GET requests to the GET /users/:id endpoint of the Users Directory service
sum(http_requests_total{service="users-directory", method="GET", endpoint="/user/:id", status="403"})


Filtering based on labels

# Only consider the Users Directory service
http_requests_total{service="users-directory}"

# Filter out successful requests
http_requests_total{status!="200"}

# Matching `region` label with regular expression
failed_logins_attempts_total{region=~"us-west-.*"}


Aggregating labels

If the label is not specified, then the query will return as many time series as the number of combinations of labels and label values. To collapse them, an aggregator operator can be used in the query:

# Total number of requests regardless of any labels
sum(http_requests_total)

# Average memory usage for each service
avg(memory_used_bytes) by (service)

# Startup time of the oldest instance of each service in each datacenter
min(startup_time_milliseconds) by (service, datacenter)

Prometheus considers each unique combination of labels and label value as a different time series. As a result if a label has an unbounded set of possible values, Prometheus will have a very hard time storing all these time series. In order to avoid performance issues, labels should not be used for high cardinality data sets (for example, Customer unique ids).


Micrometer Tags to Create Prometheus Labels for Metrics

You should follow the same lowercase dot notation described for meter names when naming tags.

Using this consistent naming convention for tags allows better translation into the respective monitoring system’s idiomatic naming schemes.

registry.counter("<metric-name>", "<tag-key>", "<tag-value>",...., "<tag-keyn>","<tag-valuen>")</tag-valuen></tag-keyn></tag-value></tag-key></metric-name>

is equivalent to:

registry.counter("<metric-name>", Arrays.asList(Tag.of("key1", "value1"),...., Tag.of("keyN", "valueN")))</metric-name>

For example, suppose that we are trying to measure the number of HTTP requests and the number of database calls, then the following is recommended:

registry.counter("database.calls", "db", "users")
registry.counter("http.requests", "uri", "/api/users")

This variant provides enough context so that, if only the name is selected, the value can be reasoned about and is at least potentially meaningful. For example if we select database.calls, we can see the total number of calls to all databases. Then we can group by or select by db to drill down further or perform comparative analysis on the contribution of calls to each database.


Bad Approach

registry.counter("calls",
    "class", "database",
    "db", "users");

registry.counter("calls",
    "class", "http",
    "uri", "/api/users");

In this approach, if we select calls we get a value that is an aggregate of the number of calls to the database and to our API endpoint. This time series is not useful without further dimensional drill-down.

Prometheus will ignore the metrics having the same metric name but different dimensions as they fail it’s validation check, for example:

cache_errors{name=”cache_1”,type=”test”}
cache_errors{name=”cache_1”}

Out of the above two metrics, only the one that gets registered first will be emitted in Prometheus and all others bearing the same metric name cache_errors but different dimensions will not be emitted.


Common Tags

You can define common tags at the registry level and add them to every metric reported to the monitoring system. This is generally used for dimensional drill-down on the operating environment, such as host, instance, region, stack, and others.

In Spring Boot, this can be done in two different ways:

For example,

application.properties

management.metrics.tags.region=us-east-1<br /> management.metrics.tags.stack=prod<br /> management.metrics.tags.app=napp

  • If you need more flexibility (for example, you have to add common tags to a registry defined in a shared library), register a MeterRegistryCustomizer callback interface as a bean to add your common tags. See the Spring Boot Reference Documentation for more information.

Continuing on the above example, after the common tags are configured, all metrics will have these tags:

For registry.counter("request.received") metric, it will be translated to:

request_received_total{app="napp",region="us-east-1",stack="prod",}


Best Practices

  • Don't put the names of the labels in the metric name, such as http_server_requests_by_method_total. When the label is aggregated away, this will cause confusion. 

  • Don't put the type of the metric in the name, such as gaugecountersummary or map. This is already implied by the existing naming scheme, adds no information, and will cause confusion as you process and write rules based on the metrics.

  • Don't use colons in metric names. These are for the use of a person writing recording rules on the Prometheus server.

  • Prometheus favors snake case, avoid camelCase when creating your own metrics.


Usage in Applications

Micrometer provides the following set of Meters: Timer, Counter, Gauge, DistributionSummary, LongTaskTimer, FunctionCounter, FunctionTimer, and TimeGauge.

For the scope of this blog, we will only discuss about Timer, Counter and Gauges.


Counter

It is the most basic type of Meter, used to report a single count metric. It can only be incremented by a fixed positive number.

There are two possible ways to use counters :

  • Directly using Registry to implement the counter

    meterRegistry.counter("", ... optional tags);   // 1st Type

    // Eg:
    Counter counter1 = meterRegistry.counter("publisher.events", "type", "download");  // with tags
    meterRegistry.counter("publisher.events");  //without tags

    // to increment
    counter1.increment() // will increment the counter by 1
    counter1.increment(n) // will increment the counter by n

  • Using Counter Builder (which provides additional options like description)

    Counter counter = Counter // 2nd Type
        .builder("counter")
        .baseUnit("beans") // optional
        .description("a description of what this counter does") // optional
        .tags("region", "test") // optional
        .register(registry);

    // Eg:
    Counter counter = Counter // 2nd Type
        .builder("publisher.events")
        .baseUnit("event") // optional
        .description("count of the events published") // optional
        .tags("type", "download") // optional
        .register(registry);  

    // to increment
    counter.increment() // will increment the counter by 1
    counter.increment(n) // will increment the counter by n

If you open the endpoint of the actuator Prometheus endpoint in your browser

For example, http://localhost:8080/actuator/prometheus

You will see something like the following (this is an example, the actual values will vary):

# HELP publisher_events_total  
# TYPE publisher_events_total counter
publisher_events_total{type="download",} 5.0

In most cases, it makes sense to have a metric on count over a period of time and not absolute count value while plotting visualization and setting up alerts. In these case, where you want the rate, and you already using a Timer, there is no need to add a separate Counter as the Timer records the count of timed events as a part of its metrics.


Timer

It is used for measuring the time taken for certain events to happen (for example, method execution). Additionally, it also keeps a record of the frequency of these events.

The basic use of timers is to measure the latencies of execution or response.

We can use Timers in our code in the following ways:

  • Using Timer builder

    Timer timer = Timer. // Using the builder
        .builder("my.timer")
        .publishPercentiles(0.50, 0.75, 0.90, 0.95, 0.98, 0.99, 0.999) // optional (supports multiple values)
        .description("a description of what this timer does") // optional
        .tags("region", "test") // optional
        .register(registry);
        
    timer.record(() -> dontCareAboutReturnValue());
    timer.recordCallable(() -> returnValue())

    // Eg:
    Timer latency = Timer
                    .builder("eh_producer.send.requests")
                    .tags("eventhub", "s2s")
                    .publishPercentiles(0.50, 0.75, 0.90, 0.95, 0.98, 0.99, 0.999)
                    .description("Time to send batch of events to eh")
                    .register(meterRegistry);
                    
    latency.record(() -> {
                    try {
                       // processing logic
                    } catch (Exception e) {
                        log.error("Exception: ", e);
                    }
                });

    If you open the endpoint of the actuator Prometheus endpoint in your browser

    For example, http://localhost:8080/actuator/prometheus

    You will see something like the following (this is an example, the actual values will vary):
    # HELP eh_producer_send_requests_seconds_max Time to send batch of events to eh
    # TYPE eh_producer_send_requests_seconds_max gauge
    eh_producer_send_requests_seconds_max{eventhub="s2s",} 0.0
    # HELP eh_producer_send_requests_seconds Time to send batch of events to eh
    # TYPE eh_producer_send_requests_seconds summary
    eh_producer_send_requests_seconds{eventhub="s2s",quantile="0.5",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s",quantile="0.75",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s",quantile="0.9",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s",quantile="0.95",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s-retry",quantile="0.98",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s-retry",quantile="0.99",} 0.0
    eh_producer_send_requests_seconds{eventhub="s2s",quantile="0.999",} 0.0
    eh_producer_send_requests_seconds_count{eventhub="s2s"} 0.0
    eh_producer_send_requests_seconds_sum{eventhub="s2s"} 0.0

  • Using Timer Resource Sample (more convenient to use)

    private Timer.ResourceSample getTimer(final meterRegistry, final String metricName, final Tags tags) {
          return Timer.resource(meterRegistry, metricName) // using Timer Resource 
                        .tags(tags)
                        .publishPercentiles(0.50, 0.75, 0.90, 0.95, 0.98, 0.99, 0.999);
    }  
                    
    try (Timer.ResourceSample sendTimer = getTimer(meterRegistry, "eh_producer.send.requests",  Tags.of("eventhub", "eh1"))) {
          try {
              // processing logic
          } catch (Exception e) {
              log.error("Exception: ", e);
          }
    }

 

Gauge

It is used to represent the current value of the state as a single numerical value.

As mentioned and recommended in the official Micrometer document as well:

Gauges should be sampled and not be set, so there is no information about what might have occurred between samples. Any intermediate values set on a gauge are lost by the time the gauge value is reported to a metrics backend, so there is little value in setting those intermediate values in the first place.

Basic uses of gauges is to measure the lags in consuming events from a message broker like EventHub, size of a collection, or number of threads in active state etc.

Gauges and Counters might seem to be similar at the first glance, but there are significant differences: 
      • Counters stores increasing values, while Gauges can be decremented as well.
      • Gauges values changes only when they are observed. It is does not store the intermediate values like Counters.


We can use Gauges in our code in the following ways:

  • Manually updating the gauge value (increment/decrement):

    // maintain a reference to myGauge
    AtomicInteger myGauge = registry.gauge("numberGauge", tags, new AtomicInteger(0));

    // ... elsewhere you can update the value it holds using the object reference
    myGauge.set(27);
    myGauge.set(11);

    // Eg:
    AtomicLong lag = meterRegistry.gauge("ehconsumer_lag", "eventhub", "s2s", new AtomicLong(0));
    lag.set(n);

    A gauge can be made to track any java.lang.Number subtype that can be set, such as AtomicInteger and AtomicLong found in java.util.concurrent.atomic and similar types.

    If you open the endpoint of the actuator Prometheus endpoint in your browser

    For example, http://localhost:8080/actuator/prometheus)

    You will see something like the following (this is just an example, the actual value will vary):
    # HELP ehconsumer_lag  
    # TYPE ehconsumer_lag gauge
    ehconsumer_lag{eventhub="s2s",} 1.0

  • Passing a method to observe for change as a supplier

    // Direcly using registry to build and monitor a gauge metric
    List list = registry.gauge("listGauge", Collections.emptyList(), new ArrayList<>(), List::size);

    // Using Gauge Builder
    Gauge gauge = Gauge
        .builder("gauge", myObj, myObj::gaugeValue)
        .description("a description of what this gauge does") // optional
        .tags("region", "test") // optional
        .register(registry);
        
    // Eg: Let say you want to monitor the size of the Map advInfoMap

    Map advInfoMap = new ConcurrentHashMap<>()

    private Integer getRepoSize(){ // this method returns the size of advInfoMap
            return advInfoMap.size();
        }
    Gauge.builder(metricName, this::getRepoSize) 
                    .tags(tags)
                    .register(meterRegistry);


Gauge Reporting NaN or Disappearing?
Micrometer does not create a strong reference to the state object that you are measuring with Gauge. So, once the object being gauged is de-referenced and gets garbage collected, Micrometer will start reporting NaN or nothing for gauge. Therefore, it’s your responsibility to hold a strong reference to the state object being gauged.
If you see your gauge reporting for a few minutes and then disappearing or reporting NaN, it almost certainly suggests that the underlying object being gauged has been garbage collected.


Resources: