Monitoring big data systems at granular levels is extremely crucial. At our Central Data Platform team, we have an array of different tech stacks that requires monitoring across various space - be it Kubernetes, Control Plane, Data Plane, Spark, Cosmos DB, SLAs, business use cases, etc. This requires robust and scalable monitoring frameworks.
Here, we will explore the solutions that we utilized to improve monitoring of our applications. Let’s delve into different mechanisms used for capturing metrics including Prometheus, Push Gateway, and the powerful Thanos that addresses the challenges faced by traditional Prometheus deployments.
We implemented the deployment of Prometheus, Push Gateway, and Grafana stack using charts - kube-prometheus-stack
and prometheus-pushgateway
for scraping the Kubernetes cluster metrics. The Push Gateway solves the challenge of collecting metrics from jobs that may not have a considerable lifespan or an endpoint accessible to Prometheus for scraping.
Spring Boot provides several managed endpoints that offer various features like health checks, traceability, and many other metrics for monitoring and managing the applications. To enable these endpoints, we used config.
management.security.enabled = false
The health of services and endpoints are monitored using io.micrometer
with dependency.
The endpoints exposed via Spring Boot applications are monitored using @Timed
annotation. It has helped in publishing the quantiles per APIs exposed.
The parameters used with the annotation are as follows:
histogram = true
percentiles = {0.50,0.90, 0.95, 0.99}
To monitor Aerospike Cache services deployed on VMSS, we incorporated the Aerospike Prometheus exporter. This exporter makes metrics accessible on endpoint, allowing Prometheus to scrape the data.
To enable the scraping of metrics from specific IP addresses, we added the necessary configuration in the Prometheus values.yaml
file.
Sample configs snippet is shared below:
additionalScrapeConfigs:
- job_name: aerospike
dns_sd_configs:
- refresh_interval: 3600s
port: 9145
type: A
names:
- metrics.env1.dnsdomain.com
- metrics.env2.dnsdomain.com
- metrics.env3.dnsdomain.com
It creates a new job in Prometheus to scrape the metrics from IP addresses of the DNS names in the config. In case there is any change in the VMSS, we only need to update in the DNS Zone IP for the given name.
For scraping of metrics from Spark Jobs deployed on Databricks and Airflow we had two options:
Native Pull based mechanisms of Prometheus.
Push based mechanisms via Push Gateway.
Using Prometheus Pull based mechanisms, we identified the technical and security challenges. Every time the job cluster restarts, the driver IP changes, making it impractical to use static IPs.
Consequently, the Prometheus scraping endpoint must be updated after every restart. This limitation is due to security restrictions that restrict internal calls solely to the cluster's driver node, preventing the scraping of exposed metrics from other nodes.
This forced us to move to Push based mechanism.
To enable metric publication to the Push Gateway server from jobs, we built a generic Java client. This client is used as dependency by the Spark jobs, which are deployed across multiple subscriptions.
Once the metrics are published through these jobs to the Push Gateway, Prometheus then scrapes from the Push Gateway and stores in Prometheus DB for further analysis and monitoring.
The push gateway client developed internally uses io.prometheus
dependencies:
From the jobs we are publishing the quantitative metrics via Push Gateway.
Sample Config used by jobs for internal Push Gateway client:
{
"reporterName": "prometheus-reporter",
"prometheusURI": "${push-gateway-url}",
"jobName": "{job-name}",
"initialDelay": 30,
"timeInterval": 30,
"configTimeUnit": "SECONDS",
"shutdownExecutorOnStop": true,
"uniqueLabel": "start_time",
"labels": {
"app": "${app-name}",
"environment": "qa",
"region": "eastus"
}
}
A clarity on the resources utilized and a breakdown of the cost incurred is critical. Since Prometheus is a time series DB and Relation DB was required for application health, we used MySQL.
We configured separate health monitoring jobs primarily with the following purpose:
Collate cost from Azure API to generate the cost sheet. These sheets are sent across to the relevant audience and stored in a MySQL monitoring DB.
Generate business level metrics of the data processed by the Spark jobs and capture in monitoring DB.
Generate metrics and trends from Control Plane metadata.
In the above sections we discussed ways by which various system-level and application-level metrics are scraped and stored in DB. We have predominantly used Grafana and Superset for representation of the metrics. We used open-source dashboards for representation of Kubernetes health on Grafana and generated dashboards as needed for the Control and Data plane services via Thanos. In Superset we have added the MySQL monitoring DB as Database, added relevant tables as Datasets and created Dashboards.
Monitoring systems play a pivotal role in ensuring the health and performance of any infrastructure. With a plethora of available solutions, it becomes crucial to identify the most suitable monitoring approach for a specific system. There is no one-size-fits-all or standard method, resulting which, a combination of different monitoring solutions is often necessary to achieve comprehensive system health monitoring.
Sign up with your email address to receive news and updates from InMobi Technology