Monitoring Apache Kafka with Prometheus

TLDR, show me code  github-mark-120px-plus  kafka-prometheus-monitoring  

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.  It is scaleable, durable and distributed by design which is why it is currently one of the most popular choices when choosing a messaging broker for high throughput architectures.

One of the major differences with Kafka is the way it manages state of the consumers, this itself is distributed with the client responsible for keeping track of the messages they have consumed (this is abstracted by the high level consumer in later versions of Kafka with offsets stored in Zookeeper).  In contrast to more traditional MQ messaging technologies, this inversion of control takes considerable load off the server.

The scalability, speed and resiliency properties of Kafka is why it was chosen for a project I worked on for my most recent client Sky.  Our use case was for processing realtime user actions in order to provide personalised Recommendations for the NowTV end users, a popular web streaming service available on multiple platforms.  We needed a reliable way to monitor our Kafka cluster to help inform key performance indictors during NFT testing.

Prometheus JMX Collector

Prometheus is our monitoring tool of choice and Apache Kafka metrics  are exposed by each broker in the cluster via JMX, therefore we need a way to extract these metrics and expose them in a format suitable for Prometheus.  Fortunately prometheus.io provides a custom exporter for this.  The Prometheus JMX  Exporter is a lightweight web service which exposes Prometheus metrics via a HTTP GET endpoint.  On each request it scrapes the configured JMX server and transforms JMX mBean query results into Prometheus compatible time series data, which are then returned to the caller via HTTP.

The mBeans to scrape are controlled by a yaml configuration where you can provide a white/blacklist of metrics to extract and how to represent these in Prometheus, for example GAUGE or COUNTER.  The configuration can be tuned for your specific requirements, a list of all metrics can be found in the Kafka Operations documentation.  Here is what our configuration looked like:

lowercaseOutputName: true
jmxUrl: service:jmx:rmi:///jndi/rmi://{{ getv "/jmx/host" }}:{{ getv "/jmx/port" }}/jmxrmi
rules:
- pattern : kafka.network<type=Processor, name=IdlePercent, networkProcessor=(.+)><>Value
- pattern : kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>OneMinuteRate
- pattern : kafka.network<type=SocketServer, name=NetworkProcessorAvgIdlePercent><>Value
- pattern : kafka.server<type=ReplicaFetcherManager, name=MaxLag, clientId=(.+)><>Value
- pattern : kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>OneMinuteRate
- pattern : kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate
- pattern : kafka.server<type=Produce><>queue-size
- pattern : kafka.server<type=ReplicaManager, name=(.+)><>(Value|OneMinuteRate)
- pattern : kafka.server<type=controller-channel-metrics, broker-id=(.+)><>(.*)
- pattern : kafka.server<type=socket-server-metrics, networkProcessor=(.+)><>(.*)
- pattern : kafka.server<type=Fetch><>queue-size
- pattern : kafka.server<type=SessionExpireListener, name=(.+)><>OneMinuteRate
- pattern : kafka.controller<type=KafkaController, name=(.+)><>Value
- pattern : kafka.controller<type=ControllerStats, name=(.+)><>OneMinuteRate
- pattern : kafka.cluster<type=Partition, name=UnderReplicated, topic=(.+), partition=(.+)><>Value
- pattern : kafka.utils<type=Throttler, name=cleaner-io><>OneMinuteRate
- pattern : kafka.log<type=Log, name=LogEndOffset, topic=(.+), partition=(.+)><>Value
- pattern : java.lang<type=(.*)>

In summary:

  • Prometheus JMX Exporter – scrapes the configured JMX server and transforms JMX mBean query results into Prometheus compatible time series data, exposes result via HTTP
  • JMX Exporter Configuration – a configuration file that filters the JMX properties to be transformed – example Kafka configuration
  • Prometheus – prometheus itself is configured to poll the JMX Exporter /metrics endpoint
  • Grafana – allows us to build rich dashboards from collected metrics

kafka-prometheus-white

 

Viewing Kafka Metrics

Once metrics have been scraped into Prometheus they can be browsed in the Prometheus UI, alternatively richer dashboards can be built using Grafana.

prometheus-ui
Prometheus Graph Builder
grafana-ui
Grafana Dashboard

In order to try this out locally, a fully dockerised example which has been provided on GitHub –  kafka-prometheus-monitoring.  This project is for demonstration purposes only and is not intended to be run in a production environment.  This is only scratching the surface of monitoring and fine-tuning the Kafka brokers but it is a good place to start in order to enable performance analysis of the cluster.

A note on monitoring a cluster of brokers:  Prometheus metrics will include a label which denotes the Brokers IP address, this allows you to distinguish metrics per broker.  Therefore a JMX exporter will need to be run for each broker and Prometheus should be configured to poll each deployed JMX exporter.

 

Tagged with: