A Deep Dive into Compression in Kafka

Anurag Goel
Anurag Goel
5 min read
Posted on September 14, 2023
A Deep Dive into Compression in Kafka

Apache Kafka, a distributed streaming platform, offers a robust solution for handling high-volume, real-time data streams. When using Kafka on a high scale, we usually face issues with data storage and bandwidth. To overcome all these issues, we leverage compression in Kafka.

Let’s delve into the world of compression in Kafka, and explore its benefits, types of compression algorithms, and benchmarks.

Why Compression?

By enabling compression, we can reduce network utilization and storage, which is often a bottleneck when sending messages to Kafka. The compressed batch has the following advantage:

  • Much smaller producer request size (compression ratio up to 4x!)

  • Faster to transfer data over the network => less latency

  • Better throughput

  • Better disk utilization in Kafka (stored messages on disk are smaller)

Compression in Kafka

Kafka supports two types of compression: producer-side and broker-side.

  • Producer Side: Compression-enabled producer-side doesn’t require any configuration change in the brokers or in the consumers. Producers may choose to compress messages with a compression-type setting.

    • Compression options are none, Gzip, Lz4, Snappy, and Zstd.

    • Compression is performed by the producer client if it is enabled.

    • This is particularly efficient if the producer batches messages together (high throughput)

  • Broker Side: Compression-enabled broker-side (topic-level)

    • by default, topic compression is defined as compression.type=producer, we can change this config as per our requirement.

Producer-side compression is the most popular and simple to implement and here we will focus on that.

Reference: https://www.conduktor.io/kafka/kafka-message-compression

Compression Algorithms

There are 4 types of compression supported by Kafka, Gzip, Lz4, Snappy, Zstd, etc.

Gzip

Gzip compression is a CPU-dependent process that has different compression levels. Higher compression levels result in smaller files but are more CPU-intensive. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding.

Snappy

Snappy is a fast data compression and decompression library written in C++ on ideas from LZ77. It does not aim for maximum compression or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.

Zstd

Zstandard, commonly known by the name of its reference implementation Zstd, is lossless data compression. It was designed to give a compression ratio comparable to that of the DEFLATE algorithm, but faster, especially for decompression. It is tunable with compression levels ranging from negative 7 (fastest) to 22 (slowest in compression speed, but best compression ratio).

Gzip vs Snappy

For instance, compared to the fastest mode of Zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.

Reference: GitHub - google/snappy: A fast compressor/decompressor

Compression Algorithm Comparison

Gzip Lz4 Snappy Zstd

Highest compression ratio

 

Low compression ratio

 

Medium compression ratio

 
Medium compression ratio

Highest CPU usage

 

Lowest CPU usage

 

Moderate CPU usage

 
Moderate CPU usage

Slowest compression speed

 

Fastest compression speed

 

Moderate compression speed

 

Moderate compression speed

Benchmarks

Here are some benchmarks metrics for Gzip vs Snappy vs Zstd.

Reference: GitHub - facebook/zstd: Zstandard - Fast real-time compression algorithm

Compressor name Ratio Compression Decompress.

Compressor Name

Ratio Compression Decompression

zstd 1.5.1 -1

 

2.887

 
530 MB/s 1700 MB/s

zlib 1.2.11 -1

 

2.743

 

95 MB/s

 
400 MB/s

zstd 1.5.1 --fast=1

 

2.437

 

600 MB/s

 
2150 MB/s

zstd 1.5.1 --fast=3

 
2.239 670 MB/s 2250 MB/s

zstd 1.5.1 --fast=4

 

2.148

 
710 MB/s 2300 MB/s

snappy 1.1.9

 
2.073

550 MB/s

 
1750MB/s

Kafka Compression Flow

Producers group messages in a batch before sending. This is done to save network trips. If the producer is sending compressed messages, all the messages in a single producer batch are compressed together and sent as the "value" of a "wrapper message". Compression is more effective the bigger the batch of messages being sent to Kafka.

Messages Batched → Compressed Batch → Send to Kafka

Compression, however, has a small overhead on CPU resources as it involves compression and decompression.

  • Producers must commit some CPU cycles to compression.

  • Consumers must commit some CPU cycles to decompression.

Conclusion

Compression is a powerful feature in Apache Kafka that can optimize data transmission, improve network utilization, and enhance overall system performance. By carefully choosing the appropriate compression algorithm and configuration, you can strike a balance between compression efficiency, CPU usage, and latency requirements.