#3-2019-Aug-“Kafka lightweight consumer: How to limit the network bandwidth (by a factor of 100)”

Suppose you have a Kafka cluster up and running. And you see these kinds of stats:

../_images/02_kafka_cluster_usage.pngKafka cluster usage before

You would expect there to be many consumers. However, in my case, I had only a few, and they were reading from almost idle topics. How is this possible, reading @ 36KB/s without consuming any messages?

Bandwidth performance factors for idle consumers

I am using the confluent-kafka-python client. The client is a lightweight wrapper around the librdkafka (C client). The configuration of this C client are quite extensive.

Enable logging

To get a better understanding of what is going you should enable the 'debug': 'all' on the python client.

import logging
logger = logging.getLogger(__name__)
# Remeber to add handler, formatting as you like, you probably want a FileHandler
logger.setLevel = logging.DEBUG
from confluent_kafka.cimpl import Consumer    
config = {
    'debug' : 'all'
}
consumer = Consumer(config, logger=logger)

How to limit the bandwidth

After scanning through the log you might see these three main types of messages:

1. DEBUG 2019-08-12 16:40:24 ea_kafka_confluent.consumer FETCH [rdkafka#consumer-2] [thrd:sasl_ssl://MY_KAFKA_CLUSTER.eu-west-1.aws.confluent.cloud:9092/3]: sasl_ssl://MY_KAFKA_CLUSTER.eu-west-1.aws.confluent.cloud:9092/3: Topic ci.pipeline.build_complete [0] MessageSet size 0, error "Success", MaxOffset 0, Ver 2/2  MainProcess consumer.py _inner_consume_by_blocking 56 consumer MainThread
2. DEBUG 2019-08-12 16:40:27 ea_kafka_confluent.consumer OFFSET [rdkafka#consumer-2] [thrd:main]: Topic ci.pipeline.build_complete [0]: stored offset -1001, committed offset -1001: not including in commit  MainProcess consumer.py _inner_consume_by_blocking 56 consumer MainThread
3. DEBUG 2019-08-12 16:40:28 ea_kafka_confluent.consumer HEARTBEAT [rdkafka#consumer-2] [thrd:main]: GroupCoordinator/1: Heartbeat for group "LOCAL-CONSUMER" generation id 5  MainProcess consumer.py _inner_consume_by_blocking 56 consumer MainThread
  1. FETCH

    1. What is a FETCH? It is how the consumers get messages from a Kafka topic. One message per topic-partition.

    2. fetch.wait.max.ms=100 and fetch.min.bytes=1 are important configuration parameters for the fetch behavior

    3. By default, 100ms is very frequent for a topic without much activity, and this will cause a lot of traffic.

    4. I changed the defaults to fetch.wait.max.ms=20000 and fetch.min.bytes=10. This slowed down the FETCH traffic, and I believe it will not impact performance since fetch.min.bytes is still low so if any message arrives it will be fetched anyways

  2. OFFSET

    1. What is OFFSET? This is how the consumer group tracks which messages it is currently processing. One offset per partition.

    2. enable.auto.commit=true and auto.commit.interval.ms=3000 are important configuration parameters for the offset behavior

    3. By default, 3s might be unnecessary, especially if you have only a single consumer

    4. I changed the defaults to auto.commit.interval.ms=30000.

  3. HEARTBEAT

    1. What is HEARTBEAT? This message is how the consumer tells the broker that it is active and it is used to coordinate consumers within a consumer group

    2. heartbeat.interval.ms=3000 and session.timeout.ms=10000 are important configuration parameters for the heartbeat behavior

    3. The consumer sends one heartbeat per connection. Therefore, it is less important than the FETCH and OFFSET messages which are sent per topic-partition. E.g., consider a consumer who consumes four topics with three partitions per topic. Then there will be about 12x more FETCH and OFFSET messages than HEARTBEAT messages

    4. I changed the defaults to session.timeout.ms=60000 and heartbeat.interval.ms=20000. Remember that the heartbeat should be about 1/3 of the session timeout. Otherwise, the broker might think that the connection is down!

Important

Be sure to test the performance after making the changes to see if your consumers act as expected. I am still a newbie with Kafka, so I am not aware of all the side effects of making the adjustments

Result

../_images/02_kafka_cluster_usage_after_changes.pngKafka cluster usage after

Importantly, I have some other consumers and producers running which are responsible for about ~500B/s. Therefore, the network read is down from 36KB/s to 300B/s, a reduction of about 100! Which feels good for my $$$ since I pay US$ 0.143 / GB of network reading on confluent.cloud