:snake: Kafka in Python: Share Your Wisdom! :rocke...
# random
p
🐍 Kafka in Python: Share Your Wisdom! 🚀 Hey there, Python + Kafka enthusiasts! 👋 TL;DR: Curious about your Kafka-Python adventures - throughput, troubles, and which tooling you’re using I’m trying to understand how Python consumers cope with Kafka, Python’s GIL, and parallel reading from multiple partitions. I have experience with Flink, Kafka Streams, and Spring consumers, and I’m keen to discuss how these technologies benefit us (e.g., committing only after sink, transactions, parallelism, multi-partition thread reading). 🤓 I’ve been diving into the Python-Kafka realm lately and thought, why not tap into this awesome community’s expertise? 🤓 1. Throughput: What kind of throughput have you achieved with Kafka and Python? Any tips for optimizing it? 📈 2. Troubles: Have you encountered any particular challenges or common errors when working with Kafka in a Python environment? How is the consumer heart beat doing? heart beat 3. Tooling: Are you wielding fluvio or rocking kafka-python? Share your tool tales! 🛠️ 4. GIL Concerns: Python’s Global Interpreter Lock (GIL) can be a hindrance for parallelism. How have you dealt with this limitation, especially when reading from multiple partitions in parallel? 🐍 5. How are you handling AVRO serialization and decompression time? (I’m currently using lz4) 📦 Having experience with Flink, Kafka Streams, and Spring consumers, I’ve found them to be incredibly powerful for stream processing. For instance: • Committing Only After Sink: This can ensure message integrity and prevent data loss. 🤝 • Transactions: Implementing transactional processing can help maintain data consistency and reliability. 💼 • Parallelism: These technologies often offer parallelism, making it easier to scale and improve performance. 🚀 • Multi-Partition Thread Reading: Efficiently reading from multiple partitions can be a game-changer. Have you explored this aspect in the Python Kafka ecosystem? 📚📚 Your insights will be really much welcome, and I’m all ears. Can’t wait to learn from your experiences! 🙌
👀 1
j
Background I had a client to sync data from an on-premise data warehouse into a postgresql DB on AWS. They used the Debezium connector to push the data into a single topic - records of all tables are routed into that topic. It was mainly because it required to perform transformations as soon as data was ingested into source tables. The transformations were made by stored procedures within the DB. A Kafka consumer app was created using the kafka-python package and basically it did (1) poll messages and take the latest records by table name (2) upsert/delete records based on database operation (c, u or d) (3) send transformation details (stored procedure execution info) into multiple SQS queues so that transformations can be performed by Lambda functions (stored procedures took seconds to minutes (or longer in rare cases) and they couldn't be processed by the consumer app) The source topic had three partitions and the same number of instances of the consumer app were deployed on ECS. 1. Throughput: What kind of throughput have you achieved with Kafka and Python? Any tips for optimizing it? I don't have an exact number but the consumer app was able to process multi-million records a day. Actually the latency was mainly due to the database performance. We didn't put much optimisation on the consumer app while focusing on the transformation logic. 2. Troubles: Have you encountered any particular challenges or common errors when working with Kafka in a Python environment? How is the consumer heart beat doing? Initially the consumer instances rebalanced often whenever the stored procedures took a long time exceeding the heart beat limit. Although an instance got excluded from the consumer group, it was not terminated so that we had to manually kill it. After we execute them separately, however, we didn't have such an issue. 3. Tooling: Are you wielding fluvio or rocking kafka-python? Share your tool tales! We used the kafka-python package. And there is one by Confluent as well. Basically Python Kafka client packages implement the admin, consumer and producer APIs. I didn't know fluvio and it looks to be related to stream processing. In Python community, https://quix.io/ and https://bytewax.io/ are mentioned more. 4. GIL Concerns: Python's Global Interpreter Lock (GIL) can be a hindrance for parallelism. How have you dealt with this limitation, especially when reading from multiple partitions in parallel? We can deploy multiple consumer instances and partitions are dynamically assigned. Also, I saw a couple of examples that achieve parallelism using the Ray cluster - https://www.ray.io/. 5. How are you handling AVRO serialization and decompression time? (I'm currently using lz4) The confluent kafka package supports AVRO records that are associated with the Confluent schema registry. I think the (de)serialization can also be used in an app that is built by the kafka-python package. Moreover it is possible to (de)serialize records on Glue schema registry in Python as well.
p
Hey! Thank you for the descriptive response. Did you had in that case one consumer per partition or one consumer consuming more than one partition?
Initially the consumer instances rebalanced often whenever the stored procedures took a long time exceeding the heart beat limit. Although an instance got excluded from the consumer group, it was not terminated so that we had to manually kill it. After we execute them separately, however, we didn’t have such an issue.
With separately you mean one consumer per partition? How often the rebalance was happening? I also heard about bytewax and quix. Both of them work somewhat like Flink, right?
j
We deployed the consumer application as an ECS service and it has three tasks (instances) that matches the number of partitions. The kafka-python library dynamically assign partitions. At the beginning, there were lots of rebalancing whenever a task fails to process transformation within the heart beat limit. Once we off-loaded it, it was fixed. Even we didn't monitor how often they are rebalanced. Yes, that's right. In Kafka ecosystem, they aim to support the Kafka streams API. Another Python package that I have to mention is Faust - https://faust.readthedocs.io/en/latest/