Hey everyone, Using flink 1.17.1, Currently encou...
# troubleshooting
o
Hey everyone, Using flink 1.17.1, Currently encounter an issue where I use a KafkaSink with EXACTLY_ONCE policy, and a KafkaSource with isolation.level="read_committed". When using read_uncommitted I can see messages in the consumer, but when using "read_committed" I don't get any message. The only reason for it to happen is that the producer does not commit the transaction, and all of the transactions gets aborted although the checkpoint completed (this is a theory, I have no way of confirming it). Does anyone have any idea what can I do to fix this?
m
If the checkpoint has completed successfully, all of the KafkaSink transactions are committed as part of the 2-phase commit protocol in Flink
o
The checkpoint has completed successfully, but I get this error messages in the logs:
Copy code
[2:26 PM] org.apache.flink.connector.kafka.sink.KafkaCommitter - Unable to commit transaction (org.apache.flink.streaming.runtime.operators.sink.committables.CommitRequestImpl@7df51821) because it's in an invalid state. Most likely the transaction has been aborted for some reason. Please check the Kafka logs for more details.
I don't see any errors in the Kafka logs as suggested by this error.
m
That sounds like the Kafka transaction timeout window has passed
From the Kafka broker perspective of things
o
Thought so as well, but it was set to 15 minutes, and the checkpoint completes in a matter of ms
I know that this configuration should work, because when I tried a bigger number it crashed due to the max timeout allowed by confluent broker
m
Can you verify that you have your KafkaSink setup something like explained on https://docs.immerok.cloud/docs/how-to-guides/development/exactly-once-with-apache-kafka-and-apache-flink/ ?
o
It seems that the checkpoint time took too long, When checkpoints take around 5 minutes, it reaches the timeout of the transaction (which is 15 minutes) for some reason
I will try to reduce the checkpoint time, but it seems that it shouldn't happen when the checkpoint is 10 minutes less than the transaction timeout
m
Are you using aligned checkpoints? You can consider enabling unaligned checkpoints
Perhaps there's data skew, causing checkpoints to fail
o
Do you have any suggestion on what metrics to look at and how to act according to each metric? Thanks for your replies!