Hi all it s me again Based on <https apache flink slack com Apache Flink #troubleshooting

Hi all, it's me again! Based on <the previous post...

Jasmin Redzepovic

09/24/2023, 1:54 PM

Hi all, it's me again! Based on the previous post, I've been diving deep into the docs, reading almost every blog post I can find, and even got my hands on a Kafka book. I've come up with a scenario that I think has happened. Can someone please confirm (or disprove 😄) the following steps? 1. KafkaSource operator reads data at very high speed and sends it to KafkaSink operator 2. Lets assume that checkpointing in optimal conditions lasts for 50ms and is triggered every 30s, job is configured with

AT_LEAST_ONCE

delivery guarantee 3. On a checkpoint, KafkaSource will write its state somewhere (but checkpoint is not yet completed, it must wait the state from KafkaSink) 4. On a checkpoint, KafkaSink (producer) will wait for all outstanding records in the Kafka buffers to be acknowledged (similar to flush) 5. If there is a super slow networking in Kafka and

<http://delivery.timeout.ms|delivery.timeout.ms>

is set to 2 mins, checkpoint will block and last up to 2 mins and then it will continue because of Kafka timeout 6. Kafka delivery will timeout and some records will expire, but Kafka buffers will be considered as acknowledged 7. Job manager will take states from both KafkaSource and KafkaSink, and will write them into checkpoint, and consumer offsets will be committed to Kafka afterwards 8. This will result in data loss Flink will assume that we are fine with the data loss since we didn't increase delivery timeout (the maximum time we are willing to wait for messages to be delivered to Kafka) and Flink will continue with the processing and checkpointing. I was hoping that checkpoint would fail in case of a delivery timeout, but looks like it wouldn’t. (but I'm not sure about anything now. 😅)

Martijn Visser

09/24/2023, 3:39 PM

Synchronization point failures like flushing will fail the checkpoint

👍 1

Jasmin Redzepovic

09/24/2023, 9:35 PM

Thanks for the answer Martijn 🙌 Then it looks like Flink is not the reason for the data loss. Do you maybe know if it is possible that Kafka somehow drops the messages in case there is a slow networking (e.g. when brokers are doing rebalance of a large data set)? If yes, I will then try to investigate it further. (there is a data loss somewhere, but I have no clue what could it be 🤷)

Jasmin Redzepovic

09/25/2023, 10:02 PM

Hi @Martijn Visser! After spending some more time debugging, I am now pretty sure that there is a bug in the Flink 1.16.1 implementation. This is what I did: I have created a sink topic with 8 partitions, a replication factor of 3, and a minimum in-sync replicas of 2. The consumer properties are set to their default values. For the producer, I made changes to the

<http://delivery.timeout.ms|delivery.timeout.ms>

and

<http://request.timeout.ms|request.timeout.ms>

properties, setting them to 5000ms and 4000ms respectively. (

acks

are set to -1 by default, which is equals to all I guess) KafkaSink is configured with an AT_LEAST_ONCE delivery guarantee. The job parallelism is set to 1 and the checkpointing interval is set to 2000ms. I started a Flink Job and monitored its logs. Additionally, I was consuming the

__consumer_offsets

topic in parallel to track when offsets are committed for my consumer group. The problematic part occurs during checkpoint 5. Its duration was 5009ms, which exceeds the delivery timeout for Kafka (5000ms). Although it was marked as completed, I believe that the output buffer of KafkaSink was not fully acknowledged by Kafka. As a result, Flink proceeded to trigger checkpoint 6 but immediately encountered a Kafka TimeoutException: Expiring N records. I suspect that this exception originated from checkpoint 5 and that checkpoint 5 should not have been considered successful. The job then failed but recovered from checkpoint 5. Some time after checkpoint 7, consumer offsets were committed to Kafka, and this process repeated once more at checkpoint 9. Since the offsets of checkpoint 5 were committed to Kafka, but the output buffer was only partially delivered, there has been data loss. I confirmed this when sinking the topic to the database. Sorry for reaching out to you directly, but I noticed that you are quite active here and influential in the Flink community. I hope that you can help me get in touch with the right people regarding this matter. 🙏 (p.s. attached job logs and kafka consumer offsets, they are not too long)

v3_offsets.txt job.log

Martijn Visser

09/25/2023, 10:18 PM

Additionally, I was consuming the __consumer_offsets topic in parallel to track when offsets are committed for my consumer group.

Flink doesn't rely on consumer offsetting for its fault tolerance... 🙂

Jasmin Redzepovic

09/25/2023, 10:21 PM

Yes, but checkpoint 5 completed successfully, but I'm pretty sure it shouldn't have

Martijn Visser

09/25/2023, 10:35 PM

Although it was marked as completed, I believe that the output buffer of KafkaSink was not fully acknowledged by Kafka.

If the output buffer isn't acknowledged by Kafka, that shouldn't succeed the checkpoint. But I do think that the duration of 5009ms (which I think is the duration of the entire checkpoint) is directly correlated with the deliver timeout for Kafka (5000 ms). Because in that 5009ms, there's more activity happening then just delivering towards Kafka. I don't think that the KafkaClient reported an error, so therefore the checkpoint is marked as succesfull

Martijn Visser

09/25/2023, 10:37 PM

From what I can find on a quick search, people indicate this as a misconfiguration on the Kafka broker side, not in the Flink side. If Flink doesn't get a signal that something is wrong, then it can't fail

Jasmin Redzepovic

09/25/2023, 10:54 PM

I just don't understand how checkpoint 6 can fail 1 second after being triggered because of an exception from Kafka saying 5000ms has passed 🤷

Copy code

21:46:10,761 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 6
...
21:46:11,739 WARN  org.apache.flink.runtime.taskmanager.Task
...
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for dataops.test.jasmin.ticket-base-sb-pl-online.v1-0:5001 ms has passed

Jasmin Redzepovic

09/25/2023, 11:09 PM

forgot to mention that source topic I used for this test had only ~4000 records, and I was running job locally on M2 mac

Jasmin Redzepovic

09/26/2023, 12:08 AM

I did one more test - I run everything with Flink version 1.15.4 There were 0 issues, 0 exceptions, max checkpoint duration was ~100ms and all data was there - 0 missing records. I repeated it few time and got same results So, the Kafka is the same, configurations are the same, only difference is Flink version Running with version 1.16.1 caused again kafka timeout exceptions, long checkpoints and missing records. Do you know if something might have changed since 1.15.4 that is related to checkpointing and Kafka connector?

Martijn Visser

09/26/2023, 12:08 AM

the Kafka is the same,

No, because that uses a different version of the Kafka Client 😅

Jasmin Redzepovic

09/26/2023, 12:09 AM

ahh 😅

Martijn Visser

09/26/2023, 12:09 AM

I do find it interesting though and I want to get it double checked

Martijn Visser

09/26/2023, 12:09 AM

But I'll have to sync with some people offline

Jasmin Redzepovic

09/26/2023, 12:09 AM

I appreciate your help, thanks a lot

Jasmin Redzepovic

10/12/2023, 10:33 AM

Hi @Martijn Visser, any news on this?

Martijn Visser

10/12/2023, 11:43 AM

No, I've had other priorities to look at unfortunately

👍 1

Jasmin Redzepovic

10/13/2023, 11:11 AM

Should I maybe put this on Jira, as Slack has limited message retention?

Martijn Visser

10/13/2023, 12:11 PM

That would be a good idea

Tzu-Li (Gordon) Tai

10/17/2023, 7:23 PM

hey @Jasmin Redzepovic thanks for looking into this and filing the ticket.

I am now pretty sure that there is a bug in the Flink 1.16.1 implementation

was this ever tested against 1.16.2 or 1.17.1? I’m asking because at a first glance the reported issue with

checkpoint 5

exceeding

<http://delivery.timeout.ms|delivery.timeout.ms>

(meaning likely the Kafka buffer was not successfully fully flushed) but actually succeeded, should be addressed via https://issues.apache.org/jira/browse/FLINK-31305 already. That fix is merged in 1.16.2 and 1.17.1.

Tzu-Li (Gordon) Tai

10/17/2023, 7:23 PM

If you have also reproduced this / can reproduce this in those versions, then we should definitely look a bit deeper.

Tzu-Li (Gordon) Tai

10/17/2023, 7:23 PM

also left comments on the JIRA https://issues.apache.org/jira/browse/FLINK-33293

Jasmin Redzepovic

10/18/2023, 11:19 AM

Hi @Tzu-Li (Gordon) Tai, thank you for the quick response. These are the Flink versions I've tested the issue against: • 1.15.4 worked today without data loss 🟢 • 1.16.1 experienced data loss today 🔴 • 1.16.2 I couldn't reproduce it today, but I'm pretty sure I did before ~3weeks, today it worked without data loss 🟡 • 1.17.1 experienced data loss today 🔴 If you need any additional input or help from my side, I will gladly help 🙂 edit: ps. we could maybe completely transfer this conversation to jira 😅

Martijn Visser

10/18/2023, 1:39 PM

So with Flink 1.17.1, you didn't test the Kafka 3.0.0-1.17 version?

Jasmin Redzepovic

10/18/2023, 1:41 PM

No, I've tested this one:

Copy code

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka</artifactId>
  <version>1.17.1</version>
</dependency>

(didn't specify 3.0.0- prefix though)

Martijn Visser

10/18/2023, 1:41 PM

Hmm then I'll defer back to @Tzu-Li (Gordon) Tai for his opinion on this

👍 1

49 Views

Open in Slack

Previous Next