Have folks ever run into a situation where a task ...
# troubleshooting
h
Have folks ever run into a situation where a task slot "pauses" on a Kafka source and doesn't make any forward progress without a restart?
d
yes, so this can happen for several reasons. First you should get insight into the logs of both connector and Kafka broker. And also view Kafka metrics. One possibility is that Kafka brokers are not reachable for some reason.
You can see if there is prometheus integration with Kafka for nsights
h
We have full metrics from Strimzi. What we are seeing is while the broker is healthy, and we can do requests from Flink to Kafka, there is a lot of locking that's happening around Pekko/Kafka inside of the thread dump.
d
there could be an ongoing consumer regrouping on the Kafka side
whats Pekko?
h
It used to be called Akka, then got forked during the licensing dispute and merged/updated into Flink 1.18.
d
oh ok
with Akka I think that
Copy code
akka.actor.default-dispatcher.fork-join-executor.parallelism-min
and also -max, or configuring a dedicated dispatcher for Kafka operations, might help alleviate the locking.
also you can look into adjusting
Copy code
fetch.min.bytes
and
Copy code
fetch.max.bytes
settings on the connector.
also check ratio of flink parallel tasks to kafka partitions to make sure its reasonable
check for resource contention on the flink application side memory/cpu usage
👍 1
Also if none of this works you might have to look into thread dump or into the akka actor system settings.
these are some of the things that can be set
Copy code
akka {
  actor {
    default-dispatcher {
      type = "Dispatcher"
      executor = "fork-join-executor"
      fork-join-executor {
        parallelism-min = 8
        parallelism-factor = 3
        parallelism-max = 64
      }
      throughput = 100
    }
  }
}
there are also mailbox settings
if its bounded mailbox
Copy code
my-bounded-mailbox {
  mailbox-type = "akka.dispatch.BoundedMailbox"
  mailbox-capacity = 1000
  mailbox-push-timeout-time = 10s
}
and there is also unbounded mailbox it depends
check for backpressure. that’s about it
h
Thanks Draco! I'll start running down your list.
d
ok, let us know how it goes. Some things are faster to check than others.