Joris Basiglio
09/19/2024, 9:55 PMD. Draco O'Brien
09/20/2024, 4:11 AMtaskmanager.job.<job_id>.total.backpressure-time, operator.<subtask_index>.backpressured-time-ns
You can also monitor buffer usage, especially if you’re using network shuffling. Uneven distribution of data across keys can cause some tasks to have a lot more buffered data waiting to be processed than others.
taskmanager.network.<task_id>.outbound.<channel_index>.buffers-in-use, taskmanager.network.total.num-network-buffers
Large differences between event time and processing time can hint at inefficiencies, possibly caused by skewed data processing.
operator.<subtask_index>.latency
It’s not a direct direct measure of hot keys, but monitoring the size of keyed state can give you an rough idea of which keys are retaining more state, which might correlate with processing load.
state.backend.*
So these are some of the related metrics you can sample and aggregateD. Draco O'Brien
09/20/2024, 4:17 AMJoris Basiglio
09/20/2024, 1:23 PMblock-cache
on RocksDB due to a memory leak. After doing that, I noticed some performance degradation which lead me to think that we have hot key issue (but the block-cache feature was probably hiding the symptoms).
I like your idea of aggregating those metrics in Flink I'll look more deeply into that.
I've also noticed that Flink publishes state.backend.rocksdb.metrics.block-cache-hit
which could actually be a good proxy for hot keys in my case. I might revert the block-cache disabling change and see how this metric look like for a while.D. Draco O'Brien
09/20/2024, 1:28 PMstate.backend.rocksdb.metrics.block-cache-hit
could definitely provide some additional insights