Does anyone have good pointers on how to debug iss...
# random
o
Does anyone have good pointers on how to debug issues with checkpointing not progressing? I'm using Table API w/ Iceberg source + DataStream API on AWS KDA / Managed Flink.
j
o
Thanks! The checkpointing was stuck and never made any progress, all the stats in the Flink Web UI dashboard shows 0, e.g.
The doc is helpful, through the thread dump I was able to figure out that the checkpointing seems to be waiting for lock forever with
Copy code
"Channel state writer <operator name> (73/128)#0" Id=128 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5b5fd993
	at java.base@11.0.18/jdk.internal.misc.Unsafe.park(Native Method)
	-  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5b5fd993
	at java.base@11.0.18/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
	at java.base@11.0.18/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
	at java.base@11.0.18/java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:483)
	at java.base@11.0.18/java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:671)
	at app//org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:96)
	at app//org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:75)
	at app//org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl$$Lambda$956/0x0000000800b8a440.run(Unknown Source)
	at java.base@11.0.18/java.lang.Thread.run(Thread.java:829)
j
Is it some kind of networking issue?
o
I'm watching your talk at Flink Forward on this topic and realized that I never replied to this thread. I was able to get rid of the issue by avoiding batch read (i.e. use streaming read instead) on Iceberg tables w/ the old Iceberg source implementation (not the FLIP27 source). Batch read works fine for the FLIP27 source implementation in Iceberg but we found it to be less reliable. I didn't try to understand what happened under the hood though
j
Oh I see very interesting! Thanks for following up on it.