Apache Flink #troubleshooting

Noufal Rijal

11/04/2025, 4:19 PM

Hi Team, I was going through the external resources implementation in Flink. Our usecase is we have to to model inference over the data that we receive, which requires GPU. As per the doc we have tried implementing the same, by setting up the values for nvidia. What we have observed is that - when we setup the amount = 1, each task manager that we spawn will start taking the entire GPU. This inturn is causing a critical resource scaling issue - say if we want 10TM then we need 10GPU's which is pretty costly. Is there a way by which we can solve this case, any kind of resource sharing mechanisms, or do we need to go ahead with a model server architecture. Note - we are using Nvidia T4 GPU nodes where CPU=4 , memory=28GB #C065944F9M2 #C03G7LJTS2G

Sidharth Ramalingam

11/05/2025, 6:12 AM

we are using Flink's

AsyncIO

function with Futures to make external gRPC calls. Currently, we have set the async capacity to 1, and we are using a blocking stub to make those calls. For each event, we trigger 4 Futures (i.e., 4 gRPC calls per event). Does this mean that the

Executors.newFixedThreadPool()

needs to have at least 4 threads to avoid queue starvation? Also, if we increase the async capacity to 2, should we increase the thread pool size to 8 to keep up with the parallel calls?

Jashwanth S J

11/06/2025, 6:50 AM

Team, We've baked JM and TM pod image to have jar files which are required for flinksessionjob submission. We're submitting jobs using jarURI: "file:///opt/flink/integration_testing_ui_demo.jar".
Operator is failing to find the jar inside the JM/TM pod to bring up TM pods even though jar is present within image. I could exec and see the jar. Can someone help to find the cause Operator logs:

Copy code

2025-11-06 06:36:55,168 o.a.f.k.o.r.ReconciliationUtils [WARN ][ncm-sc/nc-int] Attempt count: 11, last attempt: false
2025-11-06 06:36:55,259 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][ncm-sc/fsc-flink-cluster] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
2025-11-06 06:36:55,362 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][ncm-sc/fsc-flink-cluster] The derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead
2025-11-06 06:36:55,459 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][ncm-sc/fsc-flink-cluster] Resource fully reconciled, nothing to do...
2025-11-06 06:36:55,469 i.j.o.p.e.EventProcessor       [ERROR][ncm-sc/nc-int] Error during event processing ExecutionScope{ resource id: ResourceID{name='nc-int', namespace='ncm-sc'}, version: 48211671}
org.apache.flink.kubernetes.operator.exception.ReconciliationException: java.io.FileNotFoundException: /opt/flink/integration_testing_ui_demo.jar (No such file or directory)
	at org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.reconcile(FlinkSessionJobController.java:130)
	at org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.reconcile(FlinkSessionJobController.java:58)
	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:153)
	at io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:111)

Exec to JM Pod: I have jar inside

Copy code

❯ kubectl exec -it -n ncm-sc fsc-flink-cluster-6c47f5964b-t5jhb -- bash
Defaulted container "flink-main-container" out of: flink-main-container, setup-certs-and-plugins (init)
root@fsc-flink-cluster-6c47f5964b-t5jhb:/opt/flink# ls
'${sys:log.file}'   bin    examples                          lib       licenses   NOTICE   plugins        README.txt
 artifacts          conf   integration_testing_ui_demo.jar   LICENSE   log        opt      pod-template

Describe flinksession job

Copy code

❯ kubectl describe flinksessionjob nc-int -n ncm-sc | tail -n 20

Status:
  Error:  {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"java.io.FileNotFoundException: /opt/flink/integration_testing_ui_demo.jar (No such file or directory)","additionalMetadata":{},"throwableList":[{"type":"java.io.FileNotFoundException","message":"/opt/flink/integration_testing_ui_demo.jar (No such file or directory)","additionalMetadata":{}}]}
  Job Status:
    Checkpoint Info:
      Last Periodic Checkpoint Timestamp:  0
    Job Id:                                f72f53c6355212276b75452aa2dc376e
    Savepoint Info:
      Last Periodic Savepoint Timestamp:  0
      Savepoint History:
  Lifecycle State:      UPGRADING
  Observed Generation:  2
  Reconciliation Status:
    Last Reconciled Spec:      {"spec":{"job":{"jarURI":"file:///opt/flink/integration_testing_ui_demo.jar","parallelism":1,"entryClass":"com.beam.screaltime.worker.ncm.process_functions.NcmEventsJob","args":["--jobType","cdc","--cloudProvider","nx","--buildId","dev1-6404","--operatorParallelism","{\"default\":1}"],"state":"suspended","savepointTriggerNonce":null,"initialSavepointPath":null,"checkpointTriggerNonce":null,"upgradeMode":"stateless","allowNonRestoredState":null,"savepointRedeployNonce":null,"autoscalerResetNonce":null},"restartNonce":1,"flinkConfiguration":null,"deploymentName":"fsc-flink-cluster"},"resource_metadata":{"apiVersion":"flink.apache.org/v1beta1","firstDeployment":true}}
    Reconciliation Timestamp:  1762411304062
    State:                     UPGRADING
Events:
  Type     Reason               Age                  From  Message
  ----     ------               ----                 ----  -------
  Warning  SessionJobException  7m5s (x19 over 23m)  Job   /opt/flink/integration_testing_ui_demo.jar (No such file or directory)
  Normal   Submit               7m5s (x19 over 23m)  Job   Starting deployment

windwheel

11/07/2025, 8:59 AM

https://github.com/apache/flink/pull/27152 Is anyone interested in this PR? I'm the author of this PR, and I've completed the FLIP code since its creation and fixed as many e2e tests as possible. However, no one has asked any questions in the discussion thread so far. Are there any community members with free time who could offer some suggestions? I'd really like to push this PR to be merged into the main branch.

Kamakshi

11/07/2025, 10:59 PM

Hi Team, I am trying to run Bdd tests with Flink Tables on docker. Is there any example of this?

Elad

11/09/2025, 1:03 PM

Hi there Trying to deploy flink as standalone cluster on k8s WITHOUT operator (unfortunately I cannot install the required CRDs on my k8s cluster). I’m failing to deploy the cluster in high availability mode. I have read the documentation and added the needed configuration (running the job manager with the args [“standalone-job”, “—host”, “$POD_IP”] and with setting the pod ip env variable dinamicly of course ) (read deployment/resource-providers/standalone/kubernetes/#applicatiob-cluster-resource-definitions). When reading the generated high availability ConfigMap it seems like the job managers registered themselves with their localhost address- 127.0.0.1, instead of using the pod ip as needed. It also matches with the TaskManagers trying to connect to a localhost address… I’m wondering if the job managers using the —host property, and if not - what should I do to make them work? TLDR - trying to configure high availability on k8s WITHOUT operator failing - I would love to see examples for a working setup.

Han You

11/10/2025, 3:53 PM

Hi team! I think there has been a change in k8s operator which breaks autoscaler standalone. In this hotfix commit, we stop all autoscaler config keys from being set on the jobmanager. However the standalone autoscaler reads those configs from jobmanager as shown in this line. The effect is that standalone autoscaler no longer works in operator version 1.13. Can someone more familiar with the operator and autoscaler confirm? I can also help make a PR to fix. Thank you!

Vasu Dins

11/11/2025, 6:53 AM

Hi everyone, I’m trying to integrate Elasticsearch 7 with PyFlink (version 1.20.1) using the jar

flink-connector-elasticsearch7-3.1.0-1.20.jar

. when I run my job, I’m getting the following error:

Copy code

Exception in thread "Thread-5" java.lang.NoClassDefFoundError: org/apache/flink/connector/elasticsearch/sink/ElasticsearchSinkBuilderBase
py4j.protocol.Py4JError: org.apache.flink.connector.elasticsearch.sink.Elasticsearch7SinkBuilder does not exist in the JVM

here’s a minimal example of my code

Copy code

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.elasticsearch import Elasticsearch7SinkBuilder, ElasticsearchEmitter
import json

env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///C:/flink-1.20.1/lib/flink-connector-elasticsearch7-3.1.0-1.20.jar")

data_stream = env.from_collection([
    {"id": "1", "name": "John", "age": 25},
    {"id": "2", "name": "Jane", "age": 30}
])

json_stream = data_stream.map(lambda x: json.dumps(x))

es7_sink = Elasticsearch7SinkBuilder() \
    .set_bulk_flush_max_actions(1) \
    .set_emitter(ElasticsearchEmitter.static_index('my-index', 'id')) \
    .set_hosts(['10.0.0.102:9200']) \
    .build()

json_stream.sink_to(es7_sink)
env.execute("Flink Elasticsearch 7 Job")

i’m running it using

Copy code

flink run -py sampletestelastic.py

has anyone faced this issue before? seems like

ElasticsearchSinkBuilderBase

class is missing from the jar or not visible to PyFlink. do i need an extra dependency or different jar for flink 1.20.1? It seems like

ElasticsearchSinkBuilderBase

might be missing or not accessible from the JVM side. any guidance or suggestions would be really appreciated

Vikas Patil

11/11/2025, 9:02 PM

Hello All, our pekko framesize keeps on increasing. This is causing job failures. We use incremental checkpointing and flink version 1.18.1. Is there a relation between the number of sst files and pekko payload ? The number of sst files is well over 100k for this job.

徐平

11/12/2025, 5:16 AM

hi everyone， i resume the task from savepoint，but it failed suddenly with this problem，who can give me some tricks？thank u very much 2025-11-12 050248.618 [Source: dw_realtime_inner_ods_sensor_events[1] -> Calc[2] -> dw_realtime_inner_dwd_sensor_deal_loginresult_dtl_000[3]: Writer -> dw_realtime_inner_dwd_sensor_deal_loginresult_dtl_000[3]: Committer (1/1)#546] WARN org.apache.flink.runtime.taskmanager.Task - Source: dw_realtime_inner_ods_sensor_events[1] -> Calc[2] -> dw_realtime_inner_dwd_sensor_deal_loginresult_dtl_000[3]: Writer -> dw_realtime_inner_dwd_sensor_deal_loginresult_dtl_000[3]: Committer (1/1)#546 (765f7cffb5ee6a037a1ae2a15d0f3ed5_cbc357ccb763df2852fee8c4fc7d55f2_0_546) switched from RUNNING to FAILED with failure cause: java.io.IOException: Failed to deserialize consumer record due to at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:56) at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:33) at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:203) at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:443) at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:638) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:973) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:917) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:970) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:949) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to deserialize consumer record ConsumerRecord(topic = event_topic, partition = 1, leaderEpoch = 7, offset = 2399381243, CreateTime = 1762853256453, serialized key size = 37, serialized value size = 2917, headers = RecordHeaders(headers = [], isReadOnly = false), key = [B@54e6a925, value = [B@16f80c51). at org.apache.flink.connector.kafka.source.reader.deserializer.KafkaDeserializationSchemaWrapper.deserialize(KafkaDeserializationSchemaWrapper.java:59) at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:53) ... 14 common frames omitted Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:92) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:310) at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110) at org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter$SourceOutputWrapper.collect(KafkaRecordEmitter.java:67) at org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:84) at org.apache.flink.streaming.connectors.kafka.table.DynamicKafkaDeserializationSchema.deserialize(DynamicKafkaDeserializationSchema.java:115) at org.apache.flink.connector.kafka.source.reader.deserializer.KafkaDeserializationSchemaWrapper.deserialize(KafkaDeserializationSchemaWrapper.java:56) ... 15 common frames omitted Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:92) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at StreamExecCalc$150.processElement_0_0_rewriteGroup82(Unknown Source) at StreamExecCalc$150.processElement(Unknown Source) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ... 23 common frames omitted Caused by: java.lang.IllegalStateException: Received element after endOfInput: Record @ (undef) : +I(17653228541,33333888241751361,6931888464607053,deal_loginresult,true,8548827403,,false,fido,101.230.113.1577001894070038941:????????3,HUNDSUN,CTP,b7afd566c360b4104d44d46bc4cc972e,cf1b93e8c889f5acaa5956a01a1edf30,,,188201457,,,,null,null,null,null,null,null,null,null,null,??????,app,null,null,null,null,null,null,null) at org.apache.flink.util.Preconditions.checkState(Preconditions.java:215) at org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.processElement(SinkWriterOperator.java:203) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ... 28 common frames omitted

GAURAV MIGLANI

11/12/2025, 8:30 AM

in flink 1.19.0, is there any issue with simple pipeline kafka source->sql transformation-> kafka sink in k8s kubernetes flink operator, atleast once guarantee is failing and at time of kubernetes consolidation, data is getting dropped in flink, the pipeline is using processing time 1to 1 sink only, please help

Hristo Yordanov

11/12/2025, 7:28 PM

Hi there, Trying to explore Flink CDC Pipeline (source postgres and sink fluss). Hot storing to Flus is working well. What I'm trying is activating cold store (Minio, Paimon) using tiering service. Is running in separate container. All containers start well and cdc working as expecting in hot storage with fluss, but no cold storage is happening at all. I know that normally tiering should be activated but in pipelines with fluss sink I don't see how to doit. My question is if pipeline with fluss sink support tiering to paimon? Or i should use other approach but not pipelines?

Ananth bharadwaj

11/13/2025, 2:44 AM

Hi. I am trying to use postgres-cdc connector where my source and destination both are postgres. I have a fairly large table size 4G which is taking considerate time for initial snapshot process. I would like make use of snapshot.select.statement.overrides feature to only start snapshot from a timestamp. For example - snapshot.select.statement.overrides.public.orders=SELECT * FROM public.orders WHERE order_date > '2020-11-06'. But even after that the full snapshot gets triggered. I have specified scan.incremental.snapshot.enabled as false but still is same behaviour. Is there a way to only snapshot with a timestamp or id based condition?

Mehdi Jalili

11/13/2025, 2:45 PM

Hello all. I could really use your help understanding the

scan.watermark.idle-timeout

. I’m using the SQL Gateway to join two tables, backed by Kafka topics. One is a fact table that receives a constant stream of data. The other one is a compact topic populated by Debezium that contains dimension data and I’m joining the two to enrich the facts. I want to sink the results to an append-only Kafka table sink which is why I have made the dimension table versioned and use a Temporal join to get the snapshot of the dimensions at the specified fact timestamp. Because the dimension stream rarely gets any new data, the watermark will usually be hours if not days behind the fact stream watermark and this will stop the output watermark from progressing. I have used

scan.watermark.idle-timeout

on the dimension table to let the watermark generator know to ignore the dimension table watermark but I still see the watermark on the TemporalJoin operator as the minimum of the two. What am i I doing wrong? Sorry if my grasp of these concepts are somewhat shaky. I’d really appreciate the help. Thank you. Here’s some simplified code to demonstrate the setup:

Untitled.sql

Brin

11/14/2025, 12:40 PM

Hello all: I could really using Flink 1.20 with Flink CDC 3.3, deployed on a Kubernetes cluster, to synchronize data from MySQL to StarRocks. Both the source table (

test.vip_record

) and the sink table (

test.ods_vip_record

) currently have exactly the same schema, and no DDL changes have been made during the job execution. However, the job keeps failing after running for some time, and the logs show the following error:

Copy code

2025-11-14 20:32:03
java.lang.IllegalStateException: Unable to coerce data record from test.vip_record (schema: null) to test.ods_vip_record (schema: null)
  at org.apache.flink.cdc.runtime.operators.schema.regular.SchemaOperator.lambda$handleDataChangeEvent$1(SchemaOperator.java:235)
  at java.util.Optional.orElseThrow(Optional.java:290)
  at org.apache.flink.cdc.runtime.operators.schema.regular.SchemaOperator.handleDataChangeEvent(SchemaOperator.java:232)
  at org.apache.flink.cdc.runtime.operators.schema.regular.SchemaOperator.processElement(SchemaOperator.java:152)
  at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
  at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
  at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
  at org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:310)
  at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
  at org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlRecordEmitter$OutputCollector.collect(MySqlRecordEmitter.java:141)
  at java.util.Collections$SingletonList.forEach(Collections.java:4824)
  at org.apache.flink.cdc.debezium.event.DebeziumEventDeserializationSchema.deserialize(DebeziumEventDeserializationSchema.java:93)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlRecordEmitter.emitElement(MySqlRecordEmitter.java:119)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlRecordEmitter.processElement(MySqlRecordEmitter.java:101)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlPipelineRecordEmitter.processElement(MySqlPipelineRecordEmitter.java:120)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlRecordEmitter.emitRecord(MySqlRecordEmitter.java:73)
  at org.apache.flink.cdc.connectors.mysql.source.reader.MySqlRecordEmitter.emitRecord(MySqlRecordEmitter.java:46)
  at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:203)
  at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:422)
  at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
  at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:638)
  at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:973)
  at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:917)
  at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:970)
  at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:949)
  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763)
  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
  at java.lang.Thread.run(Thread.java:750)

The key part is:

Copy code

schema: null

which suggests that both the source and sink schemas are unexpectedly lost or not recognized during the runtime. --- ### Additional information * Deployment mode: Flink on Kubernetes (Native K8s Mode) * Source: MySQL 5.7 * Sink: StarRocks * Flink CDC performs snapshot + incremental reading normally after restart * After 1–2 hours of normal running, the job fails with the above error * No schema changes were made during this time --- ### Questions 1. Under what conditions would

SchemaOperator

encounter

(schema: null)

for both source and sink? 2. Could this be related to state inconsistency or schema metadata loss inside checkpoint/savepoint? 3. Is this a known issue in Flink CDC 3.3 with Flink 1.20? 4. Are there recommended configurations to ensure stable schema management for MySQL → StarRocks pipelines? Thank you!

melin li

11/17/2025, 7:04 AM

hello all Deploy Flink on native Kubernetes, set "user.artifacts.artifact-list" to upload local JAR files to the pod. Also, set the pipeline.classpaths. The jvm classpaths do not include the jars that were set in the pipeline.classpaths.

Idan Sheinberg

11/17/2025, 11:54 AM

Hello all, I'm trying to figure out how to setup

taskmanager.numOfTaskSlots

via the Flink Kubenetes Operator. I'm using the operator version 1.13.0 and Flink 1.20.3. While the configmap generated by the operator correctly reflects the set value (

'3'

in my case). It looks like during the actual taskmanager startup process, it resolves

taskmanager.numOfTaskSlots=1

as a Dynamic configuration property, which overrides the value at runtime. Has anyone successfully set a value larger than 1 in K8s in the past?

Roberto Serafin

11/17/2025, 8:03 PM

Hi everyone, We are using Flink to write with fanout DynamicIcebergSink into about 300 Iceberg tables and we need to run TableMaintenance (expireSnapshots, vacuum, etc.) on all of them. Running a separate maintenance job for each table would consume a huge number of taskSlots. Is there any best practice, tool, or pattern to efficiently manage maintenance for a large number of tables, without exhausting taskSlots with hundreds of concurrent jobs? Has anyone solved this problem or implemented scalable TableMaintenance orchestration across many tables? Thanks a lot!

Lee Wallen

11/17/2025, 11:40 PM

Hi everyone! We’re running a Flink 1.19.3 app on AWS EKS, using flink-connector-kafka:3.3.0-1.19 for KafkaSource and Sink. The Kafka cluster is hosted on Confluent Cloud. Issue: When the JobManager is replaced (e.g., during failover), the new JobManager starts fine, but then TaskManagers begin throwing exceptions related to

SaslClientAuthenticator

failing to configure. Relevant configuration: • Added

classloader.parent-first-patterns.additional

per Confluent Cloud support to avoid child-first classloading issues with OAuth:

Copy code

org.apache.kafka.common.security.oauthbearer.;
org.apache.kafka.common.security.auth.;
org.apache.kafka.common.security.plain.;
org.apache.kafka.common.security.scram.

• Both

sasl.login.callback.handler.class

and

sasl.client.callback.handler.class

are set to:

Copy code

org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginCallbackHandler

Runtime classpath (Kafka-related):

Copy code

+--- io.confluent:kafka-avro-serializer:7.4.12
|    +--- io.confluent:kafka-schema-serializer:7.4.12
|    |    +--- io.confluent:kafka-schema-registry-client:7.4.12
|    +--- io.confluent:kafka-schema-registry-client:7.4.12 (*)
+--- org.apache.flink:flink-connector-kafka:3.3.0-1.19
|    +--- org.apache.kafka:kafka-clients:3.4.0

Stacktrace excerpt:

Copy code

Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
Caused by: java.lang.IllegalArgumentException: Callback handler must be castable to org.apache.kafka.common.security.auth.AuthenticateCallbackHandler: org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginCallbackHandler

Has anyone encountered this issue or have suggestions on how to resolve it? I'll add a thread and include a snippet with the full stacktrace.

Anuj Jain

11/18/2025, 6:24 AM

Hello Everyone , i have deployed flink K8 operator but its not supporting joburi from s3. I have read teh documentation and get that we need to install s3 plugin while installing operator. i am using helm build to deploy operator.

Copy code

if you use helm to install flink-kubernetes-operator, it allows you to specify a postStart hook to download the required plugins.

i need help how we use this posStart hook, is there any example ?

Ashish Marottickal Gopi

11/19/2025, 11:10 AM

Hi Everyone, Im developing a new Flink Application (real time stock out) which will be reading from various Kafka topics ( 6 so far ) . I had some questions I needed to clarify from the experts here: 1. All topics are of different schema's , So I started with creating individual source in flink per topic. I believe this is correct ? 2. I wanted to use event time as this is important for this application ( Important wrt order of the events) . But this is not needed for all the topics I consume from as some of them are low throughput reference data. When I say event time, I dont want to consider late arriving data. I added watermark in Kafka Source itself as i read about per partition watermarks. I wanted to clarify if this is necessary or simply reading with Process time semantics and in the Joins ( KeyedCoprocessFunctions etc ) just keeping the last read business event timestamp and only considering newer one's will do ? a. One reason for this thought is because the topic contains data from all markets and I dont want by introducing event time and lateness configs to ignore a different market's event as late arriving. Another is as mentioned above, the latest event for a key is whats important for me. b. The thing that's worrying about event time and watermarks is that since Im reading from multiple topics with different throughputs I dont want a situation where data is not emitted due to watermark issues as this application needs minimum end to end latency.

Jatin Kumar

11/19/2025, 2:03 PM

Hello, i have a use case, to read from s3 then write to iceberg(on s3), in addition to that i want to create rocksdb state and savepoint/checkpoint for the same when job finish. so that i can start from savepoint/checkpoint for another job, this is warmup/backfill phase then i will start the job to read from kafka once backfill is done then checkpointing is quite easy any thoughts?

Xianxin Lin

11/19/2025, 8:32 PM

Hi Everyone, Looking into k8s operator autoscaler doc, is there anyway to support out of box metrics for scaling decision

Copy code

Collected metrics:

Backlog information at each source
Incoming data rate at the sources (e.g. records/sec written into the Kafka topic)
Record processing rate at each job vertex
Busy and backpressured time at each job vertex

Sebastian Tota

11/20/2025, 7:42 PM

Hi everyone. I'm looking to see if I can get some advice on the best option to use for a Parquet S3 file sink if our base object is a simple POJO, not Avro generated object, and we need the batch FileSink bucket assigner to have access to the Java object in order to know the event time of the object to build the output S3 path. We're currently using

AvroParquetWriters.forReflectRecord

since it specifically returns back the

object we pass in, but we know this is the least efficient solution. I thought about switching to using

forGenericRecord

, but that method returns a

ParquetWriterFactory

GenericRecord

which means we no longer have access to the event time information of the record in our Bucket Assigner. Does anyone have any recommendations for how we can optimize this Parquet writer?

Ricardo Mendoza

11/21/2025, 2:40 AM

Does anyone know if there is a problem with using Flink 1.20 and Hudi 1.0.2, I am trying to sync partitions into Glue catalog but it is not happening and also the old files in S3 are not deleted. Thank you in advance

Prafful Javare

11/22/2025, 4:52 PM

Hi all, I am trying to replicate this confluent developer exercise (github) using a local setup on docker compose instead of using confluent cloud. Here are my Dockerfile and docker-compose.yml:

Copy code

FROM flink:1.17.1

RUN wget -P /opt/flink/lib/ <https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.4.0-1.20/flink-sql-connector-kafka-3.4.0-1.20.jar>
RUN chown -R flink:flink /opt/flink/lib
COPY ./exercises/target/*.jar .

Copy code

services:
  broker:
    image: apache/kafka:4.0.0
    platform: linux/amd64
    container_name: broker
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: <PLAINTEXT://0.0.0.0:9092>,<CONTROLLER://localhost:9093>
      KAFKA_ADVERTISED_LISTENERS: <PLAINTEXT://broker:9092>
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_NUM_PARTITIONS: 3

  kcat:
    image: edenhill/kcat:1.7.1
    platform: linux/amd64
    container_name: flink_kcat
    entrypoint:
      - /bin/sh
      - -c
      - |
        apk add jq;
        while [ 1 -eq 1 ];do sleep 60;done

  sql-client:
    build: .
    command: bin/sql-client.sh
    depends_on:
      - jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        rest.address: jobmanager

  jobmanager:
    build: .
    ports:
      - "8081:8081"
    command: jobmanager
    volumes:
      - flink_data:/tmp/
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        state.backend: hashmap
        state.checkpoints.dir: file:///tmp/flink-checkpoints
        heartbeat.interval: 1000
        heartbeat.timeout: 5000
        rest.flamegraph.enabled: true
        web.backpressure.refresh-interval: 10000

  taskmanager:
    build: .
    depends_on:
      - jobmanager
    command: taskmanager
    volumes:
      - flink_data:/tmp/
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 3
        state.backend: hashmap
        state.checkpoints.dir: file:///tmp/flink-checkpoints
        heartbeat.interval: 1000
        heartbeat.timeout: 5000

volumes:
  flink_data:

but when I run

docker compose up

and try to submit the job to the flink cluster, I get this error:

Copy code

Caused by: java.lang.ClassNotFoundException: org.apache.flink.streaming.api.lineage.LineageVertexProvider
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown Source)
        at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
        ... 46 more

I am using flink 1.17.1 if that helps. I am able to run the WordCount example, but this specific exercise involving writing to a kafka sink is not working. Any pointers? Thanks!

Vedanth Baliga

11/23/2025, 9:42 AM

Hello all, Im trying to setup flink and redpanda on kubernetes but get a connection refused error when running select statements. Creating tables and views is fine so I think the jars are correct. Here's my k8s yaml. The producer is working well and I'm able to see messages on the topic. the problem is when I create a flink table in sql client and try to select from it. The same setup is working well on docker, I have no problems. Any help is appreciated!

Copy code

# ---------- Namespace ----------
apiVersion: v1
kind: Namespace
metadata:
  name: telemetry
---
# ---------- Redpanda-1 ----------
apiVersion: v1
kind: Service
metadata:
  name: redpanda-1
  namespace: telemetry
spec:
  selector:
    app: redpanda-1
  ports:
    - name: kafka
      port: 29092
      targetPort: 29092
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redpanda-1
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redpanda-1
  template:
    metadata:
      labels:
        app: redpanda-1
    spec:
      containers:
        - name: redpanda
          image: docker.redpanda.com/redpandadata/redpanda:v23.1.8
          args:
            - redpanda
            - start
            - --smp
            - "1"
            - --reserve-memory
            - 0M
            - --overprovisioned
            - --node-id
            - "1"
            - --kafka-addr
            - <PLAINTEXT://0.0.0.0:29092>
            - --advertise-kafka-addr
            - <PLAINTEXT://redpanda-1:29092>
            - --rpc-addr
            - 0.0.0.0:33145
            - --advertise-rpc-addr
            - redpanda-1:33145
            - --pandaproxy-addr
            - <PLAINTEXT://0.0.0.0:28082>
            - --advertise-pandaproxy-addr
            - <PLAINTEXT://redpanda-1:28082>
          ports:
            - containerPort: 29092
            - containerPort: 28082
            - containerPort: 33145
          volumeMounts:
            - name: data
              mountPath: /var/lib/redpanda
      volumes:
        - name: data
          emptyDir: {}
---
# ---------- Redpanda-2 ----------
apiVersion: v1
kind: Service
metadata:
  name: redpanda-2
  namespace: telemetry
spec:
  selector:
    app: redpanda-2
  ports:
    - name: kafka
      port: 29093
      targetPort: 29093
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redpanda-2
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redpanda-2
  template:
    metadata:
      labels:
        app: redpanda-2
    spec:
      containers:
        - name: redpanda
          image: docker.redpanda.com/redpandadata/redpanda:v23.1.8
          args:
            - redpanda
            - start
            - --smp
            - "1"
            - --reserve-memory
            - 0M
            - --overprovisioned
            - --node-id
            - "2"
            - --seeds
            - redpanda-1:33145
            - --kafka-addr
            - <PLAINTEXT://0.0.0.0:29093>
            - --advertise-kafka-addr
            - <PLAINTEXT://redpanda-2:29093>
            - --rpc-addr
            - 0.0.0.0:33146
            - --advertise-rpc-addr
            - redpanda-2:33146
            - --pandaproxy-addr
            - <PLAINTEXT://0.0.0.0:28083>
            - --advertise-pandaproxy-addr
            - <PLAINTEXT://redpanda-2:28083>
          ports:
            - containerPort: 29093
            - containerPort: 28083
            - containerPort: 33146
          volumeMounts:
            - name: data
              mountPath: /var/lib/redpanda
      volumes:
        - name: data
          emptyDir: {}
---
# ---------- Redpanda Console ----------
apiVersion: v1
kind: Service
metadata:
  name: redpanda-console
  namespace: telemetry
spec:
  selector:
    app: redpanda-console
  ports:
    - name: http
      port: 8080
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redpanda-console
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redpanda-console
  template:
    metadata:
      labels:
        app: redpanda-console
    spec:
      containers:
        - name: redpanda-console
          image: docker.redpanda.com/redpandadata/console:v2.2.4
          command: ["/bin/sh", "-c"]
          args:
            - echo "$CONSOLE_CONFIG_FILE" > /tmp/config.yml; /app/console
          env:
            - name: CONFIG_FILEPATH
              value: /tmp/config.yml
            - name: CONSOLE_CONFIG_FILE
              value: |
                kafka:
                  brokers: ["redpanda-1:29092", "redpanda-2:29093"]
                  schemaRegistry:
                    enabled: false
                redpanda:
                  adminApi:
                    enabled: false
                connect:
                  enabled: false
          ports:
            - containerPort: 8080
---
# ---------- Flink: JobManager ----------
apiVersion: v1
kind: Service
metadata:
  name: jobmanager
  namespace: telemetry
spec:
  selector:
    app: jobmanager
  ports:
    - name: rpc
      port: 6123
      targetPort: 6123
    - name: ui
      port: 8081
      targetPort: 8081
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jobmanager
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jobmanager
  template:
    metadata:
      labels:
        app: jobmanager
    spec:
      containers:
        - name: jobmanager
          image: flink-sql-k8s:1.19
          args: ["jobmanager"]
          env:
            - name: FLINK_PROPERTIES
              value: |
                jobmanager.rpc.address: jobmanager
                jobmanager.bind-host: 0.0.0.0
                jobmanager.rpc.port: 6123
                rest.address: jobmanager
                rest.bind-address: 0.0.0.0
                rest.port: 8081
          ports:
            - containerPort: 6123
            - containerPort: 8081
---
# ---------- Flink: TaskManager ----------
apiVersion: apps/v1
kind: Deployment
metadata:
  name: taskmanager
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: taskmanager
  template:
    metadata:
      labels:
        app: taskmanager
    spec:
      containers:
        - name: taskmanager
          image: flink-sql-k8s:1.19
          args: ["taskmanager"]
          env:
            - name: FLINK_PROPERTIES
              value: |
                jobmanager.rpc.address: jobmanager
                taskmanager.numberOfTaskSlots: 20
                taskmanager.bind-host: 0.0.0.0
          ports:
            - containerPort: 6121
            - containerPort: 6122
---
# ---------- Flink: SQL Client ----------
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sql-client
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sql-client
  template:
    metadata:
      labels:
        app: sql-client
    spec:
      containers:
        - name: sql-client
          image: flink-sql-k8s:1.19
          command: ["bash", "-c", "sleep infinity"]
          env:
            - name: FLINK_PROPERTIES
              value: |
                jobmanager.rpc.address: jobmanager
                rest.address: jobmanager
---
# ---------- ClickHouse ----------
apiVersion: v1
kind: Service
metadata:
  name: clickhouse
  namespace: telemetry
spec:
  selector:
    app: clickhouse
  ports:
    - name: http
      port: 8123
      targetPort: 8123
    - name: native
      port: 9000
      targetPort: 9000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: clickhouse
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: clickhouse
  template:
    metadata:
      labels:
        app: clickhouse
    spec:
      containers:
        - name: clickhouse
          image: clickhouse/clickhouse-server:23.8
          ports:
            - containerPort: 8123
            - containerPort: 9000
          volumeMounts:
            - name: data
              mountPath: /var/lib/clickhouse
            - name: logs
              mountPath: /var/log/clickhouse-server
      volumes:
        - name: data
          emptyDir: {}
        - name: logs
          emptyDir: {}
---
# ---------- Telemetry Producer (Python) ----------
apiVersion: apps/v1
kind: Deployment
metadata:
  name: telemetry-producer
  namespace: telemetry
spec:
  replicas: 1
  selector:
    matchLabels:
      app: telemetry-producer
  template:
    metadata:
      labels:
        app: telemetry-producer
    spec:
      containers:
        - name: telemetry-producer
          image: telemetry-producer:latest
          imagePullPolicy: IfNotPresent
          env:
            - name: KAFKA_BOOTSTRAP_SERVERS
              value: "redpanda-1:29092"
            - name: KAFKA_TOPIC
              value: "fleet.prod.telemetry.raw"

My Dockerfile

Copy code

# base image
FROM flink:1.19-scala_2.12-java11

USER root

RUN wget -P /opt/flink/lib/ \
    <https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/3.2.0-1.19/flink-sql-connector-kafka-3.2.0-1.19.jar> && \
    wget -P /opt/flink/lib/ \
    <https://repo1.maven.org/maven2/org/apache/flink/flink-json/1.19.0/flink-json-1.19.0.jar>

Bhargav Vekariya

11/24/2025, 6:11 PM

Hello all, I have setup of ◦ flink (1.20.1), ◦ flink-cdc (3.5.0), ◦ starrocks (4.0.0), ◦ mysql(8.0). • I'm using below pipeline yaml for data streaming. the problem is that there is one table that is related to add/delete cart at a time it has more data (more row count) in starrocks then mysql. I have check there is no data duplication of data in starrocks but delete is not getting performed what can be the problem? how can i resolve this? • I have tried the drop the table and restarting the pipeline. for some time it works fine. but after some time when traffic is high then starrocks data count getting more then mysql. even when traffic is almost none in non working hours its still dose not get back to normal

Copy code

source:
  type: mysql
  hostname: 10.3.4.168
  name: ms_mysql_prod_sales_us_ca
  port: 3306
  username: flink_replication_admin
  password: Uq^ZXXXXXXX
  tables: |
    ms.sales_flat_quote_item,
    ms_canada.sales_flat_quote_item
  server-id: 8000-8500
  server-time-zone: UTC
  scan.newly-added-table.enabled: true
  schema-change.enabled: true

sink:
  type: starrocks
  name: StarRocks Sink
  jdbc-url: jdbc:<mysql://10.231.XX.XX:9030>
  load-url: 10.231.XX.XX:8030
  username: root
  password: dpBjIXXXXXX
  sink.buffer-flush.interval-ms: 2000
  sink.buffer-flush.max-bytes: 67108864
  table.create.properties.replication_num: 1
  sink.at-least-once.use-transaction-stream-load: false
  sink.io.thread-count: 6
  sink.label-prefix: flink_sync_custom

route:
  - source-table: ms.\.*
    sink-table: ms.mage_us_<>
    replace-symbol: <>
    description: route all tables in ms to ms
  - source-table: ms_canada.\.*
    sink-table: ms.mage_ca_<>
    replace-symbol: <>
    description: route all tables in ms_canada to ms
pipeline:
  parallelism: 4
  name: Sync MySQL Sales US/CA DWH Tables to StarRocks
  execution.checkpointing.interval: 30000
  execution.checkpointing.max-concurrent-checkpoints: 1
  execution.checkpointing.mode: EXACTLY_ONCE
  execution.checkpointing.timeout: 300000
  execution.checkpointing.min-pause: 5000
  state.backend.changelog.enabled: true

Ashish Marottickal Gopi

11/24/2025, 8:06 PM

Hello all, A question on Broadcast State pattern: Scenario: I have a High throughput inventory sales data and another reference data that has details of max stock of each product ( Just around 2000 records but will receive updates and deletes). So basically i would want to alert if the stock sales crosses this thresold. My first solution was to use the Broadcast state pattern, where i broadcasted the Threshold data in a KeyedBroadcastProcessFunction to the sales data. Now my overall parallelism is >1 for the job. The threshold event comes to Kafka as kind of change events ( create, update, deletes ) and so I will have to reconcile the state as per the change type ). This reconciliation Im doing in the

processBroadcastElement

function Question: Is Broadcasting the state the right approach ? I'm specifically confused on following points: 1. Will the broadcasted state be consistent across the parallel instances of the non-broadcasted stream ? 2. Since I have the CDC kind of data in the broadcasted stream, I believe the state update that i make will be consistent across the parallel tasks of the non-broadcasted stream ? 3. Should I change the parallelism of the broadcast stream to 1 ? 4. Or does it seems good to just use a KeyedCoProcessJoin for this ?

Eddy Agossou

11/25/2025, 11:51 AM

Hi all! I'm constantly getting an 's3:// scheme not supported' error when trying to deploy a FlinkSessionJob. I'm currently using a session deployment. One JobManager + TaskManager with a FlinkDeployment, all using the kubernetes-operators. P.S.: Adding the s3-fs-presto plugin to the plugins /opt/flink/plugins folder (in the podTemplate FlinkDeployment.yml) did not help. Has anyone faced that before? Any guidance will be appreciated. I've been stuck on it for 24hours now