Hi all, I wanted to understand a few things in Fli...
# troubleshooting
g
Hi all, I wanted to understand a few things in Flink's Session clusters. 1. We observed that the JVM Metaspace keeps increasing as we deploy / cancel jobs on the cluster. Is this expected? Is there any documentation that explains this behavior? We tend to deploy new jobs almost every day and they are long running streaming pipelines. We want to capacity plan and make sure that enough memory is allocated on the job manager so it doesn't crash (it had happened before due to metaspace issue) 2. When we restart a particular pipeline, the checkpointed state is lost and we send duplicate data again as the
MapState
we use is empty on restart and it starts build up again. We are saving checkpoints on S3. Below is my checkpointing config:
Copy code
this.env.enableCheckpointing(30000, CheckpointingMode.EXACTLY_ONCE);
    this.env.setStateBackend(new HashMapStateBackend());
    final CheckpointConfig config = env.getCheckpointConfig();
    config.setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
    config.setCheckpointStorage(checkPointBucket);
    config.setTolerableCheckpointFailureNumber(applicationConfiguration.getCheckpointFailureTolerance());
3. Are there ways to optimize the pipelines so there is less data transfer across task managers as streaming data progresses through to further stages in the pipeline?
j
For 1: If you are using JdbcDrivers this is kind of expected, see this to solve