I m using Apache Flink 1 15 with RocksDB state backend Is th Apache Flink #troubleshooting

Join Slack

I'm using Apache Flink 1.15 with RocksDB state bac...

# troubleshooting

Marco Scalerandi

09/03/2024, 8:31 AM

I'm using Apache Flink 1.15 with RocksDB state backend. Is the Compaction filter strategy enabled by default?

D. Draco O'Brien

09/03/2024, 8:52 AM

No, In Flink 1.15 it’s not on by default and you need to set apply a RocksDBConfigSetter to RocksDBStateBackend

D. Draco O'Brien

09/03/2024, 8:53 AM

Copy code

import org.apache.flink.api.common.state.StateDescriptor;
import org.apache.flink.contrib.streaming.state.RocksDBConfigSetter;

public class CustomRocksDBConfigSetter implements RocksDBConfigSetter {

    @Override
    public void setConfig(String propertyName, String propertyValue) {
        // You can set other RocksDB options here if needed
    }

    @Override
    public void setOptions(DBOptions options) {
        // Set a custom compaction filter factory
        options.setCompactionFilterFactory(new NativeCompactionFilterFactory() {
            @Override
            public CompactionFilter createCompactionFilter(String columnFamilyName) {
                return new YourCustomCompactionFilter(); // Replace with your custom filter class
            }
        });
    }

    @Override
    public void setColumnFamilyOptions(ColumnFamilyOptions options, String columnFamilyName) {
        // Additional configurations for specific column families can be set here
    }
}

D. Draco O'Brien

09/03/2024, 8:54 AM

It’s applied like this:

Copy code

import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;

// Inside your Flink job setup
RocksDBStateBackend rocksDBBackend = new RocksDBStateBackend(
    new Path("/path/to/checkpoint_directory"), // Specify your checkpoint directory
    true // Enable incremental checkpoints
);

// Apply the custom RocksDB configuration
rocksDBBackend.setRocksDBConfigSetter(new CustomRocksDBConfigSetter());

Marco Scalerandi

09/03/2024, 8:55 AM

I'm using Managed Apache Flink by AWS, so I think I can't configure RocksDB in the code

Marco Scalerandi

09/03/2024, 8:56 AM

Instead can I tell to AWS support team to configure this property 'state.backend.rocksdb.ttl.compaction.filter.enabled'?

Marco Scalerandi

09/03/2024, 8:57 AM

As it's explained here https://flink.apache.org/2019/05/17/state-ttl-in-flink-1.8.0-how-to-automatically-cleanup-application-state-in-apache-flink/#rocksdb-ba[…]ired-state ?

D. Draco O'Brien

09/03/2024, 8:57 AM

hmm. that might be a question for AWS team. They might be applying it already or have another way to apply it

D. Draco O'Brien

09/03/2024, 8:58 AM

I would think they have some way to activate this. But with Flink 1.15 I believe they it requires a custom implementation

Marco Scalerandi

09/03/2024, 9:29 AM

Why in flink 1.15 it is more difficult to activate it then Flink 1.8? 🤯

D. Draco O'Brien

09/03/2024, 12:07 PM

It’s not necessarily more difficult it has to do with the requirement for a custom compaction filter. There maybe other ways in 1.15 to achieve your objectives.

D. Draco O'Brien

09/03/2024, 12:08 PM

You can use StateTTLConfig to set some things

D. Draco O'Brien

09/03/2024, 12:09 PM

Starting with Flink 1.11 you can also use more aggressive periodic cleanup strategy vs. the default lazy clean up

D. Draco O'Brien

09/03/2024, 12:10 PM

Also if you use incremental checkpoints (assuming this is an option for you), expired state is not included in new checkpoint snapshots

D. Draco O'Brien

09/03/2024, 12:10 PM

That will reduce size of checkpoint data and thus reduce cleanup effort

Marco Scalerandi

09/03/2024, 12:14 PM

mmm no. I use Incremental checkpoint but the checkpoint size is always growing

D. Draco O'Brien

09/03/2024, 12:16 PM

well are you also at the same time using periodic cleanup?

D. Draco O'Brien

09/03/2024, 12:17 PM

Make sure TTL is correctly set for all relevant state descriptors. And you are enabling background periodic cleanup

D. Draco O'Brien

09/03/2024, 12:18 PM

Copy code

StateTtlConfig#enableCleanupInBackground(true)

D. Draco O'Brien

09/03/2024, 12:20 PM

Beyond that make sure that all TTL state is configured correctly for all relevant state descriptors.

D. Draco O'Brien

09/03/2024, 12:21 PM

You should use Flink metrics to monitor

Copy code

state.backend.rocksdb.total-delayed-state-size

D. Draco O'Brien

09/03/2024, 12:21 PM

and also

Copy code

num-delayed-keys

D. Draco O'Brien

09/03/2024, 12:23 PM

It can also be indirectly helpful to optimize rocksdb config like

Copy code

write-buffer-size, and max-write-buffer-number

D. Draco O'Brien

09/03/2024, 12:24 PM

Look into any long lived records that are exceeding their TTL and not getting expired

D. Draco O'Brien

09/03/2024, 12:26 PM

This could be an issue with timestamps (type of timestamp used) or how watermarks strategy is implemented

D. Draco O'Brien

09/03/2024, 12:27 PM

There are many possibilities for what causes the buildup. You need to monitor and systematically eliminate the possibilities. Let us know what you find. I am quite sure that you do not depend on the compaction filter to achieve your goals although its a viable option if you can configure it on AWS.

Marco Scalerandi

09/03/2024, 12:29 PM

I'm not currently using a periodic cleanup strategy

Marco Scalerandi

09/03/2024, 12:29 PM

thanks for the help!

Marco Scalerandi

09/03/2024, 12:32 PM

The method enableCleanupInBackground doesn't exist. disableCleanupInBackgroud instead exists

D. Draco O'Brien

09/03/2024, 12:33 PM

oh ok. I wonder if that means its on by default. Not sure.

Marco Scalerandi

09/03/2024, 12:35 PM

https://issues.apache.org/jira/browse/FLINK-15606

Marco Scalerandi

09/03/2024, 12:35 PM

The answer is here

D. Draco O'Brien

09/03/2024, 12:48 PM

Yes, so since 1.10 it appears its on by default. I would say then you need to see closer whats happening with the state. log at debug level and set the TTL on some state. Take snapshots before and after to see if its being cleaned up

D. Draco O'Brien

09/03/2024, 12:51 PM

I think we should conclude that probably TTL Config on some state descriptors is incorrect or there is an issue with timestamp/watermarks. Enable debugging logs to take a closer look

Marco Scalerandi

09/03/2024, 12:52 PM

What log should I see when a cleanup is done?

D. Draco O'Brien

09/03/2024, 12:55 PM

you should TTL initialization logs like “Starting cleanup of expired state”

D. Draco O'Brien

09/03/2024, 12:56 PM

“cleaned up x bytes of state for key Y in descriptor Z”. Total cleaned up state xMB etc

D. Draco O'Brien

09/03/2024, 12:56 PM

With debug you might see more granular as well

D. Draco O'Brien

09/03/2024, 12:57 PM

It’s also important to watch the progression of watermarks to make sure thats working correctly.

D. Draco O'Brien

09/03/2024, 12:57 PM

This is a good use of time as watermarks a fundamental aspect of flink stream processing. You want to know how to verify that your watermarks are implemented correctly

D. Draco O'Brien

09/03/2024, 12:58 PM

“Oberserved Watermark X for Stream Y” etc

D. Draco O'Brien

09/03/2024, 12:58 PM

For ending you might see “Finish State cleanup task” or similar.

Marco Scalerandi

09/03/2024, 12:58 PM

I'm using the ProcessingTime, not watermarking

D. Draco O'Brien

09/03/2024, 1:00 PM

hmm …

D. Draco O'Brien

09/03/2024, 1:01 PM

That could likely be the problem right there

D. Draco O'Brien

09/03/2024, 1:01 PM

Its difficult to get ProcessingTime to work correctly and deal with delays etc.

Open in Slack

Previous Next