Hi all, according to the documentation, the state ...
# troubleshooting
m
Hi all, according to the documentation, the state backend is ignored in batch mode. We wanted to run a batch every day instead of using streaming mode, but we will need to initialise the state from the previous batch. Is it possible to do that ? Otherwise I guess we will have to use the streaming mode. We could also initialise the state with an initial batch and then switch to streaming mode using a savepoint, but ideally we would prefer to stick with the batch mode
d
Apache Flink differentiates between batch processing and stream processing modes quite fundamentally. In batch mode, jobs are expected to process a finite amount of data and terminate, whereas in stream processing mode, jobs continuously process unbounded data streams, often requiring state management for handling windows, aggregations, etc over time. When Flink operates in batch mode, it optimizes for processing static data sets efficiently, often ignoring state backends because the expectation is that all required data is available upfront and results are computed before the job terminates. Therefore, persisting and restoring state across batch job executions is not directly supported as it is in streaming mode. There are a few strategies you can consider to achieve a daily “batch” processing that leverages state from the previous run: 1. Use Streaming Mode with Daily Windows: Instead of running a pure batch job daily, you can set up a streaming job that processes data as it comes in but uses time windows (e.g., daily tumbling windows) to perform daily computations. This way, you can leverage Flink’s state backend to maintain state between windows. You can initialize the job with a savepoint containing the state from the last window computation. 2. Initial Batch + Streaming Continuation: As you mentioned, you could start with an initial batch job to process historical data and initialize the state, followed by switching to streaming mode using a savepoint taken from this initial batch run. The challenge here is transitioning smoothly from batch to stream without downtime or data duplication, which might require careful orchestration. 3. Periodic Savepoints for State Recovery: If you decide on streaming mode, you can schedule periodic savepoints to capture the state at regular intervals. In case of failures or if you want to restart your job with a specific state, you can restore from these savepoints. While this doesn’t mimic a pure batch setup, it allows you to retain and reuse state. 4. Custom State Initialization Logic: Although not straightforward and not natively supported by Flink, you could potentially design your application to read an initial state from an external storage at the beginning of each daily batch execution. This would require manually managing and updating this external state store after each batch run to prepare for the next day. 5. Revisit Job Design: Depending on your exact use case, it might be worth reconsidering whether a strict batch processing model is required. Since Flink’s can handle both batch and stream processing effectively, leveraging its streaming capabilities with appropriate windowing and state management might provide a more elegant solution, especially when continuity of state is crucial. Given your preference for batch mode, the closest you might get to maintaining state between runs would be through external means (option 4), but this adds complexity to your system. Most likely, adopting a streaming approach with careful windowing and state management (option 1 or 2) would align better with Flink’s design principles and offer more robustness for such recurring daily processing scenarios.
m
What I explored is using the State Processor API to : • read a savepoint to initialise the state at the beginning of the batch • create a savepoint at the end of the batch But I haven't find a way to initialise the state with a savepoint using State Processor API. So basically option 4 with State Processor API. Reason to consider batch mode is because it feels cheaper than doing real time computation. For some scenarios where we don't have that much data, it feels better to just compute once a day all the data we receive so far. Another option I guess will be stop/resume the job in streaming mode as we are using the Flink K8S operator. Doing that will just reuse last savepoint.