Apache Flink differentiates between batch processing and stream processing modes quite fundamentally. In batch mode, jobs are expected to process a finite amount of data and terminate, whereas in stream processing mode, jobs continuously process unbounded data streams, often requiring state management for handling windows, aggregations, etc over time.
When Flink operates in batch mode, it optimizes for processing static data sets efficiently, often ignoring state backends because the expectation is that all required data is available upfront and results are computed before the job terminates. Therefore, persisting and restoring state across batch job executions is not directly supported as it is in streaming mode.
There are a few strategies you can consider to achieve a daily “batch” processing that leverages state from the previous run:
1. Use Streaming Mode with Daily Windows: Instead of running a pure batch job daily, you can set up a streaming job that processes data as it comes in but uses time windows (e.g., daily tumbling windows) to perform daily computations. This way, you can leverage Flink’s state backend to maintain state between windows. You can initialize the job with a savepoint containing the state from the last window computation.
2. Initial Batch + Streaming Continuation: As you mentioned, you could start with an initial batch job to process historical data and initialize the state, followed by switching to streaming mode using a savepoint taken from this initial batch run. The challenge here is transitioning smoothly from batch to stream without downtime or data duplication, which might require careful orchestration.
3. Periodic Savepoints for State Recovery: If you decide on streaming mode, you can schedule periodic savepoints to capture the state at regular intervals. In case of failures or if you want to restart your job with a specific state, you can restore from these savepoints. While this doesn’t mimic a pure batch setup, it allows you to retain and reuse state.
4. Custom State Initialization Logic: Although not straightforward and not natively supported by Flink, you could potentially design your application to read an initial state from an external storage at the beginning of each daily batch execution. This would require manually managing and updating this external state store after each batch run to prepare for the next day.
5. Revisit Job Design: Depending on your exact use case, it might be worth reconsidering whether a strict batch processing model is required. Since Flink’s can handle both batch and stream processing effectively, leveraging its streaming capabilities with appropriate windowing and state management might provide a more elegant solution, especially when continuity of state is crucial.
Given your preference for batch mode, the closest you might get to maintaining state between runs would be through external means (option 4), but this adds complexity to your system. Most likely, adopting a streaming approach with careful windowing and state management (option 1 or 2) would align better with Flink’s design principles and offer more robustness for such recurring daily processing scenarios.