Hello team 2 qq on best practices 1 I d like to run a Flink Apache Flink #troubleshooting

Hello team 2 qq on best practices 1. I'd like to ...

Matthew Kerian

04/03/2023, 5:42 PM

Hello team 2 qq on best practices 1. I'd like to run a Flink job that polls/runs pipelines on a custom schedule. I've seen a few SO posts such as this. But I can't find any docs, examples, or blogs showing a more prod-ready solution. Specifically my main concern is related to fault tolerance. E.g. if I want to poll once per minute on time-bound data but our app goes down for 10 minutes I'd want to recover those 10 minutes in their 1 minute intervals. We use Airflow for this currently which seems more naturally suited to the task. But it'd simplify our workflow if we could run this within Flink 2. Is it an anti-pattern to store data longish-term within Flink? My idea is to have a pipeline that reduces daily data to a few metrics and store the last x days of data in a FIFO ds. Once a new day's worth of data is in we drop the oldest day's data and add our new data. I believe this should be possible with a keyed stream and checkpointing but I'm concerned about fault tolerance and if it'd be better to store this in a database and just query the db in order to get the data. Thanks

Martijn Visser

04/03/2023, 7:29 PM

What is your use case? Are you talking about batch or streaming jobs?

Matthew Kerian

04/03/2023, 7:30 PM

Streaming, both cases data is unbounded

Martijn Visser

04/03/2023, 7:31 PM

So something like your business logic is external and you want to poll that every X minutes, to see if your logic has changed and then use that new config?

Matthew Kerian

04/03/2023, 7:38 PM

We're running pipelines doing anomaly detection, latency is important so we want to streamline everything as much as possible and we have to get frequent updates Business logic is within the flink job. For 1 we need to hit an endpoint to retrieve new data. Currently we have an Airflow job that hits the endpoint and forwards it to Kafka (which we then read from Flink). Our usecase is expanding so I wanted to look into eliminating Airflow For 2 we're reading from Kafka.

Martijn Visser

04/03/2023, 7:42 PM

Why not create a custom connector for Flink that calls your endpoint? That would eliminate Airflow and Kafka for that use case

👀 1

Martijn Visser

04/03/2023, 7:43 PM

That would probably be working better with Flinks fault tolerance overall

Martijn Visser

04/03/2023, 7:45 PM

On 2, it’s not an anti pattern but a business requirements one imho. If you treat Flink state as your system of record, it probably means that you have additional requirements in terms of stability, backup and restore, etc. What happens if you can’t upgrade Flink because it would break savepoint compatibility and you are stuck on an older version etc

Matthew Kerian

04/04/2023, 5:30 AM

Yeah those are good points. Had a feeling my solution was trying to be too clever. My thought process was to add a custom connector. I've seen stuff like this but I'm worried our use-case doesn't apply in the same way Not having a default connector for hitting an API on a schedule or something similar made me concerned there was something I was missing. Since missing any amount of data (eventually is fine just not completely skipping) I was worried about how to implement this. I'm not fully sure how to implement this in a way that guarantees we don't miss anything. Anyway. Any information you have that can help us would be fantastic

Matthew Kerian

04/04/2023, 6:24 AM

Essentially implementing the connector in such a way that it complies with all the regular expectations we'd expect out of an out of box connector

Open in Slack

Previous Next