Hi all I have a problem at work with an important service an Apache Flink #troubleshooting

Hi all! I have a problem at work with an important...

Sandrine Bédard

08/26/2024, 4:34 PM

Hi all! I have a problem at work with an important service, and we're considering using Flink (right now we use Cadence). So here's a bit more about my use-case: 1. Our service handles 12B inventory updates/day. Right now, we have an API (CreateOrUpdateInventory), that saves the requests into S3 and trigger a the InventoryUpdate workflow (Cadence) 2. The InventoryUpdate workflow essentially does some decoration and saves the items into a table (

trackable_job

). It then triggers another workflow, called Sync (also in Cadence), which sends data to downstream services 3. The Sync workflow receives a store ID as input (among other things), waits for 5-min for an aggregation window. After 5-min, it reads all items from

trackable_job

with the given store ID, and calls external APIs to sync data The issues with the current architecture are: 1. The 5-min aggregation window results in a lot of items being read at once from

trackable_job

(up to 170K), making the DB query to fetch items very long and inefficient 2. Multiple Cadence workflows read from

trackable_job

, making things worse from a DB perspective 3. We track success/failed syncs in the

trackable_job

table, but we don't do anything with it (e.g., publishing events to clients), so this table is used as a queue only In my design, I'm considering 2 options: • Option 1: Kafka + Cadence ◦ Replace

trackable_job

with a proper Kafka queue (partitioned by store ID) ◦ Modify the Sync workflow to pull from that queue. I've read Cadence isn't super easy to set up with Kafka. For example, if queues are partitioned, we need to define which Cadence worker reads from which queue. Is that true? ◦ We must have grouping logic to group items by store ID, and then make the API call downstream (requirement from external dependencies) • Option 2: Kafka + Flink ◦ Replace trackable_job with a proper Kafka queue (partitioned by store ID) ◦ Move business logic of Sync workflow into Flink. Set up Flink tasks to maximize parallelism with Kafka ◦ Add a time/count window through Flink to control to aggregation window ◦ Group events in Flink by store ID Do you think my problem is a good use-case for Flink, and what do you think of option 2? Thanks a lot!

Ken Krugler

08/26/2024, 4:41 PM

Seems like this would work really well with Flink, it’s a very common pattern. One comment is that I would look hard at using Paimon instead of Kafka for where you land the incoming data - lower operation & hardware costs, and should work given that it seems like you’d be OK with latency of a minute, right?

Sandrine Bédard

08/26/2024, 5:59 PM

When you say latency of 1 min, do you refer to that latency from the time the message arrives in Kafka, up to the sink (external API call) in the Flink job?

Sandrine Bédard

08/26/2024, 6:07 PM

Another requirement I'm looking to implement is the ability to handle bursts of traffic. We have a very regular, but highly variable traffic pattern. For example: 125 QPS around 12PM, and 25 QPS at 12AM.

Ken Krugler

08/26/2024, 10:08 PM

Let me take a step back. If you wanted to the minimum amount of work, you’d set up Flink CDC (change data capture) on the

trackable_job

table, and feed that stream of updates to a Flink workflow that calculates the 5 minute tumbling window aggregation (keyed by store). This assumes the DB you’re using for that table is supported by Flink CDC. Since you mentioned “Multiple Cadence workflows read from” that same table, this seems easiest. If you wanted to get rid of the

trackable_job

table, you could write all inventory updates directly to a Paimon table, and have a Flink job that handles the “decoration” (I assume enrichment) currently being done by the Cadence InventoryUpdate workflow. This could write results to another Paimon table that’s the source for the aggregation Flink workflow, or this same job could also do the aggregation work if you didn’t need an enriched version of the inventory updates stored for other workflows. There is some latency between when a Paimon table is updated and when that update becomes visible to a Flink workflow that’s reading from this same table, and that’s the “1 minute” I was referring to. You can go lower, that just feels like a reasonably safe value to use.

Sandrine Bédard

08/26/2024, 11:36 PM

I see, very interesting, thank you. The only benefit of Kafka over Paimon would be lower latency?

Ken Krugler

08/27/2024, 2:07 AM

Well, Kafka is more battle-tested. I know some companies have been successful swapping out Kafka for Paimon. If you can wait until Flink Forward (Oct 23/24) I think there will be some talks on this topic that would be informative.

Sandrine Bédard

08/27/2024, 4:30 PM

Ah unfortunately I cannot. I need to implement this project by early Oct at the latest. What do you think of the Kafka + Cadence pairing. Any experience on that?

Ken Krugler

08/27/2024, 4:31 PM

If your ops team already is familiar with Kafka, then that would be safest. As far as using Kafka with Cadence, no experience, sorry. But most data processing systems have good Kafka integration.

🙏 1

Open in Slack

Previous Next