Hi team,
I'm trying to perform streaming upserts into Apache Hudi using Flink.
My requirements -
* Data freshness - 2-3 minutes
* response time < 5 seconds
* historical data spans to a few years
* File size - 500TB whole table, ~1TB a day, partition by hour
I have questions regarding streaming updates into a large table as such.
For every row streamed by Flink, I'm assuming it will be micro-batched and applied upon commit. Even so, we're looking at batch updates every few minutes.
My throughput is nearly 1,080,000,000 rows per hour, consisting of updates and appends. If we're working even by partition, each update is looking for its key in that amount of rows will definitely result in memory issues.
My question is how would I be able to efficiently perform said updates while considering these factors? Even if I'm considering to stream into a different sink, lets say a MySQL database, this problem still persist no?