Ankam Praveen
06/09/2023, 8:19 AMsap1ens
06/09/2023, 4:20 PMAnkam Praveen
06/09/2023, 9:38 PMsap1ens
06/09/2023, 10:14 PMAnkam Praveen
06/09/2023, 10:25 PMsap1ens
06/10/2023, 4:37 PMsap1ens
06/10/2023, 4:38 PMMoreover, since our data is partitioned by day as well as hour, it means we need to construct the list of Paths to scan each day dynamically.Not sure I fully understand that… Flink is able to recursively find all “sub folders” as well
Ankam Praveen
06/10/2023, 6:52 PM<s3://test-data/><yyyy-MM-dd>/<event>/<HH>
. We have data stored in this format dating back to 2019, and for each day, we have approximately 50 different event types stored in their respective directories. However, out of these, my use case involves processing only around 4-5 specific event types.
I need to process current data, my strategy is to initially provide the path <s3://test-data/><current date>/<topic>
to Flink’s FileSource
. This allows the FileSource
to continuously monitor this directory for new files corresponding to the current date. Once a day’s data has been processed and the day has ended, I would need to update the FileSource
to process data for the new date.
Is there any way to configure FileSource efficiently ? is creating a custom FileEnumerator makes sense ?