Hi all, I’m a little confused on how the S3 source...
# ask-community-for-troubleshooting
m
Hi all, I’m a little confused on how the S3 source works under different sync modes. I have a connection set to “Full Refresh | Overwrite” that runs every hour and matches files .csv files in a certain bucket. It seems to load data from those CSV files regardless of whether or not they’ve been updated--it says it’s emitted 139K records every time. The files CAN and SHOULD be overwritten when the data needs to change. I’d like for the data in my target to be overwritten every time the file is overwritten. How can I do this? If I switch to “Incremental | Deduped” it’s also going to read all the file contents. I’m worried that this process won’t scale as the files pile up.
s
Hey, yes for this use case an incremental sync would be appropriate 🙂 We do our best to keep things performant, if there are any issues in the future then please report it to our Github repo and we will address the concerns. There is also some info on performance considerations in this doc.
m
The problem with the incremental sync is that it still seems to read through every file and do a data comparison based on the update key. I’m looking for a solution where the sync would just skip any file that hadn’t been updated. Think of a scenario where there are a 1000 files in an S3 bucket but only one got updated. Was hoping it’d check the file timestamp in that instance. Sorry for the late reply. Any ideas for me?
u
Hello Matt Webster, it's been a while without an update from us. Are you still having problems or did you find a solution?
m
I’m still having a problem. Would like to know if there’s a way to have Airbyte skip reading the contents of a S3 file altogether based on the timestamp. My preference is to keep writing new files to S3 but fear that the process won’t work when there are 100s or 1000s off files…