Hi everyone :) My name is Tom Griffin - I’m experimenting with Airbyte to build out a vaccination warehouse for a Albany County in New York.
One of the data sources is the state’s immunization registry. Each day they cut a csv of the prior day’s vaccination events. We pull that file, ingest it, etc...
We ended up writing a small python script that syncs their SFTP directory with a local directory on our end (based on filenames). We went this route because there were days where they dropped more than one file and we didn’t want to risk losing anything.
I experimented with the SFTP connector and was able to download specific files I defined as part of the source configuration. I was unable to match all files in the directory, like /directory/filename2021*.csv.
What would be the preferred method to mimic what I am doing now in terms of the directors synchronization and then only processing the new files (with new defined as those that I didn’t have before the job ran)?
For example, is there a way that I could trigger a script to run beforehand that would stage the data locally?
Any ideas would be appreciated :)