Hi :) I'm a volunteer working on vaccine surveilla...
# ask-community-for-troubleshooting
t
Hi :) I'm a volunteer working on vaccine surveillance and modeling for one of the counties in New York. Each day we have to retrieve an extract from New York State Immunization Information System (NYSIIS) of the last day's vaccinations. The files are named with the date they were cut (this was yesterday's file): 20211202_nysiis.csv.zip The plan is to dump CSVs into Postgres, do some transforms, then shift the data into Elasticsearch for analysis. I'm hoping someone would be willing to help me understand how best to use the Files source given that it does not yet support multiple files. I'm looking for guidance on how to handle the initial load + the ongoing daily incrementals. I could do preprocessing before the Airbyte process - but I was hoping to use this as an opportunity to learn best practices inside the Airbyte ecosystem. Thanks! :) πŸ™‚
βœ… 1
s
is loading to s3 an option? the s3 connector supports globbing files based on file name
t
ok - this looks like a solid workaround. πŸ™‚ πŸ™‚ and if I did want to do some preprocessing on the incoming data (I do know there are data quality issues in some of the early files from when we first started vaccinating...) - would the best practice there be to handle those outside of airbyte beforehand, and then trigger the airbyte job when the files are prepped?
πŸŽ‰ 1
s
what kind of data quality issues? you could also load them to postgres and cleanup there but if that’s a hassle then preprocessing might be the way to go
t
offset columns in some cases...they're few and far between - and the plan is to just drop the records...could I use constraints in Postgres to keep myself from ingesting garbage rows?
u
You have two paths: one cleaning the data before ingesting to Airbyte or applying some logic in data after it'd been ingested in your destination. For your situation, probably cleaning the data before it.