Hi I m a volunteer working on vaccine surveillance and model Airbyte #ask-community-for-troubleshooting

Hi :) I'm a volunteer working on vaccine surveilla...

Tom Griffin

12/04/2021, 2:14 AM

Hi :) I'm a volunteer working on vaccine surveillance and modeling for one of the counties in New York. Each day we have to retrieve an extract from New York State Immunization Information System (NYSIIS) of the last day's vaccinations. The files are named with the date they were cut (this was yesterday's file): 20211202_nysiis.csv.zip The plan is to dump CSVs into Postgres, do some transforms, then shift the data into Elasticsearch for analysis. I'm hoping someone would be willing to help me understand how best to use the Files source given that it does not yet support multiple files. I'm looking for guidance on how to handle the initial load + the ongoing daily incrementals. I could do preprocessing before the Airbyte process - but I was hoping to use this as an opportunity to learn best practices inside the Airbyte ecosystem. Thanks! :) 🙂

✅ 1

12/04/2021, 2:18 AM

is loading to s3 an option? the s3 connector supports globbing files based on file name

Tom Griffin

12/04/2021, 2:24 AM

ok - this looks like a solid workaround. 🙂 🙂 and if I did want to do some preprocessing on the incoming data (I do know there are data quality issues in some of the early files from when we first started vaccinating...) - would the best practice there be to handle those outside of airbyte beforehand, and then trigger the airbyte job when the files are prepped?

🎉 1

12/04/2021, 2:41 AM

what kind of data quality issues? you could also load them to postgres and cleanup there but if that’s a hassle then preprocessing might be the way to go

Tom Griffin

12/04/2021, 2:46 AM

offset columns in some cases...they're few and far between - and the plan is to just drop the records...could I use constraints in Postgres to keep myself from ingesting garbage rows?

[DEPRECATED] Marcos Marx

12/06/2021, 10:23 PM

You have two paths: one cleaning the data before ingesting to Airbyte or applying some logic in data after it'd been ingested in your destination. For your situation, probably cleaning the data before it.

Open in Slack

Previous Next