I work with a bunch of public energy system data, much of which is published by government agencies in Excel spreadsheets with multiple tabs, whose filenames, headers and organizational structure changes from year to year. It's kind of a bespoke mess to get them all integrated into unified, well normalized database tables. I'm wondering if anyone knows of any public projects using Airbyte to manage this kind of data source effectively, that we might be able to learn from? Where does one put all of the logic required to get this kind of semi-structured data to the point where it makes sense in a database? Right now we use Python and pandas to do it all, and then load it into SQLite locally, but we want to do a better job of separating out the "Transform" stuff (calculating derived values, creating denormalized views, entity matching between datasets) and only doing that after there's a well-defined structure to start with. Right now it's kind of mixed in with the work of getting all the different years of data into a coherent whole.
10/28/2021, 2:23 AM
dbt (getdbt.com) sounds like what you’re describing (separate T from EL).
dbt gives you things like modularity, change control, and lineage to the transform layer.
This lets you separate your semi-structured data to a basic extract-load problem, and you can build in complexity from there.
These types of deployments are best when they’re orchestrated (via airflow/dagster/etc.) so that your data models are refreshed directly after EL processes finish
12/06/2021, 11:07 PM
I think I must not understand exactly what roles these different tools play. I thought that dbt was primarily aimed at making SQL more modular and reusable, getting it under version control, and then being able to run those SQL transformations on the fly when the outputs are needed, rather than pre-loading the derived outputs into a DB. Is it also the right tool for taking a bunch of spreadsheets with different structures and turning them into a single coherent table? Would an Airbyte Excel extract-load step just copy hundreds of spreadsheet tabs into hundreds of different database tables without aligning their columns, or knowing what to do with header and footer rows and merged cells or data types or NA values?
12/07/2021, 12:47 AM
Yes your Airbyte assumption is correct here - would move point-to-point.
From my viewpoint, best practice for the solution to your problem is to fix your source data to something more usable - I’m not naive to suggest that is possible - so I’ll recognize as a limitation.
Airbyte will move the table point-to-point, you *can do a transform mid-flight, but i’d advise against this.
This suggestion is under the assumption of modern workflow, Extract > Load > Transform
At its base dbt does as you describe, but there’s also programmatic possibilities:
• It could allow for indefinite column definitions, or at least a macro for you to put various definitions of the same column in a bucket, to be able to treat different columns across multiple tables in the same fashion. Function like dbt-utils.star() function (column auto-discover)
• Given you’re able to reconcile the multiple tables into standard streams, you can compile the disparate files into a unified dataset.
• Validation checks can also be written against loosely referenced definitions.
To approach from another angle, my alternative suggestion would be to check out Google’s Document AI
It may be possible to train a model to go and mine the values from these tables as needed (given the values themselvves, rather than what someone’s named them), specifically the table parsing function.