How do y'all handle multiple runs of the same DBT ...
# ingestion
b
How do y'all handle multiple runs of the same DBT project? We have one DBT project, and multiple departments own seeds and models within it. Rather than running everything in one run once per day, we have each department running their pipelines (via
--select
) on their own schedule, which may be daily or hourly. So scheduling DataHub to pull on a schedule means our assertions (run_results.json) may not be complete. Ideas we're considering in the thread.
1
🔍 1
📖 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? Yes button No button
b
Ideas we'd had: Trigger after each run We use Airflow to schedule DBT, so store the dbt json files on S3 and trigger a Datahub DBT ingestion at the end of each DBT run. ...but now we have concurrency issues because we might trigger an ingestion when the previous one is still running. Ignore DBT recipe and push via API Pushing the output via API instead... but I think we have to reinvent the wheel and do the DBT transforms that DataHub can do natively again (e.g. sibling with the destination db object) Potentially we could just do this with assertions, since manifest and catalog should be the same between runs (dbt compiles the entire project for these files, not just executed models) Custom ingestion, piggybacking from dbt_artefacts We already have dbt manifest and run results pushed to snowflake tables by a dbt_artifacts plug in. But we have to reinvent the wheel again, because we must reproduce the DBT ingestion transforms DataHub does out of the box again.
g
On the “Trigger after each run” approach: what sort of concurrency issues do you foresee? One other possibility on the trigger after each run approach: we support
entities_enabled
and
node_name_pattern
filters, which you can use to make sure that your datahub ingestion runs only bring in the relevant stuff
b
Thanks, I did some experimentation with these. I think we can do: • On deployment of a new DBT image (where models have changed) we do a
dbt docs generate
so we can do a stateful ingestion of the model, snapshot, test etc definitions here. • On the execution of a DBT build/test we can ingest only the test_result entities, using the manifest and catalog from the earlier deployment and the run_results from the execution. Edit: since run results are appended to the timeseries index and DataHub picks the most recent test (not most recent ingestion) per definition, if they happen to run out of order, no problem I think that solves my problem (and helps anyone else looking for how to orchestrate their DBT ingestion)