Hello I have a question regarding DBT connector I see there DataHub #ingestion

Hello, I have a question regarding DBT connector -...

quick-pizza-8906

06/06/2022, 2:08 PM

Hello, I have a question regarding DBT connector - I see there was

platform_instance

parameter added recently and 2 changes by @green-football-43791 : https://github.com/datahub-project/datahub/pull/4926 https://github.com/datahub-project/datahub/pull/5028 I set

platform_insance

parameter and then run DBT connector first with 5028 commit and then with this commit reverted - both ended up producing target datasets without a platform instance. How is this feature supposed to work?

green-football-43791

06/06/2022, 3:46 PM

Hey Piotr- what you should see is the platform_instance showing up in the dbt urns but not the other data platform urns.

green-football-43791

06/06/2022, 3:46 PM

What system are you persisting your dbt datasets to?

green-football-43791

06/06/2022, 3:50 PM

e.g. snowflake/bigquery/redshift/something else?

quick-pizza-8906

06/06/2022, 3:52 PM

I see, since I use

disable_dbt_node_creation: True

I don't see dbt urns right?

green-football-43791

06/06/2022, 3:52 PM

ah yes

green-football-43791

06/06/2022, 3:53 PM

that would be correct. i see, so you want the platform_instance applied to your source system in that case then?

quick-pizza-8906

06/06/2022, 3:53 PM

I have snowflake tables, they have a set instance and are ingested via snowflake connector with platform_instance set

green-football-43791

06/06/2022, 3:53 PM

got it got it

quick-pizza-8906

06/06/2022, 3:53 PM

problem is - dbt manifest/catalog/etc contain information about pipelines between the very same tables (and some intermediate ones) but there is instance information missing

quick-pizza-8906

06/06/2022, 3:54 PM

so all sets created by dbt connector lack instance information and moreover even if a table was already ingested by snowflake connector with proper instance info there is a clone of it created by dbt to show lineage (without instance)...

quick-pizza-8906

06/06/2022, 3:55 PM

I was thinking of something resembling

target_instance

parameter here, wdyt? Do you think it is a feature that could be merged if it was implemented?

quick-pizza-8906

06/06/2022, 3:55 PM

I am not necessarily asking somebody to implement it, I can do it, but wanted to consult your vision for the dbt connector

green-football-43791

06/06/2022, 3:56 PM

yes I think make a lot of sense

green-football-43791

06/06/2022, 3:56 PM

target_platform_instance

green-football-43791

06/06/2022, 3:56 PM

would be an appropriate name

green-football-43791

06/06/2022, 3:57 PM

if you put up a PR, I can help review & shepherd it in 🙂

quick-pizza-8906

06/06/2022, 3:57 PM

thank you!

quick-pizza-8906

06/06/2022, 3:57 PM

will take some time though 😉

green-football-43791

06/06/2022, 3:57 PM

sure 🙂

green-football-43791

06/06/2022, 3:58 PM

have you contributed to an ingestion source before?

quick-pizza-8906

06/06/2022, 4:00 PM

no, my understanding is I should write a unit test involving golden mce file though

quick-pizza-8906

06/06/2022, 4:00 PM

beside the actual code implementing the logic

green-football-43791

06/06/2022, 4:00 PM

yep!

green-football-43791

06/06/2022, 4:01 PM

you can also refer to this to get your dev env set up: https://datahubproject.io/docs/metadata-ingestion/developing/

green-football-43791

06/06/2022, 4:01 PM

in your case I would add another test in

metadata-ingestion/tests/integration/dbt/test_dbt.py

like I had added in 4926

green-football-43791

06/06/2022, 4:03 PM

then, you can run

green-football-43791

06/06/2022, 4:03 PM

pytest tests/integration/dbt/test_dbt.py --update-golden-files

to generate golden files

mammoth-bear-12532

06/06/2022, 5:44 PM

for systems like this we've used one of two patterns: 1. platform_instance_map: map of platform name to platform_instance (e.g. https://datahubproject.io/docs/generated/ingestion/sources/kafka-connect) 2. connection_to_platform_map: map of system-local connection name to platform name, instance. (e.g. https://datahubproject.io/docs/generated/ingestion/sources/looker) I would recommend we use one of these patterns for dbt.

quick-pizza-8906

06/06/2022, 6:35 PM

@mammoth-bear-12532 good point, will implement it this way

quick-pizza-8906

06/06/2022, 8:55 PM

@mammoth-bear-12532 I think config similar to

kafka-connect

matches here better then the one for

looker

. I just got a question, given config:

Copy code

platform_instance_map:   # optional
      mysql: test_mysql      # optional
    connect_to_platform_map: # optional
      postgres-connector-finance-db:     # optional - Connector name
        postgres: core_finance_instance  # optional - Platform to instance map

It is quite easy to understand

platform_instance_map

I think, that is whenever we see set belonging to

mysql

platform it will be assigned

test_mysql

instance. But I am a bit puzzled by

connect_to_platform_map

(btw. I understand these 2 options are mutually exclusive) - is

postgres-connector-finance-db

coming from kafka connect app name? The I understand the mapping for

postgres

platform datasets to

core_finance_instance

will be done only for datasets originating from kafka connect app named

postgres-connector-finance-db

- is that right? I would propose similar approach to dbt, we could go with two settings:

target_platform_instance

- mapping for all sets to a single instance

Copy code

target_platform_instance_mapping:
- schema_name: instance_name

Both settings could be set at the same time and

target_platform_instance_mapping

would take precedence over

target_platform_instance

. Sounds reasonable, doesn't it?

mammoth-bear-12532

06/06/2022, 10:10 PM

the two are exclusive... if you use

connect_to_platform_map

then you are basically mapping the

kafka-connect

connection id to specific platform_name, instance etc... similar to the looker option

mammoth-bear-12532

06/06/2022, 10:10 PM

and if you don't need to perform just fine-grained mapping, and you just need to provide platform-level defaults, then you can use the

platform_instance_map

quick-pizza-8906

06/06/2022, 10:19 PM

I see, thanks for clarification, I have a very nice PoC working, need some polishing should be creating a PR by Tue/Wed, thank you for help!

quick-pizza-8906

06/06/2022, 10:23 PM

btw. is it ok that while snowflake connector create

Table

type of datasets the dbt connector creates just a generic

Dataset

? Would it be something that could be added (separate thing to this problem I think)

mammoth-bear-12532

06/06/2022, 10:38 PM

I think Dbt produces subtypes like : View, Incremental, Ephemeral etc

quick-pizza-8906

06/09/2022, 11:47 AM

@mammoth-bear-12532 @green-football-43791 I prepared a PR here: https://github.com/datahub-project/datahub/pull/5129 it seems somebody needs to approve running workflows, will appreciate your comments!

quick-pizza-8906

06/09/2022, 5:54 PM

^ I made some linting fixes, would it be possible to allow PR to run workflow again?

quick-pizza-8906

06/09/2022, 6:00 PM

hmm on the second thought I see after merging with master I got some problems with tests, need to fix first

green-football-43791

06/09/2022, 6:26 PM

Yes, sorry I recently merged some changes here

quick-pizza-8906

06/09/2022, 6:47 PM

any particular commit?

green-football-43791

06/09/2022, 6:48 PM

yep- this was what I did:

green-football-43791

06/09/2022, 6:49 PM

https://github.com/datahub-project/datahub/pull/5115

quick-pizza-8906

06/09/2022, 6:51 PM

ok thanks

quick-pizza-8906

06/09/2022, 6:51 PM

will adjust myself to that

quick-pizza-8906

06/09/2022, 7:15 PM

ok how about now? can we run pipeline?

quick-pizza-8906

06/09/2022, 7:15 PM

I pushed some changes, most notably the check of dbt platform is done in a separate function used to determine platform instance

green-football-43791

06/09/2022, 7:16 PM

reran!

quick-pizza-8906

06/09/2022, 7:16 PM

thanks, lets see 😉

quick-pizza-8906

06/09/2022, 7:44 PM

for some reasons linting done on my machine brings different results than the workflow...

quick-pizza-8906

06/09/2022, 7:55 PM

sorry to bother you, can we rerun again?

quick-pizza-8906

06/09/2022, 11:11 PM

@green-football-43791 can we run the workflow again still today?

green-football-43791

06/09/2022, 11:11 PM

will run now!

quick-pizza-8906

06/09/2022, 11:12 PM

ok, if you find time I will highly appreciate review as well (assuming code will pass through CI of course) 😉

quick-pizza-8906

06/09/2022, 11:12 PM

thank you

green-football-43791

06/09/2022, 11:16 PM

left a few comments

green-football-43791

06/09/2022, 11:17 PM

also ran ci

quick-pizza-8906

06/10/2022, 11:53 AM

@green-football-43791 thanks I made some changes according to the comments

quick-pizza-8906

06/10/2022, 2:59 PM

CI ran and somehow it fails on hive tests which I didn't touch: https://github.com/datahub-project/datahub/pull/5129

quick-pizza-8906

06/10/2022, 8:03 PM

@green-football-43791 now I see it's all green, was it retriggered manually or some magic happened? WDYT about my changes due to your comments?

green-football-43791

06/10/2022, 8:05 PM

Hey @quick-pizza-8906 - looks good from my end. I retriggered the build!

green-football-43791

06/10/2022, 8:05 PM

Going to let Shirshanka take on last pass to make sure we’re being consistent with our configurations

green-football-43791

06/10/2022, 8:05 PM

& then it should be good to merge! thanks for the contribution 🙂

quick-pizza-8906

06/10/2022, 8:09 PM

thank you

mammoth-bear-12532

06/10/2022, 9:29 PM

@quick-pizza-8906 I just realized that dbt only supports connecting to one platform per manifest... so my advice to you to create a map of platform -> instance etc... was plain wrong 😞

mammoth-bear-12532

06/10/2022, 9:29 PM

fear not! I will fix up your PR

🦸 1

mammoth-bear-12532

06/10/2022, 9:29 PM

and merge it in later today

teamwork 1

quick-pizza-8906

06/11/2022, 8:16 AM

I think if it allows just a single platform per manifest then having platform->instance mapping doesn't make sense (I can not provide multiple manifests, can I?) - I understood that single catalog/manifest pair refers to a single platform but instances might be different as for where the particular sets are coming from - hence I proposed we assign them based on schema if such granularity is needed, if not it the instance can be set "globally" for all sets ingested by the connector instance

mammoth-bear-12532

06/11/2022, 6:12 PM

right... since one dbt recipe only supports one manifest + catalog .. and one dbt catalog only supports one adapter... it makes no sense to have another map of platform_name -> instance

mammoth-bear-12532

06/11/2022, 6:12 PM

instead just having a parallel

target_platform_instance

to go along with the current

target_platform

makes more sense

mammoth-bear-12532

06/11/2022, 6:48 PM

I pushed a commit to your PR

quick-pizza-8906

06/13/2022, 4:03 PM

@mammoth-bear-12532 thank you for merging!

4 Views

Open in Slack

Previous Next