Hello, I have a question regarding DBT connector -...
# ingestion
q
Hello, I have a question regarding DBT connector - I see there was
platform_instance
parameter added recently and 2 changes by @green-football-43791 : https://github.com/datahub-project/datahub/pull/4926 https://github.com/datahub-project/datahub/pull/5028 I set
platform_insance
parameter and then run DBT connector first with 5028 commit and then with this commit reverted - both ended up producing target datasets without a platform instance. How is this feature supposed to work?
g
Hey Piotr- what you should see is the platform_instance showing up in the dbt urns but not the other data platform urns.
What system are you persisting your dbt datasets to?
e.g. snowflake/bigquery/redshift/something else?
q
I see, since I use
disable_dbt_node_creation: True
I don't see dbt urns right?
g
ah yes
that would be correct. i see, so you want the platform_instance applied to your source system in that case then?
q
I have snowflake tables, they have a set instance and are ingested via snowflake connector with platform_instance set
g
got it got it
q
problem is - dbt manifest/catalog/etc contain information about pipelines between the very same tables (and some intermediate ones) but there is instance information missing
so all sets created by dbt connector lack instance information and moreover even if a table was already ingested by snowflake connector with proper instance info there is a clone of it created by dbt to show lineage (without instance)...
I was thinking of something resembling
target_instance
parameter here, wdyt? Do you think it is a feature that could be merged if it was implemented?
I am not necessarily asking somebody to implement it, I can do it, but wanted to consult your vision for the dbt connector
g
yes I think make a lot of sense
target_platform_instance
would be an appropriate name
if you put up a PR, I can help review & shepherd it in πŸ™‚
q
thank you!
will take some time though πŸ˜‰
g
sure πŸ™‚
have you contributed to an ingestion source before?
q
no, my understanding is I should write a unit test involving golden mce file though
beside the actual code implementing the logic
g
yep!
you can also refer to this to get your dev env set up: https://datahubproject.io/docs/metadata-ingestion/developing/
in your case I would add another test in
metadata-ingestion/tests/integration/dbt/test_dbt.py
like I had added in 4926
then, you can run
pytest tests/integration/dbt/test_dbt.py --update-golden-files
to generate golden files
m
for systems like this we've used one of two patterns: 1. platform_instance_map: map of platform name to platform_instance (e.g. https://datahubproject.io/docs/generated/ingestion/sources/kafka-connect) 2. connection_to_platform_map: map of system-local connection name to platform name, instance. (e.g. https://datahubproject.io/docs/generated/ingestion/sources/looker) I would recommend we use one of these patterns for dbt.
q
@mammoth-bear-12532 good point, will implement it this way
@mammoth-bear-12532 I think config similar to
kafka-connect
matches here better then the one for
looker
. I just got a question, given config:
Copy code
platform_instance_map:   # optional
      mysql: test_mysql      # optional
    connect_to_platform_map: # optional
      postgres-connector-finance-db:     # optional - Connector name
        postgres: core_finance_instance  # optional - Platform to instance map
It is quite easy to understand
platform_instance_map
I think, that is whenever we see set belonging to
mysql
platform it will be assigned
test_mysql
instance. But I am a bit puzzled by
connect_to_platform_map
(btw. I understand these 2 options are mutually exclusive) - is
postgres-connector-finance-db
coming from kafka connect app name? The I understand the mapping for
postgres
platform datasets to
core_finance_instance
will be done only for datasets originating from kafka connect app named
postgres-connector-finance-db
- is that right? I would propose similar approach to dbt, we could go with two settings:
target_platform_instance
- mapping for all sets to a single instance
Copy code
target_platform_instance_mapping:
- schema_name: instance_name
Both settings could be set at the same time and
target_platform_instance_mapping
would take precedence over
target_platform_instance
. Sounds reasonable, doesn't it?
m
the two are exclusive... if you use
connect_to_platform_map
then you are basically mapping the
kafka-connect
connection id to specific platform_name, instance etc... similar to the looker option
and if you don't need to perform just fine-grained mapping, and you just need to provide platform-level defaults, then you can use the
platform_instance_map
q
I see, thanks for clarification, I have a very nice PoC working, need some polishing should be creating a PR by Tue/Wed, thank you for help!
btw. is it ok that while snowflake connector create
Table
type of datasets the dbt connector creates just a generic
Dataset
? Would it be something that could be added (separate thing to this problem I think)
m
I think Dbt produces subtypes like : View, Incremental, Ephemeral etc
q
@mammoth-bear-12532 @green-football-43791 I prepared a PR here: https://github.com/datahub-project/datahub/pull/5129 it seems somebody needs to approve running workflows, will appreciate your comments!
^ I made some linting fixes, would it be possible to allow PR to run workflow again?
hmm on the second thought I see after merging with master I got some problems with tests, need to fix first
g
Yes, sorry I recently merged some changes here
q
any particular commit?
g
yep- this was what I did:
q
ok thanks
will adjust myself to that
ok how about now? can we run pipeline?
I pushed some changes, most notably the check of dbt platform is done in a separate function used to determine platform instance
g
reran!
q
thanks, lets see πŸ˜‰
for some reasons linting done on my machine brings different results than the workflow...
sorry to bother you, can we rerun again?
@green-football-43791 can we run the workflow again still today?
g
will run now!
q
ok, if you find time I will highly appreciate review as well (assuming code will pass through CI of course) πŸ˜‰
thank you
g
left a few comments
also ran ci
q
@green-football-43791 thanks I made some changes according to the comments
CI ran and somehow it fails on hive tests which I didn't touch: https://github.com/datahub-project/datahub/pull/5129
@green-football-43791 now I see it's all green, was it retriggered manually or some magic happened? WDYT about my changes due to your comments?
g
Hey @quick-pizza-8906 - looks good from my end. I retriggered the build!
Going to let Shirshanka take on last pass to make sure we’re being consistent with our configurations
& then it should be good to merge! thanks for the contribution πŸ™‚
q
thank you
m
@quick-pizza-8906 I just realized that dbt only supports connecting to one platform per manifest... so my advice to you to create a map of platform -> instance etc... was plain wrong 😞
fear not! I will fix up your PR
🦸 1
and merge it in later today
teamwork 1
q
I think if it allows just a single platform per manifest then having platform->instance mapping doesn't make sense (I can not provide multiple manifests, can I?) - I understood that single catalog/manifest pair refers to a single platform but instances might be different as for where the particular sets are coming from - hence I proposed we assign them based on schema if such granularity is needed, if not it the instance can be set "globally" for all sets ingested by the connector instance
m
right... since one dbt recipe only supports one manifest + catalog .. and one dbt catalog only supports one adapter... it makes no sense to have another map of platform_name -> instance
instead just having a parallel
target_platform_instance
to go along with the current
target_platform
makes more sense
I pushed a commit to your PR
q
@mammoth-bear-12532 thank you for merging!