Hello All! We have snowflake/dbt/looker tech stack...
# ingestion
h
Hello All! We have snowflake/dbt/looker tech stack. We've ingested snowflake data as well as dbt json files - however datahub created 2 separate datasets - rather than linking the snowflake assets to the dbt models. Has anyone tried this? Any thoughts/ideas to de-duplicate / clean it up?
l
@acoustic-printer-83045 @gray-shoe-75895 ^
a
That doesn’t overly surprise me, can you provide the URN of a DBT ingested dataset and a snowflake ingested dataset that should match but doesn’t? Each metadata ingestion tool will make some assumptions about how to craft a URN and if they don’t match across tools we’ll see some disconnects. Also, what do you want from snowflake metadata ingestion vs. what do you want from DBT based ingestion? I’m asking because what assets we should load from DBT is an open question right now.
h
Thanks Gary - I'll get you the URN numbers here shortly. For the latter question • DBT ◦ Lineage ◦ Will have column descriptions ◦ New 0.19 allows some tagging and meta information that would be great if it could be parsed • Snowflake - if all the above from DBT is available it may be limited - however we also have Looker which we would ideally map dashboards/looks and I presume we would need Snowflake appropriately modeled / ingested in order to map these assets together
m
Great question.. catalog de-dup / reconciliation is definitely something we need to do well. Looking at the dbt generated ids should help with stitching together.
h
Here are the URN values: • dbt: 
urn:li:dataset:(urn:li:dataPlatform:dbt,analytics.analytics.fct_bookings,PROD)
• Snowflake: 
urn:li:dataset:(urn:li:dataPlatform:snowflake,analytics.analytics.fct_bookings,PROD)
a
Ok, so the disconnect is in dataplatform. I can think of a few ways to fix that. If DBT is going to be a supplementary datasource than that needs to be defined in the config file.
I’m +1 on it being supplementary and that was how I used it when I first instrumented our setup. @mammoth-bear-12532 / @gray-shoe-75895 make sense?
I’m also wondering if we should even pull in schema data if it’s supplementary, IE the primary source for that would be the other pipeline.
g
Yup - for the superset integration, we also map the database types into the data platform
I think it makes sense to have the schema data generation optional via config
👍 1
a
Ok, I’ll take a look at the superset implementation and try to punch out a PR this weekend (more likely next weekend)
👍 1
g
Awesome, thanks @acoustic-printer-83045
a
@gray-shoe-75895 I notice that datahub master doesn't mention superset, I was going to read that to get a handle on how superset was handling the dataplatform configuration. I'm assuming it's the provider field and will check the code. Just wondering why superset is there in examples / source but not in the readme 🙂
m
@acoustic-printer-83045: yeah looks like a miss.. there is a recipe here
datahub/metadata-ingestion/examples/recipes/superset_to_rest.yml
looks like you found the recipe and the source already. nvm 🙂
a
Looks like superset returns the dataplatform part of the URN. Which is great 🤔 on how DBT should do it. I'll have to poke at it a bit more to see how the other ingestion sources set
dataplatform
b
@handsome-airplane-62628 @loud-island-88694 @acoustic-printer-83045 @mammoth-bear-12532 was this issue resolved? I am facing the same issue now and I was searching the slack channel for a possible fix or to see if my ingestion recipe needs any corrections.
m
@blue-plastic-11088: what sort of urns is the
dbt
connector generating for you?
b
@mammoth-bear-12532
Copy code
urn:li:dataset:(urn:li:dataPlatform:snowflake,DG_SANDBOX.WEB_SALES_CURATED.customer,PROD)
urn:li:dataset:(urn:li:dataPlatform:snowflake,dg_sandbox.web_sales_curated.customer,PROD)
case difference
m
ah casing in snowflake names came up earlier today as well. Are both urns generated by dbt? or one by dbt and the other by the snowflake connector?
b
one by dbt (upper case) and one by Snowflake (lower case).
c
I am having a similar issue with redash and mssql. 2 different dataset urns but with different cases.
Copy code
urn:li:dataset:(urn:li:dataPlatform:mssql,GensuiteWeb.dbo.at_ltbdeveloper,PROD)

urn:li:dataset:(urn:li:dataPlatform:mssql,gensuiteweb.dbo.AT_ltbDeveloper,PROD)
l
@blue-plastic-11088 @careful-insurance-60247 We are looking into this and will fix this soon
👍 1