Hi, we recently upgraded to DataHub 0.10.0, we are...
# troubleshoot
w
Hi, we recently upgraded to DataHub 0.10.0, we are ingesting metadata for Glue + dbt and we noticed the list of columns is now duplicated, with: • 1 set of them (coming from Glue) having no description • the 2nd set, which is repeated has the description coming from dbt After a bit of digging noticed, that in the graphql response the Glue datasets have
fieldPaths
as version 2:
Copy code
fields: 
  0:
    description: null
    fieldPath: "[version=2.0].[type=string].a_column_x"
    globalTags: null
    glossaryTerms: null
    isPartOfKey: false
    jsonPath: null
    label: null
    nativeDataType: "string"
    nullable: true
    recursive: false
    type: "STRING"
    __typename: "SchemaField"
While the dbt sibling entity has fieldPaths version 1(?):
Copy code
siblings: 
  isPrimary: false
  siblings: 
    0:
      ... 
      schemaMetadata: 
        fields: 
          ...
          60: 
            description: "A description for column x"
            fieldPath: "a_column_x"
            globalTags: null
            glossaryTerms: null
            isPartOfKey: false
            jsonPath: null
            label: null
            nativeDataType: "varchar"
            nullable: false
            recursive: false
            type: "STRING"
            __typename: "SchemaField"
Not really sure how it was before but found that interesting. Also might be worth mentioning that we have enable statful ingestion for both dbt and Glue and that it worked well before. Any ideas what might be going wrong?
a
Hi @white-shampoo-69122, @dazzling-judge-80093 might be able to help you out here!
w
We thought it might be similar to https://datahubspace.slack.com/archives/CUMUWQU66/p1676401053834339 But there doesn’t seem to be any difference in uppercase vs lowercase. The only think we can find atm is the fieldPaths difference
After some more digging as I understand Glue source records the fieldPaths as
version=2.0
while dbt only does as v1. At the moment unsure whether the UI should deal with merging different versions (saw some comments against that, but also saw some code that is meant to deal with downgrading
version=2.0
to
v1
) or should there be a config for dbt to set
version=2.0
for pathfields?
@gray-shoe-75895 Sorry to tag you personally, but would you be able to provide a direction to take with this? or am I missing something?
g
Huh this is a tricky one. I’m not sure where the merging logic should go (or if we should instead just try to upgrade everything to fieldpaths v2). I’ll defer to @bulky-soccer-26729 on if this is something the UI can handle
a
ah yeah this would be a tough one to handle on the UI side if fieldPaths are what's different. I can look into it but not trivial
w
It’d be interesting to know the decision on which approach is to be taken. I think https://datahubspace.slack.com/archives/C02R2NBJXD1/p1680023359973639 is a bit related since wea are building some extensions for our custom tools to extend metadata on fields and we are running into questions on how to generate V2
I’ve created a Github issue regarding this so we don’t lose track 🙂
g
Thanks
m
This issue is still relevant, as even the latest version (0.12.0) still produce v1 paths and leads to duplicated fields in siblings with v2 path.
r
Hey there! 👋 Make sure your message includes the following information if relevant, so we can help more effectively! 1. Which DataHub version are you using? (e.g. 0.12.0) 2. Please post any relevant error logs on the thread!
b
We are also running into this issue (using the Iceberg and dbt sources) and I agree! It looks like @white-shampoo-69122' GitHub issue was automatically closed as stale. And as far as I can tell, the dbt source still hasn't been updated with v2 fieldPaths. Is there anything holding it back on the v1 spec? Is there some sort of blocker preventing this change? Or is it just a matter of priorities? I (or one of my teammates) may be able to tackle this, if it's just a matter of getting it done. 🙂
g
Mainly a matter of priorities - we haven't gotten around to it The one thing to call out here is that we probably should emit v2 fieldPaths only for iceberg / similar sources, since we still need it to be v1 for things like snowflake/bigquery/etc
👍 1