Hi all I m at the stage of doing a proof of concept to learn DataHub #getting-started

Hi all, I’m at the stage of doing a proof of conce...

bland-wolf-37286

10/27/2021, 10:01 AM

Hi all, I’m at the stage of doing a proof of concept to learn how DataHub can satisfy our data catalogue needs. So far I’ve spun up the quickstart deployment and have ingested metadata from Hive tables. That was all quite straightforward thanks to good documentation and examples. I’m now looking at taking this further to explore how we can associate extended information about schema fields over and above

name

type

and

description

. For example, we might want to add information about the

format

(perhaps ‘UUID’, ‘ISO8601 date’ or some other free text),

source

(where does data in the field originate from) and perhaps other attributes we might define. This extended information will need to be editable from within the UI as well as via the API. I’ve been looking at doing this by extending the metadata model, adding attributes to

SchemaField.pdl

and

EditableSchemaFieldInfo.pdl

then chasing the changes through, but it looks like I need make changes in quite a lot of other places (so far I have edits in 10 different

pdl

graphql

json

and

java

files). I thought it best to pause at this point and ask the community on here whether this is the right way to go about this or if there’s a better way that I have overlooked?

mammoth-bear-12532

10/27/2021, 2:50 PM

Hi @bland-wolf-37286 welcome! For the requirements you have, have you considered using either a controlled vocabulary of terms (via business glossary) or free form tags to attach to the fields?

mammoth-bear-12532

10/27/2021, 2:51 PM

For free form text to enhance the description of the fields, you can do that through the ui and api already using the “editableSchemaMetadata” aspect.

bland-wolf-37286

10/27/2021, 3:09 PM

Hi @mammoth-bear-12532, thanks for replying. At the moment we’re exploring the different possibilities for how to attach additional metadata to schema fields. I suspect that a controlled vocabulary of terms may not meet the needs of the owning teams, but it’s certainly something for me to look into more closely. With regards the

editableSchemaMetadata

aspect, am I right in thinking that out of the box it will allow me to edit the field description in the UI but if I want to add additional attributes (rather than just including these in the description text itself) that I will need to either extend

editableSchemaFieldInfo

or define something similar for the custom attributes (say

editableSchemaFieldCustomInfo

) and add an array of it to

editableSchemaMetadata

. I then need to regenerate the code and make a number of other code changes to pass the values through and use them in the GraphQL API and finally make changes to the UI to expose the fields for editing. Part of the work I am currently doing is looking at whether we need to extend DataHub to fit our needs (which are not yet fully defined) and if so, how much effort is required in doing so.

big-carpet-38439

10/27/2021, 9:25 PM

@bland-wolf-37286 Yes that is correct. We are pretty strongly typed from the model to the UI

big-carpet-38439

10/27/2021, 9:26 PM

a) Add new field to PDL b) Update GraphQL mapper to include new field c) fetch new field by extending gql in UI d) display new field in ui

mammoth-bear-12532

10/28/2021, 12:03 AM

@bland-wolf-37286: would definitely like to understand if the combo of tags (free form) and glossary terms (controlled vocab) are not enough .. and what are the factors getting in the way of using them as a way to layer on semantics on fields

big-carpet-38439

10/28/2021, 8:31 PM

+1!

bland-wolf-37286

11/01/2021, 10:55 AM

Apologies for the delayed response - I’ve had a couple of days off. We might want to associate with a schema field attributes such as the

format

and

source

of the data the field contains. The

format

attribute might contain something like ‘UUID’, ‘ISO8061 date’, ‘4 digit year followed by 2 letter period code’, etc. The

source

attribute might contain something like ‘HTTP request payload’, ‘extracted from HTTP request path’, ‘derived from X’, etc. There might be other attributes that need to be included. We could incorporate all of this into the field description, but that potentially makes the field description long and unwieldy, making the content less usable. Glossary terms could work well for any attributes that have a limited set of well known values, but would be less useful for attributes such as

format

and

source

which would be rather more freeform in nature. Tags don’t intuitively feel like the right solution for attributes such as

format

- when I think of a tag I have in mind a short but meaningful label, possibly backed with a lengthier description to give additional context. The potential problem with this is that it may be difficult to derive a suitable tag label for a

format

- shortening it too much may lose meaning whereas using the full text doesn’t feel the right way to go. That’s not to say that terms and tags couldn’t be made to work, it’s just that it may not be a solution that gives the best experience for our users. It’s why I was looking at what’s involved in extending the model to support additional attributes - to feed into the internal discussion here about how we want to setup and use DataHub. Extending the model and feeding those extensions through to GraphQL and the UI isn’t well documented at the moment so we wanted to gain an understanding of the effort involved in doing so. I tried two different approaches, the first being to add two new attributes, but I couldn’t work out all the places I needed to change to make the new attributes ingestable, queryable through GraphQL and visible/editable in the UI. I then changed tack and tried renaming an existing attribute, the idea being to see what breaks and fix the code all the way through until it works again as a means of identifying all the places that changes are required, then using the knowledge gained to reset and add new attributes. With changes in 26 different files I still hadn’t got to the point of fixing what I had broken so I called time on the exercise. Though I wasn’t successful at extending the model, the exercise was still valuable in terms of understanding the potential effort required.

loud-island-88694

11/02/2021, 4:07 PM

@bland-wolf-37286 We're planning on addressing the effort involved in extending the model. Please stay tuned

loud-island-88694

11/02/2021, 4:08 PM

separately, your use-case probably warrants the introduction of something like "reference data" which we have heard from other people in the community too.

bland-wolf-37286

11/02/2021, 4:08 PM

I certainly will stay tuned 👍

loud-island-88694

11/02/2021, 4:08 PM

we can then associate the right value from reference data with the columns

cool-television-57439

05/31/2023, 7:14 AM

Hello, has there been any progress in incorporating reference data into Datahub?

2 Views

Open in Slack

Previous Next