Hi all, I’m at the stage of doing a proof of conce...
# getting-started
b
Hi all, I’m at the stage of doing a proof of concept to learn how DataHub can satisfy our data catalogue needs. So far I’ve spun up the quickstart deployment and have ingested metadata from Hive tables. That was all quite straightforward thanks to good documentation and examples. I’m now looking at taking this further to explore how we can associate extended information about schema fields over and above
name
,
type
and
description
. For example, we might want to add information about the
format
(perhaps ‘UUID’, ‘ISO8601 date’ or some other free text),
source
(where does data in the field originate from) and perhaps other attributes we might define. This extended information will need to be editable from within the UI as well as via the API. I’ve been looking at doing this by extending the metadata model, adding attributes to
SchemaField.pdl
and
EditableSchemaFieldInfo.pdl
then chasing the changes through, but it looks like I need make changes in quite a lot of other places (so far I have edits in 10 different
pdl
,
graphql
,
json
and
java
files). I thought it best to pause at this point and ask the community on here whether this is the right way to go about this or if there’s a better way that I have overlooked?
m
Hi @bland-wolf-37286 welcome! For the requirements you have, have you considered using either a controlled vocabulary of terms (via business glossary) or free form tags to attach to the fields?
For free form text to enhance the description of the fields, you can do that through the ui and api already using the “editableSchemaMetadata” aspect.
b
Hi @mammoth-bear-12532, thanks for replying. At the moment we’re exploring the different possibilities for how to attach additional metadata to schema fields. I suspect that a controlled vocabulary of terms may not meet the needs of the owning teams, but it’s certainly something for me to look into more closely. With regards the
editableSchemaMetadata
aspect, am I right in thinking that out of the box it will allow me to edit the field description in the UI but if I want to add additional attributes (rather than just including these in the description text itself) that I will need to either extend
editableSchemaFieldInfo
or define something similar for the custom attributes (say
editableSchemaFieldCustomInfo
) and add an array of it to
editableSchemaMetadata
. I then need to regenerate the code and make a number of other code changes to pass the values through and use them in the GraphQL API and finally make changes to the UI to expose the fields for editing. Part of the work I am currently doing is looking at whether we need to extend DataHub to fit our needs (which are not yet fully defined) and if so, how much effort is required in doing so.
b
@bland-wolf-37286 Yes that is correct. We are pretty strongly typed from the model to the UI
a) Add new field to PDL b) Update GraphQL mapper to include new field c) fetch new field by extending gql in UI d) display new field in ui
m
@bland-wolf-37286: would definitely like to understand if the combo of tags (free form) and glossary terms (controlled vocab) are not enough .. and what are the factors getting in the way of using them as a way to layer on semantics on fields
b
+1!
b
Apologies for the delayed response - I’ve had a couple of days off. We might want to associate with a schema field attributes such as the
format
and
source
of the data the field contains. The
format
attribute might contain something like ‘UUID’, ‘ISO8061 date’, ‘4 digit year followed by 2 letter period code’, etc. The
source
attribute might contain something like ‘HTTP request payload’, ‘extracted from HTTP request path’, ‘derived from X’, etc. There might be other attributes that need to be included. We could incorporate all of this into the field description, but that potentially makes the field description long and unwieldy, making the content less usable. Glossary terms could work well for any attributes that have a limited set of well known values, but would be less useful for attributes such as
format
and
source
which would be rather more freeform in nature. Tags don’t intuitively feel like the right solution for attributes such as
format
- when I think of a tag I have in mind a short but meaningful label, possibly backed with a lengthier description to give additional context. The potential problem with this is that it may be difficult to derive a suitable tag label for a
format
- shortening it too much may lose meaning whereas using the full text doesn’t feel the right way to go. That’s not to say that terms and tags couldn’t be made to work, it’s just that it may not be a solution that gives the best experience for our users. It’s why I was looking at what’s involved in extending the model to support additional attributes - to feed into the internal discussion here about how we want to setup and use DataHub. Extending the model and feeding those extensions through to GraphQL and the UI isn’t well documented at the moment so we wanted to gain an understanding of the effort involved in doing so. I tried two different approaches, the first being to add two new attributes, but I couldn’t work out all the places I needed to change to make the new attributes ingestable, queryable through GraphQL and visible/editable in the UI. I then changed tack and tried renaming an existing attribute, the idea being to see what breaks and fix the code all the way through until it works again as a means of identifying all the places that changes are required, then using the knowledge gained to reset and add new attributes. With changes in 26 different files I still hadn’t got to the point of fixing what I had broken so I called time on the exercise. Though I wasn’t successful at extending the model, the exercise was still valuable in terms of understanding the potential effort required.
l
@bland-wolf-37286 We're planning on addressing the effort involved in extending the model. Please stay tuned
separately, your use-case probably warrants the introduction of something like "reference data" which we have heard from other people in the community too.
b
I certainly will stay tuned 👍
l
we can then associate the right value from reference data with the columns
c
Hello, has there been any progress in incorporating reference data into Datahub?