Hi all, I've just locally deployed and started pla...
# advice-metadata-modeling
Hi all, I've just locally deployed and started playing around with Datahub. At our company we're looking for a Data Discovery tool that might help us to push forward a Data Mesh initiative we launched last year. We'd like to somehow use this Data Discovery tool to give more visibility to the Data Products that the different domains will be creating. After ingesting the example data in Datahub and taking a look at the main entities you can see in the UI (Datasets, Dashboards, Pipelines, Feature tables...) I was wondering if someone has tried to map the concept of a Data Product within Datahub's metamodel. Ideally I'd like to create a new entity called Data Product that will somehow include the components and metadata that make up a DP (pipelines, output ports...). Any ideas/thoughts?
plus1 2
You could try using the glossary to get started quickly. I've used it for grouping data products. I added a quick example to the demo page (it might get reset, dunno how long the link works) https://demo.datahubproject.io/glossaryTerm/urn:li:glossaryTerm:9bfcdf84-cac1-4a5c-8e2e-10c3608fe30a/Documentation?is_lineage_mode=false It's not perfect but like 80% of the way there in my experience
👍 1
thanks for the answer @ambitious-piano-33685. From my understanding the Glossary section in a Data Catalog helps to better understand the business terms. Exposing the data products through the glossary in Datahub will definitely help when it comes to better understand the DP but I think it doesn't cover all the things that I have in mind. For instance I'd like to better understand the processes or workflows that are implemented within the data product and I can't see an easy way to extract this info from the Glossary section.
Even though the feature is named "glossary", what it really allows you to do is to 1) create custom pages with titles and documentation, 2) organize those pages into various groups, 3) link those pages to any dataset or almost any other object. I use it not just for business terms but also for policies, KPI definitions, rules, guides and other documentation related to data. I'm not a fan of how the feature is named "glossary" because it's really useful for so much more than just business terms 😂
For data product -related processes or workflows we have used either plain documentation, or have linked it to the relevant other objects in datahub. So far it has been working fine enough, though as we get more experience with the "data as product" thing I can imagine developing dedicated features and entities for products. For now the main features that are missing are IMO social features (like commenting on data products) and being able to easily track who is using your data product.
thank you for the hints @ambitious-piano-33685. I agree, the social aspect of the data products is a very important one. Another feature that might be interesting: rating the data products. Anyway, I'll keep playing around with datahub to better understand its features.
👍 1
Thank you for this thread @sticky-twilight-17476 and @ambitious-piano-33685! I too think Glossary is a good low-effort start to prove Data Product concepts. The Data Product in the Glossary can then be linked to multiple Dataset implementations as files, tables, streams, etc. I want to play with the "Properties" tab in the demo by adding a few properties, do you know how i can configure Property Names and Values from the UI?
it is not possible to modify properties from ui now. Not sure if @bulky-soccer-26729 will add it anytime soon, i recall they wanted to expand the UI capabilities.
🤔 1
thanks @better-orange-49102! and if i want to add property name&value pairs programmatically, do you know where i am supposed to do that? i have been reading the GlossaryTerm doco, it doesn't explicitly say what UI elements correspond to what code components.. i think it's the "customProperties" in the schemaMetadata, but i'm not sure... all guidance much appreciated!
this code creates a term from scratch
Copy code
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import (
gms_endpoint = "<http://localhost:8080>"
token = "xx"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint, token=token))

termUrn = f"specifyyourownurn"
mcp = MetadataChangeProposalWrapper(
        name=f"Name of term",        
to update the properties for an existing term, you can use the datahub graph query for the existing term, then insert the properties into it and emit it back. there are examples in metadata-ingestion/examples/library
👍 1
thanks heaps @better-orange-49102!
hi @eager-australia-69729, my colleague @future-helmet-59694 has managed to programatically add new properties to our very first attempt to create a data product in datahub
🤩 1
I've just checked out her code and it's very similar to the snippet provided by @better-orange-49102
👍 1
excellent, thank you very much @future-helmet-59694 and @sticky-twilight-17476!
Hi @sticky-twilight-17476 , it will be great if you can share the code as well here , since we do have the same requirement over here .