Hello. This is a "before I get started" topic. H...
# advice-data-governance
c
Hello. This is a "before I get started" topic. Have some "philosophical" questions to get me grounded here regarding where the metadata is 'mastered', i.e., what tool acts as the "source of truth". Currently using dbt and Snowflake and looking at metadata frameworks including DataHub. Trying to wrap my head around the ownership/mastership question. Understand the tech pretty well. Curious to know how others think about metadata ownership and where the source of truth lives --- is it
dbt
? Is it
DataHub
(or
data.world
or
Collibra
or
Atlan
or
Alation
?) Some combination of the two (obviously), but where are the lines drawn. I love the idea of keeping docs, tests, metadata right next to my models in dbt, managed in GitHub. If I take the "dbt as source of truth" approach, then I obviously would need to push metadata changes to the data catalog or enterprise metadata tool/framework. But there's way more to it than that ... metadata for the same objects is likely available in other forms from other sources (e.g., crowd sourcing for one, or a separate glossary of business terms, or some higher fidelity lineage data, profiling, not to mention data contracts, access policies, etc). How do you all think about and solve this? Thanks!
m
Datahub "should" be the source of truth. Dbt metadata should be periodically ingested to datahub. So in that sense, dbt becomes a source of metadata. For us, we also enrich dataset with ownership metadata from azure directory etc. So dbt cannot be the source of truth. Datahub should be your centralised repo of metadata.
c
@modern-artist-55754 thanks for the response. That makes sense. To clarify further ... if DataHub is the source of truth, one might interpret that to mean that ALL metadata should have its "genesis" in DataHub. That is, for example, the definition of a table/column - its name, description, custom metadata, declarative tests, etc - should be authored in DataHub - via a web form or some such means. And that tools like dbt, NOT being the source of this metadata, should not be declaring that metadata directly (e.g., via a
schema.yml
file in GitHub) but should first PULL that data from the source of truth (DataHub), then execute
dbt build
from there. I am not suggesting this is a good thing to do, rather, that it is or could be the logical deduction from a strict application of "DataHub-as-source-of-truth". Is it rather more appropriate to say something like "DataHub is a window to the truth, a truth that is the union of various 'bits' of truth that come from multiple places (e.g., dbt and Azure Directory)." Thanks again!
m
Oh i didn't suggest all the metadata originated from datahub, but as its name suggest, it should be the hub of all metadata. It should be the central repo for all metadata