Hi team, I see that ownership can only be declared...
# ui
b
Hi team, I see that ownership can only be declared for Hive tables via UI, but not for Hive schemas. Is there any way we can declare ownership for Hive schemas?
m
Hi Harvey, do you manage your Hive schemas separately from the Hive tables?
b
Hi Shirshanka, thanks for the reply! In our case, Hive schema can be owned by Person A while all tables under that Hive schema can be owned by different persons. Thus, we would need ownership at both schema and table level. Do you have suggestions how we can declare schema ownership via DataHub (if it’s supported)?
b
Currently this is unsupported. This would require modeling "schema" as a separate top-level entity. It's something we've considered in the past, but haven't prioritized. I'm assuming this is something you folks need? Can you classify this as a "must have" vs "nice to have"?
m
Also just to make sure we are talking about the same thing, @boundless-student-48844 in your example, is the schema the "database" name (as commonly referred to in RDBMS world)? e.g. if
customer.account
and
customer.marketing_preferences
are the Hive table names,
customer
is the schema name?
b
Hey @mammoth-bear-12532, yes! The “schema” I referred to is
database
in RDBMS world.
Hi @big-carpet-38439, thanks for sharing! In our case, “schemas” are associated with some important metadata, such as access groups, data stewards. (Alation even provides top users at schema level 😄) This is a must-have for us. For now I find a workaround to model “schema” as
Dataset
and ingest into DataHub using the emitter interfaces directly. But would definitely love to see this feature on the fly in future datahub. We will explore on our end to see if we can contribute as well.
m
yeah this makes sense @boundless-student-48844 and I think this definitely counts as a valid feature ask 🙂
b
Cool, lemme know if i need to raise any feature request ticket for record purpose 🙇‍♂️
Hey @mammoth-bear-12532, i’ve a related question if you don’t mind. 😅 On this note regarding data ownership, what do you think of assigning data ownership at schema (aka database) vs table level? Is it an uncommon / unrecommended practice for big tech firms like Linkedin to have metadata like ownership at “schema” / database level in addition to table level? We are trying to learn from the industry practice to shed light on our internal data governance practices as the data landscape at our company scales rapidly.
m
Hey @boundless-student-48844, I think it is uncommon (at least in the big data space, from what I've seen) ... I think @loud-island-88694 in his past life at Airbnb also didn't do "database" level ownership.
However, setting replication policies, retention policies etc at database level is definitely common from what I've seen.
b
Interesting, How about access policy? Do we have the need to grant access at schema/database level? without schema/database owner, how do we manage the approval workflow? Also even for replication/retention policies, do we need schema owner to make the call?
m
@boundless-student-48844: Would be great if you could file a feature request here for supporting "schemas" or maybe we should call them "dataset containers"? (https://github.com/linkedin/datahub/issues/new?assignees=&labels=feature-request&t[…]ure-request.md&title=A+short+description+of+the+feature+request)
👀 1
Yeah, @brainy-planet-58935 I think there are definitely cases where databases/schemas can be used to simplify access policy / retention policy / replication policy specification, especially when they align with the unit of ownership / stewardship. In my (decidedly skewed) experience, we had central platform teams who were responsible for infrastructure that implemented "central policies" across large classes of datasets (e.g. behavioral tracking data, database data, third-party data etc.) So there wasn't a "schema owner" per se, just owners (subject matter experts) for the individual tables.
b
I see, make sense. In light of data mesh, we will have many owners from different BU or entities. Federated data governance will require us to distribute the ownership. Schema owner becomes important to collaborate the policy within the schema which could be a sub-domain.
m
Yes that makes complete sense... if you align ownership to data product definitions. Would be great to chat about how you envision this showing up in the metadata model and the UI.
b
Hey @mammoth-bear-12532, here’s the feature request for modelling schemas / databases as separate top-level entities (Dataset Container) https://github.com/linkedin/datahub/issues/2838 Thank you!
thankyou 1
In terms of UI, this is what Alation has for schema / database metadata page. It has linkage from schema to tables & allows declaring ownership / access groups / other custom metadata associated with that schema. I think from UX point of view, this is great
b
Thanks for the great product feedback! Let us digest and think more about how to prioritize this one
thankyou 1