Anddddd <another RFC>! Here's the first iteration...
# contribute-code
b
Anddddd another RFC! Here's the first iteration of the Fine-grained Access Control RFC. Proposed design was heavily inspired by Elasticsearch's take on the same. Have a look and let me know your feedback!
🙌 1
❤️ 2
🔏 1
t
Does it make sense to add the environment in/from which the actor wishes to consume the resource? This may become particularly important with the recent changes to data transfers out of the EU.
s
@big-carpet-38439 How would datahub know which user belongs to which group? As we are using GSuite I have not been able to ingest our users yet so not sure if there is something in there which will add this mapping. Access given to groups make lot of sense but not sure how access given to groups would be mapped to access to users in that group.
b
@better-hydrogen-26228 Unfortunately, not all assets are currently associated with an environment, so it'd be sort of inconsistent at this moment. In the future, there should no reason we can't extend the ABAC model to filter the resource based on this type of attribute
@square-activity-64562 Great question - for now we will leverage the CorpGroup model we already have. There will be 2 ways to create users and groups in DataHub from external systems: 1. Ad hoc - When a user logs into DataHub, we attempt to retrieve the groups they are in from the IdP claims. If a group they are in is not already in DataHub, one will be created automatically. The downside here is there needs to be a periodic sync for cleaning up removed users and groups. 2. Batch Sync - Period batch job that pulls users and groups from the IdP using an API with admin privileges (to read all users and groups) and synchronizes them with the users / groups found in DataHub. The downside here is that you must schedule a recurring job for it. Today, we do not have a Batch source compatible with the Google Identity API. However, we would absolutely love a contribution here! I think it could benefit others in the community as well. Additionally, we are working to implement #1 using OIDC claims (we should check to make sure Google Identity can return Groups claims)
We also intend to provide ability for admins to create users and groups using the UI, however this will come after the RBAC stuff is implemented, so as to prevent anyone from adding or removing these things
s
What would the process of creating a policy look like? So far we have around 1.4k datasets in datahub (from ~10 databases) and we have not ingested all sources in. Access in most cases is supposed to be on database level. As there isn't any database entity in datahub I am wondering how would we create policies for 1.4k datasets? As I ingest more things in we will have few hundred dashboards, few thousand charts and probably much more that I am missing here. So need some easy way to create these policies for our teams.
b
What would be ideal way to generate these for you? We will by default have read access of everything for everyone, and write access limited to owners
s
during ingestion be able to add 2 policies at database level so that for every database we have 2 policies in the system - one for read only and one for read write.
b
So policies would be metadata attached to a new dataset container concept? What would the policy say? Would it list exactly who should have each of those permissions?
s
I was thinking something like
Copy code
source:
  ...

sink:
  ...

policy:
  policyA: ALLOWS_READ
  policyB: ALLOWS_READ_WRITE
Something like that so that for every dataset sent to sink we also add permissions in the 2 policies policyA and policyB which allow read, read_write respectively. This will enable having database level policies for group access. For our use case I don't want policy per dataset (we have more than 1k datasets so far). I want 2 policies per database which I can give to groups.
b
Awesome feedback. And do you want to associate users / groups at ingestion time also? Or would you prefer to do this using an API / UI
s
Would prefer a UI. But would at least need to be able to get in some admin users so they can use the UI.
b
Got it. Is this type of access control (dataset level, push based) a must have for your organization?
s
We have country teams which should not under any circumstances be able to see other country's data. This is checked by external auditors. We cannot roll out datahub to whole company without these access controls.
I am not sure what you mean by push based