A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Appreciate your help with this John :slightly_smiling_face:

So basically what we'll need to do is inject calls to our Authorizer engine for each entity in the resulting page in a post-filtering step that runs on any search, searchAcrossLineage, searchAcrossEntities, listX, browse, GraphQL queries

Anyways, I'll send reference to the location where we already do some of this (we validate that each entity in a list response actually exists as a real entity, not just a reference, which is somewhat similar) once I have that PR completed (hopefully very soon)

I think this will definitely be very useful going forward

Will we be able to merge our changes to the open source Datahub, or will we have to keep these in our fork?

As you pointed out, the post-filtering will result in somewhat non-deterministic counts on lists (for example the total). But all we can do is provide all information to the client which can determine the most elegant way to display this

So then I'm assuming the way we can do this is like: A page returns x total results with y filtered. The number of items in the page will be equal to x - y. Then we can keep paginating until we fill our page size. Is that correct?

Cool, I'll run that by my stakeholders but I expect that should work. Thank you again :slightly_smiling_face:

there is one additional option, which we've taken less consideration of up until this point: virtual offsets

wherein the server itself attempts to fill the page size requested by the client, using similar logic, and then returns a virtual offset to the client

but it doesn't help with the "total" count being off

total we will never know without traversing the entire set

that being said, we are open to alternative ideas which can make the user experience more seamless. if anyone has experience doing this in other systems the insight would be very much appreciated

<@U01GCJKA8P9> Any update on this :slightly_smiling_face:

<@U01GCJKA8P9> I work with <@U03TX1VKXCJ> . Our stakeholder doesn't like 2 aspects of the post filter:
1. total count off - unfixable
2. pagination requires unpredictable call count
We have been brainstorming and came up with several alternatives. The one we think is best is to translate the current user's policies into elasticsearch query criteria. We aren't sure how difficult that will be. If we were to implement this feature, and provide a configuration flag to enable/disable, would Datahub be interested in merging that in? We are trying not to introduce forks

<@U01GCJKA8P9> Another idea we had was that an external source could precompute the permissions per document, then search would have clause(s) for matching based on permissions of current user. i.e. document has `read_permissions` field that has every group and user that has been computed as having access, then datahub search adds clause for matching the user and iterates for all groups to match as well. This is our stakeholder's preferred solution, because it has less risk of query complexity increasing substantially and affecting performance. What do you think of that approach?