view ACLs: enable admins to block view of sample d...
# feature-requests
s
view ACLs: enable admins to block view of sample data values given by profiling runs in Stats view • Aggregated statistics in Profiling runs should be fine. • Being able to see column descriptions and all datasets might allow data discovery. • Being able to see sample data in profiling runs is like people getting access to data that they should not have access to. There could be sensitive info which team that does not have access should not be able to view. I would rather block everyone from seeing sample values over them accidentally being able to see data that they should not have access to. We are rolling out datahub to whole company slowly. But if we had profiling on datahub it would show sample values. This can be a big concern during external audit. This is one big reason we have not enabled profiling for most of our databases.
b
@square-activity-64562 Your third bullet point: Is it sufficient that the data is disassociated from other columns? I have heard from some folks that because the sample values are random and not associated with other column values, this is less of a concern in most situations. I'm assuming that's not true for you guys?
s
@big-carpet-38439 One specific example is customer names. For our B2B clients being able to see any customer name is basically going to let a person know this is a client. There can be other such examples where a single column value does give out important information. This becomes an external audit issue. For some of other cases we wouldn't want to have sample values at all. e.g. Korea has some strict data storage laws. For Korea before doing processing in another country we have to remove all PII. PII cannot be stored outside Korea. Our datahub is hosted in Singapore. We would like to have profiling (nulls, row counts, distinct etc.) but not any sample value. If we do get sample values then we may need to get another instance of datahub in Korea to comply with the laws. Or we will have to skip the RDBMS which are in Korea. We would prefer skipping sample values over missing out on sample values. This is a legal issue. If I start discussing this with my team in detail we will probably find many other examples. I think the simplest solution is to be able to block viewing of sample values.
b
Okay sounds good. So to clarify: is collecting the information during profiling still okay? (ie just need to restrict display) Or is collection also a problem? Trying to determine where the flag to disable should live, either in the ingestion source itself or as an application setting
s
need to restrict display solves the first problem. not storing sample values in database solves both. So ideally two things • view restriction permissions for sample values • storage restriction for sample values for databases where there are more strict requirements
I understand that doing both increases work. So maybe just disable sample values should be enough
Now I have written this down this is just a change in metadata ingestion. I can easily send a PR without getting into Auth
Let me try and send PR for the 2nd case where profiling run does not send sample values. If you want you can do the first one. I think others have also asked for view restrictions
q
Can you not just skip profiling for certain datasets by just using
deny
and/or
allow
list under
profile_pattern
s
We have ~2000 datasets so far in datahub. If we go and try to do this for individual datasets it will take very long. Let me assume we do decide to take the time and do that. First we lose statistics on number of rows, distinct counts, mean, median etc. for all datasets which have sensitive information. That is useful information even without sample values. Second problem is all the what-if scenarios. e.g. What if a country team added a column in a dataset on which profiling is enabled and the new column has sensitive information? We made the decision to enable profiling when it didn't have sensitive data. Now it has sensitive data and sample values are shown in datahub. Country teams should not be able to see other country's data. But because of this new column now they are able to see sensitive data of another country. Allow/deny for masking/redacting sensitive data by making a list of what to deny is a recipe for disaster waiting to happen. That was one scenario. There can be many other scenarios if I sit down and think it through. When it is legal as well as external audit concern (both of which investors consider) we cannot take such chances. Not sending sample values eliminates all what-if scenarios. You cannot accidentally leak what you don't store.
View Access Restrictions is a different way to solve this problem. That will take time. That will happen only if datahub team decides to add View ACLs.
q
Makes sense. Could be tedious if you have to manually inspecting datasets for sensitive information. I was hoping you could generate the deny list based on an existing role within your db. But either way seems like you need stats on the dataset, but not sample values to be displayed. I probably misread your feature request.
b
yeah i think it makes sense for now to allow a feature flag at the ingestion source itself saying whether sample values are extracted. this should solve most of your use case very quickly. @helpful-optician-78938 can we add this to our backlog on profiling?
👍 1
👀 1
c
@big-carpet-38439 i really agree with Aseem and because of the same couple of reasons , we didn’t enable profiling. If you guys can add something like “sample-data: false” in source recipe, that would be really great!
👍 1
b
cc @mammoth-bear-12532 @little-megabyte-1074
l
Please head over & upvote if this is still relevant to you all!