Automated/Machine-learning Personal Data Discovery (a.k.a PII). We have datawarehouses with over 100,000 columns, which means it is no longer possible for a human to manually review and tag each columns to identify whether it contains personal data. I see some commercial data catalogues have added a machine-learning module to help with this task - my question is: Is anybody using this same technique with DataHub? Any insight much appreciated.
plus1 1
h
high-hospital-85984
01/12/2022, 8:46 AM
We had this problem, but just bit the bullet once and had the owner of each table tag each column with a proper level of sensitivity. After that, we enforce this tagging on all future changes to the tables in a CI pipeline. It's up to the implementer and the reviewer to make sure the tag is correct. Been working reasonably well.
b
bumpy-dream-34113
01/12/2022, 10:10 AM
Thanks for the reply. Roughly how many columns are involved?
h
high-hospital-85984
01/12/2022, 11:39 AM
about 10 000, so not on your scale but still too much for a single person to govern (willingly)