Automated/Machine-learning Personal Data Discovery...
# advice-data-governance
b
Automated/Machine-learning Personal Data Discovery (a.k.a PII). We have datawarehouses with over 100,000 columns, which means it is no longer possible for a human to manually review and tag each columns to identify whether it contains personal data. I see some commercial data catalogues have added a machine-learning module to help with this task - my question is: Is anybody using this same technique with DataHub? Any insight much appreciated.
plus1 1
h
We had this problem, but just bit the bullet once and had the owner of each table tag each column with a proper level of sensitivity. After that, we enforce this tagging on all future changes to the tables in a CI pipeline. It's up to the implementer and the reviewer to make sure the tag is correct. Been working reasonably well.
b
Thanks for the reply. Roughly how many columns are involved?
h
about 10 000, so not on your scale but still too much for a single person to govern (willingly)
✔️ 1
But we're looking forward to Snowflake's data classification feature
b
@high-hospital-85984 Do you mean that downstream tables inherit tags from the table the data is coming from?