https://datahubproject.io logo
#advice-data-governance
Title
# advice-data-governance
b

bumpy-dream-34113

01/12/2022, 8:20 AM
Automated/Machine-learning Personal Data Discovery (a.k.a PII). We have datawarehouses with over 100,000 columns, which means it is no longer possible for a human to manually review and tag each columns to identify whether it contains personal data. I see some commercial data catalogues have added a machine-learning module to help with this task - my question is: Is anybody using this same technique with DataHub? Any insight much appreciated.
plus1 1
h

high-hospital-85984

01/12/2022, 8:46 AM
We had this problem, but just bit the bullet once and had the owner of each table tag each column with a proper level of sensitivity. After that, we enforce this tagging on all future changes to the tables in a CI pipeline. It's up to the implementer and the reviewer to make sure the tag is correct. Been working reasonably well.
b

bumpy-dream-34113

01/12/2022, 10:10 AM
Thanks for the reply. Roughly how many columns are involved?
h

high-hospital-85984

01/12/2022, 11:39 AM
about 10 000, so not on your scale but still too much for a single person to govern (willingly)
✔️ 1
But we're looking forward to Snowflake's data classification feature
b

busy-dentist-64466

03/10/2022, 2:20 PM
@high-hospital-85984 Do you mean that downstream tables inherit tags from the table the data is coming from?
2 Views