A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Automated/Machine-learning Personal Data Discovery (a.k.a PII).  We have datawarehouses with over 100,000 columns, which means it is no longer possible for a human to manually review and tag each columns to identify whether it contains personal data. I see some commercial data catalogues have added a machine-learning module to help with this task - my question is: Is anybody using this same technique with DataHub?  Any insight much appreciated.

We had this problem, but just bit the bullet once and had the owner of each table tag each column with a proper level of sensitivity. After that, we enforce this tagging on all future changes to the tables in a CI pipeline. It's up to the implementer and the reviewer to make sure the tag is correct. Been working reasonably well.

Thanks for the reply.  Roughly how many columns are involved?

about 10 000, so not on your scale but still too much for a single person to govern (willingly)

But we're looking forward to Snowflake's <https://www.snowflake.com/blog/bringing-the-worlds-data-together-announcements-from-snowflake-summit/|data classification feature>

<@U01AFJB5M9C> Do you mean that downstream tables inherit tags from the table the data is coming from?