Hi, Does Datahub allow any automatic (ML based?) c...
# ingestion
w
Hi, Does Datahub allow any automatic (ML based?) classification of data assets it discovers? if not, is there a plan to incorporate such a feature in the future?
o
Hi Cristian! When you say automatic classification, are you referring to entity type of the data asset? Or something else? Generally we know what kind of data asset we are getting metadata about based on the source of the information and do automatically pull corresponding information according to our internal models based on the source, this doesn't use any ML algorithms though. (i.e. source = MySQL we know this is a RDBMS that holds datasets and can query system metadata tables)
w
Hey Ryan - I am thinking something like this: I am ingesting a data asset, let's say a mysql table. Based on some "intelligent" mechanism, datahub can "idenfity" that my "customerAddress" field is a "Sensitive/PII" field, based on either the column name, or based on the actual distribution of the values
and it would place an automatic "PII" tag on that column
I've seen this in one of Microsoft's offering for Azure
o
We don't currently support intelligent introspection of fields for tagging, although similar behavior to this can be achieved by creating a set of regex expressions for the tag transformer: https://datahubproject.io/docs/metadata-ingestion/transformers/#adding-tags-by-schema-field-pattern
w
Ok got it. Not really sure that you can achieve with a regex what a model will give you, but it's definitely a good start, thanks!