I have a number of related use cases for what I d describe a DataHub #getting-started

I have a number of related use cases for what I'd ...

swift-account-97627

09/07/2020, 12:40 PM

I have a number of related use cases for what I'd describe as "Data Profile" and/or "Data Quality" attributes, most of which are per-field. For example: completeness (~non-null percentage), distinct values, or histograms of distinct values in enumeration-like fields. I've been doing some quick-and-dirty prototyping by adding these attributes to the SchemaField model, but that feels wrong. It seems like "Data Profile" is really a separate aspect to "Schema", but they both contain per-field information, so I'm not sure how best to model this. I could add "DataProfile" to the set of dataset aspects. But then I'd have two aspects containing field-level information (

SchemaMetadata.SchemaFields

and something like

DataProfile.FieldProfiles

). If this is the correct model, what would be a good way to associate each particular FieldProfile with a particular SchemaField? Or is there a different model that would be better? More generally, it seems like there's a tension between two models for field-level aspects: 1. Dataset has many Aspects, some of which have metadata for many Fields 2. Dataset has many Fields, some of which have multiple Aspects I don't have an opinion on which of these models is more "correct", but the current implementation only seems to really support one aspect per field, and pushes any extensions to favour model (1) above. Is this a conscious design decision, or has this question just not come up yet?

bumpy-keyboard-50565

09/07/2020, 2:30 PM

This is an excellent question and shows that you have a deep understand of DataHub's modeling framework. @ambitious-battery-33996 is currently working on a RFC which faced a similar issue. The proposed solution is to create a

DatasetFieldUrn

that can be used as a pointer without fully materialized field entities. Fields will be created as nodes in the graph to answer questions like "Give me all the PII-containing fields of this dataset based on annotation supplied to its upstream datasets (aka fine-grained metadata propagation)"

swift-account-97627

09/07/2020, 3:45 PM

Ah, yes, I'd temporarily forgotten someone was working on field-level lineage. That would also need to address this general problem.

swift-account-97627

09/07/2020, 3:45 PM

Thanks!

swift-account-97627

09/07/2020, 3:45 PM

Field URN sounds like a good idea to me. I'll have a proper read of the RFC.

bumpy-keyboard-50565

09/07/2020, 3:48 PM

Feel free to comment on the RFC as well.

swift-account-97627

09/07/2020, 3:49 PM

Yep, will do.

swift-account-97627

09/07/2020, 3:49 PM

to answer questions like "Give me all the PII-containing fields of this dataset based on annotation supplied to its upstream datasets (aka fine-grained metadata propagation)"

Incidentally, is there already a model for the single-dataset part of this (i.e. ignoring lineage, just directly annotating individual fields within a dataset as containing PII)? It looks like the PII flag is also dataset-level only at the moment.

bumpy-keyboard-50565

09/07/2020, 3:51 PM

This is also on the roadmap. Unfortunately we're unable to open source the internal models as-is since it's highly customized for LinkedIn.

swift-account-97627

09/07/2020, 8:30 PM

Sure, no problem. Just wanted to check I wasn't missing something that was already present.

6 Views

Open in Slack

Previous Next