I have a number of related use cases for what I'd ...
# getting-started
s
I have a number of related use cases for what I'd describe as "Data Profile" and/or "Data Quality" attributes, most of which are per-field. For example: completeness (~non-null percentage), distinct values, or histograms of distinct values in enumeration-like fields. I've been doing some quick-and-dirty prototyping by adding these attributes to the SchemaField model, but that feels wrong. It seems like "Data Profile" is really a separate aspect to "Schema", but they both contain per-field information, so I'm not sure how best to model this. I could add "DataProfile" to the set of dataset aspects. But then I'd have two aspects containing field-level information (
SchemaMetadata.SchemaFields
and something like
DataProfile.FieldProfiles
). If this is the correct model, what would be a good way to associate each particular FieldProfile with a particular SchemaField? Or is there a different model that would be better? More generally, it seems like there's a tension between two models for field-level aspects: 1. Dataset has many Aspects, some of which have metadata for many Fields 2. Dataset has many Fields, some of which have multiple Aspects I don't have an opinion on which of these models is more "correct", but the current implementation only seems to really support one aspect per field, and pushes any extensions to favour model (1) above. Is this a conscious design decision, or has this question just not come up yet?
b
This is an excellent question and shows that you have a deep understand of DataHub's modeling framework. @ambitious-battery-33996 is currently working on a RFC which faced a similar issue. The proposed solution is to create a
DatasetFieldUrn
that can be used as a pointer without fully materialized field entities. Fields will be created as nodes in the graph to answer questions like "Give me all the PII-containing fields of this dataset based on annotation supplied to its upstream datasets (aka fine-grained metadata propagation)"
s
Ah, yes, I'd temporarily forgotten someone was working on field-level lineage. That would also need to address this general problem.
Thanks!
Field URN sounds like a good idea to me. I'll have a proper read of the RFC.
b
Feel free to comment on the RFC as well.
s
Yep, will do.
to answer questions like "Give me all the PII-containing fields of this dataset based on annotation supplied to its upstream datasets (aka fine-grained metadata propagation)"
Incidentally, is there already a model for the single-dataset part of this (i.e. ignoring lineage, just directly annotating individual fields within a dataset as containing PII)? It looks like the PII flag is also dataset-level only at the moment.
b
This is also on the roadmap. Unfortunately we're unable to open source the internal models as-is since it's highly customized for LinkedIn.
s
Sure, no problem. Just wanted to check I wasn't missing something that was already present.