hey team, any suggestion to model the PII tag. For...
# advice-metadata-modeling
r
hey team, any suggestion to model the PII tag. For example, I want to annotate some field in some schema as a some PII type, like
SSN
Does the community suggest using
Tag
or
GlossaryTerm
?
g
The general recommendation is to use GlossaryTerms, for more explanation see this https://blog.datahubproject.io/tags-and-terms-two-powerful-datahub-features-used-in-two-different-scenarios-b5b4791e892e
r
Thanks for the info!
Looks like I should use term over tag. But what extra benefit term will give me than tag?
g
Terms include the ability describe related terms and represent them in a hierarchy. PII is typically one term in a broader hierarchy/taxonomy of terms. Tags on the other hand are independent of each other, they have no relationship to other tags.
r
ah, thanks for the explanation!
one more question, will terms also be one filter type?
b
Dataset level terms are filterable in search
thank you 1
r
looks like there are two entities related, GlossaryTerm and GlossaryNode. Any doc about the diff between them?
b
iirc, node is the "topic" and term is the actual tag. All the terms need a parent topic
thank you 1
r
let's say if I would like to model PII related terms. I need one node, like
PII
and the
PII
node will be the parentNode of terms, like
IP_ADDRESS
,
SSN
, etc. ?
g
There is not one way to do it, however you might start with a parent of Classification with PII under that like Classification.PII. Or you might choose a hierarchy like Classification.RESTRICTED.PII, if you want to be more specific then you can go further. It depends on your needs.
r
oh, my main question is what should be a GlossaryNode and what should be GlossaryTerm
Not quite sure what should I put as GlossaryNode and what data should be in GlossaryTerm
g
It’s not clear to me in the documentation as GlossaryNode seems like it is a container of elements, however GlossaryTerms are also a container of other terms.
r
@mammoth-bear-12532 could you help to clarify a little about the concept diff between GlossaryNode and GlossaryTerm? It will be great if you could provide some example
e
Nodes are how we organize the terms (terms do not contain other terms), each term has a parent node associated with it to determine the groupings of terms. nodes can also have a parent node, so we have a hierarchical groupings of terms
r
Hey Dexter, thanks for chiming in and the explanation. Sound like my understanding is correct. In my situation, I would like to label field to show what kind PII it includes. Therefore I created one node:
PII
and Created multiple terms:
EMAIL
SSN
and etc. The terms will be linked to
PII
node
e
Sounds good to me!
thank you 1
g
@early-lamp-41924 - this comment
terms do not contain other terms
seems to contradict the documentation. Is this a documentation error, or perhaps needs to be clarified?
g
Hey @gentle-night-56466 - there are a few ways we can organize terms. Nodes are the primary method for organizing terms. However, terms can also
contain
other terms.
The
contain
relationship is not about the taxonomy- instead its a way to express for example that a Customer contains Address, FirstName, LastName, etc
👍 1
there is also a more structured way to express these relationships by associating a schema with a glossary term via the
SchemaMetadata
aspect
g
I do think the documentation is misleading, its not
contains
but more of a
relates to
g
that’s a good point, we can update the wording there
contains
is more a
hasA
relationship
m
there are two relnship types that terms can participate in w.r.t each other ->
isA
and
hasA
contains
and
inherits
are short-hands in the yaml for generating those relationships
and they also show up in the UI that way
we felt that
isA
and
hasA
seemed too technical
probably worth updating the docs to say what the implication of the
contains
and
inherits
in the yaml is