Hello, is anyone working on a way to persist manua...
# ingestion
i
Hello, is anyone working on a way to persist manual field descriptions if the underlying databases do not have them in the tables definitions?
m
Yup @green-football-43791 added the "foundation" for that capability through the EditableSchemaFieldInfo , there is some work left in connecting up the UX and graphql layers to it obviously
i
How will this work? Will there be a merge of documents in GMS between UI-submitted changes (already persisted in the datahub store) & MCEs from the crawlers?
Is there a precedence-like logic here to ensure correct merges?
m
we'll keep them separate for now
so "raw schema" attached metadata can keep getting ingested
and UI edits are stored separately
For now, not planning any "auto-resolution" capabilities in GMS ... UX will always show "raw schema description" and overlaid "user-edits" .. we'll think through the design elements there...
long term - want to take a more principled approach towards this kind of metadata versioning that does not require pulling out an
Editable
aspect per aspect ...
Let us know if you want to participate in the design of that!
i
I can give you my context if it helps.
Druid does not store field descriptions of its datasources (which the ingestion crawler maps to datasets)
I.e: You can have a datasource which is an aggregration of 2 kafka streams or the contents of a blob storage.
Druid forces a user to define the schema of that datasource, but only the field names & types. This is what appears in Datahub. If there happened to descriptions for these fields in the datasources themselves (kafka stream, blob storage, local files, etc...) those are not crawled because Druid doesn’t store them.
I'm looking at a way to add field descriptions (which exist internally in my company) in way that they are not overridden every time the crawl executes and to be present in the correct place in the datahub UI.
I'm happy to take a look & discuss any designs you may want feedback on 🙂
m
got it... are you crawling the upstream datasets' field descriptions as well?
i
The current solution on the table on our side is to crawl druid with the existing crawler, save the results in a json file. Then using the
jq
bash tool, merge those MCEs with a file of field descriptions that we will have control over.
We are not crawling the upstream datasources because they do not exist in DataHub. Due to historical reasons they are schemaless kafka streams which my team has no control over meaning that they would appear as a dataset with a single field of type string that is the json payload of the record.
m
got it... I have a few ideas for how to make it less painful in the short term
1
from a metadata storage perspective, you can actually emit the "additional field descriptions" as EditableSchemaFieldInfo and DataHub will store it separately
so you won't need to maintain this additional file on the side anymore
We will just need to hook up the schema view to also pull in the descriptions from editable info ... since tags are already coming in that way, I'm hopeful that descriptions will be very little work
i
So we are just missing the frontend work for the PR you referenced in this thread, is that it?
m
right that is my understanding ... but @green-football-43791 can confirm
1
i
I'll wait for his confirmation then, thank you @mammoth-bear-12532 🙏
g
For field level descriptions yep- Shirshanka is correct. we have the model already:
and
we just need to enable reading and writing from the UI
and you could populate this from your crawler as well, although if aren't careful you would overwrite changes made from the UI
for dataset descriptions, we don't yet have the model. but it would follow a similar pattern