Hello is anyone working on a way to persist manual field des DataHub #ingestion

Hello, is anyone working on a way to persist manua...

incalculable-ocean-74010

03/26/2021, 4:51 PM

Hello, is anyone working on a way to persist manual field descriptions if the underlying databases do not have them in the tables definitions?

mammoth-bear-12532

03/26/2021, 5:08 PM

Yup @green-football-43791 added the "foundation" for that capability through the EditableSchemaFieldInfo , there is some work left in connecting up the UX and graphql layers to it obviously

incalculable-ocean-74010

03/26/2021, 5:15 PM

How will this work? Will there be a merge of documents in GMS between UI-submitted changes (already persisted in the datahub store) & MCEs from the crawlers?

incalculable-ocean-74010

03/26/2021, 5:15 PM

Is there a precedence-like logic here to ensure correct merges?

mammoth-bear-12532

03/26/2021, 5:26 PM

we'll keep them separate for now

mammoth-bear-12532

03/26/2021, 5:27 PM

so "raw schema" attached metadata can keep getting ingested

mammoth-bear-12532

03/26/2021, 5:28 PM

and UI edits are stored separately

mammoth-bear-12532

03/26/2021, 5:29 PM

For now, not planning any "auto-resolution" capabilities in GMS ... UX will always show "raw schema description" and overlaid "user-edits" .. we'll think through the design elements there...

mammoth-bear-12532

03/26/2021, 5:30 PM

long term - want to take a more principled approach towards this kind of metadata versioning that does not require pulling out an

Editable

aspect per aspect ...

mammoth-bear-12532

03/26/2021, 5:31 PM

Let us know if you want to participate in the design of that!

incalculable-ocean-74010

03/26/2021, 5:32 PM

I can give you my context if it helps.

incalculable-ocean-74010

03/26/2021, 5:32 PM

Druid does not store field descriptions of its datasources (which the ingestion crawler maps to datasets)

incalculable-ocean-74010

03/26/2021, 5:33 PM

I.e: You can have a datasource which is an aggregration of 2 kafka streams or the contents of a blob storage.

incalculable-ocean-74010

03/26/2021, 5:36 PM

Druid forces a user to define the schema of that datasource, but only the field names & types. This is what appears in Datahub. If there happened to descriptions for these fields in the datasources themselves (kafka stream, blob storage, local files, etc...) those are not crawled because Druid doesn’t store them.

incalculable-ocean-74010

03/26/2021, 5:36 PM

I'm looking at a way to add field descriptions (which exist internally in my company) in way that they are not overridden every time the crawl executes and to be present in the correct place in the datahub UI.

incalculable-ocean-74010

03/26/2021, 5:36 PM

I'm happy to take a look & discuss any designs you may want feedback on 🙂

mammoth-bear-12532

03/26/2021, 5:37 PM

got it... are you crawling the upstream datasets' field descriptions as well?

incalculable-ocean-74010

03/26/2021, 5:38 PM

The current solution on the table on our side is to crawl druid with the existing crawler, save the results in a json file. Then using the

jq

bash tool, merge those MCEs with a file of field descriptions that we will have control over.

incalculable-ocean-74010

03/26/2021, 5:38 PM

We are not crawling the upstream datasources because they do not exist in DataHub. Due to historical reasons they are schemaless kafka streams which my team has no control over meaning that they would appear as a dataset with a single field of type string that is the json payload of the record.

mammoth-bear-12532

03/26/2021, 5:40 PM

got it... I have a few ideas for how to make it less painful in the short term

✅ 1

mammoth-bear-12532

03/26/2021, 5:41 PM

from a metadata storage perspective, you can actually emit the "additional field descriptions" as EditableSchemaFieldInfo and DataHub will store it separately

mammoth-bear-12532

03/26/2021, 5:42 PM

so you won't need to maintain this additional file on the side anymore

mammoth-bear-12532

03/26/2021, 5:43 PM

We will just need to hook up the schema view to also pull in the descriptions from editable info ... since tags are already coming in that way, I'm hopeful that descriptions will be very little work

incalculable-ocean-74010

03/26/2021, 5:43 PM

So we are just missing the frontend work for the PR you referenced in this thread, is that it?

mammoth-bear-12532

03/26/2021, 5:43 PM

right that is my understanding ... but @green-football-43791 can confirm

✅ 1

incalculable-ocean-74010

03/26/2021, 5:44 PM

I'll wait for his confirmation then, thank you @mammoth-bear-12532 🙏

green-football-43791

03/26/2021, 11:14 PM

For field level descriptions yep- Shirshanka is correct. we have the model already:

green-football-43791

03/26/2021, 11:15 PM

https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/EditableSchemaMetadata.pdl

green-football-43791

03/26/2021, 11:15 PM

and

green-football-43791

03/26/2021, 11:15 PM

https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/EditableSchemaFieldInfo.pdl

green-football-43791

03/26/2021, 11:15 PM

we just need to enable reading and writing from the UI

green-football-43791

03/26/2021, 11:16 PM

and you could populate this from your crawler as well, although if aren't careful you would overwrite changes made from the UI

green-football-43791

03/26/2021, 11:16 PM

for dataset descriptions, we don't yet have the model. but it would follow a similar pattern

Open in Slack

Previous Next