Redshift connector just as example sets `dataPlatform` as `r DataHub #ingestion

Redshift connector (just as example) sets `dataPla...

witty-butcher-82399

08/09/2021, 2:50 PM

Redshift connector (just as example) sets

dataPlatform

redshift

. This is noted here and here. Since I want to ingest tables from multiple redshift clusters, I would like to differentiate them by having different values for the

dataPlatform

. I have thought of changing this with a custom transform, but since

dataPlatform

is part of the URN, a custom transform wouldn’t work and so this requires to be managed from the connector itself; please, correct me if I’m wrong. Actually, the model is the one preventing this. Current approach seems to model

dataPlatform

as a sort of platform categorization. Is there any plans to model

dataPlatform

as platform instances instead?

square-activity-64562

08/09/2021, 3:26 PM

you mean they should show up under different paths in the UI? Or do you specifically want to change dataPlatform?

big-carpet-38439

08/09/2021, 5:15 PM

Yes - this is a fundamental limitation of our models today. We are indeed working on a design for modeling data platform instance in a more robust fashion!

big-carpet-38439

08/09/2021, 5:15 PM

The short of it will be :

big-carpet-38439

08/09/2021, 5:16 PM

Each data asset (dataset, chart, dashboard) will have as their key 2 things: • data platform instance (type and unique identifier) • coordinates within the data platform (differs for each platform type, will be standardized into a string format)

big-carpet-38439

08/09/2021, 5:19 PM

@witty-butcher-82399 Would love you input as we design this. Multiple folks share the same use case in the community. And the models should have included this from the start but unfortunately did not

bland-orange-95847

08/10/2021, 5:03 AM

@witty-butcher-82399 It’s an interesting point and we have some similar thoughts. As an hack you could indeed use an custom transformer and manipulate the URN string to exchange redshift by a string of your choice. For example pass in some dict/file based mapping for it. Not the prettiest way but would work for now

witty-butcher-82399

08/10/2021, 6:20 AM

Would love you input as we design this.

Well, our use case is quite simple actually. Our organization manages many AWS accounts and so there are many redshift clusters, glue catalogs, etc. So, for every data asset we want to keep trace the redshift cluster or glue catalog it belongs too. This should be noted in the URN itself (so we can differentiate two redshift tables with the same name in different clusters) and ideally in the browse paths in the UI too. Also, it would be nice if it could become a search item too: find data assets belonging to a data platform instance. @big-carpet-38439 From the description of your proposal, I think it perfectly fits our use case. Any estimation for this? Additionally, I was thinking on using data platform instance for RBAC: a group owns a data platform instance and so is exclusively granted to manage metadata for entities belonging to that platform. From what I read about RBAC, this would be possible as soon as the data platform instance is part of the URN. Thanks @bland-orange-95847 for your suggestion, makes totally sense! I naively assumed that the URN couldn’t be updated in the transform.

square-activity-64562

08/10/2021, 6:23 AM

@witty-butcher-82399 be aware that recently @mammoth-bear-12532 told me that platform is associated with a few things. So cannot be any random string. Might have some side effects

mammoth-bear-12532

08/10/2021, 6:27 AM

Yup, platform == a well understood tool / technology, a platform instance == a deployment / instance of this tool / technology.

bland-orange-95847

08/10/2021, 6:28 AM

@square-activity-64562 interesting side note. I just played around with datahub and custom transformer and did not notice issues when replacing the data platform part with a custom string. But totally makes sense in form of the whole product that it might has side effects I did not notice because I do not have a production like usage yet.

mammoth-bear-12532

08/10/2021, 6:31 AM

@bland-orange-95847: if you really want to work around this issue currently until we roll out the new models with platform-instances in the urn, it might be better to prefix the platform instance to the dataset name. e.g. postgres:DB.Schema.Table -> postgres:Instance.DB.Schema.Table

👍 2

👌 1

2 Views

Open in Slack

Previous Next