Redshift connector (just as example) sets `dataPla...
# ingestion
w
Redshift connector (just as example) sets
dataPlatform
as
redshift
. This is noted here and here. Since I want to ingest tables from multiple redshift clusters, I would like to differentiate them by having different values for the
dataPlatform
. I have thought of changing this with a custom transform, but since
dataPlatform
is part of the URN, a custom transform wouldn’t work and so this requires to be managed from the connector itself; please, correct me if I’m wrong. Actually, the model is the one preventing this. Current approach seems to model
dataPlatform
as a sort of platform categorization. Is there any plans to model
dataPlatform
as platform instances instead?
s
you mean they should show up under different paths in the UI? Or do you specifically want to change dataPlatform?
b
Yes - this is a fundamental limitation of our models today. We are indeed working on a design for modeling data platform instance in a more robust fashion!
The short of it will be :
Each data asset (dataset, chart, dashboard) will have as their key 2 things: • data platform instance (type and unique identifier) • coordinates within the data platform (differs for each platform type, will be standardized into a string format)
@witty-butcher-82399 Would love you input as we design this. Multiple folks share the same use case in the community. And the models should have included this from the start but unfortunately did not
b
@witty-butcher-82399 It’s an interesting point and we have some similar thoughts. As an hack you could indeed use an custom transformer and manipulate the URN string to exchange redshift by a string of your choice. For example pass in some dict/file based mapping for it. Not the prettiest way but would work for now
w
Would love you input as we design this.
Well, our use case is quite simple actually. Our organization manages many AWS accounts and so there are many redshift clusters, glue catalogs, etc. So, for every data asset we want to keep trace the redshift cluster or glue catalog it belongs too. This should be noted in the URN itself (so we can differentiate two redshift tables with the same name in different clusters) and ideally in the browse paths in the UI too. Also, it would be nice if it could become a search item too: find data assets belonging to a data platform instance. @big-carpet-38439 From the description of your proposal, I think it perfectly fits our use case. Any estimation for this? Additionally, I was thinking on using data platform instance for RBAC: a group owns a data platform instance and so is exclusively granted to manage metadata for entities belonging to that platform. From what I read about RBAC, this would be possible as soon as the data platform instance is part of the URN. Thanks @bland-orange-95847 for your suggestion, makes totally sense! I naively assumed that the URN couldn’t be updated in the transform.
s
@witty-butcher-82399 be aware that recently @mammoth-bear-12532 told me that platform is associated with a few things. So cannot be any random string. Might have some side effects
m
Yup, platform == a well understood tool / technology, a platform instance == a deployment / instance of this tool / technology.
b
@square-activity-64562 interesting side note. I just played around with datahub and custom transformer and did not notice issues when replacing the data platform part with a custom string. But totally makes sense in form of the whole product that it might has side effects I did not notice because I do not have a production like usage yet.
m
@bland-orange-95847: if you really want to work around this issue currently until we roll out the new models with platform-instances in the urn, it might be better to prefix the platform instance to the dataset name. e.g. postgres:DB.Schema.Table -> postgres:Instance.DB.Schema.Table
👍 2
👌 1