Hi team, is it possible to update/modify a dataset...
# advice-metadata-modeling
g
Hi team, is it possible to update/modify a dataset URN? That would be in the context of renaming a dataset. If so, will this change be reflected in all the relationships this dataset may have with other entities like DataJobs and other Datasets, for example? Thanks 🙏 🤞
d
To rename the dataset, you can use the
name
attribute in the DatasetProperties (datasetProperties) aspect. This will not affect the urn of the dataset, but will affect the display of the dataset on the UI (so there’s no need to change lineage)
g
Thanks @delightful-ram-75848 🙏 I would actually like to change the URN of the dataset, since we use that to retrieve datasets programmatically. Once we change the name but do not change the urn, it makes it harder to reconstruct the identifier for that dataset and retrieve it successfully.
m
Hey Antonio, migrating an urn like that is not something that we support natively as a bullet-proof feature. It is definitely going to be useful if we build it. Curious to hear what use-case you have where you are facing this need to migrate the urn.
g
Thanks @mammoth-bear-12532 🙏 We’re working with datasets in a quite dynamic research environment, and sometimes we need to rename a dataset to reflect a new convention from researchers and avoid ambiguities. It’s not something that we expect to be doing all the time, but it’d be nice to do so when we do need. Since we use the urn to programmatically retrieve metadata from DataHub, if we rename a dataset and not it’s urn, we kind of lose the ability to construct it’s urn and retrieve metadata from our codebase. I guess the main challenge is updating the urn in all other entities that might refer to that dataset, not sure if the underlying data model would allow that to be done easily.
a
We faced the same challenge during experimentation with DataHub in our early days of ingesting custom data sources. The "cleanest" approach we came up with is deleting the existing datasets and ingesting them again with the new urns. We had a similar issue with the PowerBI ingestion last week that makes urns based on configuration (which we got wrong the first time). Deleting all datasets in a platform (via the datahub CLI) and re-ingesting solved it for us.
b
Hi team! I have a similar use case / question about our need to rename datasets URNs. Our problem is that we opened up the usage of DataHub in "preprod" mode, so our GLUE database names are
pre_*
... Users have added documentation / links in the UI. We are getting ready to move to production, where our database names would be something like
prod_*
... Going into production, I don't want to lose all of the manual edits our users made in the past couple weeks. A way I was thinking of doing the migration was: • snapshot preprod Postgresql DB • Load snapshot in prod • find and replace in postgresql to replace all occurences of pre_ -> prod_ • run the elastic search re-index command Would that get me a "clean" version of DataHub in my new environment? Is there an other way to go about this?
r
Good question. Doing such a tailored migration is possible but a bit tricky. You'll need to effectively extract, transform, and re-write all the aspects from pre-pod assets to your new prod instance. You can use an API like scrollAcrossEntities to first get the URNs in bulk pages, but will need some python side logic to map the pre-prod urns into their prod formats, and then write back using the MCP "emit" API in Python to store into your new instance. Not for the faint-hearted, but certainly doable
plus1 1
b
so, what I really had in mind was to go do a big old find and replace in the postgresql table itself 🙂
my understanding of the Postgresql table that powers the application is that it's really on big key value storage, and so I could do something like: Look at all of the rows, look at the metadata column, whenever you see "pre_*", replace with "prod_*"
and then reindex the elasticsearch engine
r
You absolutely could achieve it in this way. This would be the backdoor way to achieve this. Just a script that lists all rows from that table in batch, replace the URNs in both the "metadata" and the "urn" columns of the table, with a reindex following!
b
Well, wonderful! I'll let you know once we get to this, and if it worked "smoothly"!
r
Benefit of having everything in a single table :p
😄 1