Hi all! I’m sure that people here have already enc...
# advice-metadata-modeling
b
Hi all! I’m sure that people here have already encountered a similar problem before. I’m curious if there are any recommendations in terms of naming datsets in Datahub. Specifically, do people have “rules of thumb” for names, such that: • names can be auto-generated • are human-readable • there’s a 1:1 correspondence between files and Dataset in Datahub In addition to the above, I’m curious how do people deal with moving dataset files to another location and maintaining a link between Dataset on Datahub and the actual files. One way I can think of is preventing any modifications to a dataset that has been marked as “production”. Are there other approaches that you have found successful in practice?
g
I'm not sure I understand the problem fully, trying to understand tho as we may have these problems, but haven't hit them yet in our deployment! Names of datasets are derived from the table name from SOR. Generally I've found modifying/changing names in anyway leads to alot more confusion as people familiar with SOR record tables say use table X, and data consumers search for X, if it's not exactly the same there's back and forth on if both are talking about the same thing. By human readable, do you mean remove underscore in favor of spaces, or doing other transforms?
moving dataset files to another location and maintaining a link between Dataset on Datahub and the actual files
Why would this cause issues?
b
By human readable, do you mean remove underscore in favor of spaces, or doing other transforms?
Ah sorry, I should’ve been more specific. In my case, some of the datasets are CSV/parquet S3 files. One idea that I’ve had for naming them, was to translate the S3 object paths like this, e.g.:
Copy code
(s3 path) <s3://foo/bar/baz.csv> -> (dataset name) foo.bar.baz_csv