Encountered an issue where I have the same data that is acce DataHub #advice-metadata-modeling

Encountered an issue where I have the same data th...

better-orange-49102

11/08/2022, 3:17 AM

Encountered an issue where I have the same data that is accessible via multiple means. The json file is stored in a NAS which is able to emulate HDFS for people accessing via Spark; people can access it via CIFS mount on the folder, and also it is accessible via Hive using Hue. I am not sure which is the correct platform type to model this data. Is there a way to show multiple platform types for a dataset? Any suggestions would be welcomed.

modern-artist-55754

11/10/2022, 8:37 AM

I suppose you can clone and modify the s3 datalake connector and get it to work with cifs mount. In the end, datahub will just use spark to read the data, so whether s3 or Mount point, it should work

better-orange-49102

11/10/2022, 10:58 AM

no, the question is what platform type this dataset should be using...

modern-artist-55754

11/10/2022, 11:06 AM

Ideally it should have the “s3 data lake” equivalent. But we don’t have something like that. But in the absence of it, maybe Hive is easier to reason

astonishing-answer-96712

11/10/2022, 5:09 PM

@mammoth-bear-12532 any thoughts here?

mammoth-bear-12532

11/11/2022, 2:42 PM

One way is to have multiple dataset entities with different urns in this case (Eg s3 - /bucket/foo/bar.json and hive - nas.foo) and connect them together using a sibling aspect. You can mark one of the dataset urns as being the primary in the sibling relnship (the UI will default to showing that as the main dataset while combining in metadata from the other)

3 Views

Open in Slack

Previous Next