A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hi Team,
We are trying to create lineage where data sources could be files. From the example provided on Datahub, it looks like only a single file would exist on file platform. But in reality, we could have the same file across different servers and environments. So how can this be handled in datahub.

Hi! This might be better suited to <#C02R2NBJXD1|advice-metadata-modeling>, but it sounds like you've added some models to support file as a type? Or are you modeling it as a Dataset?

From a high level perspective, really what you're talking about are different files. Unless you have a way of guaranteeing these files are not modified in the different locations (i.e. this is really just a centralized file that gets pulled from git/other repo in which case that file location should be considered the actual source), each one is truly a different source, it could contain different information from the true source and it's not recommended to treat it as the same. If you need to include information that the source file has different instances then I recommend taking a look at the <https://datahubproject.io/docs/platform-instances/|platform instance docs> for inspiration.

Hey <@UV5UEC3LN>, guess we assumed Datahub would have out of the box model for files.. apparently that assumption is incorrect?

It depends on if you're treating files as a separate concept from datasets or other entities. "File" is an inherently extremely broad term and I don't think it's particularly useful in a metadata model without further context about what the file actually is. Is it a CSV file containing schema rows? Is it an ad hoc text document? Etc.

The value in DataHub is logically categorizing data. Pretty much anything can be a file and a category that doesn't really narrow things down isn't a particularly useful category

I agree. So if we were to bring in csv, xml or json datasets for lineage and discoverability, what would your recommendation be?

Are they stored in a central location like S3 or Github? We have an S3 source that pulls in schema files and I would recommend using that or something similar pulling from a centralized file store.

If you need information about particular instances of that schema file in use I'd recommend leveraging the platform instance concept linked above.