A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hello there!
We’ve been trying out DataHub for about a week, ingesting metadata from 3 different sources: Hive, Spark and Metabase, and we came across with one issue:
- we scanned the datasets from Hive (data is stored in s3)
- have our pipelines interacting with those datasets in Spark (we connect Spark with the same Hive)
- exploring the datasets from Metabase (we connect Metabase with Trino, who connects to the same Hive)
The thing is that DataHub doesn’t realize that the datasets are all the same, and repeats the same datasets 3 times, 1 for each ingestion
Is there a way to fix this?