A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hi! With some scheduling, connectors keep the metadata for a given source system up-to-date in DataHub. However, what if a dataset is removed in the source system? how are you managing this scenario to set the `Status.removed=true` in particular and prevent stale metadata in general?

This has been the best way we've found.. Currently we do not have any sanctioned sources to do this, but I'm starting to think we should. For example a "snowflake gc source" that basically compares the tables in snowflake to those found in datahub, and does the corresponding deletes

With the initial _global garbage collector_ proposal, how do you suggest to perform such a scan on the entities? using the GMS API? what about directly scanning Elasticsearch? even Spark+ES, has anyone tried this?
Regarding the _source garbage collector_ approach, it could be tricky in case of deploying multiple instances of a connector (_eg_ for different accounts or namespaces). The connector should know somehow what’s the parent source connector. Is that information stored in the Entity?

Right... It'd need some way to say "give me all entities associated with this source stored in DataHub" which would indeed require a mapping from entity to the source instance

Which right now is not persisted, but likely should be

in fact we've been talking a lot about a need for more clearly model "data source instances" instead of just data platforms

and then link those to entities themselves, which would permit exactly this type of thing

Yes, that sounds great! :+1:
The way I see this relationships is:
• 1 dataset belongs to 1 data platform and was ingested by 1 data source instance
• however relationship between connector and data platform is not 1:1, there could be multiple data source instances (connectors) ingesting data from 1 data platform
Thanks for the comments and response!