Hi! With some scheduling, connectors keep the meta...
# ingestion
w
Hi! With some scheduling, connectors keep the metadata for a given source system up-to-date in DataHub. However, what if a dataset is removed in the source system? how are you managing this scenario to set the
Status.removed=true
in particular and prevent stale metadata in general?
1
b
This has been the best way we've found.. Currently we do not have any sanctioned sources to do this, but I'm starting to think we should. For example a "snowflake gc source" that basically compares the tables in snowflake to those found in datahub, and does the corresponding deletes
w
With the initial global garbage collector proposal, how do you suggest to perform such a scan on the entities? using the GMS API? what about directly scanning Elasticsearch? even Spark+ES, has anyone tried this? Regarding the source garbage collector approach, it could be tricky in case of deploying multiple instances of a connector (eg for different accounts or namespaces). The connector should know somehow what’s the parent source connector. Is that information stored in the Entity?
b
Right... It'd need some way to say "give me all entities associated with this source stored in DataHub" which would indeed require a mapping from entity to the source instance
Which right now is not persisted, but likely should be
in fact we've been talking a lot about a need for more clearly model "data source instances" instead of just data platforms
and then link those to entities themselves, which would permit exactly this type of thing
w
Yes, that sounds great! 👍 The way I see this relationships is: • 1 dataset belongs to 1 data platform and was ingested by 1 data source instance • however relationship between connector and data platform is not 1:1, there could be multiple data source instances (connectors) ingesting data from 1 data platform Thanks for the comments and response!