https://datahubproject.io logo
#ingestion
Title
# ingestion
b

bland-balloon-48379

09/26/2022, 7:14 PM
Hey team, I wanted to start a discussion about how datasets get represented over time. In particular I'm thinking of a situation where a dataset is created, ingested, edited in Datahub, and deleted at some point. Then some time in the future, a new dataset with the same name and in the same database/schema is created. The same urn would be generated in both these cases, so internally they would be treated as the same dataset. And if the original dataset was soft-deleted, then the subsequent one would inherit that custom documentation even if it doesn't apply to this new case. Should these be considered two separate datasets? Would the second one be considered a new version of the original? If the original dataset had it's status set to removed when it was deleted from the database, would there be some indication that it had been removed and then reingested? I think there is a lot to consider here. I'm interested to hear what Datahub and the community's perspective is on this scenario, both from a stateful and stateless ingestion standpoint, and if there is a particular direction Datahub has in mind. Thanks!
g

gray-shoe-75895

09/27/2022, 1:02 AM
The current behavior would be that the soft-deleted flag would get flipped back to false, and hence the new table would inherit data from the previous one, and there wouldn’t be any indication that it had been removed and reingested.
imo the best thing to do in this case would be to hard-delete the old dataset before ingesting the new one - is that a workable solution?
b

bland-balloon-48379

09/27/2022, 3:46 PM
Thanks for the reply Harshal. This is a feasible solution, but not an preferred or practical one imo. There's a couple issue I see: 1. Scalability: You can do the hard delete easily when we have a specific dataset in mind, but what about if you have a process that regularly ingesting databases with tens of thousands of datasets each? 2. Historical data: By hard deleting you lose the historical data of that original dataset. It may or may not be useful to keep that historical data on hand.
On that second point, I'm interested to know if there's a particular philosophy regarding historical data for stale datasets in Datahub. Maybe preservation of that data isn't in the scope of Datahub's current trajectory, and if so it'd be good to know. I'm partly looking for a solution with this thread, but also just inquiring if there have been any thoughts or consideration around this problem space.
g

gray-shoe-75895

09/28/2022, 12:36 AM
In general, we do try to preserve version history - we actually retain a history of every edit made to every entity. When a dataset is soft-deleted, no data is actually deleted - we just stop surfacing that dataset in search/browse, but you can still access it via a direct URL as you would any other dataset.
Now, if a dataset that was previously marked as soft-deleted is then ingested again, we treat that as if the previous dataset was un-deleted. This behavior is usually the right default - people regularly misconfigure their permissions (and hence an asset “disappears”) or accidentally delete something only to restore it soon after. The unfortunate side effect of picking this behavior by default is that it makes your use-case more complex - we don’t have a great way of distinguishing something that is genuinely new vs the reappearance of an old thing
A decent compromise might be to backup and hard-delete entities that have been marked as soft-deleted for more than N weeks/months. Actually implementing such a rule won’t cause performance issues - we should be able to handle that scale just fine
b

bland-balloon-48379

09/28/2022, 12:55 PM
I appreciate the input. I think the last point is a solid one and something we were considering. And you're right, I think implementing that as a separate blanket rule with it's own timetable would be much more manageable than determining what stays and what goes at ingestion time.
I do have one additional question related to soft-delete. I understand that it 's not a real delete and just gets hidden from the UI, but want to confirm that there aren't any processes or timeframes by which soft-deleted entities get auto-hard deleted (i.e. If I soft delete something will it always be retained until I hard-delete it). This is the behavior I would expect, but just want to confirm. Thanks for engaging!
g

gray-shoe-75895

09/28/2022, 7:03 PM
Yep that’s correct!
We do support retention policies on aspect versions (e.g. if a description is overwritten, how long do we keep the old one around) as documented here https://datahubproject.io/docs/advanced/db-retention/#what-type-of-retention-policies-are-supported, but this is disabled by default. We don’t automatically delete soft-deleted entities
2 Views