Hey team,
I wanted to start a discussion about how datasets get represented over time. In particular I'm thinking of a situation where a dataset is created, ingested, edited in Datahub, and deleted at some point. Then some time in the future, a new dataset with the same name and in the same database/schema is created. The same urn would be generated in both these cases, so internally they would be treated as the same dataset. And if the original dataset was soft-deleted, then the subsequent one would inherit that custom documentation even if it doesn't apply to this new case.
Should these be considered two separate datasets? Would the second one be considered a new version of the original? If the original dataset had it's status set to removed when it was deleted from the database, would there be some indication that it had been removed and then reingested? I think there is a lot to consider here.
I'm interested to hear what Datahub and the community's perspective is on this scenario, both from a stateful and stateless ingestion standpoint, and if there is a particular direction Datahub has in mind. Thanks!